Clojure for Data Science
上QQ阅读APP看书,第一时间看更新

Confidence intervals

Since the standard error of our sample measures how closely we expect our sample mean to match the population mean, we could also consider the inverse—the standard error measures how closely we expect the population mean to match our measured sample mean. In other words, based on our standard error, we can infer that the population mean lies within some expected range of the sample mean with a certain degree of confidence.

Taken together, the "degree of confidence" and the "expected range" define a confidence interval. While stating confidence intervals, it is fairly standard to state the 95 percent interval—we are 95 percent sure that the population parameter lies within the interval. Of course, there remains a 5 percent possibility that it does not.

Confidence intervals

Whatever the standard error, 95 percent of the population mean will lie between -1.96 and 1.96 standard deviations of the sample mean. 1.96 is therefore the critical z-value for a 95 percent confidence interval.

Note

The name z-value comes from the fact that the normal distribution is also called the z-distribution.

The number 1.96 is so commonly used that it's a number worth remembering, but we can also calculate the critical value using the s/quantile-normal function. Our confidence-interval function that follows expects a value for p between zero and one. This will be 0.95 for our 95 percent confidence interval. We need to subtract it from one and divide it by two to calculate the site of each of the two tails (2.5 percent for the 95 percent confidence interval):

(defn confidence-interval [p xs]
  (let [x-bar  (s/mean xs)
        se     (standard-error xs)
        z-crit (s/quantile-normal (- 1 (/ (- 1 p) 2)))]
    [(- x-bar (* se z-crit))
     (+ x-bar (* se z-crit))]))

(defn ex-2-9 []
  (let [may-1 (f/parse-local-date "2015-05-01")]
    (->> (load-data "dwell-times.tsv")
         (with-parsed-date)
         (filtered-times {:date {:$eq may-1}})
         (confidence-interval 0.95))))

;; [83.53415272762004 97.75306531749274]

The result tells us that we can be 95 percent confident that the population mean lies between 83.53 and 97.75 seconds. Indeed, the population mean we calculated previously lies well within this range.

Sample comparisons

After a viral marketing campaign, the web team at AcmeContent take a sample of dwell times for us to analyze from a single day. They'd like to know whether their latest campaign has brought more engaged visitors to the site. Confidence intervals provide us with an intuitive way to compare the two samples.

We load the dwell times from the campaign as we did earlier and summarize them in the same way:

(defn ex-2-10 []
  (let [times (->> (load-data "campaign-sample.tsv")
                   (i/$ :dwell-time))]
    (println "n:      " (count times))
    (println "Mean:   " (s/mean times))
    (println "Median: " (s/median times))
    (println "SD:     " (s/sd times))
    (println "SE:     " (standard-error times))))

;; n:       300
;; Mean:    130.22
;; Median:  84.0
;; SD:      136.13370714388046
;; SE:      7.846572839994115

The mean seems to be much larger than the means we have been looking at previously—130s compared to 90s. It could be that there is some significant difference here, although the standard error is over twice the size of our previous one day sample, owing to a smaller sample size and larger standard deviation. We can calculate the 95 percent confidence interval for the population mean based on this data using the same confidence-interval function like before:

(defn ex-2-11 []
  (->> (load-data "campaign-sample.tsv")
       (i/$ :dwell-time)
       (confidence-interval 0.95)))

;; [114.84099983154137 145.59900016845864]

The 95 percent confidence interval for the population mean is 114.8s to 145.6s. This doesn't overlap with the 90s population mean we calculated previously at all. There appears to be a large underlying population difference that is unlikely to have occurred just through a sampling error alone. Our task now is to find out why.

Bias

A sample should be representative of the population from which it is drawn. In other words, it should avoid bias that would result in certain kinds of population members being systematically excluded (or included) over others.

A famous example of sample bias is the 1936 Literary Digest poll for the US Presidential Election. It was one of the largest and most expensive polls ever conducted with 2.4 million people being surveyed by mail. The results were decisive—Republican governor of Kansas Alfred Landon would defeat Franklin D. Roosevelt, taking 57 percent of the vote. In the event, Roosevelt won the election with 62 percent of the vote.

The primary cause of the magazine's huge sampling error was sample selection bias. In their attempt to gather as many voter addresses as possible, the Literary Digest scoured telephone directories, magazine subscription lists, and club membership lists. In an era when telephones were more of a luxury item, this process was guaranteed to be biased in favor of upper- and middle-class voters and was not representative of the electorate as a whole. A secondary cause of bias was nonresponse bias—less than a quarter of those who were approached actually responded to the survey. This is a kind of selection bias that favors only those respondents who actually wish to participate.

A common way to avoid sample selection bias is to ensure that the sampling is randomized in some way. Introducing chance into the process makes it less likely that experimental factors will unfairly influence the quality of the sample. The Literary Digest poll was focused on getting the largest sample possible, but an unbiased small sample is much more useful than a badly chosen large sample.

If we open up the campaign-sample.tsv file, we'll discover that our sample has come exclusively from June 6, 2015. This was a weekend, a fact we can easily confirm with clj-time:

(p/weekend? (t/date-time 2015 6 6))
;; true

Our summary statistics so far have all been based on the data we filtered just to include weekdays. This is a bias in our sample, and if the weekend visitor behavior turns out to be different from the weekday behavior—a very likely scenario—then we would say that the samples represent two different populations.