data:image/s3,"s3://crabby-images/896ab/896ab2a38df8acc0d9b991d98008dc8e418c6f17" alt="Clojure for Data Science"
Samples and populations
The words "sample" and "population" mean something very particular to statisticians. A population is the entire collection of entities that a researcher wishes to understand or draw conclusions about. For example, in the second half of the 19th century, Gregor Johann Mendel, the originator of genetics, recorded observations about pea plants. Although he was studying specific plants in a laboratory, his objective was to understand the underlying mechanisms behind heredity in all possible pea plants.
Since populations may be large—or in the case of Mendel's pea plants, infinite—we must study representative samples and draw inferences about the population from them. To distinguish the measurable attributes of our samples from the inaccessible attributes of the population, we use the word statistics to refer to the sample attributes and parameters to refer to the population attributes.
In fact, statistics and parameters are distinguished through the use of different symbols in mathematical formulae:
data:image/s3,"s3://crabby-images/2e881/2e881e06453dbb24d9c325c0331646ae1f6fc7e3" alt=""
Here, is pronounced as "x-bar," µx is pronounced as "mu x," and σx is pronounced as "sigma x."
If you refer back to the equation for the standard error, you'll notice that it is calculated from the population standard deviation σx, not the sample standard deviation Sx. This presents us with a paradox—we can't calculate the sample statistic using population parameters when the population parameters are precisely the values we are trying to infer. In practice, though, the sample and population standard deviations are assumed to be the same above a sample size of about 30.
Let's calculate the standard error from a particular day's means. For example, let's take a particular day, say May 1:
(defn ex-2-8 [] (let [may-1 (f/parse-local-date "2015-05-01")] (->> (load-data "dwell-times.tsv") (with-parsed-date) (filtered-times {:date {:$eq may-1}}) (standard-error)))) ;; 3.627
Although we have only taken a sample from one day, the standard error we calculate is very close to the standard deviation of all the sample means—3.6 compared to 3.7s. It's as if, like a cell containing DNA, each sample encodes information about the entire population within it.