Clojure for Data Science
上QQ阅读APP看书,第一时间看更新

Samples and populations

The words "sample" and "population" mean something very particular to statisticians. A population is the entire collection of entities that a researcher wishes to understand or draw conclusions about. For example, in the second half of the 19th century, Gregor Johann Mendel, the originator of genetics, recorded observations about pea plants. Although he was studying specific plants in a laboratory, his objective was to understand the underlying mechanisms behind heredity in all possible pea plants.

Note

Statisticians refer to the group of entities from which a sample is drawn as the population, whether or not the objects being studied are people.

Since populations may be large—or in the case of Mendel's pea plants, infinite—we must study representative samples and draw inferences about the population from them. To distinguish the measurable attributes of our samples from the inaccessible attributes of the population, we use the word statistics to refer to the sample attributes and parameters to refer to the population attributes.

Note

Statistics are the attributes we can measure from our samples. Parameters are the attributes of the population we are trying to infer.

In fact, statistics and parameters are distinguished through the use of different symbols in mathematical formulae:

Here, Samples and populations is pronounced as "x-bar," µx is pronounced as "mu x," and σx is pronounced as "sigma x."

If you refer back to the equation for the standard error, you'll notice that it is calculated from the population standard deviation σx, not the sample standard deviation Sx. This presents us with a paradox—we can't calculate the sample statistic using population parameters when the population parameters are precisely the values we are trying to infer. In practice, though, the sample and population standard deviations are assumed to be the same above a sample size of about 30.

Let's calculate the standard error from a particular day's means. For example, let's take a particular day, say May 1:

(defn ex-2-8 []
  (let [may-1 (f/parse-local-date "2015-05-01")]
    (->> (load-data "dwell-times.tsv")
         (with-parsed-date)
         (filtered-times {:date {:$eq may-1}})
         (standard-error))))

;; 3.627

Although we have only taken a sample from one day, the standard error we calculate is very close to the standard deviation of all the sample means—3.6 compared to 3.7s. It's as if, like a cell containing DNA, each sample encodes information about the entire population within it.