Module: Basic statistics
Quote of the page
Science is built up of facts, as a house is built of stones; but an accumulation of facts is no more a science than a heap of stones is a house.
- Henri Poincarè
Perhaps the most important uses of statistical reasoning are in dealing with samples. For example, a pollster who wants to find out how many people in Hong Kong are satisfied with the performance of the Chief Executive doesn't interview everyone in Hong Kong, for obvious reasons. Instead, the pollster interviews a sample of Hong Kong residents, and uses statistical techniques to draw conclusions about the population as a whole.
But sampling isn't limited to surveys and opinion polls. In manufacturing industries, sampling is used in quality control; for example, from the number of defects in a sample of electrical components, one can infer the overall reliability of the manufacturing process. In medical research, sampling is used to identify health risks; for example, from the difference in heart disease rates between a sample of smokers and a sample of non-smokers, one can infer the overall effect of smoking on the risk of developing heart disease. In general, sampling is necessary in any case where examining every member of the relevant population of people or things would be too expensive or too time consuming.
In the next few sections, we are going to look at reasoning from samples, and some common ways it can go wrong. Ideally, what one wants from a sample is that the properties you are interested in are the same in the sample as in the whole population. For example, the pollster hopes that the proportion of people who are satisfied with the Chief Executive's performance is the same in the sample as in the population of Hong Kong as a whole. In general, this kind of assumption can fail in two different ways--bias and sampling error. We will examine them in turn.
The next few sections cite several statistical results without proof. If you want to see where they come from, see any statistics text (e.g. Larry Gonick and Woollcott Smith (1993), The Cartoon Guide to Statistics. New York: HarperCollins).
We study samples to learn about a population. In this context, the population is just the set of things we are interested in; they could be people, but they could also be companies or fish or door-handles. The sample is the subset of the population that we actually investigate. We collect data from the sample, and calculate the figure we are interested in--say, the mean number of employees in a sample of Hong Kong companies, or the mean mercury level in a sample of locally caught fish. We want to conclude that the figure applies also to the population as a whole--that we now know something about the mean number of employees in Hong Kong companies generally, or the mean mercury level in locally caught fish generally. Under what circumstances are such inferences justified?
Part of the answer is that the sample must be random. (For the rest of the answer, you'll have to wait until the next section.) This doesn't mean that the sample is chosen in a haphazard way; often it takes a lot of care to make sure that a sample is random. What it means is that each item in the population has an equal probability of being included in the sample. (A random sample is sometimes also called a representative sample, although this name is somewhat misleading, since a random sample can fail to accurately represent the population, as we will see in the next section.)
A sample which is not random is called biased. In a biased sample, some members of the population have a greater chance of being included in the sample than others. Because of this, any figures calculated on the basis of the sample may not be applicable to the population as a whole. For example, suppose we collect a sample of 100 fish from the waters next to an industrial area, and calculate their mean mercury level. Clearly such a sample is biased, since all the fish living away from the industrial area have a zero chance of being included in the sample. One might well expect that the mercury level of fish living close to the industrial area will be higher, on average, than that of fish living elsewhere. This may or may not in fact be the case, but since it could be true, the mercury level you have calculated is not necessarily a good guide to the population as a whole.
Now suppose you take one netful of fish from every square mile of the relevant area. Now is your sample random? Suppose your net has 3-inch holes. Then you won't catch any fish smaller than 3 inches long, and fish less than 3 inches thick have a smaller chance of being caught than fish over three inches thick. So strictly speaking, your sample of fish still isn't random, since some fish have a higher chance of being caught than others. Is this a problem? You may have no reason to think that small fish have different mercury levels than big fish, but of course, it's possible. The beauty of a truly random sample is that it doesn't matter what other factors might be correlated with mercury level in fish, since each fish has an equal chance of being caught.
As this example illustrates, getting a truly random sample is often very difficult. Not all sources of bias are equally serious, but it is always best to obtain a sample that is as random as possible, as only then are the statistical results of the next section fully justified.
Identify possible sources of bias in each of the following examples: