[T08] Samples and biases

Module: Basic statistics

Quote of the page

Some women govern their husbands without degrading themselves, because intellect will always govern.

- Mary Wollstonecraft

Popular pages

§1. Samples

Perhaps the most important uses of statistical reasoning are in dealing with samples. For example, a pollster who wants to find out how many people in Hong Kong are satisfied with the performance of the Chief Executive doesn't interview everyone in Hong Kong, for obvious reasons. Instead, the pollster interviews a sample of Hong Kong residents, and uses statistical techniques to draw conclusions about the population as a whole.

But sampling isn't limited to surveys and opinion polls. In manufacturing industries, sampling is used in quality control; for example, from the number of defects in a sample of electrical components, one can infer the overall reliability of the manufacturing process. In medical research, sampling is used to identify health risks; for example, from the difference in heart disease rates between a sample of smokers and a sample of non-smokers, one can infer the overall effect of smoking on the risk of developing heart disease. In general, sampling is necessary in any case where examining every member of the relevant population of people or things would be too expensive or too time consuming.

In the next few sections, we are going to look at reasoning from samples, and some common ways it can go wrong. Ideally, what one wants from a sample is that the properties you are interested in are the same in the sample as in the whole population. For example, the pollster hopes that the proportion of people who are satisfied with the Chief Executive's performance is the same in the sample as in the population of Hong Kong as a whole. In general, this kind of assumption can fail in two different ways--bias and sampling error. We will examine them in turn.

The next few sections cite several statistical results without proof. If you want to see where they come from, see any statistics text (e.g. Larry Gonick and Woollcott Smith (1993), The Cartoon Guide to Statistics. New York: HarperCollins).

§2. Bias

We study samples to learn about a population. In this context, the population is just the set of things we are interested in; they could be people, but they could also be companies or fish or door-handles. The sample is the subset of the population that we actually investigate. We collect data from the sample, and calculate the figure we are interested in--say, the mean number of employees in a sample of Hong Kong companies, or the mean mercury level in a sample of locally caught fish. We want to conclude that the figure applies also to the population as a whole--that we now know something about the mean number of employees in Hong Kong companies generally, or the mean mercury level in locally caught fish generally. Under what circumstances are such inferences justified?

Part of the answer is that the sample must be random. (For the rest of the answer, you'll have to wait until the next section.) This doesn't mean that the sample is chosen in a haphazard way; often it takes a lot of care to make sure that a sample is random. What it means is that each item in the population has an equal probability of being included in the sample. (A random sample is sometimes also called a representative sample, although this name is somewhat misleading, since a random sample can fail to accurately represent the population, as we will see in the next section.)

A sample which is not random is called biased. In a biased sample, some members of the population have a greater chance of being included in the sample than others. Because of this, any figures calculated on the basis of the sample may not be applicable to the population as a whole. For example, suppose we collect a sample of 100 fish from the waters next to an industrial area, and calculate their mean mercury level. Clearly such a sample is biased, since all the fish living away from the industrial area have a zero chance of being included in the sample. One might well expect that the mercury level of fish living close to the industrial area will be higher, on average, than that of fish living elsewhere. This may or may not in fact be the case, but since it could be true, the mercury level you have calculated is not necessarily a good guide to the population as a whole.

Now suppose you take one netful of fish from every square mile of the relevant area. Now is your sample random? Suppose your net has 3-inch holes. Then you won't catch any fish smaller than 3 inches long, and fish less than 3 inches thick have a smaller chance of being caught than fish over three inches thick. So strictly speaking, your sample of fish still isn't random, since some fish have a higher chance of being caught than others. Is this a problem? You may have no reason to think that small fish have different mercury levels than big fish, but of course, it's possible. The beauty of a truly random sample is that it doesn't matter what other factors might be correlated with mercury level in fish, since each fish has an equal chance of being caught.

As this example illustrates, getting a truly random sample is often very difficult. Not all sources of bias are equally serious, but it is always best to obtain a sample that is as random as possible, as only then are the statistical results of the next section fully justified.

Identify possible sources of bias in each of the following examples:

A political poll is conducted by calling numbers picked at random from the telephone directory.
One potential source of bias is the fact that people who don't own a telephone have no chance of being included in the sample. This was a big problem in the past. For example, a 1936 telephone poll predicted that Landon would overwhelmingly beat Roosevelt in the U.S. presidential election, but in fact the reverse occurred. In 1936, only wealthy people owned telephones, and wealthy people were more likely to vote for Landon. These days, this is not such a serious source of bias; however, those people with two telephone lines are twice as likely to appear in the sample as those with only one telephone line. A second potential source of bias is the fact that those people who are home to answer their phone are not necessarily a random sample of voters, particularly if the calls are made during the day, when many people are at work. A third potential source of bias is that some people will refuse to answer the interviewer's questions, and those who do answer may not constitute a random sample of the population. These latter two sources of bias are both instances of non-response bias; even if a truly random sample of people is polled, those who respond may not constitute a random sample.
A survey on views about redevelopment in a particular residential area is conducted by knocking on doors of a random sample of homes in that area.
Again, non-response bias may be a factor here. People who agree to respond to the interviewer's questions may be those who have strong views about redevelopment, or those who have a lot of time on their hands. People who are at home when the interviewer calls may be more likely to be old people or those with young children. Any of these are potential sources of bias.
Tests to determine the incidence of hepatitis in the population are conducted on a random sample of people attending a blood donation centre.
People attending a blood donation centre may well not constitute a random sample of the population at large. For example, those who know that they have been exposed to hepatitis may choose not to give blood. This is a classic example of an opportunity sample--a sample which is chosen simply because it is easy to obtain. It is relatively easy to obtain blood samples for testing from people who have volunteered to donate blood; it would be much harder to persuade people chosen at random from the population to submit to a blood test. But a sample which is easy to obtain is not necessarily an informative sample.

previous tutorial next tutorial