Module: Basic statistics
Quote of the page
I carry my thoughts about with me for a long time, often for a very long time, before writing them down.
- Ludwig van Beethoven
Suppose we succeed in obtaining a truly random sample of fish from a given area of sea, and we calculate the mean mercury level in our sample. Will this level tell us the mean mercury level in local fish overall? Not necessarily. Suppose our random sample consists of two fish. What guarantee do we have that the mercury level in these two fish is representative of the mercury level in the population overall? It could well be that both of these fish just happen to have particularly high levels, or particularly low levels.
What if we take a sample of a hundred fish? Again, it's possible that all these fish have particularly high or particularly low mercury levels, in which case our sample mean doesn't accurately represent the population mean. But intuitively, this seems much less likely in the case of a hundred fish than in the case of two fish. If the fish are really chosen at random, it seems like we would have to be very unlucky to get a sample of fish which had a very different mean mercury level than the population at large.
In fact, this intuition can be given a rigorous statistical justification. The basic idea is that for a small sample, any quantity you calculate has a large sampling error. The sampling error provides a measure of how likely it is that the quantity you have calculated is a specified distance from the true population value. The larger the sample, the smaller the sampling error. Note that sampling error isn't an error in the sense of a mistake; every figure calculated from a sample has a sampling error, however good the sampling process. Due to the chance nature of random sampling, you can never completely eliminate the possibility that your sample is misleading. But if the sampling error is small enough, that tells you that it is unlikely that your calculated value is a long way from the true value in the population.
Several different statistics can be used as a measure of the sampling error, but the one you will encounter most frequently is called a confidence interval. For example, suppose you collect a random sample of 100 fish, measure their mercury levels, and calculate the mean and standard deviation of the results. Suppose you obtain a mean of 0.265 ppm (parts per million) and a standard deviation of 0.081 ppm. Your best estimate of the mean mercury level of the fish in the population as a whole is 0.265 ppm, but how far is this likely to be from the true population mean? Using the standard deviation of the sample, you can calculate a quantity called the standard error by dividing the standard deviation by , where n is the sample size. In our case, then the standard error is given by 0.081/10 = 0.0081. As long as the sample is reasonably large (say, more than 30), the 95% confidence interval is an interval of two standard errors on either side of the sample mean. In our case, then, the 95% confidence interval for the mean mercury level in local fish is 0.265 ± 0.016 ppm.
What does the 95% confidence interval tell us? Often, statistics books say things like the following: We can have 95% confidence that the true population mean lies between 0.249 ppm and 0.281 ppm. But this is slightly misleading, as it suggests that confidence intervals are about a particular kind of feeling of confidence you should have towards your result. What the confidence interval actually means is that if you take random samples of 100 fish over and over again, then 95% of the time, the confidence interval will contain the true population mean. Only 5% of the time will the true population mean lie outside the confidence interval, so you would have to be quite unlucky for the confidence interval not to contain the true mean. How this is connected with any feeling of confidence you may have about your result is a difficult philosophical question.
Note that the width of the 95% confidence interval is proportional to 1/ , where n is the sample size. In fact, this is generally true of all measures of sampling error. This provides a useful link between sampling error and sample size; for example, if you want to cut your sampling error in half, you need to multiply the size of your sample by four. Sampling error is purely a result of the size of the sample, unlike bias, which is a result of the way the sample is collected.
Confidence intervals are often found in reports of the results of surveys and political polls. For example, political polls often quote a "margin of error", usually of around 3 percentage points. What this means is that the 95% confidence interval is the cited figure plus or minus 3 percentage points. If the poll says that 52% of voters support Lee, then the 95% confidence interval is 52 ± 3%. In 95% of such polls, you can expect the true level of voter support to be within the confidence interval. Remember, though, that the confidence interval is just a measure of the sampling error; it assumes that the sampling method has ruled out all sources of bias, which as we have seen, is very difficult to achieve.
Suppose that Lee hires a pollster to estimate his support in the forthcoming election. A poll of 1000 people indicates that he has the support of 52% of people, with a margin of error of 3 percentage points. Lee would like a more accurate estimate. How many people would need to be polled in order to reduce the margin of error to 1 percentage point?