** Module: Basic statistics**

- T00. Introduction
- T01. Basic concepts
- T02. The rules of probability
- T03. The game show puzzle
- T04. Expected values
- T05. Probability and utility
- T06. Cooperation
- T07. Summarizing data
- T08. Samples and biases
- T09. Sampling error
- T10. Hypothesis testing
- T11. Correlation
- T12. Simpson's paradox
- T13. The post hoc fallacy
- T14. Controlled trials
- T15. Bayesian confirmation

** Quote of the page**

The great enemy of clear language is insincerity. When there is a gap between one’s real and one’s declared aims, one turns as it were instinctively to long words and exhausted idioms, like a cuttlefish squirting out ink.

- George Orwell

Help us promote

critical thinking!

** Popular pages**

- What is critical thinking?
- What is logic?
- Hardest logic puzzle ever
- Free miniguide
- What is an argument?
- Knights and knaves puzzles
- Logic puzzles
- What is a good argument?
- Improving critical thinking
- Analogical arguments

Perhaps the most important uses of statistical reasoning are
in dealing with samples. For example, a pollster who wants to
find out how many people in Hong Kong are satisfied with the
performance of the Chief Executive doesn't interview everyone
in Hong Kong, for obvious reasons. Instead, the pollster
interviews a *sample* of Hong Kong residents, and uses
statistical techniques to draw conclusions about the
population as a whole.

But sampling isn't limited to surveys and opinion polls. In manufacturing industries, sampling is used in quality control; for example, from the number of defects in a sample of electrical components, one can infer the overall reliability of the manufacturing process. In medical research, sampling is used to identify health risks; for example, from the difference in heart disease rates between a sample of smokers and a sample of non-smokers, one can infer the overall effect of smoking on the risk of developing heart disease. In general, sampling is necessary in any case where examining every member of the relevant population of people or things would be too expensive or too time consuming.

In the next few sections, we are going to look at reasoning
from samples, and some common ways it can go wrong. Ideally,
what one wants from a sample is that the properties you are
interested in are the same in the sample as in the whole
population. For example, the pollster hopes that the
proportion of people who are satisfied with the Chief
Executive's performance is the same in the sample as in the
population of Hong Kong as a whole. In general, this kind of
assumption can fail in two different ways--*bias* and
*sampling error*. We will examine them in turn.

The next few sections cite several statistical results without
proof. If you want to see where they come from, see any
statistics text (e.g. Larry Gonick and Woollcott Smith
(1993), *The Cartoon Guide to Statistics*. New York:
HarperCollins).

We study samples to learn about a population. In this
context, the *population* is just the set of things we
are interested in; they could be people, but they could also
be companies or fish or door-handles. The *sample* is the
subset of the population that we actually investigate. We
collect data from the sample, and calculate the figure we are
interested in--say, the mean number of employees in a sample
of Hong Kong companies, or the mean mercury level in a sample
of locally caught fish. We want to conclude that the figure
applies also to the population as a whole--that we now know
something about the mean number of employees in Hong Kong
companies generally, or the mean mercury level in locally
caught fish generally. Under what circumstances are such
inferences justified?

*Part* of the answer is that the sample must be *random*.
(For the rest of the answer, you'll have to wait
until the next section.) This doesn't mean that the sample is
chosen in a haphazard way; often it takes a lot of care to
make sure that a sample is random. What it means is that each
item in the population has an equal probability of being
included in the sample. (A random sample is sometimes also
called a *representative* sample, although this name is
somewhat misleading, since a random sample can fail to
accurately represent the population, as we will see in the
next section.)

A sample which is not random is called *biased*. In a
biased sample, some members of the population have a greater
chance of being included in the sample than others. Because
of this, any figures calculated on the basis of the sample may
not be applicable to the population as a whole. For example,
suppose we collect a sample of 100 fish from the waters next
to an industrial area, and calculate their mean mercury level.
Clearly such a sample is biased, since all the fish living
away from the industrial area have a zero chance of being
included in the sample. One might well expect that the
mercury level of fish living close to the industrial area will
be higher, on average, than that of fish living elsewhere.
This may or may not in fact be the case, but since it could be
true, the mercury level you have calculated is not necessarily
a good guide to the population as a whole.

Now suppose you take one netful of fish from every square mile of the relevant area. Now is your sample random? Suppose your net has 3-inch holes. Then you won't catch any fish smaller than 3 inches long, and fish less than 3 inches thick have a smaller chance of being caught than fish over three inches thick. So strictly speaking, your sample of fish still isn't random, since some fish have a higher chance of being caught than others. Is this a problem? You may have no reason to think that small fish have different mercury levels than big fish, but of course, it's possible. The beauty of a truly random sample is that it doesn't matter what other factors might be correlated with mercury level in fish, since each fish has an equal chance of being caught.

As this example illustrates, getting a truly random sample is often very difficult. Not all sources of bias are equally serious, but it is always best to obtain a sample that is as random as possible, as only then are the statistical results of the next section fully justified.

Identify possible sources of bias in each of the following examples:

- A political poll is conducted by calling numbers picked at random from the telephone directory.
- A survey on views about redevelopment in a particular residential area is conducted by knocking on doors of a random sample of homes in that area.
- Tests to determine the incidence of hepatitis in the population are conducted on a random sample of people attending a blood donation centre.