Data aren't only summarized by means of graphs and diagrams; quite frequently, data are summarized using numbers. The most frequently cited number is the average of the data. For example, the 25 June, 2000 issue of the South China Morning Post (which just happened to be close at hand) cites averages in every section. In news features, researchers interviewing drug users "found that 18 per cent had shared needles or syringes with three other people on average" ("HIV rate among addicts sharing needles soars", p. 3). In business features, Thai officials report that "more than 360,000 tourists came for golf holidays last year, spending an average of 8,000 baht a day, almost twice that of the average visitor" ("Cheap health care may yet provide the biggest tourist lure for Thailand", p. 4). In sports, "ever since making a full debut against Brazil in 1994, as a 21-year-old, Milosevic has averaged something very close to a goal every two games" ("Yugoslav Villan turned hero", p. 14). Even the weather report tells us that "total rainfall since January 1st is 1,334.5 mm. against an average of 926.6 mm." (p. 2).
When newspapers talk about the average value of some quantity, they are almost always referring to what statisticians call the mean. The mean of a set of N numbers is the sum of the numbers, divided by N. So, for example, the mean of the data set {2, 5, 5, 8, 10} is (2 + 5 + 5 + 8 + 10)
5 = 6. The mean gives an idea of where the "middle" of the data set lies.
However, the mean is not the only way to express the "middle" of a set of data. Another way is to cite the median of the data. The median is quite literally the middle value; list the numbers in the data set in increasing order, and the median is the middle one. So, for example, the median of the data set {2, 5, 5, 8, 10} is 5. If the size of the data set is even, then there are two numbers in the middle, and the median is the mean of these two numbers. So, for example, the median of {2, 5, 5, 8, 10, 12} is (5 + 8)
2 = 6.5.
The reason that the mean is most often used to typify a data set is that it has a central place in the theoretical machinery of statistics, and is closely connected with concepts such as probability and expected value. For example, suppose you play a game in which you toss a fair coin, and you win $2 if the coin lands heads and lose $1 if the coin lands tails. The probability of each outcome is 0.5, so the expected value of the game is (0.5
$2)
(0.5
$1) = $0.5. The concept of expected value is related to that of the mean in that if you play the game over and over again, your mean winnings per game will eventually get closer and closer to the expected value.
Despite these advantages, the mean is not always the best way to summarize a data set. One advantage of the median over the mean is that it is not sensitive to outliers. An outlier is an extreme value which is exceptional in some way, and hence not representative of the quantity you are interested in. As we saw before, the mean of the data set {2, 5, 5, 8, 10} is 6, and the median is 5. If we change the data set to {2, 5, 5, 8, 100}, the mean rises to 24, but the median remains unchanged at 5. In many cases, when the data set contains outliers the median provides a better way of summarizing the data. For example, perhaps the data represent the number of aeroplane flights taken by a sample of Hong Kong residents in the past year, and the figure of 100 comes from a professional pilot; in this case, the median of 5 flights per year is probably more representative of the population at large than the mean of 24 flights per year. More examples can be found in the following self-test questions.
|
Student number |
Salary (HK$/month) |
|
00001 |
14,000 |
|
00002 |
14,500 |
|
00003 |
14,000 |
|
00004 |
16,000 |
|
00005 |
19,000 |
|
00006 |
12,000 |
|
00007 |
15,500 |
|
00008 |
86,500 |
|
00009 |
16,500 |
|
00010 |
13,000 |
Calculate the mean and median values for this data. Why are they so different? Which is likely to be the best way to summarize the data?
2. The following table shows students' marks for a particular coursework assignment (the figures are invented):|
Student number |
Mark |
|
00001 |
59 |
|
00002 |
61 |
|
00003 |
57 |
|
00004 |
0 |
|
00005 |
51 |
|
00006 |
64 |
|
00007 |
70 |
|
00008 |
0 |
|
00009 |
55 |
|
00010 |
0 |
Calculate the mean and median values for this data. Why are they so different? Suppose there is a policy that the mean mark for each assignment must be close to 58; if the mean is more than 5 points below 58, a fixed quantity is added to each student's mark so that the mean becomes 58. What does the policy require in this case? Does it seem like the right response in this instance?
The ultimate court of appeal is observation and experiment... not authority.

Thomas Henry Huxley