logo

OpenCourseWare on critical thinking, logic, and creativity



Measuring the spread

Although the average is the most frequently used statistic for summarizing a set of data, it is often quite uninformative without some idea of how widely spread the data is--whether all the data points are clustered tightly around the average, or whether they range widely from the average. Two ways of expressing data spread are the standard deviation and the interquartile range; the first is based on the mean, and the second is based on the median.

The standard deviation is obtained, roughly speaking, by finding the average distance of the data points from the mean. But this has to be done in a particular way. The most obvious way of finding this average is to subtract the mean from each of the N data points, add the resulting numbers and divide by N. But this won't work. Why not? (Answer.)

So instead, the standard deviation is calculated using the following recipe:

  • For each of your N data points, calculate the square of the distance between the data point and the mean.
  • Add these numbers together.
  • Divide the result by N$-$1.
  • Take the square root.
This recipe is rather involved; fortunately, most calculators can compute a standard deviation for you.

Despite the complications of calculating it, the standard deviation is the most commonly cited measure of spread. This is because it is used in many statistical techniques, and it has useful connections to some common ways in which data points are distributed. For example, in many real-life situations, the distribution of data points follows what is known as the normal distribution (or bell curve). For data distributed in this way, it can be shown that two-thirds of the data lie within one standard deviation of the mean, 95% of the data lie within two standard deviations of the mean, and 99.7% of the data lie within three standard deviations of the mean. The number of standard deviations from the mean for a particular result can be used as a measure of how unusual that result is; it is not terribly unusual to get a result over one standard deviation from the mean, but it is quite unusual to get a result over two standard deviations from the mean, and very unusual to get a result over three standard deviations from the mean. We will come back to this topic later.

The interquartile ranged is based on the median of a data set. Remember that the median m divides the data set in two; half the data points are below it and half the data points are above it. Now take the points below m (including m, if it is a data point) and find their median. Call this the first quartile. Then take the points above m (including m, if it is a data point) and find their median. Call this the third quartile. The second quartile is m itself. What we have done is to divide the data into four equal parts; a quarter of the data points are below the first quartile, a quarter are between the first and second quartiles, a quarter are between the second and third, and a quarter are above the third quartile. The interquartile range, as its name implies, is the distance between the first and third quartiles.

The advantage of the interquartile range is that it is easy to calculate and easy to visualize; exactly half the data points fall within the range. However, it does not have the nice connections to other statistical techniques that the standard deviation has.

For large data sets, finer-grained distinctions into percentiles are sometimes used. Percentiles are like quartiles, but they divide the data set into 100 equal parts. For example, the 34th percentile of a data set is the value such that 34% of the data points are below it and 66% are above it. The 50th percentile is the median.

  • Self-test question:
The following table shows (approximate) monthly market turnover for the Hong Kong Stock Exchange for 1999. (Source: Hong Kong Securities and Futures Commission).
Month Turnover (billion shares)
January 50
February 35
March 75
April 100
May 130
June 120
July 135
August 85
September 215
October 125
November 145
December 190
Calculate the mean and the standard deviation. Find the median and the interquartile range. How many standard deviations below the mean is the February figure? How many standard deviations above the mean is the September figure?


Next: [T3.4 The importance of] Up: [T3 Summarizing data] Previous: [T3.2 Measuring the middle]
Back to: [Frontpage]

<< previous page


AddThis Social Bookmark Button

About

Search this site

Quote of the page

What is the hardest task in the world? To think.


Ralph Waldo Emerson