Tutorial: Measures of Dispersion

8.4: Measures of Dispersion

This tutorial: Part A: Variance and Standard Deviation of a Set of Scores

Next tutorial: Part B: Variance and Standard Deviation of a Random Variable

Based on Section 8.4 in Finite Mathematics and Finite Mathematics and Applied Calculus

Variance and Standard Deviation of a Set of Scores

Consider the following two sets of scores:

Set 1:

Set 2:

Both these sets have the same mean (50), but the second set is a lot more widely dispersed ("scattered") than the first.


Set 1	Set 2

Q How do we measure the dispersion of a set of scores?
A Here's how to do it graphically: First, measure the distance of each point from the mean, square each distance, and then take the average of all those squared distances. This measurement is called the population variance of the set of scores.


	Sum of squared distances: 10² + 0² + 10² + 10² + 10 ² + 0² = 400 Population variance: 400/6 66.67		Sum of squared distances: 50² + 50² + 25² + 25² + 30 ² + 30² = 8050 Population variance: 8050/6 1341.67

Q Is there a formula to measure this?
A Actually, there are a couple. First, notice that the distance of a typical score x to the mean is given by subtracting it from the mean: x - . Therefore, the square distance is (x - )². Then we can get the average of these (the population variance) by adding and dividing by the number of points.

Population Variance and Standard Deviation

The population variance (it is written as ²) is the average square distance from the mean:

(x₁ - )² + (x₂ - )² + ... + (x_n - )²

The population standard deviation is the square root of the population variance, and is written as .

Example
The population mean of the scores {1, -1, 2, 3} is

1 - 1 + 2 + 3

= 1.25

Its population variance and standard deviation are given by:

(1 - 1.25)² + (-1 - 1.25)² + (2 - 1.25)² + (3 - 1.25)²

(-0.25)² + (-2.25)² + 0.75² + 1.75²

8.75

= 2.1875

Population variance

2.1875

1.4790

Population standard deviation

Here is one for you to try. The values you enter must be accurate to 4 dcimal places:

Population: 0, 2, 4, 4, 10
=	² =	=

Sample Variance and Standard Deviation

The sample variance is the statistic we use when using a sample of scores instead of all of them (the whole population). The sample variance is written as s², and is computed in almost the same way as the population variance except that, instead of dividing by n, we divide by n-1:

s²

(x₁ - x)² + (x₂ - x)² + ... + (x_n - x)²

n - 1

Notice that the sample mean x is computed in exactly the same way as the population mean -- we just use a different symbol for it.

The sample standard deviation is the square root of the sample variance, and is written as s.

Example
The mean of the sample {1, -1, 2, 3} is

1 - 1 + 2 + 3

= 1.25

Its sample variance and standard deviation are given by:

s²

(1 - 1.25)² + (-1 - 1.25)² + (2 - 1.25)² + (3 - 1.25)²

4-1

(-0.25)² + (-2.25)² + 0.75² + 1.75²

8.75

2.91667

Sample variance

2.91667

1.7078

Sample standard deviation

Here is one for you to try: the same set of data you used above, but this time treated as a sample. The values you enter must be accurate to 4 dcimal places:

Sample: 1, 1, 2, 3
x =	s² =	s =

Q Why do we divide by n when computing the variance and standard deviation for a population, but by n-1 when doing it for a sample?
A When we have a sample, we do not have all the data in the population, but we would still like the sample variance to approximate the population variance. We can interpret this to mean that we would like the average of a very large number of calculations of sample variances for different samples to be very close to the population variance. It turns out that the formula for s² given above is the formula that accomplishes this task. The sample variance s² as we have defined it is referred to by statisticians as an "unbiased estimator" of the population variance ²; if, instead, we divided by n in the formula for s², we would, on average, tend to underestimate the population variance. (See the on-line text on Sampling Distributions for more discussion of unbiased estimators.)

A Tabular Method for Calculation of Variance & Standard Deviation

Here is a nice way or organizing the data we used in computing the variance and standard deviation of the sample 1, -1, 2, 3:

x	x - x	(x - x)²
1	1 - 1.25 = -0.25	(-1.25)² = 0.0625
-1	-1 - 1.25 = -2.25	(-2.25)² = 5.0625
2	2 - 1.25 = 0.75	0.75² = 0.5625
3	3 - 1.25 = 1.75	1.75² = 3.0625
5	0	8.75

In the first column go the given values of x. The total goes at the bottom, and we use that us to compute the mean,
before we carry on filling out the table.
In the next column go the differences x - x: We subtract the mean from each value of x. Note that their sum (at the bottom) should always be zero. (If the sum is not zero, then you have done something wrong.)
In the right-most column go the squares of the numbers in the middle column, with the sum of these squares on the bottom. Note that squares can never be negative. (To square, say, -2.25, on your calculator or Excel, enter (-2.25)^2, and not -2.25^2. (Why?)
We can now compute the sample variance by dividing the sum at the bottom right by n-1 (or the population variance by dividing by n):

Ecneret Keane, the Utarek, Mars Minister of Health, is concernend about reports of mercury contamination in Martian striped sandworm (a staple of the Martian diet). The Martian Environmental Protection Agency has determined a safe level of less than 5.8 micrograms of mercury per liter of blood for humans, so Ec Keane decided to conduct tests of blood mercury levels on 6 randomly chosen Utarek (human) citizens.

His measurements (in mcg/liter) are: 5.5, 5.8, 6.0, 6.2, 5.5, 5.8, 5.8

Complete the following table in order to compute the mean and sample standard deviation of mercury blood levels. Check your calculations after completing each column.

Ecneret Keane
Utarek Minister of Health

x x - x (x - x)²

5.5

5.8

6

6.2

5.5

5.8

Sum: Sum:

x: s²:

Q OK. We know how to calculate the standard deviation, which is a measure of dispersion. Is there anything more specific that it tells us?
A There are two ways we can use the standard deviation to get specific information about a set of scores. One of these ways, called the empirical rule, (see blow) gives us a great deal of information, but only applied to distributions of scores that are both bell-shaped and symmetric.

Q What does it mean for a distibution of scores to be bell-shaped?
A It means that if you group the scores into suitable measurement classes (see the tutorial for Section 8.1) and then graph the frequencies or probabilities, you get a nice bell-shaped symmetric curve:


Bell-shaped and symmetric	Not symmetric	Not bell-shaped

Empirical Rule

For a set of data whose frequency distribution is bell-shaped and symmetric, the following is true:

Approximately 68% of the scores fall within 1 standard deviation of the mean (within the interval [x - s, x + s] for samples or [ - , + ] for populations).
Approximately 95% of the scores fall within 2 standard deviations of the mean (within the interval [x - 2s, x + 2s] for samples or [ - 2, + 2] for populations).
Approximately 99.7% of the scores fall within 3 standard deviations of the mean (within the interval [x - 3s, x + 3s]for samples or [ - 3, + 3]. Almost all of the scores lie within 3 standard deviations of the mean.

Examples

1. If the mean of a sample with a bell-shaped symmetric distribution is 20 with standard deviation s = 2, then approximately 95% of the scores lie in the interval [20-2(2), 20 + 2(2)] = [16, 24]. In other words, approximately 95% of the scores lie between 16 and 24.

Note that this also means that approximately 5% of the scores lie outside this range: approximately 2.5% are above 24 and approximately 2.5% are below 16 (since the distribution is symmetric.)

Q The Empirical Rule rtells us how to interpret the sandard devaition for bell-shaped symmetric distributions. What about distributions that re not bell-shaped and symmetric?
A In cases where the distribution is not nice, we cannot be nearly so accurate. What we can always say is the following:
A

Chebyshev's Rule

For an arbitrary set of data (not necessarily bell-shaped or symmetric) the following is true:

At least 3/4 of the scores fall within 2 standard deviations of the mean (within the interval [x - 2s, x + 2s] for samples or [ - 2, + 2] for populations).
At least 8/9 of the scores fall within 3 standard deviations of the mean (within the interval [x - 3s, x + 3s]for samples or [ - 3, + 3].
At least 15/16 of the scores fall within 4 standard deviations of the mean (within the interval [x - 4s, x + 4s]for samples or [ - 4, + 4].
. . .
In general, at least (n²-1)/n² of the scores fall within n standard deviations of the mean (within the interval [x - ns, x + ns]for samples or [ - n, + n].

Examples

1. If the mean of a sample is 20 with standard deviation s = 2, then 3/4, or 75%, of the scores lie in the interval [20-2(2), 20 + 2(2)] = [16, 24]. In other words, approximately 95% of the scores lie between 16 and 24.

Note that this also means that at most 25% of the scores lie this range. We cannot say that at most 12.5% are above 24 and at most 12.5% below 16 unless we know that the distribution is symmetric.

Now try some of the exercises in Section 8.4 of Finite Mathematics and Finite Mathematics and Applied Calculus. However, to be able to do all the exercises, you will need to go on to the next tutorial, which deals with random variables rather than sets of scores.

Top of Page