Measures of Variation

In addition to measures of central tendency that indicate the central value of a dataset, there are also measures of variation that indicate how spread out a dataset is. Measures of variation are also called measures of spread or dispersion for this reason. Standard deviation and variance are measures of spread that pair naturally with the mean. Range, interquartile and percentile ranges, on the other hand, go with the median. There is no measure of spread that naturally pairs with the mode.

Variance and Standard Deviation

Variance and standard deviation (which is the square root of variance) make use of how far each data point is away from the mean.

Ungrouped Data

Essentially, the variance \sigma^2 is the average of the differences squared (squared so that positive numbers are added together) and is given by:


\sigma^2\hspace{5pt}=\hspace{10pt}\frac{\sum (x-\bar{x})^2}{n}\hspace{10pt}=\hspace{10pt}\frac{\sum x^2}{n}-\left(\frac{\sum x}{n}\right)^2 

The second formula can be thought of as ‘the mean of the datapoints squared subtract the square of the datapoints’ mean‘. The variance is often written in terms of a summary statistic \sigma^2=\frac{S_{xx}}{n} where the summary statistic is given by

S_{xx}=\sum\left(x-\bar{x}\right)^2=\sum x^2-\frac{(\sum x)^2}{n}

This summary statistic is given in the formula booklet for the Edexcel A-Level Maths syllabus.

The standard deviation is hence given by

\sigma\hspace{5pt}=\hspace{10pt}\sqrt{\frac{\sum (x-\bar{x})^2}{n}}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{\sum x^2}{n}-\left(\frac{\sum x}{n}\right)^2}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{S_{xx}}{n}}

When calculating variance, the differences are squared (so that positive numbers are added together) and so may suggest an unnatural measure of spread. Since standard deviation is the square root of variance, it gives a natural measure of spread and also has the same units as the original datapoints.

Grouped Data

Estimates for the variance and standard deviation can be obtained using midpoints when grouped data is given. This is very much like when estimating the mean for grouped data. The formulae change slightly to include the frequencies (f) and the x are now the midpoints of the intervals:

\sigma^2\hspace{5pt}=\hspace{10pt}\frac{\sum f(x-\bar{x})^2}{\sum f}\hspace{10pt}=\hspace{10pt}\frac{\sum fx^2}{\sum f}-\left(\frac{\sum fx}{\sum f}\right)^2

\sigma\hspace{5pt}=\hspace{10pt}\sqrt{\frac{\sum f (x-\bar{x})^2}{\sum f}}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{\sum fx^2}{\sum f}-\left(\frac{\sum fx}{\sum f}\right)^2}

These formulae also apply to ungrouped data that is given in frequency tables.

Ranges

Range

It is likely that you have heard of the range before, it is simply given as follows:

{\text{Range = Largest - Smallest }}

Evidently, the range is affected by extreme values.

Quartiles and Percentiles

Similarly to the median, also known as Quartile 2 \left(Q2\right), Quartile 1 \left(Q1\right) and Quartile 3 \left(Q3\right) indicate where the first 25% and 75% of the data values lie respectively. All three quartiles can be seen on a box plot. See below for how to find quartiles. Similarly, percentiles indicate where any percentage of the data points lie. For example, P_{10}, represents the value that 10% of the data points are less than. Percentiles can be identified on cumulative frequency diagrams (see the exam-style example here) or by using interpolation.

Finding the quartiles for discrete data: In some examples, such as Example 1 on the box plots page, it is easy to locate the quartiles after listing the data in order. Otherwise, in general, Q1 can be found by dividing the number of data points n by 4. If this is a whole number, then Q1 is halfway between this data point and the one above. If n/4 is not a whole number, round up to the next whole number and take this datapoint as Q1. For Q3, do the same thing but for 3n/4.

Finding the quartiles for grouped continuous data: similarly to finding a mean for grouped data, the quartiles for grouped data will be estimates. Q1, Q2 and Q3 can be found by estimating the (n/4)th, (n/2)th and (3n/4)th values respectively. This can be done using interpolation. Note that there are different conventions for finding quartiles and so answers within a certain interval are usually accepted.

Interquartile and Interpercentile Ranges

As well as the range which is calculated by subtracting the smallest from the largest data value, there are the interquartile and interpercentile ranges. The interquartile range (IQR) is given by

{\text{IQR = Quartile 3 - Quartile 1 }}

The IQR is not affected by extreme values but it does only take into account the spread of the middle 50% of data.

There are multiple interpercentile ranges, for example, P_{90}-P_{10}, shows the spread of the middle 80% of data. Since this considers more data than the IQR and still isn’t affected by extreme values, it is often favoured over the IQR.

Examples

Find the variance and standard deviation for each of the following:

  1. The scores obtained in an exam sat by 7 students: 76, 84, 52, 62, 81, 55, 63.
  2. The set of 11 datapoints whose summary statistic is given by S_{xx}=567
x012345
f320451

Solution

  1. The mean of the scores is 67.571. The sum of the squares is 32935 which divided by 7 is 4705. Subtracting the mean squared from this gives the variance as \sigma ^2=139.102 (maintaining accuracy) and it follows that the standard deviation is given by 11.794 to 3 decimal places. Be sure to maintain accuracy throughout your calculations.
  2. The variance is given by \sigma^2=\frac{S_{xx}}{n}=\frac{567}{11}=51.54 to 2 decimal places. Hence, the standard deviation is \sigma=\sqrt{\frac{567}{11}}=7.18 to 2 decimal places.
  3. The mean is given by 2.6. Using the first frequency formula where \sum f(x-\bar{x})^2=3(-2.6)^2+2(-1.6)^2+4(0.4)^2+5(1.4)^2+2.4^2 and \sum f=3+2+0+4+5+1 gives \sigma^2=\frac{41.6}{15}=2.773 and \sigma=1.665 to 3 decimal places.

The following table shows the times 100 people waited in a queue for KFC at a specific store in the first hour of reopening during the CoronaVirus lockdown in 2020.

Queue Time, t (minutes)Frequency
0\leq t\hspace{3pt}<\hspace{3pt} 1011
10\leq t\hspace{3pt}<\hspace{3pt} 204
20\leq t\hspace{3pt}<\hspace{3pt} 3518
35\leq t\hspace{3pt}<\hspace{3pt} 5027
50\leq t\hspace{3pt}<\hspace{3pt} 9040

Find the variance and standard deviation for this grouped data.

Solution

Mid-point (MP), mid-point squared (MP^2), mid-point times frequency (MPF) and mid-point squared times frequency (MP^2F) columns can be added to the table:

(MP)(MP^2)(MPF)(MP^2F)
52555275
1522560900
17.5756.2549513612.5
42.51806.251147.548768.75
7049002800196000
 Total:4557.5259556.25

Using the second formula for the variance of grouped data we have: \sigma^2=\frac{259556.25}{100}-\left(\frac{4557.5}{100}\right)^2=518.48 and \sigma=22.77 to 2 decimal places.

See some more statistical analysis for this example.

This dataset shows the ages of 15 female gorillas kept at a conservation park in the Congo:

8, 30, 5, 14, 61, 15, 52, 25, 22, 32, 43, 7, 41, 18, 55

1. Find the median and the interquartile range for this dataset.

2. The 10th to 90th interpercentile range for the male gorillas at the same park is 37. Without finding it, do you think that the 10th to 90th interpercentile range would be higher or lower than this for the female gorillas? Justify your answer.

Solution

1. First of all, write the ages of the female gorillas in ascending order:

5, 7, 8, 14, 15, 18, 22, 25, 30, 32, 41, 43, 52, 55, 61

The median can now be identified (in blue) as 25 as well as Q1 and Q3 (in red) as 14 and 43. Hence, the interquartile range is 43-14=29.

2. Consider the data value that is below Q1, i.e. 8 and the data value that is one above Q3, i.e. 52. Whatever interpercentile range is given by these two data points, i.e. 52-8=46, it is already bigger than 37. For this reason, the 10th to 90th interpercentile range for the females will be greater than that of the males.

See more statistical analysis for this example (Box Plots – Example 1)

The following table shows the prices in millions of pounds of some houses in a given postcode:

House Price, P (£m)Frequency
0.2\leq P\hspace{2pt}<\hspace{2pt} 0.3 21
0.3\leq P\hspace{2pt}<\hspace{2pt} 0.437
0.4\leq P\hspace{2pt}<\hspace{2pt} 0.524
0.5\leq P\hspace{2pt}< \hspace{2pt}0.6 11
0.6\leq P\hspace{2pt}<\hspace{2pt} 0.77
  1. Using interpolation, find an estimate for the median and the interquartile range.
  2. Also using interpolation, find the 5th to 95th interpercentile range. Estimate how many houses fall within this range.

Solution

  1. Q1 = £ 308,100, Q2=£375,700 and Q3=£ 466,700 – see interpolation for details. Hence, the median is £375,700 and the interquartile range is £466,700-£308,100=£158,600.
  2. P_{5}=£ 219,000 and P_{95}=£ 641,300 – see interpolation for details. Hence,  the 5th to 95th interpercentile range is £641,300-£219,000=£422,300.

See more statistical analysis for this example.