In addition to measures of central tendency that indicate the central value of a dataset, there are also measures of variation that indicate how spread out a dataset is. Measures of variation are also called measures of spread or dispersion for this reason. Standard deviation and variance are measures of spread that pair naturally with the mean. Range, interquartile and percentile ranges, on the other hand, go with the median. There is no measure of spread that naturally pairs with the mode.

Variance and Standard Deviation

Variance and standard deviation (which is the square root of variance) make use of how far each data point is away from the mean.

Ungrouped Data

Essentially, the variance $\sigma^2$ is the average of the differences squared (squared so that positive numbers are added together) and is given by:

${\text{Variance }}\hspace{5pt}=\hspace{5pt}\sigma^2\hspace{5pt}=\hspace{10pt}\frac{\sum (x-\bar{x})^2}{n}\hspace{10pt}=\hspace{10pt}\frac{\sum x^2}{n}-\left(\frac{\sum x}{n}\right)^2$

The second formula can be thought of as ‘the mean of the datapoints squared subtract the square of the datapoints’ mean‘. The variance is often written in terms of a summary statistic $\sigma^2=\frac{S_{xx}}{n}$ where the summary statistic is given by

$S_{xx}=\sum\left(x-\bar{x}\right)^2=\sum x^2-\frac{(\sum x)^2}{n}$

This summary statistic is given in the formula booklet for the Edexcel A-Level Maths syllabus.

The standard deviation is hence given by

${\text{Standard Deviation}}\hspace{5pt}=\hspace{5pt}\sigma\hspace{5pt}=\hspace{10pt}\sqrt{\frac{\sum (x-\bar{x})^2}{n}}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{\sum x^2}{n}-\left(\frac{\sum x}{n}\right)^2}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{S_{xx}}{n}}$

When calculating variance, the differences are squared (so that positive numbers are added together) and so may suggest an unnatural measure of spread. Since standard deviation is the square root of variance, it gives a natural measure of spread and also has the same units as the original datapoints.

Grouped Data

Estimates for the variance and standard deviation can be obtained using midpoints when grouped data is given. This is very much like when estimating the mean for grouped data. The formulae change slightly to include the frequencies (f) and the x are now the midpoints of the intervals:

${\text{Variance }}\hspace{5pt}=\hspace{5pt}\sigma^2\hspace{5pt}=\hspace{10pt}\frac{\sum f(x-\bar{x})^2}{\sum f}\hspace{10pt}=\hspace{10pt}\frac{\sum fx^2}{\sum f}-\left(\frac{\sum fx}{\sum f}\right)^2$

${\text{Standard Deviation}}\hspace{5pt}=\hspace{5pt}\sigma\hspace{5pt}=\hspace{10pt}\sqrt{\frac{\sum f (x-\bar{x})^2}{\sum f}}\hspace{10pt}=\hspace{10pt}\sqrt{\frac{\sum fx^2}{\sum f}-\left(\frac{\sum fx}{\sum f}\right)^2}$

These formulae also apply to ungrouped data that is given in frequency tables.

Example 1 (Ungrouped)

Find the variance and standard deviation for each of the following:

1. The scores obtained in an exam sat by 7 students: 76, 84, 52, 62, 81, 55, 63.
2. The set of 11 datapoints whose summary statistic is given by $S_{xx}=567$
3. No. of days off sick 0 1 2 3 4 5
Frequency 3 2 0 4 5 1

Example 2 (Grouped)

The following table shows the times 100 people waited in a queue for KFC at a specific store in the first hour of reopening during the CoronaVirus lockdown in 2020.

 Queue Time, t (minutes) Frequency $0\leq t\hspace{3pt}\textless\hspace{3pt} 10$ 11 $10\leq t\hspace{3pt}\textless\hspace{3pt} 20$ 4 $20\leq t\hspace{3pt}\textless\hspace{3pt} 35$ 18 $35\leq t\hspace{3pt}\textless\hspace{3pt} 50$ 27 $50\leq t\hspace{3pt}\textless\hspace{3pt} 90$ 40

Find the variance and standard deviation for this grouped data.

Ranges

Range

It is likely that you have heard of the range before, it is simply given as follows:

${\text{Range = Largest - Smallest }}$

Evidently, the range is affected by extreme values.

Quartiles and Percentiles

Similarly to the median, also known as Quartile 2 $\left(Q2\right)$, Quartile 1 $\left(Q1\right)$ and Quartile 3 $\left(Q3\right)$ indicate where the first 25% and 75% of the data values lie respectively. All three quartiles can be seen on a box plot. See below for how to find quartiles. Similarly, percentiles indicate where any percentage of the data points lie. For example, $P_{10}$, represents the value that 10% of the data points are less than. Percentiles can be identified on cumulative frequency diagrams (see the exam-style example here) or by using interpolation.

Finding the quartiles for discrete data: In some examples, such as Example 1 on the box plots page, it is easy to locate the quartiles after listing the data in order. Otherwise, in general, Q1 can be found by dividing the number of data points n by 4. If this is a whole number, then Q1 is halfway between this data point and the one above. If n/4 is not a whole number, round up to the next whole number and take this datapoint as Q1. For Q3, do the same thing but for 3n/4.

Finding the quartiles for grouped continuous data: similarly to finding a mean for grouped data, the quartiles for grouped data will be estimates. Q1, Q2 and Q3 can be found by estimating the (n/4)th, (n/2)th and (3n/4)th values respectively. This can be done using interpolation. Note that there are different conventions for finding quartiles and so answers within a certain interval are usually accepted.

Interquartile and Interpercentile Ranges

As well as the range which is calculated by subtracting the smallest from the largest data value, there are the interquartile and interpercentile ranges. The interquartile range (IQR) is given by

${\text{IQR = Quartile 3 - Quartile 1 }}$

The IQR is not affected by extreme values but it does only take into account the spread of the middle 50% of data.

There are multiple interpercentile ranges, for example, $P_{90}-P_{10}$, shows the spread of the middle 80% of data. Since this considers more data than the IQR and still isn’t affected by extreme values, it is often favoured over the IQR.

Example 1 (Ungrouped)

This dataset shows the ages of 15 female gorillas kept at a conservation park in the Congo:

8, 30, 5, 14, 61, 15, 52, 25, 22, 32, 43, 7, 41, 18, 55

1. Find the median and the interquartile range for this dataset.

2. The 10th to 90th interpercentile range for the male gorillas at the same park is 37. Without finding it, do you think that the 10th to 90th interpercentile range would be higher or lower than this for the female gorillas? Justify your answer.

See more statistical analysis for this example (Box Plots – Example 1).

Example 2 (Grouped)

The following table shows the prices in millions of pounds of some houses in a given postcode:

 House Price, P (£m) Frequency $0.2\leq P\hspace{2pt}\textless\hspace{2pt} 0.3$ $21$ $0.3\leq P\hspace{2pt}\textless\hspace{2pt} 0.4$ $37$ $0.4\leq P\hspace{2pt}\textless\hspace{2pt} 0.5$ $24$ $0.5\leq P\hspace{2pt}\textless \hspace{2pt}0.6$ $11$ $0.6\leq P\hspace{2pt}\textless\hspace{2pt} 0.7$ $7$
1. Using interpolation, find an estimate for the median and the interquartile range.
2. Also using interpolation, find the 5th to 95th interpercentile range. Estimate how many houses fall within this range.