Linear interpolation & Coding
It was mentioned on the Measures of Variation page that the quartiles and percentiles for grouped continuous data can be found using linear interpolation (or simply interpolation). We identify the quartiles according to the following:
We assume that the intervals are evenly distributed and so we split them evenly according to the frequency in each interval.
We use the following example for illustration. The following table shows the prices in millions of pounds of some houses in a given postcode:
House Price, P (ยฃm) | Frequency |
21 | |
37 | |
24 | |
11 | |
7 |
Hence, the quartiles are identified as:
- – the 25th value is the 4th value in the second interval. Split the interval evenly between 37 values and let the first value be 0.3. Q1 is then given by 4th value: . Hence, the lower quartile is ยฃ308,100.
- – the 50th value is the 29th value in the second interval. Q2 is given by: . Hence, the median is ยฃ375,700.
- – the 75th value is the 17th value in the third interval. Split the interval evenly between 24 values and let the first value be 0.4. Q3 is then given by 17th value: . Hence, the upper quartile is ยฃ466,700.
The percentiles can be found in a similar way. For example, and can be found by locating the 5th and 95th values. The 5th and 95th values are and respectively. Hence and .
If there were 70 values and would be found from the 7th and 63rd values, for example.
Note that, for listed data, the quartiles are found in a more specific way. This is because we know all of the data points exactly. For grouped data, we don’t them exactly and we obtain estimates hence why we find them according to the bullet points above. Also note that when we estimate the mean for grouped data, splitting the interval evenly according to the frequency is exactly the same as assigning the midpoint to each member (it balances out).
Coding
When ‘coding’ is mentioned in statistics, it refers to simplifying a set of data. This is not to be confused with the type of coding a computer programmer does. A set of data values (say x values) can be coded into a set of y-values so that the y values are much simpler to perform calculations on. The formula for coding is usually of the form:
for constants a and b (see the Examples below). The mean and standard deviation of the coded y-values ( and respectively) are related to the mean and standard deviation of the original x-values ( and respectively) via the equations:
,
It follows that the original statistics can be found using and when necessary. Note that coding may also be applied to grouped data where the transformation should be applied to the midpoints.