Linear interpolation & Coding

It was mentioned on the Measures of Variation page that the quartiles and percentiles for grouped continuous data can be found using linear interpolation (or simply interpolation). We identify the quartiles according to the following:

  • {\text{Q1}} =\frac{n}{4}{\text{th value}}
  • {\text{Q2}} =\frac{n}{2}{\text{th value}}
  • {\text{Q3}} =\frac{3n}{4}{\text{th value}}

We assume that the intervals are evenly distributed and so we split them evenly according to the frequency in each interval.

We use the following example for illustration. The following table shows the prices in millions of pounds of some houses in a given postcode:

House Price, P (ยฃm)Frequency
0.2\leq P\hspace{2pt}<\hspace{2pt} 0.3 21
0.3\leq P\hspace{2pt}<\hspace{2pt} 0.4 37
0.4\leq P\hspace{2pt}<\hspace{2pt} 0.524
0.5\leq P\hspace{2pt}< \hspace{2pt}0.6 11
0.6\leq P\hspace{2pt}<\hspace{2pt} 0.77

Hence, the quartiles are identified as:

  • {\text{Q1}} =\frac{100}{4}{\text{th value = 25th value}} – the 25th value is the 4th value in the second interval. Split the interval evenly between 37 values and let the first value be 0.3. Q1 is then given by 4th value: 0.3+3\times \frac{0.1}{37}\approx 0.3081. Hence, the lower quartile is ยฃ308,100.
  • {\text{Q2}} =\frac{100}{2}{\text{th value = 50th value}} – the 50th value is the 29th value in the second interval. Q2 is given by: 0.3+28\times \frac{0.1}{37}\approx 0.3757. Hence, the median is ยฃ375,700.
  • {\text{Q3}} =\frac{300}{4}{\text{th value = 75th value}} – the 75th value is the 17th value in the third interval. Split the interval evenly between 24 values and let the first value be 0.4. Q3 is then given by 17th value: 0.4+16\times \frac{0.1}{24}\approx 0.4667. Hence, the upper quartile is ยฃ466,700.

The percentiles can be found in a similar way. For example, P_{5} and P_{95} can be found by locating the 5th and 95th values. The 5th and 95th values are 0.2+4\times \frac{0.1}{21}\approx 0.2190 and 0.6+1\times \frac{0.1}{7}\approx 0.6413 respectively. Hence P_{5}=£ 219,000 and P_{95}=£ 641,300

If there were 70 values P_{10} and P_{90} would be found from the 7th and 63rd values, for example.

Note that, for listed data, the quartiles are found in a more specific way. This is because we know all of the data points exactly. For grouped data, we don’t them exactly and we obtain estimates hence why we find them according to the bullet points above. Also note that when we estimate the mean for grouped data, splitting the interval evenly according to the frequency is exactly the same as assigning the midpoint to each member (it balances out).

Coding

When ‘coding’ is mentioned in statistics, it refers to simplifying a set of data. This is not to be confused with the type of coding a computer programmer does. A set of data values (say x values) can be coded into a set of y-values so that the y values are much simpler to perform calculations on. The formula for coding is usually of the form:

y=\frac{x-a}{b}

for constants a and b (see the Examples below). The mean and standard deviation of the coded y-values (\bar{y} and \sigma_y respectively) are related to the mean and standard deviation of the original x-values (\bar{x} and \sigma_x respectively) via the equations:

\bar{y}=\frac{\bar{x}-a}{b}\hspace{5pt}\sigma_y=\frac{\sigma_x}{b}

It follows that the original statistics can be found using \bar{x}=b\bar{y}+a and \sigma_x =b\sigma_y when necessary. Note that coding may also be applied to grouped data where the transformation should be applied to the midpoints. 

Examples

The temperature (T^\circ {\text{F}}) of some sick patients in a hospital were measured as follows:

109^\circ {\text{F}}, 97^\circ {\text{F}}, 101^\circ {\text{F}}, 99^\circ {\text{F}}, 102^\circ {\text{F}}, 104^\circ {\text{F}}

  1. Code the temperatures according to the following formula: t=\frac{T-100}{10}.
  2. Find the mean and standard deviation of the coded data.
  3. Use your answer in part 2 to calculate the mean and standard deviation of the original temperatures.

Solution:

  1. Take the first value 109 as an example. The coded version of this value is (109-100)/10=0.9. The full set of coded values is given by 0.9, -0.3, 0.1, -0.1, 0.2, 0.4.
  2. {\text{Mean}}=\bar{t}=\frac{0.9-0.3+0.1-0.1+0.2+0.4}{6}=0.2 {\text{S.D.}}=\sigma_{t}=\sqrt{\frac{0.7^2+(-0.5)^2+(-0.1)^2+(-0.3)^2+0.2^2}{6}}\approx 0.383
  3. The original mean is given by \bar{T}=10\bar{t}+100=102^\circ {\text F} and the original standard deviation is \sigma_T=10\sigma_t=3.83^\circ {\text F}

A dataset (X) with mean \bar{X}=23.574 is coded to produce a new dataset (Y) with mean \bar{Y}=6.1. Given that the original data set has standard deviation \sigma_X=1.15 and the new data set has standard deviation \sigma_Y=0.5, find the formula used to code the X values into the Y values.

Solution:

Using the standard deviation formula: \sigma_Y=\frac{\sigma_X}{b}, it follows that 0.5=\frac{1.15}{b}. Hence, b=2.3. Using the formula for coding the mean: \bar{Y}=\frac{\bar{X}-a}{b}, we now have 6.1=\frac{23.57-a}{2.3}. It follows that a=9.54 and the formula used to code the Y values from the X values is given by Y=\frac{X-9.54}{2.3}.