Hypothesis Testing for Correlation

We learned how to conduct hypothesis tests for binomial probabilities in AS Maths. In A2 Maths, we extend the ideas of hypothesis testing to normal distributions and also for testing correlation – click here to go to hypothesis testing for normal distributions. On this page we are going to learn how to conduct hypothesis tests for correlation, that is, to determine if there is a linear relationship within the whole population of two variables by looking at the product moment correlation coefficient from a paired sample. First, we recall a couple of other things we learned in AS Maths.

Product Moment Correlation Coefficient (PMCC)

When studying correlation and regression in AS Maths, we briefly learned about the Product Moment Correlation Coefficient (PMCC). It is a number that lies between -1 and 1 and is calculated from two related sets of data (or paired data) measuring two variables (how do I find the PMCC?). The PMCC is a measure of how closely the data points resemble a straight line when plotted on a scatter diagram. From here we will denote the PMCC with the letter r.

  • Positive Correlation – If r is close to 1, then the data points will be close to a straight line that has a positive gradient – we say that the variables are ‘positively correlated’. This means that if one of the variables goes up, then the other variable will go up too. If r is exactly 1, then the data points will sit exactly on the line with positive gradient.
  • Negative Correlation – Similarly, if r is close to -1, the data points will be close to a straight line with negative gradient and we say that the variables are ‘negatively correlated’. In this case, if one of the variables goes up, then the other variable will go down. If r is exactly -1, then the data points will sit exactly on a straight with negative gradient.
  • Zero Correlation – if r is close to 0, then there is no indication that the variables are linearly related at all.

For mid-range values of r, we can say that the variables are weakly correlated as in the images above. For example, a PMCC of 0.5 suggests that the variables are weakly positively correlated.

Note that the value of the r (how do I calculate r?) is a measure of the strength of the correlation and whether it is positive or negative – it does not tell you, however, the gradient of the straight line the data points resemble. This is done with regression.

A value of r close to 0 suggests that the variables do not have a linear relationship. However, it is possible that there is another relationship between the variables or a linear relationship can be revealed by a transformation of the variables. See the next section.

Exponential Curves

Recall that when studying Exponential & Logarithmic Graphs in AS Maths, we learned that performing a log transformation to curves of the form y=ax^n and y=kb^x resulted in a linear relationship between the variables:

  • For y=ax^n, it follows that \log(y)=\log(ax^n)=\log(a)+\log(x^n)=\log(a)+n\log(x) using the log rules. We leave the base out of these equations since it works for any chose of base, provided it is used consistently. It follows that if \log(y)=\log(a)+n\log(x), then there is a linear relationship between \log(y) and \log(x). Plotting \log(y) against \log(x) will show a straight line with gradient n and y-intercept \log(a).
  • For y=kb^x, it follows that \log(y)=\log(kb^x)=\log(k)+\log(b^x)=\log(k)+x\log(b) also using the log rules. It follows here that if \log(y)=\log(k)+x\log(b), then there is a linear relationship between \log(y) and x. Plotting \log(y) against x will show a straight line with gradient \log(b) and y-intercept \log(k).

If a hypothesis test reveals evidence to suggest there is a correlation between \log(y) and \log(x), then there is evidence to suggest that the variables have a polynomial relationship in the form y=ax^n. Similarly, if a hypothesis test reveals evidence to suggest there is a correlation between \log(y) and x, then there is evidence to suggest that the variables have an exponential relationship in the form y=ab^x.

Hypothesis Testing for No Correlation

We learned how to conduct one-tailed and two-tailed hypothesis tests for binomial probabilities in AS Maths – you should revise this first. We can now look at hypothesis testing with the PMCC. In order to make inferences about correlation in a population, we take a sample from it and calculate r from the sample. Performing a hypothesis test on r will tell us about the PMCC of the whole population which we will call R. The null hypothesis H_0 suggests that there is no correlation in the population, that is to say R=0:

H_0: R=0

For a one-tailed test, the alternate hypothesis H_1 suggests that either R is positive or R is negative:

H_1: R>0 or H_1: R<0

For a two-tailed test, the alternate hypothesis suggests that R is different from 0:

H_1: R\ne 0

The hypotheses are tested against r found from the sample and we use the Product Moment Coefficient table, as supplied in the Formula Booklet, to compare it against the critical values given there.

ยฉ Pearson, Edexcel Formula Booklet – correlation coefficient critical values table.

Note that this table gives significance level against sample size so these must be known beforehand. See Example 1 for testing a linear relationship between two variables and Example 2 for testing a linear relationship after a transformation of variables.

Examples

The following tables show the average temperature in Heathrow, UK, for the months May to October in the years 1987 and 2015:

Month in 1987MayJunJulAugSepOct
Average Temperature (^\circC)11.714.617.517.215.311.3
ยฉ Crown Copyright Met Office
Month in 2015MayJunJulAugSepOct
Average Temperature (^\circC)13.216.818.818.114.412.5
ยฉ Crown Copyright Met Office

A meteorologist claims that there is a strong positive correlation between average monthly temperature in 1987 and 2015. Perform a hypothesis test, by first calculating the product moment correlation coefficient for these data, at the 0.5% significance level to see if there is evidence to support the meteorologist’s claim.

The correlation coefficient r can be found using an appropriate calculator. For example, using a CASIO fx-85GT, one can enter linear statistics mode by pressing ‘MODE’ then ‘2’ and ‘2’ again. This will bring up a table where you can enter the x and y values using ‘=’ and navigating with the arrows. To calculate r, press ‘AC’ to exit the table, then press shift and ‘1’ to calculate statistics, followed by ‘5’ for regression, then ‘3’ and ‘=’ for the PMCC. This gives the values r=0.922 to 3 decimal places. We state the null and alternate hypotheses as follows:

H_0: R=0, H_1: R> 0

where R is the population correlation coefficient. This is a one-tailed test and the significance level is 0.5%. We use the table in the Formula Booklet to find the critical value – for a sample size of 6, this is r=0.9172. This means that the critical region is r>0.9172. Since the sample value of r=0.922>0.9172, r lies in the critical region and there is sufficient evidence to reject the null hypothesis at the 0.5% significance level. We conclude that there is evidence to suggest a correlation between average monthly temperatures between 1987 and 2015 at the 0.5% significance level.

A demographer (what is a demographer?) wants to conduct a hypothesis test to see if there is evidence for a polynomial or exponential relationship between time and the world’s population. The following table shows the population of Cambridge, UK (obtained from the Office of National Statistics), to the nearest thousand, every 10 years from 1981:

No. of years after 1981, x010203040
Population to nearest thousand, y8792109124146
Population of Cambridge, UK, for consensus years from 1981
  1. Plot a scatter diagram of x (the numbers of years after 1981) against \log_{10}(y) (log of the population rounded to the nearest 1000).
  2. State whether the relationship between x and y could be of the form y=ax^n where a and n are constants or of the form y=kb^x where k and b are constants. Give a reason for your answer.
  3. The correlation coefficient between x and \log_{10}(y) is 0.990. Conduct a two-tailed hypothesis test, at the 2% significance level, to see whether there is evidence to suggest there is an exponential relationship between time and population.
  4. A regression analysis gives the relationship between x and \log_{10}(y) to be \log_{10}(y)=1.924+0.0058x. Find the unknown constants from part 2) to determine the approximate relationship between time and population for Cambridge.

1. We can make a new table where, instead of y, we can put \log_{10}(y) in the second row:

x010203040
\log_{10}(y)1.93951.96382.03752.09342.1644

Plotting these points gives the following scatter diagram:

x plotted against log(y) with line of best fit

2. The relationship between x and y could be of the form y=kb^x where k and b are constants. This is because there appears to be a linear relationship between x and \log_{10}(y), that is, \log_{10}(y)=A+Bx for constants A and B. If we write A=\log_{10}(k) and B=\log_{10}(b), then \log_{10}(y)=\log_{10}(k)+x\log_{10}(b)=\log_{10}(k)+\log_{10}(b^x)=\log_{10}(kb^x) using the log rules. It follows that y=kb^x.

3. We state the null and alternate hypotheses as follows:

H_0: R=0, H_1: R\ne 0

where R is the population correlation coefficient. This is a two-tailed test and the significance level in each tail is 1%. We use the table in the Formula Booklet to find the critical value – for a sample size of 5, this is r=\pm 0.9343. This means that the critical region is r<-0.9343 or r>0.9343. Since the sample value of r=0.990>0.9343, r lies in the critical region and there is sufficient evidence to reject the null hypothesis at the 2% significance level. We conclude that there is evidence to suggest a correlation between x and \log_{10}(y), and hence an exponential relationship between time and population, at the 1% significance level.

4. Following part 2) and comparing coefficients, we can set \log_{10}(k)=1.924 and \log_{10}(b)=0.0058. It follows that k=10^{1.924}=83.946 and b=10^{0.0058}=1.013 to 3 decimal places where we have used the log rules. Hence, we find the approximate relationship between x and y to be y=83.946\times 1.013^x.

AS Statistics Hypothesis Testing

A2 Statistics Hypothesis Testing