# Hypothesis Testing for Correlation

We learned how to conduct hypothesis tests for binomial probabilities in AS Maths. In A2 Maths, we extend the ideas of hypothesis testing to normal distributions and also for testing correlation – click here to go to hypothesis testing for normal distributions. On this page we are going to learn how to conduct hypothesis tests for correlation, that is, to determine if there is a linear relationship within the whole population of two variables by looking at the product moment correlation coefficient from a paired sample. First, we recall a couple of other things we learned in AS Maths.

## Product Moment Correlation Coefficient (PMCC)

When studying correlation and regression in AS Maths, we briefly learned about the Product Moment Correlation Coefficient (PMCC). It is a number that lies between -1 and 1 and is calculated from two related sets of data (or paired data) measuring two variables (how do I find the PMCC?). The PMCC is a measure of how closely the data points resemble a straight line when plotted on a scatter diagram. From here we will denote the PMCC with the letter .

• Positive Correlation – If is close to 1, then the data points will be close to a straight line that has a positive gradient – we say that the variables are ‘positively correlated’. This means that if one of the variables goes up, then the other variable will go up too. If is exactly 1, then the data points will sit exactly on the line with positive gradient.
• Negative Correlation – Similarly, if is close to -1, the data points will be close to a straight line with negative gradient and we say that the variables are ‘negatively correlated’. In this case, if one of the variables goes up, then the other variable will go down. If is exactly -1, then the data points will sit exactly on a straight with negative gradient.
• Zero Correlation – if is close to 0, then there is no indication that the variables are linearly related at all.

For mid-range values of , we can say that the variables are weakly correlated as in the images above. For example, a PMCC of 0.5 suggests that the variables are weakly positively correlated.

Note that the value of the (how do I calculate ?) is a measure of the strength of the correlation and whether it is positive or negative – it does not tell you, however, the gradient of the straight line the data points resemble. This is done with regression.

A value of close to 0 suggests that the variables do not have a linear relationship. However, it is possible that there is another relationship between the variables or a linear relationship can be revealed by a transformation of the variables. See the next section.

## Exponential Curves

Recall that when studying Exponential & Logarithmic Graphs in AS Maths, we learned that performing a log transformation to curves of the form and resulted in a linear relationship between the variables:

• For , it follows that using the log rules. We leave the base out of these equations since it works for any chose of base, provided it is used consistently. It follows that if , then there is a linear relationship between and . Plotting against will show a straight line with gradient and y-intercept .
• For , it follows that also using the log rules. It follows here that if , then there is a linear relationship between and . Plotting against will show a straight line with gradient and y-intercept .

If a hypothesis test reveals evidence to suggest there is a correlation between and , then there is evidence to suggest that the variables have a polynomial relationship in the form . Similarly, if a hypothesis test reveals evidence to suggest there is a correlation between and , then there is evidence to suggest that the variables have an exponential relationship in the form .

## Hypothesis Testing for No Correlation

We learned how to conduct one-tailed and two-tailed hypothesis tests for binomial probabilities in AS Maths – you should revise this first. We can now look at hypothesis testing with the PMCC. In order to make inferences about correlation in a population, we take a sample from it and calculate from the sample. Performing a hypothesis test on will tell us about the PMCC of the whole population which we will call . The null hypothesis suggests that there is no correlation in the population, that is to say :

For a one-tailed test, the alternate hypothesis suggests that either is positive or is negative:

or

For a two-tailed test, the alternate hypothesis suggests that is different from 0:

The hypotheses are tested against found from the sample and we use the Product Moment Coefficient table, as supplied in the Formula Booklet, to compare it against the critical values given there.

Note that this table gives significance level against sample size so these must be known beforehand. See Example 1 for testing a linear relationship between two variables and Example 2 for testing a linear relationship after a transformation of variables.

## Examples

The following tables show the average temperature in Heathrow, UK, for the months May to October in the years 1987 and 2015:

A meteorologist claims that there is a strong positive correlation between average monthly temperature in 1987 and 2015. Perform a hypothesis test, by first calculating the product moment correlation coefficient for these data, at the 0.5% significance level to see if there is evidence to support the meteorologist’s claim.

The correlation coefficient can be found using an appropriate calculator. For example, using a CASIO fx-85GT, one can enter linear statistics mode by pressing ‘MODE’ then ‘2’ and ‘2’ again. This will bring up a table where you can enter the x and y values using ‘=’ and navigating with the arrows. To calculate , press ‘AC’ to exit the table, then press shift and ‘1’ to calculate statistics, followed by ‘5’ for regression, then ‘3’ and ‘=’ for the PMCC. This gives the values to 3 decimal places. We state the null and alternate hypotheses as follows:

where is the population correlation coefficient. This is a one-tailed test and the significance level is 0.5%. We use the table in the Formula Booklet to find the critical value – for a sample size of 6, this is . This means that the critical region is . Since the sample value of , lies in the critical region and there is sufficient evidence to reject the null hypothesis at the 0.5% significance level. We conclude that there is evidence to suggest a correlation between average monthly temperatures between 1987 and 2015 at the 0.5% significance level.

A demographer (what is a demographer?) wants to conduct a hypothesis test to see if there is evidence for a polynomial or exponential relationship between time and the world’s population. The following table shows the population of Cambridge, UK (obtained from the Office of National Statistics), to the nearest thousand, every 10 years from 1981:

1. Plot a scatter diagram of (the numbers of years after 1981) against (log of the population rounded to the nearest 1000).
2. State whether the relationship between and could be of the form where and are constants or of the form where and are constants. Give a reason for your answer.
3. The correlation coefficient between and is 0.990. Conduct a two-tailed hypothesis test, at the 2% significance level, to see whether there is evidence to suggest there is an exponential relationship between time and population.
4. A regression analysis gives the relationship between and to be . Find the unknown constants from part 2) to determine the approximate relationship between time and population for Cambridge.

1. We can make a new table where, instead of y, we can put in the second row:

Plotting these points gives the following scatter diagram:

2. The relationship between and could be of the form where and are constants. This is because there appears to be a linear relationship between and , that is, for constants and . If we write and , then using the log rules. It follows that .

3. We state the null and alternate hypotheses as follows:

where is the population correlation coefficient. This is a two-tailed test and the significance level in each tail is 1%. We use the table in the Formula Booklet to find the critical value – for a sample size of 5, this is . This means that the critical region is or . Since the sample value of , lies in the critical region and there is sufficient evidence to reject the null hypothesis at the 2% significance level. We conclude that there is evidence to suggest a correlation between and , and hence an exponential relationship between time and population, at the 1% significance level.

4. Following part 2) and comparing coefficients, we can set and . It follows that and to 3 decimal places where we have used the log rules. Hence, we find the approximate relationship between and to be .