# Basic Concepts of Correlation

Definition 1: The covariance between two sample random variables x and y is a measure of the linear association between the two variables, and is defined by the formula

Observation: The covariance is similar to the variance, except that the covariance is defined for two variables (x and y above) whereas the variance is defined for only one variable. In fact, cov(x, x) = var(x).

The covariance can be thought of as the sum of matches and mismatches among the pairs of data elements for x and y: a match occurs when both elements in the pair are on the same side of their mean; a mismatch occurs when one element in the pair is above its mean and the other is below its mean.

The covariance is positive when the matches outweigh the mismatches and is negative when the mismatches outweigh the matches. The size of the covariance in absolute value indicates the intensity of the linear relationship between x and y: the stronger the linear relationship the larger the value of the covariance will be. The size of the covariance is also influenced by the scale of the data elements, and so in order to eliminate the scale factor the correlation coefficient is used as the scale-free metric of linear relationship.

Definition 2: The correlation coefficient between two sample variables x and y is a scale-free measure of linear association between the two variables, and is given by the formula

If necessary we can write r as rxy to explicitly show the two variables.

We also use the term coefficient of determination for r2

Observation: Just as we saw for the variance in Measures of Variability, the covariance can be calculated as

As a result, we can also calculate the correlation coefficient as

Observation: If r is close to 1 then x and y are positively correlated. A positive linear correlation means that high values of x are associated with high values of y and low values of x are associated with low values of y.

If r is close to -1 then x and y are negatively correlated. A negative linear correlation means that high values of x are associated with low values of y, and low values of x are associated with high values of y.

When r is close to 0 there is little linear relationship between x and y.

Observation: We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their pdf.

Definition 3: The covariance between two random variables x and y for a population with discrete or continuous pdf is

Definition 4: The (Pearson’s product moment) correlation coefficient for two variables  x and y for a population with discrete or continuous pdf is

Property 4: The following is true for both for the sample and population definitions of covariance:

If x and y are independent then cov(x, y) = 0

Property 5: The following are true both for samples and populations:

Observation: Click here for additional properties of covariance and correlation, as well as the proofs of the properties given above.

Observation: It turns out that r is not an unbiased estimate of ρ. A relatively unbiased estimate of ρ2 is given by the adjusted coefficient of determination $r_ {adj}^2$:

While $r_{adj}^2$ is a better estimate of the population coefficient of determination, especially for small values of n, for large values of n it is easy to see that $r_ {adj}^2$ ≈ r2. Note too that $r_ {adj}^2$ r2, and while $r_ {adj}^2$ can be negative, this is relatively rare.

An even more unbiased estimate of the population correlation coefficient associated with normally distributed data is given by

Excel Functions: Excel provides the following functions regarding the covariance and correlation coefficient:

COVAR(R1, R2) = the population covariance between the data in arrays R1 and R2. If R1 contains data {x1,…,xn}, R2 contains {y1,…,yn}, $\bar{x}$ = AVERAGE(R1) and $\bar{y}$ = AVERAGE(R2), then COVAR(R1, R2) has the value

This is the same as the formula given in Definition 1, with n replaced by n – 1. Excel doesn’t have a sample version of the covariance, although this can be calculated using the formula:

n * COVAR(R1, R2) / (n – 1)

CORREL(R1, R2) = the correlation coefficient of data in arrays R1 and R2. This function can be used for both the sample and population versions of the correlation coefficient. Note that:

CORREL(R1, R2) = COVAR(R1. R2) / (STDEVP(R1) * STDEVP(R2)) = the population version of the correlation coefficient

CORREL(R1, R2) = n * COVAR(R1. R2) / (STDEV(R1) * STDEV(R2) * (n  – 1)) = the sample version of the correlation coefficient

Excel also provides the following, less useful, functions:

PEARSON(R1, R2) = CORREL(R1, R2)

RSQ(R1, R2) = CORREL(R1, R2) ^ 2

Excel 2010/2013 also provide COVARIANCE.S(R1, R2) to compute the sample covariance as well as COVARIANCE.P(R1, R2) which is equivalent to COVAR(R1, R2). Also, the Real Statistics supplemental functions COVARP(R1, R2) and COVARS(R1, R2) compute the population and sample covariances respectively.

Finally there is a Correlation data analysis tool which we demonstrate in the Example 1 of Multiple Correlation.

Real Statistics Functions: The Real Statistics Resource Pack contains the following functions:

RSQ_ADJ(R1, R2) = adjusted coefficient of determination $r^2_{adj}$ for the data sets contained in ranges R1 and R2.

CORREL_ADJ (R1, R2) = estimated correlation coefficient ρest for the data sets contained in ranges R1 and R2.

RSQ_ADJ(r, n) = adjusted coefficient of determination $r^2_{adj}$ corresponding to the sample

CORREL_ADJ (r, n) = estimated correlation coefficient ρest corresponding to a sample correlation coefficient for a sample of size n.

### 57 Responses to Basic Concepts of Correlation

1. rhitz says:

Hello
I am working on a report in which I have chosen certain parameters which are state wise population growth of a country as independent variable and its impact on growing demand on housing (dependent variable ) and I have also consider the amount which has been disbursed (dependent variable ) in order to meet the demand by then population. how can I apply correlation and regression model in order to understand the impact of independent variable on other variables

• Charles says:

Rhitz,
This sounds like a multiple regression problem. Please look at the regression portion of the website. You can start with the following webpages:
Linear Regression
Multiple Linear Regression
Charles

• Viqui says:

what is the definition of correlated pairs?

• Charles says:

Viqui,
I don’t know of a precise definition of correlated pair, but probably it is a pair of samples with a correlation that is significantly different from zero.
Charles

2. Vinnie Sappingfield says:

Hi ,
How can I find the joint probability distribution from given correlation?

• Charles says:

Vinnie,
In general, you can’t find the joint probability distribution from the correlation.
Charles

3. Reznik Trevor says:

Hello Charles!

What is the relation between correlation and joint probability distribution of two random variables?

• Charles says:

Reznik,
The correlation coefficient is a statistic (or parameter) for joint probability distribution of two random variables.
Charles

4. jaton says:

What the difference between adjusted coefficient correlation and adjusted coefficient of determination ?

Thanks

vic

• Charles says:

Vic,
The coefficient of determination is the square of the correlation coefficient. The adjusted value tries to modify the sample statistic so that it is a more accurate (i.e. less biased) approximation of the corresponding population parameter.
Charles

5. Rabie Lotfy Abdel Aziz Ramadan says:

Dear Charles,
I want to test the relationship between AMH in plasma samples obtained from Holstein cows and their response to super stimulation by a gonadotropic hormonal preparation. the question is how to interpret such correlations into an excel figure (x, y figure) using Excel 2007 or newer versions
Many thanks

• Charles says:

Dear Rabie,
Sorry, but I don’t completely understand what you mean by “interpret such correlations into an excel figure (x, y figure)”. Are you trying to chart the correlation via a trend line in Excel? Are you trying to interpret such a trendline? Please explain.
Charles

6. AQ says:

Dear Charles,

I am conducting a study which measures the relationship between three variables; quality of life, medication adherence and healthcare satisfaction. Research suggests that all three variables directly affect one another (a triangular-shaped relationship). I am wondering what a relationship between three variables is called?

Many thanks

• Charles says:

AQ,
Of course it depends on the relationship that you are referring to, but probably you are looking for “they are correlated” or “they have an association”.
Charles

7. Martin says:

Hello Charles,

I have a quick question regarding ANOVA and correlation factor, I am trying to analyze different experiments to test treatments. I get inconsistent and not high enough correlation factors to prove linear relationship among the variables. I also ran a one way anova and I get a p value of 0.000 using minitab.

Can we only use ANOVA when there is a linear relationship? or can I trust my results and if so what can I determine from them?

I would appreciate any help, thanks!

• Charles says:

Martin,
You should be able to run ANOVA without making a separate test for correlation. This should come out of the ANOVA results anyway. If you are getting inconsistent results perhaps you have made an error in conducting one of the tests. Without better information about your scenario I am unable to comment further.
Charles

8. Nirosha says:

Dear Charles,
I am having 30 sample size and need to test relationship with individual age, education level with their perception towards several variables which measures using likert scale.(+1 strongly agree to -1 strongly disagree).
can I use pearson correlation test to measure correlation between two group of this sample:
for example my hypothesis will be:
educated officers have best choice of selecting best employee or
experiences of officers have positive relationship with best practices of officers etc.

I have data on age and education level as categorical data and perception as ranking data.

hope you can understand my issue

thanks
Nirosha

• Charles says:

Nirosha,

The more likert scales you have, the more accurate tests that are designed for continuous data. With 7 scales (e.g. strongly agree, fairly strongly agree, mildly agree, neutral, mildly disagree, fairly strongly disagree, strongly disagree), a continuous test should generally work fine. It is also common to use such a test with a 5-point scale, although there is more risk. Better yet would to assign any value between -1 and +1.

You can certainly use pearson’s correlation to measure the associations that you have listed. You can also test whether these correlation coefficients are significantly different from zero. Provided the data is at reasonably normally distributed this is equivalent to conducting a t test. See the webpage Relationship between correlation and t test.

You have stated that you plan to compare two groups. You can also compare more than two groups. This is equivalent to running ANOVA.

Charles

9. jane Israel says:

am hapi wt d work ur doing, pls I’m working on gender and socioeconomic status as correlates of students’s academic achievement. pls what statistical tool should I use to analyze the data..tanx in advance

• Charles says:

Jane, it really depends on what hypothesis you are trying to test. The Correlation sections describes a number of tests that you could use. See the webpage
Correlation
Charles

10. Eric W. says:

I have a large data set. I am trying to determine the correlation a distance variable and a probability variable. The distance is in increments of 5 (there are 1000+ data points for each distance increment). Most of the probabilities are zero (~10%). If I run Excel Correl() on the complete data, there is very little correlation. If I run Correl() on the average probability for each distance, there is strong correlation. Am I using Correl() in some way that is violating the built in assumptions?

• Charles says:

Eric,
There are no assumptions for using the Correl function. There may be some assumptions when you test the correlation value.
Without seeing your data, I can’t tell you why you are getting such different results.
Charles

11. Dear, Excuse me, I am very confused that How to find R1 and R2 please details inform me. thanks

• Charles says:

R1 and R2 are two ranges on an Excel spreadsheet which contain the data for the two samples.
Charles

12. Steve Thomas says:

Im trying to work out the Standard deviation on Excel, and some of the cells contain a (0) which results in the function returning with an error code.

Does anyone know how I can get around this?

Thanks,

Steve

• Charles says:

Steve,
Just because some of the cells contain a zero shouldn’t necessarily result in the standard deviation function STDEV.S returning an error.
If you send me an Excel file with your data, I can try to figure out where the problem is.
Charles

13. Akira says:

Greetings Charles,

I’ve more than 500 relationship (one to one function) to be study either they have a possible relationship or not. Therefore my first step is by using Correlation Coefficient to segregate the possible relationship or not before move to the next step.

My question, is that okay to used that method to find the possible relationship or there is another method of that more reliable to segregate those one to one into possible relationship group and vise versa.

• Charles says:

Sorry, but you haven’t provided enough information for me to answer your question.
Charles

• Akira says:

For example, in crude assays there are hundred of parameter (properties, such as sulfur content, Specific Gravity, viscosity and etc).

This one to one function, for example Sulfur vs SPG or Sulfur vs Viscosity and so on. The study is to find any possible relationship among the properties of crude oil.

Since the study doesn’t really wide and only few people attempt to do the statistical analysis on whole crude oil, I have to start with random variables.

So, I wonder if I can segregate the possible relationship just by using coefficient of determination or there is another method that much better compared to R-square.

Sorry but thank you in advance.

• Charles says:

Sorry Akira, but I don’t understand your question.
Charles

14. Alessio says:

Hi, something is the matter with the radj formula: (i) it cannot give negative r coefficients (I guess one needs to add a “sign(r)” factor before the sqrt; (ii) the content of the square root can be negative. Eg. when N=4 all r between -0.58 and 0.58 produce imaginary sqrt.

• Charles says:

Alessio,
You are correct. For this reason it is better to speak about the adjusted coefficient of determination (the square of the correlation coefficient). I have now changed the webpage to reflect this. Thanks very much for identifying this problem.
Charles

15. Emeka says:

Greetings Charles!
Weldone for this rich site. Pls can I run CORREL on two sets of data with different units. Eg. X has units in molecules/cm^2 while Y has units in molecules/cm^3. Thanks in advance

• Charles says:

Emeka,
Yes, the units won’t matter. The correlation coefficient is independent of the units.
Charles

16. Ramin says:

Hi!

I have an important correlation related question:

I am analyzing spiking Neurons.
I have 4 Island each contains 16 spiking neurons. Each neuron fires spikes randomly in a time frame of 250 us.

I want to find the correlation between this 4 islands, how can i do it ?

17. Tara says:

Thanks a lot Charles. Now I can find my way better.

18. mohsen says:

hi
may you please explain about correlation coeficient in multi variables i.e. y and x1,x2,x3,…
y=ax1+bx2+bx3+…. how to find a,b,c,… so that we attain best fitting.

19. Tanya says:

Hello,

I’m using excel to do a quick correlation. I was reading through the variables. I’m trying to make a correlation between performance metrics (rating scale is 1-5) and Versant Exam scores (rating is 1-100). Would it matter if the scales are different when I do the correlation?

• Charles says:

Tanya,
You can calculate the correlation coefficient even if the scales are different.
Charles

20. Tara says:

Hi
Please can you help about finding correlation coefficient between two dependent variables, each variable with four level but I want to find over all correlation between the two dependent variables without making regards for the levels which I do not know how to do it. Any insight will be helpful. Many thanks.

• Charles says:

Tara,
Sorry, but I don’t understand your question. In particular I don’t understand what levels you are referring to. Are these part of an ANOVA?
Charles

• Tara says:

Thanks Charles
Yes the levels are part of an ANOVA. I meant I want to find correlation between two dependent variables over the 4 levels that factor have. And My question is that can I use mean value for each level when I calculate correlation between ? if so I think I will be able to have correlation over that 4b levels of the two dependent variables.

• Charles says:

Tara,
I am afraid that I still don’t understand your question.
Charles

• Tara says:

I am sorry that I have not been able to explain my question.
For each dependent variable there are 2 factors one factor has 4 levels and the other factor has 2 levels. I can separate the factor with two levels when I test correlation but I want to keep the 4 levels together of the other factor when I test correlation. So I want to test correlation for factor 1 (a,b,c,d) with factor 2(a) then find correlation between factor 1(a,b,c,d) with factor 2(b). I test correlation between two dependent variables. Is this possible?
If so, can I use mean value of levels(a,b,c,d) when I test correlation?
I hope I could explained my question well.
Thanks a lot.

• Charles says:

Tara,

I’m not sure why you want to do this, but in any case here my response to your question based on my understanding of what you are asking.

Suppose the data for 4 variables x1, x2, x3 and x4 are contained in the range R1 (with 4 columns, one for each variable) and the data for another variable y is contained in the range R2 (with 1 column and the same number of rows as R1). The correlation of x1, x2, x3 and x4 with y can be calculated by the Real Statistics formula MultipleR(R1, R2). This is essentially the R value in multiple linear regression.

The Correlation test described in Correlation Testing is between two variables x and y. If you define the x sample values as the mean of the corresponding values of x1, x2, x3 and x4, you can then test the correlation of x with y. It is not clear to me why this would be useful though.

Charles

question then is wh

21. Ni says:

Thank you for your prompt response.

If I don’t possess entity level information for any participants within category subgroups can I really correlate subgroups between categories?

Though I possess the standard deviations and means of the categories and subgroups within categories, I don’t see how I can calculate covariance. If I can’t calculate covariance is there another way to calculate correlation?

• Charles says:

You clearly need more than just the means and standard deviations of the samples to calculate the covariance, and, as you observed, you need to know the covariance to calculate the correlation.
Charles

22. Ni says:

Mr. Zaiontz,

Great website in so many respects.

Have a correlation question for you.

Here is my data structure:
1. Over fifty categories with the same two subgroups per category. Subgroup 1 Passes and Subgroup 2 Fails.
2. Not all categories possess the same size subgroups and not all categories are the same size.
3. Data for each category contains both subgroup means and standard deviations as well as the overall category mean and standard deviation.
4. The same participant population was evaluated in all categories. A fail in one category is also a fail in all the other categories.

Question:
With data formatted in this manner is it possible to correlate the categories?

• Charles says:

Sorry, but I don’t completely understand the premise.
Charles

23. jerome says:

can you please explain pair wise correlation ?

• Charles says:

Jerome,
If you have say 4 variables A, B, C and D, there are C(4,2) = 6 different pairs of these variables, namely AB, AC, AD, BC, BD, CD. Correlation coefficients are calculated on pairs of variables. Thus with 4 variables there are are 6 pairwise correlations, namely correlation(A,B), correlation(A,C), etc.
Charles

24. John Rodri says:

Hi Charles,

Is it possible obtain any extra information from the correlation value? For example, if I have a correction value of 0,85 could we say that 85% of the values correlate. Or can we say that all values correlate with 15% error?

• Charles says:

John,
Neither of these is true, although you could say that it is 15% short of perfect correlation.
Charles

25. John Gonzales says:

what is meant by the definition of correlation coefficient ” The correlation coefficient between two sample variables x and y is a scale-free measure of linear association between the two variables, and is given by the formula,” specifically scale-free measure? Please respond as soon as possible as this is for a project due this Sunday. Thank you for your time. -John G.

• Charles says:

John,

The correlation coefficient is a measure of the linear association between the two variables, but it is not scale-free. E.g. if the sample for variable x is {3,4,5,1,5} and the sample for variable y is {5,2,7,3,4}, then the covariance coefficient is 1.08. If instead I multiply each of the sample elements by 10, the covariance coefficient will be 108, i.e. 10 x 10 = 100 times higher. Thus the covariance coefficient is not scale-free since scale matters (here scale means the size of the input data, not just their relationship to each other).

The correlation coefficient is an attempt to make the covariance coefficient scale-free. In this way only the relationship between the two variables is captured. Using the above example, the correlation coefficient for the original samples is .419425, the same as the correlation coefficient for the samples that are 10 times bigger. This is a scale-free measure. In fact, no matter what the size of the original data the correlation coefficient has a value between -1 and +1. The closer the correlation coefficient is to +1 the better (higher) the linear association between the two variables (i.e. when x is high, y tends to be high too and when x is low, y tends to be low). The closer the correlation coefficient is to 0 the worse (lower) the linear association between the two variables.

The same is true in the negative range, namely the closer the correlation coefficient is to -1 the better (higher) the linear association between the two variables, except that this time the association is the inverse of the positive association (i.e. when x is high, y tends to be low and when x is low, y tends to be high).

Charles