# Multiple Correlation

We can also calculate the correlation between more than two variables.

Definition 1: Given variables x, y and z, we define the multiple correlation coefficient

where rxz, ryz, rxy are as defined in Definition 2 of Basic Concepts of Correlation. Here x and y are viewed as the independent variables and z is the dependent variable.

We also define the multiple coefficient of determination to be the square of the multiple correlation coefficient.

Often the subscripts are dropped and the multiple correlation coefficient and multiple coefficient of determination are written simply as R and R2 respectively. These definitions may also be expanded to more than two independent variables. With just one independent variable the multiple correlation coefficient  is simply r.

Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples. A relatively unbiased version of R is given by R adjusted.

Definition 2: If R is Rz,xy as defined above (or similarly for more variables) then the adjusted multiple coefficient of determination is

where k = the number of independent variables and n = the number of data elements in the sample for z (which should be the same as the samples for x and y).

Excel Data Analysis Tools: In addition to the various correlation functions described elsewhere, Excel provides the Covariance and Correlation data analysis tools. The Covariance tool calculates the pairwise population covariances for all the variables in the data set. Similarly the Correlation tool calculates the various correlation coefficients as described in the following example.

Example 1: We expand the data in Example 2 of Correlation Testing via the t Test to include a number of other statistics. The data for the first few states are as described in the Figure 1:

Figure 1 – Data for Example 1

Using Excel’s Correlation data analysis tool we can compute the pairwise correlation coefficients for the various variables in the table in Figure 1. The results are shown in Figure 2.

Figure 2 – Correlation coefficients for data in Example 1

We can also single out the first three variables, poverty, infant mortality and white (i.e. the percentage of the population that is white) and calculate the multiple correlation coefficients, assuming poverty is the dependent variable, as defined in Definition 1 and 2. We use the data in Figure 2 to obtain the values $r_{PW}$$r_{PI}$ and $r_{WI}$.

Definition 3: Given x, y and z as in Definition 1, the partial correlation of x and z holding y constant is defined as follows:

In the semi-partial correlation, the correlation between x and y is eliminated, but not the correlation between x and z and y and z:

Observation: Suppose we look at the relationship between GPA (grade point average) and Salary 5 years after graduation and discover there is a high correlation between these two variables. As has been mentioned elsewhere, this is not to say that doing well in school causes a person to get a higher salary. In fact it is entirely possible that there is a third variable, say IQ, that correlates well with both GPA and Salary (although this would not necessarily imply that IQ is the cause of the higher GPA and higher salary).

In this case, it is possible that the correlation between GPA and Salary is a consequence of the correlation between IQ and GPA and between IQ and Salary. To test this we need to determine the correlation between GPA and Salary eliminating the influence of IQ from both variables, i.e. the partial correlation $r_{(GS,I)}$.

Property 1:

Proof: The first assertion follows since

The second assertion follows since:

Example 2: Calculate $r_{(PW,I)}$ and $r_{P(W,I)}$ for the data in Example 1.

We can see that Property 1 holds for this data since

Observation: Since the coefficients of determination is a measure of the portion of variance attributable to the variables involved, we can look at the meaning of the concepts defined above using the following Venn diagram, where the rectangular represents the total variance of the poverty variable.

Figure 3 – Breakdown of variance for poverty

Using the data from Example 1, we can calculate the breakdown of the variance for poverty in Figure 4:

Figure 4 – Breakdown of variance for poverty continued

Note that we can calculate B in a number of ways: (A + B –  A, (B + C) – C, (A + B + C) – (A + C), etc. and get the same answer in each case. Also note that

where D = 1 – (A + B + C).

Figure 5 – Breakdown of variance for poverty continued

Property 2: From Property 1, it follows that:

If the independent variables are mutually independent, this reduces to

Real Statistics Functions: The Real Statistics Resource Pack contains the following supplemental functions:

CORREL_ADJ(R1, R2) = adjusted correlation coefficient for the data sets defined by ranges R1 and R2

MCORREL(R, R1, R2) = multiple correlation of dependent variable z with x and y where the samples for z, x and y are the ranges R, R1 and R2 respectively

Observation: Definition 1 defines the multiple correlation coefficient Rz,xy and corresponding multiple coefficient of determination for three variables x, y and z. These definitions can be extended to more than three variables as described in Advanced Multiple Correlation.

E.g. if R1 is an m × n data range containing the data for n variables then the supplemental function RSquare(R1, k) calculates the multiple coefficient of determination for the kth variable with respect to the other variables in R1. The multiple correlation coefficient for the kth variable with respect to the other variables in R1 can be calculated by the formula =SQRT(RSquare(R1, k)).

Thus if R1, R2 and R3 are the three columns of the m × 3 data range R, with R1 and R2 containing the samples for the independent variables x and y and R3 containing the sample data for dependent variable z, then =MCORREL(R3, R1, R2) yields the same result as =SQRT(RSquare(R, 3)).

Observation: Similarly the definition of the partial correlation coefficient (Definition 3) can be extended to more than three variables as described in Advanced Multiple Correlation.

### 211 Responses to Multiple Correlation

1. Vikas Handa says:

Sir
I came across your site while browsing some information on Multiple correlation. It is described very nicely. It would be great if you could expand it to four variables also.

Vikas Handa

• Charles says:

Hi Vikas,

When using more than 3 variables it is best to use matrices. For example in the Multiple Correlation webpage, I describe the function Rz,xy which is the multiple correlation coefficient of z with x and y. If there are 4 (or more variables), I refer to a matrix (or range in Excel) R1 which contains the data for all 4 variables and then look for the function which gives the correlation coefficient of the kth variable in R1 with respect to the other variables in R1.

This is better explored in the Multiple Regression webpage http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/. At the end of that page you will find the supplemental function RSquare(R1,k) which calculates the square of this correlation coefficient. Thus the correlation coefficient of the kth variable with respect to all the other variables is given by the formula =SQRT(RSquare(R1,k)).

It turns out that this can also be calculated by first finding the correlation matrix for R1, then taking its inverse and then selecting the kth element on the diagonal. If this has the value c then the desired correlation coefficient is the square root of 1-1/c.

Thus the desired correlation coefficient can also be calculated by the formula =SQRT(1-1/INDEX(DIAG(MINVERSE(CORR(R1))),k)).

I hope this helps.

Charles

• bekalu says:

hi charles. woud you mind helping me out with this
when two variables (x and y) are correlated, then define the four possible explanations of correlation and give your example for each?

2. Rizky Tri Asmono says:

Dear
Did you have a reference book about multiple correlation and multiple regression?

• Charles says:

Dear Rizky,
The Bibliography of the website contains a number of good references which covers these topics. The website itself should serve as a good reference for these topics. Are there some things that you would like to see covered in more detail or more clearly in the website?
Charles

Hello 🙂

In my syllabus, there’s a topic called ‘basic matrices and their uses’…i dont understand relevance of matrices…so if u could please guide…or give some reference…thank u.

• Charles says:

Matrices are used quite often in statistics. There are many examples throughout the Real Statistics website. E.g.
http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/
http://www.real-statistics.com/multivariate-statistics/
Charles

Thank you sir,
What i mean is like correlation is a tool to know the strength and direction of relation between two variables, regression is to predict a dependent variable’s value through independent variable/s, what is the relevance of ‘matrices’? what is the uniqueness of matrices as a statistical tool ? i know i am missing something really basic., if you could pls help.
Regards.

• Charles says:

Sorry, but I still don’t understand your question. Matrices are a tool used in performing statistical analysis.
Charles

Im sorry sir, may be I cant explain what i want to ask…i meant that just like every statistical tool has a purpose, why were matrices devised in the first place? we can add, subtract, multiply numbers in any case, why we do we need to put them into a rectangular array and then do basic math on them like they do in matrices? Sir, if I am not making sense, please ignore this msg. Regards

• KP says:

I will try my best, what i understand from your questions.
Matrix is a rectangular array of variables or (numbers, symbols, or expressions). This arrangement is for systematically analyze the large number of variable or computation of complex relationship of variable (play with multivariate relation) . For example it is very easy to solve equations with two variables but it would be complex for more then two variable, but it is easy to compute through matrix system. It is useful to solve and generalize mathematical relationship of various field i.e. Statistics, Biological Sciences Economics etc. Because of its practical usefulness Matrix become new discipline in Mathematics.
Thank You
Regards

5. Ila says:

Sir,
I downloaded your Realstats Resource Pak , I put the Realstats.xlam in a folder called Realstats in my folder on a server. I clicked on Realstats.xlam, clicked on File>Help|Options>Add-Ins, clicked Go. Add-ins available are AnalysisToolPak, Analysis ToolPak – VBA, Euro Currency Tools, Solver Add-In. It does not contain Realstats. Could you please help me. Appreciate that very much. Thank you very much, Ila

• Charles says:

Hi Ila,
That Realstats does not appear is not surprising. You need to click on the Browse button and navigate to where you stored the file and then click on OK. RealStats should now appear. Click on Realstats and you should be able to use the supplemental functions and data analysis tools.
Charles

6. tnzal says:

If at initial state I only have 3 variables : length, width and thickness.
Can I create new independent variables by taking above variables,
for example : Weight = length x width x thickness x density, Volume = length x width x thickness.
Now I have 5 variables length, width, thickness, Weight and Volume to be tested.

All variables give a significant correlation factor When I did the correlation factor by using excel data analysis.
how to select the variables will be giving more representing model.

7. mikebinz says:

Hi Charles

Great posting, which I am really struggling to understand properly, having forgotten all the statistics I ever did (long ago!)

I’m wondering if you could briefly interpret the meaning of each Line of Figures 4 & 5 to gibve me a starting point

eg A+B+C Rp,wi2 0.33105
means that Proverty is well correlated to Infant Mortality, independent of being White, or whatever

• Charles says:

Mike,

I have updated the webpage so that things are explained a little better. Here is a bit more info:

A+B+C = R_p,wi^2 = .33105 means that 33.1% of the variance in P is explained from the association with W and I (and so 66.9% is not explained in this way). This is especially useful with linear regression since it gives you some idea about how good a linear regression model for P would be based on W and I. Here it is best to look at the webpages on linear regression and multiple regression.

From Figure 3 and 4, we see that the value R_p,wi^2 = r_pw^2 + r_p(i,w)^2. This is because the left side of the equation is A+B+C and the first term on the right side is A+B while the last term is C. This also follows by Property 2.

r_PI = .5644 (cell M5 of Figure 2) means there is a moderately high correlation between P (poverty) and I (infant mortality). If we factor out the effects of being white (W) from this correlation we get the partial correlation r_PI,W. From cell X38 of Figure 5, we see that the square of this number is .322505, and so taking the square root we get that r_PI,W = .5679, which is only slightly higher that the value for r_PI. This means that the being white (W) has virtually no impact on the correlation between P and I (and the little impact that it has is a dampening effect).

I know that this is a little confusing, but I hope that this explanation helps, at least a little bit.

Charles

8. Serge says:

Sir,
I’m new to this, so I may be asking a rhetorical question. Are we allowed to compare a simple correlation r of a criterion with a multiple correlation of R of the same criterion?

I think that R in this case will be larger than r and hence, R is closer to a perfect correlation. Is this always true?

• Charles says:

Serge,

Both R and r represent correlation. In fact with two data sets R and r are the same thing. Depending on what sort of assertion you are trying to make you can certainly compare them.

As can be seen from Figures 3 and 4 of the referenced webpage, R^2_p,wi = A + B + C and r^2_p,w = A + B, and so it follows that R^2_p,wi >= r^2_p,w. Thus R_p,wi >= r_p,w. This is generally true.

Charles

9. shailesh verma says:

sir ,
sir i request for ,
in multiple co-relation of cofficient what meaning of

b123X2 + b132X3=e123

this equation where e is the error. i dont no which error

• Charles says:

Shailesh, Sorry, but I don’t understand your question. Charles

10. Dana says:

Hi Charles,

I’m trying to compute a partial correlation but from the examples above I am still not clear on how to do it. Let’s say I have a 30 x 5 matrix (column 1 = subjects in an experiment, column 2 = variable 1, column 3 = variable 2, column 4 = variable 3, column 5 = variable 4). I have computed the correlation between variable 1 and 2 using the correlation function in Analysis ToolPak. But what I’m really interested in is computing the partial correlation between variable 1 and 2 while controlling for variable 3 and/or variable 4.

Your help would be greatly appreciated.

Best,
Dana

11. Tim says:

“Unfortunately R is not an unbiased estimate of the population multiple correlation coefficient, which is evident for small samples”

Golly, that really is unfortunate!

12. Claire says:

My daughter is trying to do a correlation assignment using three variables. She tried using tennis results (height, weight and age) but found no correlation. She also tried cricket results (age, height & success rate). I am despairing as I can not help her. Do you have any suggestions? She is in Year 12 and has chosen sport results but it could be anything!

• Charles says:

Hi Claire,
Not sure what the problem is. Is she trying to find three variables that correlate? What is she trying to accomplish?
Charles

13. Daniel says:

Hi Charles,

You give the multiple correlation coefficient in Definition 1, and an adjusted multiple correlation coefficient in Definition 2. I’m trying to find a correlation between 1 dependent variable and 2 independent variables, so do I have to use the adjusted multiple correlation coefficient (Definition 2) to accomplish this? Or do I just use the multiple correlation coefficient in Definition 1?

Dan

• Daniel says:

Also, how can you determine the p-value of such correlation?

• Charles says:

Daniel,

The test that the multiple correlation coefficient R is zero is the same as the test that the multiple regression model is a good fit for the data as explained on the webpage http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/.

The test statistic is F = (R^2/k)/((1-R^2)/df) where k = # of independent variables, n = the sample size and df = n – k – 1. p-value = FDIST(F,k,df).

Charles

• Arman says:

Charles,
Can you please tell what is FDIST here? I am also stuck with the same problem of how to find p-value of such correlation.

• Charles says:

Arman,
Sorry, but I don’t see any reference to FDIST on the referenced webpage.
Charles

• Charles says:

Daniel,
Generally R^2 is used (where R = the unadjusted multiple correlation coefficient), even though the adjusted multiple correlation coefficient is a less biased estimate of the population correlation coefficient, and so should be a better estimate.
Charles

14. Rahul says:

Are multiple correlation and multiple regression same?

• Charles says:

Rahul,
Not exactly, although they are clearly interrelated. To calculate the multiple correlation coefficient you can use the results for R^2 from multiple regression.
Charles

15. Simao says:

Hi

I am trying to do a multiple correlation but my independent variables were obtained from different sized populations.
regards

• Charles says:

Hi Simao,

Do you mean different sized populations or different sized samples? The calculation of the multiple correlation coefficient described on the referenced webpae is based on the sample data used and not on the underlying populations. I will try to answer the question based on missing data. If this is not what you intended, please elaborate.

On the referenced webpage the correlation coefficient is calculated from a sample of 3-tuples. If one or more of the data elements in the 3-tuples is missing, generally you have three choices: (1) drop that 3-tuple from the sample, (2) calculate the correlation using a pairwise approach whereby only non-missing pairs are used in calculating the correlation coeffcient for that pair of variables (note that the calculation of the multiple correlation coefficient is based on the values of the three pairwise correlation coefficients) and (3) impute the value(s) of the missing elements(s).

The first two approaches are described in more detail on http://www.real-statistics.com/multiple-regression/least-squares-method-multiple-regression/ when I explain two approaches for calculating a covariance or correlation matrix. The third approach is described in http://www.real-statistics.com/handling-missing-data/.

Charles

• Simao says:

Yes I meant different sized populations. Each population produce the same amount of samples that I pretend to correlate. But the populations are different in size. Should I do any adjustment for this?

regards

• Charles says:

Simao,

I don’t have a precise answer for you, but I can offer the following suggestion:

If the finite populations are large, you probably don’t need to do anything. The usual correction factor for finite populations is to multiple the standard deviation by the square root of (N-n)/(N-1) where N = the population size and n = the sample size.

Charles

16. mhymhy says:

sir,

i have a question, what statistical instrument can i used in my thesis if i have 4 variables????

17. LauHai says:

Dear Charles,

I have some set of data for data analysis and l seek your assistance in analyzing them using correlation and regression techniques

• Charles says:

What sort of assistance are you looking for? You are welcome to ask questions.
Charles

18. Gorge says:

How you, Charles?
I was confused by the printout. I did not how to interpret it. Please help me. Thank you!
A scholar was interested in determining if there were differences in correlations between anger and depression when removing the effects of self-esteem for both variables among students (Group 1), teachers (Group 2), and farmers (Group 3). Results are as follows. Please provide a complete conclusion and explanation:
The partial correlation for group 1=0.34
The F value= 5.99 with a probability=o.018
The partial correlation for group 2=0.8018
The F value=84.6 with a probability =.00
The partial correlation for group 3=.3
The F value=4.7 with a probability=0.03
The global test for equality=18 and it has a probability 0.0001

3. Step Differences
1 st partial corr 2 ND partial corr Rstat Prob
(1) .30 (3) .80 5.4 0.00
2. Step differences
1 st partial corr 2Nd partial corr Rstat Prob
(1) .3 (2) .33 .25 1
(2) .3 (3) .8 5.11 0.00

• Charles says:

Gorge,
It sounds like you are referring to a printout from some other statistics tool (SPSS, SAS, etc.), which I don’t have, and so I am not able to comment.
Charles

19. cesare says:

Dear Charles,

thanks for the nice breakdown.

I have a simple problem where i have 3 dependent variables (a, b and c) and i would like to isolate c to see how the a affect b independent of c.
i assume i have to use the partial correlation formula in definition 3 right? (the first of the two…
and i assume the r values are the r (correlation coefficients) i get from a pearson correlation.
i insert those in my formula and i get my results, but they are not between 1 and 0.
how is that possible?

i checked the excel formula several times.
have I made a mistake or is it possible that i get a value above 1? (1.19)

c

• Charles says:

Dear Cesare,
Yes, Definition 3 and Pearson’s are correct, but you should never get a value larger than 1. If you send me an Excel spreadsheet with your data I will try to figure out what has gone wrong.
Charles

20. shaista says:

Hello,
Hope you will be fine. I need your assistant about computing R formula if i have 5 independent and one dependent variable. please guide me how can i modify R formula according to my scenario

• Charles says:

Sorry, but I don’t use R. The site is about using Excel for statistics.
Charles

21. shaista says:

I mean how to calculate regression coefficient (R) for 5 variables.

• Charles says:

This is explained on the webpage Multiple Correlation – Advanced. Alternatively you can use multiple regression to calculate the value of R.
Charles

22. Matteo says:

Hi Charles

Can you use interchangeably the pairwise correlation coefficient between independent variable x and dependent variable z, and partial dependence plot of z on x?

Thanks,
Matteo

• Charles says:

Matteo,
If you are asking whether the correlation of z on x and y is the same as the correlation of z on y and x, the answer is yes.
Charles

23. Tanuj says:

Hi Charles,

I am looking for a relationship or a limit on correlation coefficient between x and z, given the corr. coefficients b/w x and y, and y and z. Could you please elaborate on this kind of problem?

Thanks a lot!

• Charles says:

Sorry, but >I don’t understand your question.
Charles

24. Kareem says:

Hi,

can i use the multiple correlation formula of definition 1 for 3 independent variables?
Also for the adjusted formula, what is n? Thanks.

• Charles says:

Hi Kareem,
You need to consider one of the variables to be the dependent variable, otherwise Definition 1 doesn’t make sense.
n is the sample size of variable z (which is equal to the sample size for x or y).
Charles

25. S. Siamai Kromah says:

This Excel’s CORREL(variable1, variable2) function to run autocorrelation (ACF)for four series – et, variables A, B, & C, results not correct.

Can anybody help?

• Charles says:

Excel’s CORREL is intended to calculate the correlation between two data sets. If instead you are looking for the correlation between 4 data sets, then you need a different function. You can use the approach described on the webpage Multiple Correlation – Advanced.
Charles

26. Anurag Singh says:

Hi,
I have one variable output temperature which is dependent on 6 variable which are length,depth,velocity,conductivity,time and diameter. how can i make correlation between them and how accurate it will be? Thanks in advance

• Charles says:

Hi,

You can simply run the regression model and this will calculate the desired correlation (as well as the adjusted correlation, which is a less biased estimate).

Alternatively, you can calculate the correlation directly using the Rsquare function, as described on the webpage

Charles

27. Victor says:

Good day sir pls can you recomend text books on statistics for higher learning here is my line 08160817081

• Charles says:

Victor,
There are lots of textbooks available. It really depends on (1) which topics you are interested in, (2) are you interested in the theory or just how to conduct the tests, (3) how mathematical should it be, etc.
Charles

28. Dakota Parker says:

I have a rather large data set and need to know how to calculate a multitude of correlations, for example the data set I have is combined of demographic data and crime data and I need to know how each demographic data point correlates to each crime data point. Is there anyway to use EXCEL to calculate such data correlations or is it only possible to calculate a small data set at a time.

• Charles says:

You can use the CORREL function to calculate the correlation between two data sets even if they are large.

If you want to calculate all pairwise correlations between a number of data sets, you can use the Excel Correlation data analysis tool or the Real Statistics CORR function.

There are many other possibilities in Excel depending on the specific problem you are trying to solve and the format of your data.

Charles

29. Richa Dubey says:

Can u plz send me interpretation of any table of multiple correlation

• Charles says:

What table of multiple correlations are you referring to?
Charles

30. Jeffrey says:

Hello Charles,

According to wikipedia, the multiple correlation coefficient should be between 0 and 1. Let’s generate 10 random numbers between -1 and 1 for each of r_xz, r_yz, and r_xy and then compute R.

r_xz = -0.42484496 0.57661027 -0.18204616 0.76603481 0.88093457 -0.90888700 0.05621098 0.78483809 0.10287003 -0.08677053

r_yz = 0.91366669 -0.09333169 0.35514127 0.14526680 -0.79415063 0.79964994 -0.50782453 -0.91588093 -0.34415856 0.90900730

r_xy = 0.77907863 0.38560681 0.28101363 0.98853955 0.31141160 0.41706094 0.08813205 0.18828404 -0.42168053 -0.70577271

R = 2.0302762 0.6704749 0.4608397 4.1256652 1.4283935 1.5836626 0.5178415 1.3375018 0.3472992 1.1998119

It is clear that there are values of which R is greater than 1. Am I missing something here?

I would really appreciate your help!

Cheers

• Charles says:

Jeffrey,
You can’t simply generate 10 random numbers since the values of r_xz, r_yz and r_xy are not independent. You can generate random numbers for x, y and z and then compute the values of r_xz, r_yz and r_xy using CORREL. When you then calculate the values for R, you should get values less than or equal to 1.
Charles

31. Amaka says:

Pls how can I compute for 4 variables without y variables

32. Wytek says:

Hi Charles,
This page really has a lot of readers responding. I am looking at this page from the regression point of view. I am looking at the State Rankings data and wonder how to explain the variation in say Poverty in terms of the remaining data. In other words, how much does the variation in income or education contribute to the variation in poverty?

So, compute the overall R^2 for the entire data set and then compute the R^2 for the data set but leave income data out. The difference between the two R^2 values is the semi-partial correlation coefficient and it account for the contribution made by income to the variation in poverty.

semi-partial r^2 = R^2 – Ri^2
where the i in the second term indicates the ith independent variable excluded.

There is another quantity VIF (Variance Inflation Factor) where
VIF = 1.0/(1-Ri^2)

Interestingly, we can compute the partial r^2 like so
partial r^2 = semi-partial r^2 * VIF = (R^2 – Ri^2)/(1 – Ri^2)

33. Maaz says:

Hello, I Have to make a Report. It is complex data with more than 3 variables. My boss needs a Report for the sales (more than 1500 types of model) of two years separately, with their respective customers, their locations, Promoters at those locations. Each customer has more than 35 branches all over the country to whom we supply our Goods.

I am not an expert.

• Charles says:

Sorry Maaz, but you have not provided enough information for me to even give you a suggestion as to how to proceed.
Charles

34. Gollapinni Karthik says:

This is well explained and it is really helpful. But I have two doubts for the extension of these.
1) How can I get correlation from correlation coefficient when there are two independent variables(x,y)?
2) How can I get correlation from correlation coefficient when there are three independent variables(x,y,z)?

I am talking about reverse engineering I guess. I have the standard correlation coefficient values but I want to get correlation out of them.
The problem is in some cases I have 2 independent variables or 3 independent variables and in some cases I 4 independent variables.

So my doubt is how do I get correlation value from correlation coefficients when I have many independent variables

• Charles says:

The referenced webpage describes how to calculate the correlation coefficient with 1 dependent variable and 2 independent variables. With more than two independent variables, please see the following webpage:
Charles

• Gollapinni Karthik says:

Hi Charles,

Thank you. I have gone through them.
I have the correlation coefficient already with me. I want to to calculate correlations from the coefficients.

Thanks,
Karthik

• Gollapinni Karthik says:

Hi Charles,

I have mailed you the issue. Kindly have a look at it.

Thanks,

Karthik

35. Kristina Fassbender says:

Charles,

I am analyzing language production data for 18 individuals across 3 sampling times. The stimuli are the same each time and my variable is number of productive words. Would a multivariate correlation be appropriate for this type of data to examine the test retest stability (i.e., is the group stable over three times or are they significantly different)?

Thank you for making this software available for all!

Cheers,

Kristina

• Charles says:

Kristina,
To find out whether there are differences over the 3 times, you might use Repeated Measures ANOVA. This is described on the Real Statistics website.
Charles

36. Ivan says:

Hello Charles,

Thanks for the explanation of this, but, as a novice i am still having some troubles understanding. You can say I am a learn-by-reallife-example kind of guy.

Say i have to do debt purchase scoring, meaning what % of the nominal price i am willing to pay based on account characteristics. I have benchmarks including various types of data, including those characteristics. Lets say that each account in my benchmark has:
– product type
– debt amount
– days past due
– age of debtor
– employment status of debtor
– principal/interest ratio
– monthly cash collection hitory

From benchmarks, i should somehow get weight, or score, for each of the mentioned characteristics, so when i compare it to new accounts, i can say that if i am estimating a new account that is a housing loan, i will give lets say 2% of value, if the age is 35-45, additional 1%, if due is 1000-2000\$, additional 4%, etc. I think you get the point.

My questions are:
– how to determine the weight(or % value) of each account characteristic
– how to compensate the missing data (if i do not have age for example, i can not hive it 0%, but i can not also give it max %; should i use some average, or utilize some sort of compensation factor?)

Ivan

• Charles says:

Ivan,

1. how to determine the weight(or % value) of each account characteristic

2. how to compensate the missing data
See the following webpage
Missing Data

Charles

• Ivan says:

Charles,

For the 1st question, What kind of data would you need?
What i have is historical data for benchmarking. So basicaly, like in the example list above, i have an account like:

Housing Loan, Debt is 2456\$, Days past due is 411 days, debtor is unemploced and 46 years old, ratio between principal and interest is 1,8, and he has payments (cash inflow) in the amount of 220\$ since he is in my portfolio.

I have data like that for approx 60 thousand accounts. And based on that, i have to weigh the value of each characteristic regarding cash inflow, to basicaly see how valuable is the new portfolio to me (to break even with my investment and gain profit after the initial period).

So based on benchmark i should be able to know is the loan, or current account, or anything else, in newcoming portfolio worth 2% of the nominal value, od 3% or something else? And all that based on historical data.

Thank you

• Charles says:

Ivan,
Unless I am missing something, you need to determine the weights based on your knowledge of the problem that you are trying to solve. This doesn’t appear to be a statistical issue, but one that has to do with your knowledge of the real-world problem.
Charles

• Ivan says:

Charles,

Well, in my opinion, if the historical data suggests that, for instance, 20% of Housing Loans will pay approx 30% of their debt in 1st year after purchase, that should be considered as a statistical conclusion? Unless I am missing something in the very foundation of the matter.

Thank You,

Ivan

• Charles says:

Ivan,
Ok, maybe you are right that this can be considered a statistical matter, but in any case I don’t know how to calculate the weights based on the information that you have provided.
Charles

• Ivan says:

Charles,

This is a quote from a web site i stumbled upon a few days ago but i seem to have lost the hyperlink. But basicaly, it is what i am looking for, but can not seem to get a grasp of it in excel.

—quote—
Statistical models function in much the same way as judgmental models. However, in choosing the factors to be scored and weighted they rely on statistical methods rather than the experience and judgment of a credit executive.

Statistical models consider many factors simultaneously, a process that calculates and analyzes multivariate correlation to identify the relevant tradeoffs among factors, and assigns statistically derived weights used in the model. The key factors are generally captured from credit agency reports and the credit files of the client.

Statistical models are often described as a scorecard, a pooled scorecard, and a custom scorecard. A scorecard uses data from one firm. A pooled scorecard uses data from many firms. A custom scorecard blends a statistical model with some of the factors used in a judgmental model.
—end quote—

37. Pragati says:

Hi Charles,
I need to predict number of orders for next week for each weekday at different time intervals based on previous 6-7 weeks.I was trying to use the trend function in Excel but problem is that my output is dependent on 2 factors-weekday and time of the day.How do I go about it?

• Charles says:

Pragati,
It looks like you have to deal with seasonality. See the following webpage for help:
Regression Analysis with Seasonality
Charles

• Pragati says:

Thank you Charles!

38. Parth says:

hey i want to find a relation between 3 quantities x,y and z; x and y are independent and z depends on x and y, such that z is directly proportional to x raised to m and y raised to n. is there a way i can find m and n?

• Charles says:

Parth,
If the relationship is z = ax^m + by^n, then you have a nonlinear regression problem. One way to find values for m and n (and a and b) is to use Solver. The approach is similar to that shown for exponential regression. See the following webpage for details:
Exponential Regression using Solver
Charles

39. H says:

Charles,
I believe I found the answer to this question as I read along, but I would like confirmation before I move forward. I have test scores that I would like to correlate–simply looking for the strength of relationship between skills assessed. The scores are on different scales (standard score–average 100, scaled score–average 10, and raw scores–# correct). I’m a little rusty on my stats. I’m thinking that I do not need to convert these scores to the same scale before I calculate correlation. Can you confirm? My other option would be to convert all scores to raw scores then correlate, but I’m thinking I don’t have to do that.
Thanks much,
H

• Charles says:

H,
If the conversion that you have in mind is to multiple the score by some constant and/or add some constant, then the conversion will have no affect on the correlation coefficient. You will get the same answer whether you make the conversion or not. Try it.
Charles

40. Dear Sir
Namaste
I have soil data(nitrogen,phosphorus,potassium,calcium,magnesium content in soil),Plant data(nitrogen,phosphorus,potassium,calcium,magnesium content in leaf) and yield data of ber plant.Sir How i calculate multiple correlation between soil data and ber yield and between leaf data and ber yield.
With regards

• Charles says:

This really depends on the details, but perhaps you can use the approach described on the following webpage:
Charles

41. faizalhakim says:

Hi Sir

Hope you in good health always.

I’m working on the land use change affecting the climate changes.
I have 9 parameters of climate for 3 periods and
8 type of land uses for same 3 periods.

I want to observe whether changes in the land use affected by the climate parameters,
and I want to identify the most affected climate parameter to the land use changes.

I’m inquire your advice and opinion sir cause I’m really poor in statistical analysis.

• Charles says:

Sorry, but you haven’t provide me with enough information to be able to offer any advice. Perhaps you could use ANOVA, MANOVA or regression, but I can’t really tell.
Charles

42. Sylvia says:

Hi!

First of all, thanks a lot for your help, as it is very useful for people like me, who are not very familiar with statistics.

I have maybe an stupid question. I would like to see the correlation between two different cell populations (let’s see cells A and cells B, how the population of A change according to the population of B). In addition, I have two different kind of animals (WT and KO). I can see correlation between these cell population in KO but not in WT, probably because i dont have too many animals (4 in each condition).

My question is: as i would like to know the global changes of these 2 cells populations, is it possible to merge WT and KO to do the correlation? When i do this, i observe very nice correlations, and within the graph i can observe how WT mice have for example less A cell populations and KO more…My boss considered that you can only do the correlation using one mice population, but not altogether, but i think this is possible to do it (maybe im wrong!), as i wonder the global changes of these 2 cells populations independently of these cells comes from WT or KO mice.

I am not sure if i explain appropiartly. I look forward to hearing from you. Again, Thanks a lot for your help!

• Charles says:

Sylvia,
Sorry, but I don’t understand the scenario you are describing or your questions.
Charles

43. Alexa Manabat says:

Hi Charles!
What statistical treatment would you recommend for us to use if we are trying to detemine the relationship between a single dependent variable and 10 independent variables. Thank you.

• Charles says:

Alexa,
It really depends on what you mean by “the relationship between”. E.g., you could use multiple regression for this. If, however, you want the correlation coefficient, then you could use the value calculated as shown on the following webpage:
Charles

44. Abiola says:

Hello sir,
I have the following data
independent variable- age (in groups e.g 31 to 40, 41 to 50, 51-60)
Dependent variables are 25 different factors influencing turnover e.g. pay, tenure, job insecurity, job stress. Responses are on a 4-point Likert-type scale.
All respondents scored each of the 25 dependent variables on the Likert scale.
I need to find out if age has an effect on the factors affecting turnover. For instance is there a difference in the factors important to age group 31 to 40 and age group 41-50 and age group 51-60?
What technique do I employ?

Thank you

• Charles says:

Abiola,
If I understand the situation properly, MANOVA might be a good technique to use, assuming that there is some correlation between the factors. You have one independent variable (Age) and 25 dependent variables.
Charles

45. Aku says:

Dear Dr. Zaiontz,
I have one dependent variable (binary categorical) and 7 independent variables; 6 of them are binary categorical and the seventh independent variable is age. I want to use correlation to determine if the independent variables affect the dependent variable (falls). Can I use multiple correlation for this?
Thank you,
Akubue

46. BGoertz says:

Dr. Zaiontz,

Good afternoon. I am hoping I can get your guidance on performing a forecast analysis.

I have two independent variable: operating days and number of physicians

I have two dependent variables: total revenue and total number of patients

I have 36 months worth of data and am trying to predict 12 months out but want to be able to understand how much ‘weight’ each independent variable may have over the dependent variables. Obviously, the more physicians you have, the more patients you can see. Same for days to patients. There is a relationship between Total Revenue and Total Patients as well.

I have calculated the correlation between the variables and can these individual correlations to predict Total Revenue 12 months out but how do I factor in ALL of the variables? Is that possible?

• Charles says:

You can use multivariate multiple linear regression. Unfortunately, the website doesn’t yet support this technique.
Charles

47. hemjay says:

i have five independent variables and i want to determine the relationship between them please how can i?

48. semir says:

Hi charles i have one question regarding to rank correlation coefficient .how i can compute Rs in having three rank like rank by A,B,and C and how to find the nearest approach. Thank you

• Charles says:

Sorry, but I don’t understand your question. Perhaps you are looking to use the RANK.AVG function to establish the ranks and then the Real Statistics MCORREL to calculate the correlation.
Charles

49. Catarina says:

Hi!
I’m trying to see the correlation between a dependent variable and a vector that is defined by two independent variables (one is the x and the other is the y, when put together they define an exact point). How can I do that?

• Charles says:

Catarina,
This certainly sounds like the situation where you use multiple correlation. In particular, you can use the Real Statistics MCORREL function.
Charles

50. Baahr says:

Hi Charles,
I am trying to see if there is a correlation between three antibacterial medicine on three species (specifically on genes).
Example:
Gene SpeciesA SpeciesB SpeciesC
aa 1 1 0
ab 1 0 0
ac 0 0 1
1=sensitive, 0=resistant
I wanted to check if SpeciesA response is dependent or independent of SpeciesB or C for the particular medicine. In other words, is there any correlation of sensitivity or resistance to different species?
Baahr

• Charles says:

Sorry, but I don’t know how to define the concept of “correlation of sensitivity or resistance to different species”
Charles

• Baahr says:

Hi Charles,
Basically I wanted to see the correlation between 3 things (species) using binary data. I have 123 observations or data points. What are possible options for such data.
Thanks.
B

• Charles says:

Baahr,
Probably you can use multiple correlation, e.g. by using the MCORREL Real Statistics function.
Charles

51. Pankaj says:

Hi all,
How to find the correlation b/w the variables such as x,y,z in which x has only 0,1 values, y has continuous values and z has categorical values.
Is there any correlation b/w them ??

• Charles says:

You should be able to use the technique shown on the referenced webpage. Whether the result is meaningful is another issue.
Charles

52. Kanak Roy Chowdhury says:

Using Data Analysis tab in excel, I tried multiple regression by selecting Regression Analysis. When I tried to select multiple columns for Input X Range, it rejects saying only 1 column to select. Can’t we use data analysis tool for multiple regression?

• Charles says:

You should be able to do multiple regression with up to 16 independent variables using Excel’s Regression data analysis tool.
Charles

53. Binod Chhetri Bhandari says:

I am planing to have research on multidimensional relationship of Socio-economic status(SES) and ethnicity on WASH (Water sanitation and Hygiene) and nutrition. Please would you suggest me the good statistical method for showing linkage or not within them.

• Charles says:

Charles

• Kanak Roy Chowdhury says:

Thanks. I have found it.

54. Kanak Roy Chowdhury says:

How to understand, standard deviation of a sample is high or low? Calculated mean from the set is 80.2 & standard deviation is 13.4

• Charles says:

Kanak,
There is no real assessment about whether a standard deviation is high or low. You can look at the ratio mean divided by standard deviation, but whether this is high or low depends on knowledge about the domain being studied.
Charles

55. bekalu says:

The link had really helped me. honestly, it was not a homework. i was just arguing with my peers. thank for your help. i will suggest my friends to look over the page.

• Charles says:

Charles

56. Edmond says:

hello thanks for this forum..please i really need help here.. i am doing a master 2 research on physical education and sports..and my topic is ” A comparative study of the uchikomi shuttle run test, special judo fitness test and 20m shuttle run test in physical fitness for cameroon judoists”. i have three independent variables and one dependent variables..pls i need to know wch test i can use to analyse my data

• Charles says:

Edmond,
You have not provided enough information for me to be able to give an answer. What hypothesis are you trying to test?
Charles

Hi I am working on a research project where i have to determine the co relation between Locus of control – Internal and External and stress and anxiety. Here there are two independent variables and two dependent variable. Which method of calculating co relation and which tool will be beneficial for that for a sample of 100.

• Charles says:

What hypothesis (or hypotheses) are you trying to test?
Charles

The hypothesis is that there is relationship between locus of control and stress and anxiety.
Now locus of control is both internal and external which here are independent variables while individually stress and anxiety are dependent variable.

• Charles says:

Generally, correlation is defined when there is one dependent variable and one or more independent variables. This is related to linear regression. It is not clear what you mean by correlation when you have two dependent variables and two independent variables.
Charles

Hi! I am working on my dissertation in which I have to correlate students math test score with different variables like confidence, interest, perseverance etc. I have tried to get help from online tutorials but they are not very helpful. Can you please guide me how to perform correlation with all these around 5 variables. Within one variable I have framed around 5 questions. An immediate reply will be highly appreciated.

• Charles says:

You can do this by performing a linear regression and using the R value, which is the correlation.
Charles

59. kiran says:

Hi,
I am doing my research on socio-economic determinants of child labor. I am really confused to use which method as i have more than 10 variables like poverty,parental education and unemployement, flaws in education,government ignorance,dowry,early marriages,parental death, migration, shocks. And are these factors effecting child labor if yes than to what extent. kindly guide me to use which method and explain a bit how?

60. Elena says:

Hello, there.

I am conducting a mini research project for my degree.

I am researching based on three variables which include depression, neuroticism, and procrastination. I am looking to find a relationship between all three variables. I am not sure of what kind of analyses to use. Could you give me any advice?

• Charles says:

Elena,
Which analyses to use depends on a number of factors, including: (1) what sort of data you have, (2) what you mean by “relationship”, (3) what sort of hypotheses you might want to test, etc.
In any case, in these sort of situations, it would be common to use multiple correlation or multiple regression.
Charles

61. Oscar says:

Hi Charles,

I think I’ve made a know in my head. My thesis is on ethical leadership, inclusion and employee voice. Basically, I hypothesis that if employees perceive their leadership to be ethical it will lead to higher perception of inclusion and employee voice. I also want to collect demographics as control variables.

I’m using 3 Likert scale questionnaires (ordinal data), to ask employees about 1. Ethical Leadership 2. Climate of inclusion 3. Employee voice (employees speaking up). On one hand this looks like a correlation, as there’s no cause (independent variable), though on the other hand wouldn’t Ethical leadership be an independent variable? In the sense that I hypothesis it will affect climate of inclusion and employee voice?

Oscar

• Charles says:

Oscar,
You can certainly view ethical leadership as an independent variable. Yes, in this case you can test whether more or less “ethical leadership affects [i.e. correlates with] a climate of inclusion and employee voice”. You might use correlation, regression, t tests, ANOVA, etc. to test this.
I don’t know what you mean by “I also want to collect demographics as control variables.” In what sense will these be control variables?
Charles

62. Dr. Abhishek Jayant says:

hi charles,
can u please help me with this. my thesis is MDCT PREDICTORS OF SURGICAL INTERVENTION IN ADHESIVE SMALL BOWEL OBSTRUCTION. so basically i am trying to predict whether a patient with intestinal obstruction needs surgery or not on the basis of CT scan findings. i have taken 75 retrospective cases and evaluated whether these findings (6) were present or absent in case of intestinal obstr. patient. and then i saw in the medical records, how many of them were operated. so i calculated the senisitivity, specificity, PPV and NPV of each of these findings/variables in predicting surgery in a case of intestinal obstruction. but now i want to apply these findings in 25 prospective cases to see whether or bot my hypothesis is right. please tell me if there is a way to do it and if yes how. thanks.

63. John Brown says:

Hello Charles, please advise. I am currently trying to run a regression analysis with 5 Independent variables, and one dependent. I am trying to determine if the factors( IV) influence the DV. The regression analysis continues to yield very high P-values and only one significant T-Stat. I believe that for this problem I should be finding the correlation but, I do not know how to find the correlation of data with regard to multiple IV’s. What is the best way to find the correlations and see if the IV’s indeed are influencing the DV? I thank you in advance!

• Charles says:

John,
The regression analysis actually includes this result (i.e. the Multiple Correlation value on the report). If only one IV has a significant p-value, then only that variable has a significant influence on predicting the DV.
Charles

64. Marion Dahousi says:

Dear Charles,

First of all, thank you very much for the description above, this is really helpful. I have a question about one of my analysis and can’t find the answer, I hope you don’t mind asking you.
I am analysing biomechanics of hands. For this, I measured the pressure a hand can apply to a handle after different neurological issues. I measured both hands. Between the left and the right hand was no significant difference in none of the parameters I am using. Now, I am not really sure how to deal with it. Can I use the measured values from both hands or do I need to analyse left and right always separately? Is there a possibility to correct for the inter-hand correlation to use all values?

• Charles says:

Marion,
How did you determine that there was no significant difference between the left and right hands? Did you use a paired t test (or Wilcoxon signed-ranks test)? You don’t need to consider correlations with these tests.
Charles

• Marion Dahousi says:

Dear Charles,

Thank you very much for your reply. Yes, I conducted a two-tailed paired t-test to get the statistical significance between both hand powers.
My question is, if I now want to use the parameter power to do multiple regression analysis, if I can use the power values from both hands as one sample or if I do need to split it (as they are not independent)?
Marion

• Charles says:

Marion,
In the regression analysis you can use the variable “left hand” (or some parameter about the left hand) and the variable “right hand” or some parameter about the right hand. Thus for each element in the sample you will have a value for left and right hands. What you can’t do is have a variable “hand” (or some parameter about hand) and have a sample element for left hand and another sample element for right hand. This would violate the randomness assumption of your sample.
Charles

• Marion Dahousi says:

Dear Charles,

Thank you very much for your help!

Best wishes,
Daniela

65. Hello Charles,
Basically i am not into math. But I need some help in correlation which I need to use in my Structural engineering research work. Hope you would help me out with this.
I have the data for different states in India, for the construction type of the building i.e., Material of wall and material of roof (Sheet 1). And I have the data for these two tables correlated (Sheet 2). But the problem is I am not able to find the methodology about how the values were executed in sheet 2. I am attaching the excel files link.
https://www.dropbox.com/s/n5wtcq4q83hjq1d/census%20state%20wise.xlsx?dl=0

• Charles says:

Ravi,
I was only able to access Sheet 1.
For simple correlations (between two samples), you can use Excel’s CORREL function. For correlations between one variable and multiple variables you can use regression. For more complicated analyses, perhaps Canonical Correlation would be useful.
Charles

66. Elakkiya says:

Hi charles, I found your blog is so helpful. thank you very much. I am doing research in computer Science Engineering. I am not good at Math. i need to find out the correlation between the two variables with respect to third variable . For example i have 2 features like GPA, IQ ,SALARY and class labels(PROFESSIONAL, NON PROFESSIONAL). now i want to know the correlation between the GPA and SALARY with respect to Class label.what kind of Formula i should use. what if i refer i can find out my answer? please help me.

• Charles says:

Elakkiya,
You can use the MCORREL function as described on the referenced webpage. Alternatively you can perform multiple regression and use the calculated correlation value.
Charles

• Elakkiya says:

Thank you very much

67. Chankit says:

hey,
I have data of 600 people out of which only 15 got passed. I want to find the correlation between these two entries. How can I calculate??

• Charles says:

You can use the CORREL function to calculate the correlation between two data sets.
Charles

68. Fatima Charloos says:

Hi Charles

Please could you assist me. I am required to find correlation between 4 variables: time, temperature, rainfall and stream flow. Please could you advise a formula that I could use to find the correlation between these 4 variables. Thank you

• Charles says:
• Fatima Charloos says:

Thank you!

69. APARNA says:

9. To understand the work commitment level of employees at KEL
The following factors induces work commitment into me :

Factors
Highly Satisfied
Satisfied
Neutral
Dissatisfied Highly Dissatisfied
a. Working Hours

b. Security Measures

c. Job Security
d. Transportation Facility
e. Leave Policy
f. ESI , EPF Benefits

10. To assess thse motivation level of employees at KEL
The following factors induces motivation into me :

Factors
Highly Satisfied
Satisfied
Neutral
Dissatisfied Highly Dissatisfied
a. Training & Development

b. Social Status

c. Recognition & Reward for performance
d. Monetary Reward
e. Promotion Policy
f. Counseling

11. To assess the job satisfaction level of employees at KEL
The following factors induces job satisfaction factors into me :

Factors
Highly Satisfied
Satisfied
Neutral
Dissatisfied Highly Dissatisfied
a. I consider my job pleasant

b. I feel real enjoyment at my work

c. I am enthusiastic about my work
d. I feel to stay with my organization even if new opportunity is advantageous to me
e. I feel happy to spent rest of my career at this organization
f. I feel part of the family at my organization

12. To assess the Employee – Employer relationship at KEL
The following factors build Employee – Employer relationship :

Factors
Highly Satisfied
Satisfied
Neutral
Dissatisfied Highly Dissatisfied
a. Conflict Resolution

b. Grievance Handling Mechanism

d. Healthy Communication
f. Ethical work practices

sir i want to done the mutiple correlation for the above 4 questions according to my projct work.I am liitle wek in maths.can you help me sir

• Charles says:

You can create the pairwise correlations using Excel’s CORREL function. You can get all such correlations using Excel’s Correlation data analysis tool.
To get the multiple correlation of any one of the questions against the other 3 questions you can perform a multiple linear regression using Excel’s Regression data analysis tool and use the Multiple R value.
How to perform each of these is described on the Real Statistics website.
Charles

70. Shabir says:

Hi Charles, I want to test a model consisting of four main variables conflict intensity, flaks, escalation and de-escalation. Each of these variable further contains three more variable high-medium and low. Please guide me what statistical test is best in this case. I treat them as dependent variables. I have read some material on it and am suggested to apply the multi-correlation on it. Waiting for your answer. Best

• Charles says:

Shabir,
What hypotheses do you want to test?
Charles

71. Taylor says:

Dear Charles;
Your thoroughness and timeliness of response is amazing! I have read the comments below but want to be sure I am heading in the right direction. I have a two part question:
1. I have > 10,000 components. When I sell one component I want to know if multiple other component could have also been sold. So for example when a shirt is sold based upon historic sales history there is an 80% chance a tie could have been sold as well and there is a 75% chance a sweater could have been sold. How do I best determine what the likelihood of additional products type could be sold?

2. Is excel the best software to handle this or should I develop the analysis in some other software?

Thank you! Taylor

• Taylor says:

A slight modification to question 1 above: I’d like to also know if I sold the shirt and the tie together what is the likelihood of selling the sweater (or numerous other products with given combinations). Thanks again!

• Charles says:

Taylor,
This might be a simple probability calculation. You might use simulation. It really depends on the details.
Charles

• Charles says:

Taylor,
This really depends on the nature of the data that you have. You might be able to get this information from multiple regression.
Charles

• Taylor says:

Hello Charles;
Thanks for your reply! I would definitely like to avoid simulation if I can as I don’t want to have to learn a given software to conduct the analysis. I will though if you feel that would serve me best.

A bit more detail to provide you with the nature of the data that you mention above:
Think of the components as parts of a car. If I sold someone a spark plug they may have benefited from having spark plug wires, a distributor cap and perhaps a tool to gap the plug and a tool to install the plug. Now they won’t always need the extra components and tools. However, I want to be able to view the sale to a given customer and determine what other parts could have possibly been sold and what is the likelihood of the customer needing the other components.

So Customer A bought spark plugs I’d like to know that there was an 80% of the time that others customers who purchased spark plugs also purchased spark plug wires. 50% of the time distributors were purchased and 45% a gaping tool.

Hopefully this gives you insight into the type of data. If not please let me know what further information you may need.

Again, thank you for your expertise!

• Charles says:

Taylor,
In your last comment you said, “So Customer A bought spark plugs I’d like to know that there was an 80% of the time that others customers who purchased spark plugs also purchased spark plug wires. 50% of the time distributors were purchased and 45% a gaping tool.”
This sounds like simply a problem of having the historic data, exactly of the type you described. If so, you simply take the percentages, not complicated statistics are required.
Charles

• Taylor says:

Thank you Charles

72. H says:

Hi Charles, I have a real issue getting my head around stats and as much as I’m trying, I am having some issues handling my data. I have nine sites that I have visited 10 times, taking quadrat samples four times at each site visit and I am trying to find a test to compare my quadrat samples from the site visits to assess how similar or different the sites are from one another in terms of flower coverage. I know the likes of T-testing exists, but I wonder if you could suggest something to allow me to compare all site to each other, at once? Thanks.

• Charles says:

H,
The expansion of a t test to multiple sites is ANOVA.
Charles

73. Hi Charles can you help me to do my thesis? I don’t know what I will gonna do. Thank you

• Charles says:

Katherine,
I happy to answer specific questions.
Charles

74. Hi Charles. Can you give us ideas on how to use our data for our thesis. Thank you very much.

• Charles says:

Katherine,
Can you be more specific as to what you need to do with the data?
Charles

75. Lai says:

hye Charles, wish u have a good day. My name is Lai. Here i got several variables of heavy metals like Zn, Cu, Pb etc. my question is, when i do the correlation test using excel, does it matter the position of each variable? or i can put them randomly and excel will calculate that variables accurately? Thanks 🙂

• Charles says:

Lai,
The order of the pairs (or n-tuples) doesn’t matter.
Charles

76. Mike Mole says:

Hi,
I work for the government in the engineering and environmental regulatory industry for oil and gas. I am currently working on a project to analyze air quality data form an air shed. parameters monitored are windspeed, temp, wind direction, h2s, s02, methane hydrocarbons and non methane hydrocarbon concentrations. A third party has been contracted to produce annual reports however the reports produced contain only windroses, bar graphs and some pretty basic statistical methods; albeit powerful because they get the point across and are easy to undertand. However, I need to come up with a statistical method to analyze the data to determine
1. on 0.05 significance level, alpha, if the average concentration of each pollutions is increasing yearly

2. A statistical analysis method to pin point sources of said pollutants during spike in readings. For example oil and gas pollutants will have a certain signature or pollution components e.g high methane, h2s and so2 reading however, bush burning activities will have a lot of particulates, c02 and nox pollutants. My idea was to couple two pollutants or parameters and see if there is a correlation and determine a linear regression equation. if I can establish that there is correlation between 2 or more pollutants I can assume that they were releases either from the same source or released at the same time. is there a better way to approach this and which test will suffice? A t test, F test or ANOVA?

• Charles says:

Mike,
To better understand the scenario that you are describing and the types of analyses that you want to perform requires some discussion. This looks to be a consulting engagement.
Charles

• Mike Mole says:

Hi Mike,
It is not a consulting engagement (on my part), however it is a side project that I am working on as I love my job so much and I am very passionate about statistical analysis… I am a young engineer eager to make a difference and a name for my self.

• Dear Mike,

Pierre Lafaye de Micheaux
Senior Lecturer, UNSW Sydney, Australia

77. Khezmo says:

Hi Charles,
I need to find a correlation between capital expenditure amount and tenure of the capital expenditure to the length of payback period. Could you please let me know how we do this?

Thanks

• Charles says:

Khezmo,
You can use the MCORREL function as described on this webpage or simply perform linear regression and use the multiple correlation output.
Charles

78. Subhrajit saha says:

What is the statistic to use the tests regarding multiple correlation coefficient with full derivation of this statistic….??

• Charles says:

Sorry, but I don’t understand your question.
Charles

79. Francis says:

Hi Charles,
I need to find a correlation between the Levels of intelligence and aggression to black humor. Could you please help me know how to compute this?

• Francis says:

To be specific, I will compute for the relationship of black humor to intelligence, black humor to aggression,and intelligence and aggression.

• Charles says:

Francis,
To compute the correlation of black humor to intelligence use Excel’s CORREL function
To compute the correlation of black humor to intelligence and aggression use the Real Statistics MCORREL function or look at the correlation value when you use Excel’s Regression data analysis tool where you use intelligence and aggression as the independent variables and black humor as the dependent variable.
Charles

80. Sonam says:

Hi Charles,

Please help me if you have time! I’m wokring on a project to find if there is a correlation between influence of advertisements on purchasing habits. I have collected data (age, gender, occupation, no. of social media accounts, no. of purchases) and scale of percieved influence from 1 to 5 (1 being totally disagree to 5 being strongly agree). What can I do to make sense of these datas?

• Charles says:

This is a very broad question. You can use various descriptive statistics approaches described on the website to better understand the data (calculate mean, variance, etc., create box plots, histograms, QQ plots, etc.).
You can also perform statistical analyses on this data (e.g. regression).
What you do depends on what hypotheses you want to test and what else you want to learn from the data.
These topics are all covered on the Real Statistics website.
Charles

81. Ines says:

Hi! I have to write a sentence based on 3 variables correlation , I don’t really understand how to put all of the data into a 1-2 sentence. Can you help me? thank you

• Charles says:

Ines,
Are you asking how to report the results of the correlation between three variables?
Is this for a publication or a school assignment? Publications have their own guidelines for these sorts of things.
Charles

82. Florian says:

Hi Charles,

sqrt(1-(1-0.331)*49/47) = 0.55 and not 0.425. In your Excel spreadsheet k = 9 and not 2, as it should be for this case.

Best wishes,
Florian

• Charles says:

Florian,
The example explicitly refers to only two of these variables, namely infant mortality and white, with poverty as the dependent variable. Thus k = 2.
Charles

• Florian says:

Hi Charles,

Yes, I got that. Therefore the value in your text should be adjusted since right now the equation is not correct.
sqrt(1-(1-0.331)*49/(50 -2 – 1) ) = 0.55 and in the text above it says sqrt(1-(1-0.331)*49/(50 – 2 – 1)) = 0.425, which is not correct.
sqrt(1-(1-0.331)*49/(50 – 9 – 1)) is 0.425 .

Best wishes,
Florian

• Charles says:

Florian,
Yes, you are correct. In fact, there is another error as well. The calculation of Adjusted R-square should not use a square root. Adjusted R would use a square root.
I have now corrected the webpage. Thanks for your help in identifying this error and improving the accuracy of the website.
Charles

83. Damon says:

Charles,
I have never actually taken a statistics class but am trying to calculate the correlation coefficient for two variables. I don’t have a problem with the the equation but my r value is .9 . I have determined that this is caused by an excessively high standard deviation for both my x and y variables. I am attempting to show the correlation between violent crime and individuals killed by police for all 50 states and DC. The three highest states have values of 166,883, 113, 227, 93,626, but 45 of the states have values under 40,000. The result generated a standard deviation of over 30,000 placing 39 of the 51 variable for x withing the Sx. I am concerned that per capita measurements will dilute the value. What would be my best option to produce the most accurate correlation coefficient value? Do I consider the top three states to be outliers and eliminate them or use the per capita values? If I use the per capita value for the violent crimes would I also need to apply that to my individuals killed data? Thank you in advance for your time and assistance.
Very Respectfully
Damon

• Charles says:

Damon,
What is the problem with having a value of r = .9? This shows that there is a high correlation.
Charles

84. Damon says:

Charles,
Thank you for taking the time to read my post. I know but that correlation is being driven by only 3 of the 51 variables. when I removed CA r= .846; remove CA and TX r=.757; finally, remove CA, TX and FL r=.637. There is still evidence of a moderately strong to strong positive linear correlation but a far cry from a .9 . I don’t want to over represent the correlation and compromise the validity of my data. I am just concerned that if I use the per capita measurement for violent crime I am introducing population as a variable. I would actually be comparing the ratio of violent crime to population, to the ratio of police killings to population. This may be a basic question that deserves nothing more than a basic answer. Like I said before I have never taken a statistics class or worked with statistics professionally; mathematically I peaked at algebra so I wont be offended by a simplistic answer. Thank you for your time in advance.
V/R
Damon

• Charles says:

Damon,
If you like can calculate a multiple correlation coefficient using all 51 variables. You don’t need to be restricted to only 3 variables.
This topic is covered in the Multiple Regression section: http://www.real-statistics.com/multiple-regression/. Using multiple regression, you can determine which variables to retain in your model, which variables have the most weight, etc.
In the example you gave with three variables, you might want to retain all the variables, although the Multiple Regression techniques will help you to determine whether this is necessary.
Charles

• Damon says:

Charles,
I appreciate the recommendation and I am contemplating the option, but the time for completing this study is fast approaching. I think for now I will go forward with the data and correlations that I have and simply identify the situation and use population as a control variable. I will state that my correlations are based only on states that have populations of less than <20,000,000 people and why. If I manage to complete the paper with some time remaining I will try and teach myself how to do a multiple correlation. I knowledge of statistics is limited what I have learned from various online sources. If I do attempt a multiple correlation I will let you know how well your tutorials aided me in the process. Thank you for taking the time to assist me on this study.
V/R
Damon