Correlation

In this section we explore the concept of correlation (especially using Pearson’s correlation coefficient) and how to perform one and two sample hypothesis testing, especially to determine whether the correlation between populations is zero (in which case the populations are independent) or equal. We briefly explore alternative measures of correlation, namely Spearman’s rho and Kendall’s tau, as well as the relationship between the t-test and chi-square test for independence and the correlation between dichotomous variables.

Topics:

74 Responses to Correlation

  1. Rowena Dsouza says:

    Hello Charles,
    Our study is about the penetration of core values among the employees for which we have designed a questionnaire with likert scale- Strongly disagree(1) , Disagree(2), Agree (3), Strongly agree(4) for superiors and subordinates commenting about each other. Now we want to analyse the data using statistical tools. Kindly suggest the most apt statistical tools for the same. Our sample size is 50.

    • Charles says:

      Rowena,
      First you need to decide what hypotheses you want to test. What is the objective of your research_
      Charles

  2. dhruv says:

    Hello Charles,

    Your website is really helpful. I have a problem for which i need certain answers.

    background: I have to analyze a database to know which is (are) the driving parameter (like Industry, cognitive function etc.)that affects my Y variable most strongly. My Y variable is a continuous variable. I have four parameters and each has about 4-5 categories in them (for example, Industry has: Oil, manufacturing, coal, nuclear, marine).
    I want to know few things
    1) Which correlation should I use in this case?
    2) for each parameter, I will be coding the categories with number like oil= 1, manufacturing= 2,coal= 3,nuclear= 4, marine=5. Will this coding have an effect on the correlation? i.e if I change the order of coding then will the correlation change?
    If it does then is there any correlation that I can do which is independent of coding?
    3) And lastly, what tests can I use to test the correlation values?

    I would really appreciate your help on this.

    Best regards,

    Dhruv

    • Charles says:

      Dhruv,
      1. The answer to your question depends on things that you haven’t mentioned in your description. Also, why do you want to calculate the correlation coefficient at all?
      2. The order will have a big affect on the correlation coefficient. You should use htis sort of coding with nominal data.
      3. The tests used to analyze the correlation are described on the website in detail, but I am not sure any of these tests will be that helpful to you if all your data is nonimal.
      In general you should use “dummy coding” for nominal data. This si explained on the website.
      Charles

  3. Hanis says:

    Hi Charles,
    I have a question for you about my project I am doing. What other analysis that I can use to predict stock price movement other than Correlation Coefficient? I have different factors that may affect the stock price but I don’ know what analysis I should be using for this. But the factors that affect stock price is Consumer Price Index

    Please reply me . Thanks (:

    • Charles says:

      Hanis,
      Generally, some form of regression is used to address these sorts of issues, more specifically some form of time series analysis.
      Charles

  4. Timothy says:

    Sir,
    I have a problem with my hypothesis testing. My hypothesis is : application of value management at briefing stage significantly minimise challenges involved in developing the client’s brief. I have computed means of problems in clients brief and means of solutions provided by value management; using 4 points likert scale for each. Now I want to test the correlation between the two groups in order to accept or reject the hypothesis. How can I approach this? Secondly, how should I link the MOST appropriate solution to each problem using correlation

  5. Peter Lynch says:

    Hello Charles, I wonder if you can help me decide on which statistic to use.

    I have Likert scale (1-5) data (40 questions representing 8 factors) for two types of employees (A and B). I have calculated the means for A and B for each of the 8 factors and scatter plotted the data (A on the X-axis and B on the Y-axis). This has given me a linear plot, with each point on the graph representing one of the 8 factors. I have calculated the correlation coefficient (Pearsons) showing a very strong positive correlation.

    I just don’t think this is the correct way to do this, but can’t reason why. I feel I should concentrate on one of the 8 factors only, taking the actual results and plotting accordingly for each one. It seems to me in the approach above, I would have a linear plot for 8 factors which in effect could be used to predict the scores for all factors in any future study provided I had one score only – this seems ridiculous to me.

    Can you help me work out where I’m going wrong and what better statistic I could use?

    Thanks in advance
    Peter

    • Charles says:

      Peter,
      Let’s start at the beginning. Before I can help you decide on which statistic to use, I (and you) need to understand what real-world problem you are trying to address or what hypothesis you are trying to test. This drives the selection of statistic to use.
      I wait for further information from you.
      Charles

      • Peter Lynch says:

        Many thanks Charles

        The real world problem is the measure of safety climate in an organisation, comparing two groups of employees. Safety Climate is measured by responses to 40 questions that are grouped to address 8 specific factors (5 questions per factor). Each question is scored by Likert scale choice of 1-5. The questions are established and already validated etc. by safety organisations, so they themselves are good to use.

        I have taken the means of the Likert scores for each question (over 100 respondents) and then calculated the means for each specific factor (so 8 results). This I have done for both employee populations, giving a total of 16 results. I have then scatter plotted the results for each factor/population and calculated the correlation coefficient for the plot, which is strongly positive (0.91).

        I wish to look at the relationship between safety climate and the perceptions of the two employee populations. I want to see if they are correlated/related in any way. But the measure of safety climate is multi-factorial (8 factors).

        Many thanks for any help you can give.

        Peter

  6. Leo says:

    Good day Charles,

    Here`s the details for my research, I have 4 Ivs and 1 Dv

    Under each Iv has several statements under 5 point likert skill.

    Now I wanted to run SPSS, Bivariate correlation test and the test run every statements in the Iv to Dv

    Can I just group all the statement in the Iv as one and to Dv`s statement as one ?

  7. Belle Perez says:

    Hi Sir!

    I would like to ask for advice on which statistical treatment to use on this study. We would like to look at the correlation of local government unit (LGU) assistance and Self concept of indigenous people (IP) teachers. There are 3 subcategories for LGU assistance with 5 questions each, (likert-type scale). The same for self-concept. Is it alright to use pearson or spearman? Im getting confused with likert-type scale data, because I’ve read some study that treated such as interval data. Thanks!

    • Charles says:

      Hi Belle,
      If, for example, you have a Likert scale of 1, 2, 3, 4, 5, the real question is whether you can assume that the intervals between the scores are equally spaced. In this case you can treat the data as continuous (although a 7-scale Likert is better than a 5-scale Likert), and so Pearson’s is probably ok. If the intervals between the scores is not equal, then you should probably use Spearman’s rather than Pearson’s.
      Charles

  8. david says:

    Hello sir, I used two soils for antibiotics uptake studies, I have determined the soil properties and want to run correlation matrix to see if there is any correlation between the soils and the antibiotics. I am not able to do it. Can correlation be done with two data set? which statistical tool can I use to know if there is any relationship between the soils and the uptake of antibiotics

  9. Abe says:

    Hello, I have a problem. I need to validate my 12 item questionnaire with a 7 point likert scale against a 19 item questionnaire with a 6 point liket scale. I know how to run a correlations test between similar scales but not when one is 7 points and the other a 6 point likert scale. How would I do this on SPSS please? Many thanks.

    • Charles says:

      Sorry, but I don’t use SPSS. This website is about statistical analysis using Excel. In any case you can use the correlation coefficient even if the scales are different.
      Charles

  10. Ashad says:

    I have a problem, in my analysis correlation is positive but in t-test null hypothesis is accepted. Have any problem about that….please answer me….

    • Charles says:

      Ashad,
      There is no problem if the correlation is positive. The important thing is that this value be statistically equal to zero (which is what the t test is designed to test). If the positive value is relatively small, then there shouldn’t be a problem. Just because the null hypothesis is that the population correlation is zero doesn’t mean that the sample correlation will be exactly zero.
      Charles

  11. mukhtar says:

    I was used my thesis sample linear regression but unfortunate the models of the question Y= ax+ b plz help me if you have any idea?

  12. Rosa says:

    Hello Sir,

    I would appreciate for your help.
    My problem is that I would like to test the relationship between the ordinal (5 point Likert scale – Strongly agree, agree, neutral, disagree and strongly disagree) and dichotomous (Yes/No question), is it appropriate to use Spearman’s rho test? If not, which test would you suggest?

    The hypotheses is finding out whether consumers’ attitude towards have positive relationship with their purchase intention.

    Thank you.

  13. alsim says:

    Hi sir,

    Pls Help me, im using likert scale, and i have 3 variable (2 independent, and 1 dependent). this 2 independent has 9 question each, and 1 dependent has 5 question, with 5 point likert scale. how can i do correlation analysis between them ? if independent max score is 9 Question * 5 =45 and dependent max score = 5 question* 5 = 25, did i need to make those variable have same big score ? like : max score div max dependent = 45/25 = 1.8, so all total score for dependent must multiply by 1.8 for each respondent ?

    Thank you

    • Charles says:

      On what basis have you decided which variables are dependent and which are independent? Why do you want to do correlation analysis?
      Charles

      • alsim says:

        base on model that has been using on many research. correlation analysis used to see if theres a correlation between them and whether the independent variables affect the dependent variable. also i want to know how big in percent was the affect

        • Charles says:

          I am not 100% sure I understand the question, but assuming that you use the total score for the 9 questions for each independent variable and the total score for the 5 questions for the dependent variable (or average score), you can use the multiple correlation coefficient (calculated as described on the Multiple Correlation webpage). Alternatively you can perform a multiple linear regression (see Multiple Regression). You can use R^2 as the effect size.
          Charles

          • alsim says:

            Thank you sir for your help. im new at statistics, so your answere very help me alot, btw sir my i ask 1 more things, what is relation between demographics and regression analysis, what i know demographics used to map respondent, was this map should be used to populate and calculate all matters and put it in regression analysis ? etc i want to make prediction from sample (using questioner) with independent is user satisfaction and dependent is user impact using services. i map demographic all respondent base on age, and my question is should i use that demographics (which mean sorted base on age) and calculate for each variable and do analysis ?

          • Charles says:

            I don’t completely understand your question “…should i use that demographics (which mean sorted base on age) and calculate for each variable and do analysis ?” But you can certainly perform regression using demographic data plus the other types of data that you have listed.
            Charles

  14. leizel says:

    i used likert type scale

  15. leizel says:

    hello,
    pls help me what statistical test i must use if i want to know the profile of my respondents then if i want to know if there is a significant relationship

    • Charles says:

      You might use the correlation coefficient, but you need to describe what you are trying to accomplish in more detail.
      Charles

  16. John Leung says:

    Hi sir,
    I would like to know what type of tests (e.g. anova, t-test) will be suitable for the questionnaire.I want to compare both Qn 1 and Qn 2 with a suitable test. What test should I use? Qn1) If the product weights between 500g to 1 kg, would you accept the weight range for this product? Data collected using likert scale: Likely 4 male, 1 female. Neutral 25 male, 7 female. Unlikely 7 male, 4 female. Mostly unlikely 2 male.
    Qn2) Would you accept the weight of the product if its is above 1 kg? Likely 3 male. Neutral 23 male, 6 female. Unlikely 9 male, 8 female. Most unlikely 3 male.
    Thank you for your help

  17. Patni says:

    Hello Sir,
    I am glad I find this blog.
    I want to analyze questionnaire data about students attitude for a study. I distributed to 50 students questionnaire that consists of 20 questions which then are grouped into 5 categories (variables). The overall cronbach alpha reliability is 0.87. But when analyzed per group, cronbach alpha for variable 1, 2, 3, 4, 5 are 0.61, 0.70, 0.65, 0.81, 0.80 respectively. If I delete two out of 6 questions in variable 1, cronbach alpha becomes 0.73. However, cronbach alpha is not increased if any one of 3 questions in variable 3 is deleted. I have several questions below:
    1. how to calculate inter-correlation among items in the questionnaire, so that I have excuse to still use variable 3
    2. how to know if the data is normally distributed? should i do it for each question item, or for each student, or for all data? How?
    3. if I want to see relationship between variables, do I have to calculate the average score of all questions in the variable so the result becomes score of the variable for each student?

    Thank you so much for your help.

    • Charles says:

      1. I am not sure why you want to do this, but in any case you can look at the Intraclass Correlation webpage to find out how to do this.

      2. The webpage Testing for Normality and Symmetry provides a variety of methods for testing whether a data set is normally distributed. You should test the specific data sets for normality based on the requirements of the analysis tool that you are planning to use. Some tests don’t require normality at all.

      3. It really depends on what you want to do with this information.

      Charles

      • Patni says:

        Thank you for replying my question.
        Actually I want to study students’ attitude towards e-learning. And honestly I do not know if it has normality test requirement.

        I distribute 5-point Likert type scale questionnaire containing 20 questions. Then I categorize these questions into 5 variables. Variable 1 (design of website) contains 6 questions, variable 2 (efficacy to use e-learning) contains 3 questions,variable 3 (enjoyment using e-learning) contains 3 questions, Variable 4 (usefulness of e-learning) contains 6 questions, and variable 5 (intention to use e-learning) contains 2 questions.
        I want to calculate correlation between variable 2 and variable 5, variable 3 and variable 5, variable 4 and variable 5.

        Because each variable has more than one question and thus more than one response, should I calculate the average response of all questions in each variable, so the result becomes the value for corresponding variable? In order to get use the formula for Pearson product-moment correlation, r?

        Many thanks.

  18. Lucy says:

    dear sir… greetings
    please i need your help on how to conduct correlation analysis in excel. i have rainfall and water flow data i need to know if there is any relationship between rainfall and water flow.
    Many thanks

    • Charles says:

      Just use the CORREL(R1, R2) where R1 contains the rainfall data elements and R2 contains the corresponding water flow data elements. You can also do hypothesis testing as described on the website.
      Charles

  19. anita says:

    hello charles,
    I introduced a new chart for the nurses to use/practice documentation. After introduction i distributed questionnaire with likert scale type to find out if the new chart was useful and easy to use with few more questions like if it was evidence based practice, also included a question if it added to the burden of nursing documentation etc. it is more like an audit. it was introduced in two different wards where a mixture different years of experienced nurses work. now what type of data analysis should i use please suggest something that i can use with excel spread sheet please. really confused.
    it would be great help. thank you.

    • Charles says:

      Anita,
      The data analysis tool to use depends on what you are trying to demonstrate, e.g. what hypothesis are you trying to prove or disprove.
      E.g. suppose you want to test whether the responses to the question “is the new chart easy to use” is different for nurses with more than 5 years of experience from those with less than 5 years of experience, then a t test with two independent samples might be the right analysis.
      You need to first decide what you want to analyze. Then you can determine which is the best test to use.
      Charles

  20. zanzi says:

    Hi,
    SA means strongly Agree, A means agree, U stands for undecided, D means disagree and SD means strongly disagree. The numbers in the brackets represents the proportion of the sample population with same response choice. Like I wrote earlier the questionnaire had 25 questions in total and was administered to 165 people.
    Many thanks.

    • Charles says:

      Sorry, but you haven’t really provided enough information for me to give you a definitive answer answer. How many responses in each of the Likert scales doesn’t really help. It looks like you want to perform a correlation test. Why?
      Charles

  21. zanzi says:

    Many thanks for your reply, OHSMS means Occupational health and Safety Management System, I got my data from questionnaires (containing 4 sections with 25 questions in total) administered to a sample size of 165 . Like I mentioned earlier, I used a Likert scale structure and have summed my responses from each questions to have sets of data in this format SA(69), A(46), U(6),D(16), SD(3). I don’t want to rely only on median and inter quantile analysis.

  22. zanzi says:

    Sir,
    I would appreciate your help, am carrying out a research on impact of effective OHSMS on work performance. Can I do a correlation analysis on the following data I got from my questionnaire(I used Likert scale) SA(69), A(46), U(6),D(16), SD(3).

    • Charles says:

      Sorry, but I don’t know what OHSMS stands for and you haven’t provided enough detail for me to answer your question.
      Charles

  23. Jamie says:

    Hi charles,
    Just wanna ask u wht method should i use if my research is about determining the awareness of eclampsia among women Age 21 to 45?

    • Charles says:

      Jamie,
      You need to to supply more information before I am able to answer your question. In particular, what are you trying to demonstrate?
      Charles

  24. Ezin says:

    Sir, How to compute correlation of gender to level of awarenes (poor, average and good). Do I need to assign female as 1 and male as 2? I have 100 respondents and 86 answered the gender profile and 4 respondents leave it blank.

  25. zach says:

    hello…i want to ask a specific method for my case…my objective is to assess relationship between socio demographic of visitors and attitude of visitors…the attitude for visitors used likert scale which from 1 to 5…(1.strongly agree …….5. strongly disagree.) but i do not know how to used my data to do the test…whether i would use correlation or other method…tq

    • Charles says:

      It really depends on what you mean by “assess relationship”. It sounds like you want the correlation coefficient as described on the referenced webpage.
      Charles

  26. Godspower says:

    Pls i have a problem on split half test reliability, i don’t know how to compute for the “r” in the formula. 2r/1+r

    • Charles says:

      r is the correlation coefficient between the data in the two halves. Once you split the data in half (into ranges R1 and R2) you can use Excel’s CORREL(R1, R2) function to calculate r. See webpage Split Half Methodology for more details.
      Charles

  27. Certainty says:

    Sir pls which method of analysis and statistical tool will i use to analyze “relationship between parental variables and academic achievement of secondary schools”.

  28. Danielle says:

    Hello Charles,
    I’m having a problem analyzing my data. We polled 5 experts and asked them to rank 6 tests for 82 scenarios. They were asked to rank the tests in order of how likely they were to use that test given a particular scenario. My issue is that one expert gave one test the same rank across all scenarios. When using the correlation function from Excel’s data analysis package, this “constant” gives a #DIV/0! error. I’m trying to see how the experts overall responses correlate. Do they agree for the most part? Is there a different statistical test I can use to find my answer? My statistics skills are not very strong and I’m becoming lost in the details. Any help is greatly appreciated.
    Thank you,
    Danielle

    • Charles says:

      Danielle,
      Yes, the correlation coefficient will be undefined if all the elements in one data set are the same. Generally you can use measures such as Cohen’d kappa, but this too will give disappointing results (or zero no matter what elements are in the other data set).
      Charles

      • Danielle says:

        Charles,
        I appreciate your quick reply. Do you have a recommendation for analyzing the data in another manner? Or will the results always be disappointing because all of the elements in one array are the same?
        Thank you,
        Danielle

        • Charles says:

          Danielle,
          I don’t have another recommendation for you. I would guess that all the results will be disappointing because all the elements in one sample are the same.
          Charles

  29. Natalie says:

    Hello,
    I really hope you can help me solve this problem,

    I have calculate the correlation of return between 10 sectors in stocks using excel.
    As the results, the correlation between Manufacturing and Miscellaneous sector is around 87%. I want to create a range of correlation between 75%-95% and see how it affect the sectors’ mean and standard deviation. Can I use data table for that? Can you explain how to create the data table?

    Please help me.
    Thank You

  30. Anisah says:

    Hi! I really need your help.
    I want to know the appropriate statistical analysis used to test my hypotheses.
    Here are my hypotheses:
    Ho1: There is no significant relationship (independent) between business profile of the SMEs, to the level of awareness on climate change and related business risks.
    Ha1: There is a significant relationship (dependent) business profile of the SMEs, and the level of awareness on climate change and related business risks.
    Ho2: There is no significant relationship (independent) between the level of awareness on climate change and the related business risk, and the adaptive measures employed by the SMEs.
    Ha2: There is a significant relationship (dependent) between the level of awareness on climate change and the related business risk, and the adaptive measures employed by the SMEs.

    The content of my questionnaire
    I. Business profiles composed of:
    Type of Ownership: Sole Proprietorship, Partnership, Corporation
    Number of Years operating: 0-10 years, 11-20 years, 21-30 years, 30 years above
    Number of employees: 0-10 employees, 11- 50 employees, 51- 250 employees
    Initial Capitalization: 0-3,000,000, 3,000,001-15,000,000, 15,000,001-100,000,000

    II. Level of Awareness about Climate Change
    10 questions answerable by Aware (Rating scale: 1) and Unaware (Rating scale:0)

    III.Level of Awareness about Business Risk associated with Climate Change
    A total of 21 questions (7 risk: financial, logistics, legal and regulatory, market, people, operational and physical….3 questions each risk)
    And still answerable by Aware (Rating scale: 1) and Unaware (Rating scale:0)

    IV. Adaptive measure
    A total of 21 statements—>adaptive measures (7 aspects: financial, logistics, legal and regulatory, market, people, operational and physical….3 statements each aspect)
    Answerable by Adapted (Rating scale: 1) and Not Adapted (Rating scale:0)

    Please help me. Thank you.

    • Charles says:

      Based on a very quick and preliminary review of what you wrote, my first thought is to use Manova. The business profile is the independent variable and Level of awareness and business risk are the dependent variables.
      Charles

  31. farah says:

    hello sir,
    i really hope u will help me with this problem

    i have 19 questions that use likert scale 1-4 (1 never, 2 rarely, 3sometime,4 always)
    between this 19 questions i only choose 6 questions that i can say positive (e.g question 1: do you use seat belt?) to indicate positive practice in driving so do the rest of the question. Moreover this questionnaire doesn’t have total score.

    so now, how can i analyze this data?
    my research question is
    1) there is significant different between good practice and gender
    2) there is significant different between good practice and year of driving(1: 1-2 years, 2: 3-4 years, 3: 5-6 years, 4: 7 years above)

Leave a Reply

Your email address will not be published. Required fields are marked *