Correlation and Chi-square Test for Independence

In Independence Testing we used the chi-square test to determine whether two variables were independent. We now look at the same problem using dichotomous variables.

Example 1: Calculate the point-biserial correlation coefficient for the data in Example 2 of Independence Testing (repeated in Figure 1) using dichotomous variables (repeated in Figure 1).

Contingency table 2x2

Figure 1 – Contingency table for data in Example 1

This time let x = 1 if the patient is cured and x = 0 if the patient is not cured, and let y = 1 if therapy 1 is used and y = 0 if therapy 2 is used. Thus for 31 patients x = 1 and y = 1, for 11 patients x = 0 and y = 1, for 57 patients x = 1 and y = 0 and for 51 patients x = 0 and y = 0.

If we list all 150 pairs of  and  as in range U4:U153 of Figure 2 (only the first 6 data rows are shown) we can calculate the correlation coefficient using the CORREL function to get  = .192.

Point-biserial correlation

Figure 2 – Calculation of the point-biserial correlation coefficient

Observation: Instead of listing all the n pairs of samples values and using the CORREL function, we can calculate the correlation coefficient using Property 3 of Relationship between Correlation and t Test, which is especially useful for large values of n. This is shown in Figure 3.

Point-biserial correlation calculation

Figure 3 – Alternative approach

Actually, based on a little algebra it is easy to see that the correlation coefficient can also be calculated using the formula =(B4*C6-C4*B6)/SQRT(B6*C6*D4*D5).

Property 1: For problems such as those in Example 1, if ρ = 0 (the null hypothesis), then nr2 ~ χ2 (1).

Observation: Property 1 provides an alternative method for carrying out chi-square tests such as the one we did in Example 2 of Independence Testing.

Example 2: Using Property 1, determine whether there is a significant difference in the two therapies for curing patients of cocaine dependence based on the data in Figure 1.

Chi-square test alternative

Figure 4 – Chi-square test for Example 2

Note that the chi-square value of 5.67 is the same as we saw in Example 2 of Chi-square Test of Independence. Since the p-value = CHITEST(5.67,1) = 0.017 < .05 = α, we again reject the null hypothesis and conclude there is a significant difference between the two therapies.

Observation: If we calculate the value of χ2 for independence as in Independence Testing, from the previous observation we conclude that r = sqrt{chi^2/n}. This gives us a way to measure the effect of the chi-square test of independence, namely φ = sqrt{chi^2/n}.

Care should be taken with the use of φ since even relatively small values can indicate an important effect. E.g. in the previous example, there is clearly an important difference between the two therapies (not just a significant difference), but if you look at r we see that only 4.3% of the variance is explained by the choice of therapy.

Observation: In Example 1 we calculated the correlation coefficient of x with y by listing all 132 values and then using Excel’s correlation function CORREL. The following is an alternative approach for calculating r, which is especially useful if n is very large.

Correlation dichotomous variables Excel

Figure 5 – Calculation of r for data in Example 1

First we repeat the data from Figure 1 using the dummy variables x and y (in range F4:H7). Essentially this is a frequency table. We then calculate the mean of x and y. E.g. the mean of x (in cell F10) is calculated by the formula =SUMPRODUCT(F4:F7,H4:H7)/H8.

Next we calculate \sum{}(xi – )(yi – ȳ), \sum{}(xi – )2 and \sum{}(yi – ȳ)2 (in cells L8, M8 and N8). E.g. the first of these terms is calculated by the formula =SUMPRODUCT(L4:L7,O4:O7). Now the point-serial correlation coefficient is the first of these terms divided by the square root of the product of the other two, i.e. r = L8/SQRT(M8*N8).

11 Responses to Correlation and Chi-square Test for Independence

  1. Raghu says:

    Hi Charles,

    Consider the following sample dataset. The following represent the count (number of occurrence of each category).

    A = {889, 889, 3549, 1746, 2385, 3132, 5293, 1821, 1995, 1995}
    B = {845, 845, 3372, 1659, 2266, 2975, 5028, 1730, 1895, 1895}

    Is Chi Square Test result not impacted by
    (a) scaling (multiplying all elements of Set A by a constant value 0.95 to get Set B as shown above)
    (b) adding a constant value to all elements of Set A to get Set B

    Pearson correlation and Cosine similarity also appear to invariant to scaling

    Thanks.

    • Charles says:

      For contingency tables used in the chi-square test for independence you need to have multiple rows and columns (not simply a string of numbers as in A), and so I am not sure how you want me to interpret the numbers in A. In any case, if I look at contingency tables, then the chi-square test is indeed impacted by multiplying all the columns by a constant or adding a constant to all the columns.
      Charles

      • Raghu says:

        Hi Charles,

        Thanks for the quick reply.

        The scenario is: We performance test a website for an hour twice (Run A and Run B). The website has ten unique transactions (Tx 1 to Tx 10). The number in a cell denotes the count of execution of each transaction.

        Both Run and Transaction are Nominal (categorical) attributes. The variations in counts between the two runs may be because of system performance etc.

        Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 Tx 6 Tx7 Tx8 Tx 9 Tx 10
        Run A 1821 1995 1997 887 889 3549 1746 2383 3132 5291
        Run B 1787 1854 1852 899 897 3589 1764 2424 3185 5384

        The ChiSquare Test p value for Chi Square Test of Independence is 0.166 (accept at alpha of 0.05)

        1) I need to check if there is a significant difference between the two runs with respect to the transactions executed. Is Chi Square Test of Independence suitable for this or Chi Square Test of Goodness of Fit (taking the proportions of Run A as the target)?

        Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 Tx 6 Tx7 Tx8 Tx 9 Tx 10
        Run A 1821 1995 1997 887 889 3549 1746 2383 3132 5291
        Run A’ 1787 1854 1852 899 897 3589 1764 2424 3185 5384

        (A’ = 0.95A, this simulates a constant 95% reduction in count in Run B)

        The ChiSquare Test p value is 1

        Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 Tx 6 Tx7 Tx8 Tx 9 Tx 10
        Run A 1821 1995 1997 887 889 3549 1746 2383 3132 5291
        Run A” 1638 1796 1796 799 799 3195 1571 2146 2818 4763

        (A” = 0.90A, this simulates a constant 90% reduction in count in Run B)

        The ChiSquare Test p value is 1

        2) This implies the test results do not change when i multiply one dataset by a constant value. Is this understanding correct?

        Tx 1 Tx 2 Tx 3 Tx 4 Tx 5 Tx 6 Tx7 Tx8 Tx 9 Tx 10
        Run A 1821 1995 1997 887 889 3549 1746 2383 3132 5291
        Run A+100 1638 1796 1796 799 799 3195 1571 2146 2818 4763

        (this simulates a constant increase in count by 100 in Run B)

        The ChiSquare Test p value is 0.716

        3) This implies the test results change when i add one dataset by a constant value. Is this understanding correct?

        Note: The Chi Square test in this site is not displaying the results. Have used most of the other tests and graph and they are working fine. An very useful website for researchers.

        Thanks,
        raghu

        • Charles says:

          Raghu,

          Sorry, but I still don’t understand the situation that you are describing. In any case, let me comment on whether the chi-square test result will change if you multiply by a constant or add a constant.

          The following is a 2 x 2 contingency table. The p-value for the chi-square test of independence for this table is .93021

          5..7
          6..9

          If I add 1 to the second row, I get the following contingency table. The p-value for this table is .97894

          5..7
          7..10

          If I multiple the second row by 2, I get the following contingency table. The p-value for this table is .92081

          5..7
          12..18

          As you can see, the p-values are all different.

          Charles

  2. Aisha Anwar says:

    hi Charles,
    I want to ask you that i want to see that whether there exists a relation between my two variables or not. I’m a little confused about whether to use the correlation or chi square because one variable is ordinal and the other one is scale variable. hope to hear from you soon.

  3. Alex says:

    Thank you for the very useful tool. I noticed that Real Statistics gives Alpha=5 instead of 0.05 which results in NUM errors for the columns x-crit and sig in the CHI-SQARE table. Correcting the value of Alpha gives the right results.

    Regards
    Alex

    • Charles says:

      Alex,
      Unfortunately, this is a common problem with some versions of Excel where decimals are represented by 0,05 instead of 0.05. The software seems to work properly in some cases, but not in others. The good news is that you just need to enter the value you want in the dialog box (instead of using the default) and then the tool works properly.
      Charles

  4. Ara says:

    Hi Charles,
    I would like to ask if the grand total must always be equal to the sample size? I have two variables age and symptoms and I need to test if these two are independent with each other. under symptoms i have backpain, itchyness, etc., and one respondent can chose more than one symptoms. the problem is when i make a contingency table its grand total will be higher than the sample size, is it okay that way? thanks!

    • Charles says:

      Ara,
      The grand total is equal to the sample size since each respondent can choose only one symptom. For your problem you can’t use the chi-square test of independence in the form described.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *