In Independence Testing we used the chi-square test to determine whether two variables were independent. We now look at the same problem using dichotomous variables.
Example 1: Calculate the point-biserial correlation coefficient for the data in Example 2 of Independence Testing (repeated in Figure 1) using dichotomous variables (repeated in Figure 1).
Figure 1 – Contingency table for data in Example 1
This time let x = 1 if the patient is cured and x = 0 if the patient is not cured, and let y = 1 if therapy 1 is used and y = 0 if therapy 2 is used. Thus for 31 patients x = 1 and y = 1, for 11 patients x = 0 and y = 1, for 57 patients x = 1 and y = 0 and for 51 patients x = 0 and y = 0.
If we list all 150 pairs of and as in range U4:U153 of Figure 2 (only the first 6 data rows are shown) we can calculate the correlation coefficient using the CORREL function to get = .192.
Figure 2 – Calculation of the point-biserial correlation coefficient
Observation: Instead of listing all the n pairs of samples values and using the CORREL function, we can calculate the correlation coefficient using Property 3 of Relationship between Correlation and t Test, which is especially useful for large values of n. This is shown in Figure 3.
Figure 3 – Alternative approach
Actually, based on a little algebra it is easy to see that the correlation coefficient can also be calculated using the formula =(B4*C6-C4*B6)/SQRT(B6*C6*D4*D5).
Property 1: For problems such as those in Example 1, if ρ = 0 (the null hypothesis), then nr2 ~ χ2 (1).
Observation: Property 1 provides an alternative method for carrying out chi-square tests such as the one we did in Example 2 of Independence Testing.
Example 2: Using Property 1, determine whether there is a significant difference in the two therapies for curing patients of cocaine dependence based on the data in Figure 1.
Figure 4 – Chi-square test for Example 2
Note that the chi-square value of 5.67 is the same as we saw in Example 2 of Chi-square Test of Independence. Since the p-value = CHITEST(5.67,1) = 0.017 < .05 = α, we again reject the null hypothesis and conclude there is a significant difference between the two therapies.
Observation: If we calculate the value of χ2 for independence as in Independence Testing, from the previous observation we conclude that r = . This gives us a way to measure the effect of the chi-square test of independence, namely φ = .
Care should be taken with the use of φ since even relatively small values can indicate an important effect. E.g. in the previous example, there is clearly an important difference between the two therapies (not just a significant difference), but if you look at r we see that only 4.3% of the variance is explained by the choice of therapy.
Observation: In Example 1 we calculated the correlation coefficient of x with y by listing all 132 values and then using Excel’s correlation function CORREL. The following is an alternative approach for calculating r, which is especially useful if n is very large.
Figure 5 – Calculation of r for data in Example 1
First we repeat the data from Figure 1 using the dummy variables x and y (in range F4:H7). Essentially this is a frequency table. We then calculate the mean of x and y. E.g. the mean of x (in cell F10) is calculated by the formula =SUMPRODUCT(F4:F7,H4:H7)/H8.
Next we calculate (xi – x̄)(yi – ȳ), (xi – x̄)2 and (yi – ȳ)2 (in cells L8, M8 and N8). E.g. the first of these terms is calculated by the formula =SUMPRODUCT(L4:L7,O4:O7). Now the point-serial correlation coefficient is the first of these terms divided by the square root of the product of the other two, i.e. r = L8/SQRT(M8*N8).