We show how to calculate Kendall’s tau and how to use it for hypothesis testing.
Definition 1: Let x1, …, xn be a sample for random variable x and let y1, …, yn be a sample for random variable y of the same size n. There are C(n, 2) possible ways of selecting distinct pairs (xi, yi) and (xj, yj). For any such assignment of pairs, define each pair as concordant, discordant or neither as follows:
- concordant if (xi > xj and yi > yj) or (xi < xj and yi < yj)
- discordant if (xi > xj and yi < yj) or (xi < xj and yi > yj)
- neither if xi = xj or yi = yj (i.e. ties are not counted).
Now let C = the number of concordant pairs and D = the number of discordant pairs. Then define tau as
Observation: If there are no ties, then C(n, 2) = C + D. Thus
Observation: To facilitate the calculation of C – D it is best to first put all the x data elements in ascending order. If x and y are perfectly positively correlated, then all the values of y would be in ascending order too, and so if there are no ties then C = C(n, 2) and τ = 1.
Otherwise, there will be some inversions. For each i, count the number of j > i for which xj < xi. This sum is D. If x and y are perfectly negatively correlated, then all the values of y would be in descending order, and so if there are no ties then D = C(n, 2) and τ = -1.
Proof: This is a result of the fact that there are C(n, 2) pairings with C(n, 2) = C + D + T where T = the number of tied pairs. Thus
τ is maximum when D = T = 0 and so τ = 1. τ is minimum when C = T = 0 and so τ = -1.
Definition 2: Unlike Spearman’s rho, there is a commonly accepted measure of standard error for Kendall’s tau (assuming the null hypothesis that x and y are independent), namely
Property 2: For sufficiently large n (generally n ≥ 10), the following statistic has a standard normal distribution (under the assumption that x and y are independent):
Observation: Property 2 can be used to test the null hypothesis that x and y are independent, i.e. the (population) correlation coefficient is zero.
For smaller values of n the table of critical values found in Kendall’s Tau Table can be used. The values of the elements in this table can be found using the following supplemental function.
Real Statistics Function: The following function is provided in the Real Statistics Resource Pack:
TauCRIT(n, α, t) = the critical value of the Kendall’s tau test for samples of size n, for the given value of alpha (default .05), and t = 1 (one tail) or 2 (two tails), the default.
Example 1: Repeat the analysis for Example 1 of One Sample Hypothesis Testing for Correlation using Kendall’s tau (to determine whether longevity is independent of smoking) where the last two data items have been modified as shown in range A3:B18 of Figure 1 (this was done to eliminate any ties).
Figure 1 – Hypothesis testing for Kendall’s tau
We begin by sorting the data in range A3:B18 in ascending order by life expectancy putting the results in range D4:E18. This can be done by using Excel’s sort capability (Data > Sort & Filter|Sort) or by using the Real Statistics supplemental array function =QSORTRows(A4:B18,1).
Since there are n = 15 people in the sample, there are C(15, 2) = 105 pairs of elements. We next calculate how many inversions D we have for each of the pairs. Let’s begin by looking at the number of inversions which corresponds to the person in row 8 (i.e. F8).
Since the number of cigarettes smoked by that person is 14 (the value in cell E8), we count the entries in column E below E8 that have value smaller than 14. This is 5 since only the entries in cells E10, E14, E15, E16 and E18 have smaller values. We carry out the same calculation for each of the rows and sum the result to get 80 (the value in cell F19).
This calculation is carried out by putting the formula =COUNTIF(E5:E$19,”<”&E4) in cell F4 (see Built-in Excel Functions for a description of COUNTIF). Next highlight the range F4:F18 and press Ctrl-D to copy this formula into all the relevant cells in column F. Cell F8 now contains the array formula =COUNTIF(E9:E$19,”<”&E8). This approach works as long as cell E19 is left empty or contains a blank or 0.
We now can calculate the key statistics (column I) as described in Figure 1 where column K lists the formulas that are found in the cells in column I. We see that tau is -.524 (cell I8). Since p-value = .003 < .025 = α/2 (two-tail test), the null hypothesis is rejected (that smoking and longevity are independent), and so we conclude there is a negative correlation between smoking and longevity.
We can also establish 95% confidence interval for tau as follows:
τ ± zcrit ∙ sτ = -.471 ± (1.96)(.192) = (-.848, -.094)
Also note that TauCRIT(15,.05,2) = .395. Since τcrit = .395 < .524 = |τ| again we can reject the null hypothesis.
Observation: If there are a large number of ties, then the denominator in the above definition should be replaced by
where nx is the number of pairs with a tie in variable x and ny is the number of pairs with a tie in variable y.
The calculation of ny is similar to that of D given above, namely for each i, count the number of j > i for which xi = xj. This sum is ny. Calculating nx is similar, although potentially easier since the xi are in ascending order.
Since in general C(m, 2) = 1 + 2 +⋯+ (m–1), it follows that
where ti = the number of elements in the ith group of ties among the x values and uj = the number of elements in the jth group of ties among the y values.
Example 2: Repeat the analysis for Example 1 using Kendall’s tau for the data in range A3:B18 of Figure 2.
As we did in Example 2, we first sort the data, placing the results in range D3:E18. This time we can see that there are ties.
The calculation is similar to that used for Example 1, except that we need to account for the ties. In particular, the formula for inversions (D) needs to be modified. E.g. cell F4 now contains the formula =COUNTIFS(E5:E$19,”<”&E4, D5:D$19,”>” & D4).
In order to calculate the modified denominator for tau we need to calculate nx and ny. E.g. the calculation for nx is carried out by putting the formula =COUNTIF(D5:D$19,”<”&D4) in cell H4. Next highlight the range H4:H18 and press Ctrl-D to copy this formula into all the relevant cells in column H. Placing the formula = SUM(H4:H18) in cell H19 yields the value for nx.
ny (cell I19) can be calculated in a similar way.
We can calculate the value of C as the sum of the concordance elements in a fashion similar to that used to calculate D.
E.g. cell G4 contains the formula =COUNTIFS(E5:E$19,”>”&E4, D5:D$19,”>” & D4). Alternatively we note that C = C(n, 2) – D – T. Now C(n, 2) = C(15, 2) = 105 (cell M5), D = 72 (cell F19) and T = nx + ny – nx&y = 7 + 4 – 1 = 10. The number of ties is equal to the number of ties in x plus the number of ties y minus the number of ties for both x and y, nx&y. We calculate nx&y as the sum of the cells in column J where for example cell J4 contains the formula =COUNTIFS(D5:D$19,”=”&D4,E5:E$19,”=”&E4).
Kendall’s tau (cell M8) is calculated by the formula =(M7-M6)/SQRT((M5-H19)*(M5–I19)).
Observation: If there are a lot of ties we also need to modify the calculation of the standard error as follows:
Example 3: Repeat the analysis for Example 2 using the improved version of the standard error and z described above.
We show the analysis in Figure 3.
Figure 3 – Hypothesis testing for Kendall’s tau: improved version
C and D are calculated as before, but this time we handle the ties using the formulas
Column H contains a non-zero value only for those values in column D (the x values) which are the first one of a group of ties. This value is one less than the number of ties in that group. Similarly column I handles the ties from column E (the y values). E.g. the value 78 occurs 4 times in column D, the first of these in cell D12, and so cell H12 contains the value 4 – 1 = 3. This is done using the formula
Thus there are C(4, 2) = 6 pairs with value 78.
Since for any m, C(m, 2) = m(m–1)/2, we can calculate the number of ties for x, nx = C(ti, 2) by the formula =SUMPRODUCT(H4:H18,H4:H18+1)/2, and similarly for ny. In a similar fashion we can calculate the values of all the formulas in the previous observation, as shown in Figure 4.
Figure 4 – Calculation of standard error
From m we can calculate the standard error (cell L10) and the z-score (cell L12) as shown in Figure 3.