Kendall’s Tau Correlation Detailed

We show how to calculate Kendall’s tau and how to use it for hypothesis testing.

Definition 1: Let x1, …, xn be a sample for random variable x and let y1, …, yn be a sample for random variable y of the same size n. There are C(n, 2) possible ways of selecting distinct pairs (xi, yi) and (xj, yj). For any such assignment of pairs, define each pair as concordant, discordant or neither as follows:

  • concordant if (xi > xj and yi > yj) or (xi < xj and yi < yj)
  • discordant if (xi > xj and yi < yj) or (xi < xj and yi > yj)
  • neither if xi = xj or yi = yj (i.e. ties are not counted).

Now let C = the number of concordant pairs and D = the number of discordant pairs. Then define tau as

Kendall's tau

Observation: If there are no ties, then C(n, 2) = C + D. Thus

image7219Alternatively

image7220

Observation: To facilitate the calculation of C – D it is best to first put all the x data elements in ascending order. If x and y are perfectly positively correlated, then all the values of y would be in ascending order too, and so if there are no ties then C = C(n, 2) and τ  = 1.

Otherwise, there will be some inversions. For each i, count the number of j > i for which xj < xi. This sum is D. If x and y are perfectly negatively correlated, then all the values of y would be in descending order, and so if there are no ties then D = C(n, 2) and τ  = -1.

Property 1:
image3664

Proof: This is a result of the fact that there are C(n, 2) pairings with C(n, 2) = C + D + T where = the number of tied pairs. Thus

image7221

τ is maximum when D = T = 0 and so τ = 1. τ is minimum when C = T = 0 and so τ = -1.

Definition 2: Unlike Spearman’s rho, there is a commonly accepted measure of standard error for Kendall’s tau (assuming the null hypothesis that x and y are independent), namely

Standard error Kendall's tau

Property 2: For sufficiently large n (generally n ≥ 10), the following statistic has a standard normal distribution (under the assumption that x and y are independent):

image3672

Observation: Property 2 can be used to test the null hypothesis that x and y are independent, i.e. the (population) correlation coefficient is zero.

For smaller values of n the table of critical values found in Kendall’s Tau Table can be used. The values of the elements in this table can be found using the following supplemental function.

Real Statistics Function: The following function is provided in the Real Statistics Resource Pack:

TauCRIT(n, α, t) = the critical value of the Kendall’s tau test for samples of size n, for the given value of alpha (default .05), and t = 1 (one tail) or 2 (two tails), the default.

Example 1: Repeat the analysis for Example 1 of One Sample Hypothesis Testing for Correlation using Kendall’s tau (to determine whether longevity is independent of smoking) where the last two data items have been modified as shown in range A3:B18 of Figure 1 (this was done to eliminate any ties).

Kendall's tau hypothesis testing

Figure 1 – Hypothesis testing for Kendall’s tau

We begin by sorting the data in range A3:B18 in ascending order by life expectancy putting the results in range D4:E18. This can be done by using Excel’s sort capability (Data > Sort & Filter|Sort) or by using the Real Statistics supplemental array function =QSORTRows(A4:B18,1).

Since there are n = 15 people in the sample, there are C(15, 2) = 105 pairs of elements. We next calculate how many inversions D we have for each of the pairs. Let’s begin by looking at the number of inversions which corresponds to the person in row 8 (i.e. F8).

Since the number of cigarettes smoked by that person is 14 (the value in cell E8), we count the entries in column E below E8 that have value smaller than 14. This is 5 since only the entries in cells E10, E14, E15, E16 and E18 have smaller values. We carry out the same calculation for each of the rows and sum the result to get 80 (the value in cell F19).

This calculation is carried out by putting the formula =COUNTIF(E5:E$19,”<”&E4) in cell F4 (see Built-in Excel Functions for a description of COUNTIF). Next highlight the range F4:F18 and press Ctrl-D to copy this formula into all the relevant cells in column F. Cell F8 now contains the array formula =COUNTIF(E9:E$19,”<”&E8). This approach works as long as cell E19 is left empty or contains a blank or 0.

We now can calculate the key statistics (column I) as described in Figure 1 where column K lists the formulas that are found in the cells in column I. We see that tau is -.524 (cell I8). Since p-value = .003 < .025 = α/2 (two-tail test), the null hypothesis is rejected (that smoking and longevity are independent), and so we conclude there is a negative correlation between smoking and longevity.

We can also establish 95% confidence interval for tau as follows:

τ ± zcrit ∙ sτ = -.471 ± (1.96)(.192) = (-.848, -.094)

Also note that TauCRIT(15,.05,2) = .395. Since τcrit = .395 < .524 = |τ| again we can reject the null hypothesis.

Observation: If there are a large number of ties, then the denominator in the above definition should be replaced by

image7222

where nx is the number of pairs with a tie in variable x and ny is the number of pairs with a tie in variable y.

The calculation of ny is similar to that of D given above, namely for each i, count the number of j > i for which xi = xj. This sum is ny. Calculating nx is similar, although potentially easier since the xi are in ascending order.

Since in general C(m, 2) = 1 + 2 +⋯+ (m–1), it follows that

image7223

where ti = the number of elements in the ith group of ties among the x values and uj = the number of elements in the jth group of ties among the y values.

Example 2: Repeat the analysis for Example 1 using Kendall’s tau for the data in range A3:B18 of Figure 2.

kendall's-tau-with-tiesFigure 2 – Hypothesis testing for Kendall’s tau (with ties)

As we did in Example 2, we first sort the data, placing the results in range D3:E18. This time we can see that there are ties.

The calculation is similar to that used for Example 1, except that we need to account for the ties. In particular, the formula for inversions (D) needs to be modified. E.g. cell F4 now contains the formula =COUNTIFS(E5:E$19,”<”&E4, D5:D$19,”>” & D4).

In order to calculate the modified denominator for tau we need to calculate nx and ny. E.g. the calculation for nx is carried out by putting the formula =COUNTIF(D5:D$19,”<”&D4) in cell H4. Next highlight the range H4:H18 and press Ctrl-D to copy this formula into all the relevant cells in column H. Placing the formula = SUM(H4:H18) in cell H19 yields the value for nx.

ny (cell I19) can be calculated in a similar way.

We can calculate the value of C as the sum of the concordance elements in a fashion similar to that used to calculate D.

E.g. cell G4 contains the formula =COUNTIFS(E5:E$19,”>”&E4, D5:D$19,”>” & D4). Alternatively we note that C = C(n, 2) – D – T. Now C(n, 2) = C(15, 2) = 105 (cell M5), D = 72 (cell F19) and T = nx + ny – nx&y = 7 + 4 – 1 = 10. The number of ties is equal to the number of ties in x plus the number of ties y minus the number of ties for both x and y, nx&y. We calculate nx&y as the sum of the cells in column J where for example cell J4 contains the formula =COUNTIFS(D5:D$19,”=”&D4,E5:E$19,”=”&E4).

Kendall’s tau (cell M8) is calculated by the formula =(M7-M6)/SQRT((M5-H19)*(M5–I19)).

Observation: If there are a lot of ties we also need to modify the calculation of the standard error as follows:

image7224image7225image7226image7227image7228image7229

Thus

image7230

Example 3: Repeat the analysis for Example 2 using the improved version of the standard error and z described above.

We show the analysis in Figure 3.

Kendall's tau standard error

Figure 3 – Hypothesis testing for Kendall’s tau: improved version

C and D are calculated as before, but this time we handle the ties using the formulas

image7223

Column H contains a non-zero value only for those values in column D (the x values) which are the first one of a group of ties. This value is one less than the number of ties in that group. Similarly column I handles the ties from column E (the y values). E.g. the value 78 occurs 4 times in column D, the first of these in cell D12, and so cell H12 contains the value 4 – 1 = 3. This is done using the formula

=IF(COUNTIF(D$3:D11,D12)=0,COUNTIF(D13:D$19,D12),0).

Thus there are C(4, 2) = 6 pairs with value 78.

Since for any m, C(m, 2) = m(m–1)/2, we can calculate the number of ties for x, nx = \sum_i C(ti, 2) by the formula =SUMPRODUCT(H4:H18,H4:H18+1)/2, and similarly for ny. In a similar fashion we can calculate the values of all the formulas in the previous observation, as shown in Figure 4.

Kendall's tau stderr calculation

Figure 4 – Calculation of standard error

From m we can calculate the standard error (cell L10) and the z-score (cell L12) as shown in Figure 3.

11 Responses to Kendall’s Tau Correlation Detailed

  1. LB says:

    I think c9 should be 9 not 7

    and k7 should be I4-I5-I6 not I11

    don’t you think so?

    • Charles says:

      LB,
      Thanks for catching the mistake in cell C9. There was a mistake in the formula, which I have now corrected. The value should be 9 as you stated. Regarding cell K7 its value was already =I4-I5-I6. Thanks for your help.
      Charles

      • GTL says:

        Actually…LB is correct. K7 is shown as “=I4-I5-I11″ in Figure 1 and should be “=I4-I5-I6″. Take a look.

        • Charles says:

          Thanks GTL,
          The formula used in cell I7 was correct, but the description of the formula written in cell K7 was incorrect as you and LB correctly pointed out. I have now corrected this on the website. Thanks for your help.
          Charles

  2. Swapna says:

    Hi Charles,
    Can you please generalize the formula to calculate standard error?

    Thanking You
    Swapna

    • Charles says:

      Hi Swapna,
      The formula for standard error is given in cell K10 of Figure 1. I am sure what you mean by generalizing this formula.
      Charles

  3. Swapna says:

    Thank you Charles.
    I am having some confusion kindly help me to sort it out.The range of Kendall Tau’s test is from -1 to +1. I do not know why but, in my case the result showing (-) ve tau-value are only significant otherwise (+) ve value are not showing significant why?
    If only the negative values will shows the significant in the test then what is the purpose of (+) and (-) rank correlation calculation.

    I have analysed 15 different sets of data 4 are (-) ve and 11 are (+) result. All 4 results are significant only due to they are having negative value other 11 result are not significant. I have tried to explain you in detail what problem exactly I am having, So kindly have some idea to solve this.

    Best regards
    SWAPNA

    • Charles says:

      Swapna,
      I don’t really understand your question. Can you send me an example which you believe will help me understand?
      Charles

      • Swapna says:

        Hi Charles
        Thank you so much for your kind attention. By this time, I have come over that problem of confusion and got the expected result which was suppose to be from my data, so no need to worry.
        Regards
        SWAPNA

  4. KEH says:

    Hi Charles,
    I think the p-value calculation shown in your first example (Fig. 1, with no ties) should be =NORMSDIST(I11) not F11?
    Thanks, KEH

    • Charles says:

      Hi KEH,
      Thanks for catching this typo. I have now changed the formula on the webpage as you have suggested.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>