We show how to calculate Kendall’s tau and how to use it for hypothesis testing.

**Definition 1:** Let *x _{1}, …, x_{n} *be a sample for random variable

*x*and let y

_{1}, …, y

*be a sample for random variable y of the same size*

_{n}*n*. There are

*C*(

*n*, 2) possible ways of selecting distinct pairs (

*x*, y

_{i}*) and (*

_{i}*x*, y

_{j}*). For any such assignment of pairs, define each pair as concordant, discordant or neither as follows:*

_{j}- concordant if (
*x*and y_{i}> x_{j}> y_{i}) or (_{j}*x*and y_{i}< x_{j}< y_{i})_{j} - discordant if (
*x*and y_{i}> x_{j}< y_{i}) or (_{j}*x*and y_{i}< x_{j}> y_{i})_{j} - neither if
*x*or y_{i}= x_{j}= y_{i}(i.e. ties are not counted)._{j}

Now let *C* = the number of concordant pairs and *D* = the number of discordant pairs. Then define tau as

**Observation**: If there are no ties, then *C*(*n*, 2) = *C + D*. Thus

**Observation**: To facilitate the calculation of *C – D* it is best to first put all the *x* data elements in ascending order. If *x* and y are perfectly positively correlated, then all the values of y would be in ascending order too, and so if there are no ties then *C* = *C*(*n*, 2) and *τ* = 1.

Otherwise, there will be some inversions. For each *i*, count the number of *j > i* for which *x _{j} < x_{i}*. This sum is

*D*. If

*x*and y are perfectly negatively correlated, then all the values of y would be in descending order, and so if there are no ties then

*D*=

*C*(

*n*, 2) and

*τ*= -1.

Proof: This is a result of the fact that there are *C*(*n*, 2) pairings with *C*(*n*, 2) = *C + D + T* where *T *= the number of tied pairs. Thus

*τ* is maximum when* D = T* = 0 and so *τ* = 1. *τ* is minimum when *C = T* = 0 and so *τ* = -1.

**Definition 2**: Unlike Spearman’s rho, there is a commonly accepted measure of standard error for Kendall’s tau (assuming the null hypothesis that *x* and y are independent), namely

**Property 2**: For sufficiently large *n* (generally *n* ≥ 10), the following statistic has a standard normal distribution (under the assumption that *x* and y are independent):

**Observation**: Property 2 can be used to test the null hypothesis that *x* and y are independent, i.e. the (population) correlation coefficient is zero.

For smaller values of *n* the table of critical values found in Kendall’s Tau Table can be used. The values of the elements in this table can be found using the following supplemental function.

**Real Statistics Function**: The following function is provided in the Real Statistics Resource Pack:

**TauCRIT**(*n, α, t*) = the critical value of the Kendall’s tau test for samples of size *n*, for the given value of alpha (default .05), and *t* = 1 (one tail) or 2 (two tails), the default.

**Example 1**: Repeat the analysis for Example 1 of One Sample Hypothesis Testing for Correlation using Kendall’s tau (to determine whether longevity is independent of smoking) where the last two data items have been modified as shown in range A3:B18 of Figure 1 (this was done to eliminate any ties).

**Figure 1 – Hypothesis testing for Kendall’s tau**

We begin by sorting the data in range A3:B18 in ascending order by life expectancy putting the results in range D4:E18. This can be done by using Excel’s sort capability (**Data > Sort & Filter|Sort**) or by using the Real Statistics supplemental array function =QSORTRows(A4:B18,1).

Since there are *n* = 15 people in the sample, there are C(15, 2) = 105 pairs of elements. We next calculate how many inversions *D* we have for each of the pairs. Let’s begin by looking at the number of inversions which corresponds to the person in row 8 (i.e. F8).

Since the number of cigarettes smoked by that person is 14 (the value in cell E8), we count the entries in column E below E8 that have value smaller than 14. This is 5 since only the entries in cells E10, E14, E15, E16 and E18 have smaller values. We carry out the same calculation for each of the rows and sum the result to get 80 (the value in cell F19).

This calculation is carried out by putting the formula =COUNTIF(E5:E$19,”<”&E4) in cell F4 (see Built-in Excel Functions for a description of COUNTIF). Next highlight the range F4:F18 and press **Ctrl-D** to copy this formula into all the relevant cells in column F. Cell F8 now contains the array formula =COUNTIF(E9:E$19,”<”&E8). This approach works as long as cell E19 is left empty or contains a blank or 0.

We now can calculate the key statistics (column I) as described in Figure 1 where column K lists the formulas that are found in the cells in column I. We see that tau is -.524 (cell I8). Since p-value = .003 < .025 = *α*/2 (two-tail test), the null hypothesis is rejected (that smoking and longevity are independent), and so we conclude there is a negative correlation between smoking and longevity.

We can also establish 95% confidence interval for tau as follows:

*τ* ± *z _{crit} ∙ s_{τ}* = -.471 ± (1.96)(.192) = (-.848, -.094)

Also note that TauCRIT(15,.05,2) = .395. Since *τ _{crit}* = .395 < .524 = |τ| again we can reject the null hypothesis.

**Observation**: If there are a large number of ties, then the denominator in the above definition should be replaced by

where *n _{x}* is the number of pairs with a tie in variable

*x*and

*n*

_{y}is the number of pairs with a tie in variable y.

The calculation of *n*_{y} is similar to that of *D* given above, namely for each *i*, count the number of *j > i* for which *x _{i} = x_{j}*. This sum is

*n*

_{y}. Calculating

*n*is similar, although potentially easier since the

_{x}*x*are in ascending order.

_{i}Since in general *C*(*m*, 2) = 1 + 2 +⋯+ (*m*–1), it follows that

where* t _{i}* = the number of elements in the

*i*th group of ties among the

*x*values and

*u*= the number of elements in the

_{j}*j*th group of ties among the y values.

**Example 2**: Repeat the analysis for Example 1 using Kendall’s tau for the data in range A3:B18 of Figure 2.

**Figure 2 – Hypothesis testing for Kendall’s tau (with ties)**

As we did in Example 2, we first sort the data, placing the results in range D3:E18. This time we can see that there are ties.

The calculation is similar to that used for Example 1, except that we need to account for the ties. In particular, the formula for inversions (*D*) needs to be modified. E.g. cell F4 now contains the formula =COUNTIFS(E5:E$19,”<”&E4, D5:D$19,”>” & D4).

In order to calculate the modified denominator for tau we need to calculate *n _{x}* and

*n*

_{y}. E.g. the calculation for

*n*is carried out by putting the formula =COUNTIF(D5:D$19,”<”&D4) in cell H4. Next highlight the range H4:H18 and press

_{x}**Ctrl-D**to copy this formula into all the relevant cells in column H. Placing the formula = SUM(H4:H18) in cell H19 yields the value for

*n*.

_{x}*n*_{y} (cell I19) can be calculated in a similar way.

We can calculate the value of *C* as the sum of the concordance elements in a fashion similar to that used to calculate *D*.

E.g. cell G4 contains the formula =COUNTIFS(E5:E$19,”>”&E4, D5:D$19,”>” & D4). Alternatively we note that *C* = *C*(*n*, 2) – *D* – *T.* Now *C*(*n*, 2) = *C*(15, 2) = 105 (cell M5), *D* = 72 (cell F19) and *T* = *n _{x}* +

*n*

_{y}–

*n*

_{x&y}= 7 + 4 – 1 = 10. The number of ties is equal to the number of ties in

*x*plus the number of ties y minus the number of ties for both

*x*and y,

*n*

_{x&y}. We calculate

*n*

_{x&y}as the sum of the cells in column J where for example cell J4 contains the formula =COUNTIFS(D5:D$19,”=”&D4,E5:E$19,”=”&E4).

Kendall’s tau (cell M8) is calculated by the formula =(M7-M6)/SQRT((M5-H19)*(M5–I19)).

**Observation**: If there are a lot of ties we also need to modify the calculation of the standard error as follows:

Thus

**Example 3**: Repeat the analysis for Example 2 using the improved version of the standard error and *z* described above.

We show the analysis in Figure 3.

**Figure 3 – Hypothesis testing for Kendall’s tau: improved version**

*C* and *D* are calculated as before, but this time we handle the ties using the formulas

Column H contains a non-zero value only for those values in column D (the *x* values) which are the first one of a group of ties. This value is one less than the number of ties in that group. Similarly column I handles the ties from column E (the y values). E.g. the value 78 occurs 4 times in column D, the first of these in cell D12, and so cell H12 contains the value 4 – 1 = 3. This is done using the formula

=IF(COUNTIF(D$3:D11,D12)=0,COUNTIF(D13:D$19,D12),0).

Thus there are *C*(4, 2) = 6 pairs with value 78.

Since for any *m*, *C*(*m*, 2) = *m*(*m–*1)/2, we can calculate the number of ties for *x*, *n _{x}* =

*C*(

*t*, 2) by the formula =SUMPRODUCT(H4:H18,H4:H18+1)/2, and similarly for

_{i}*n*

_{y}. In a similar fashion we can calculate the values of all the formulas in the previous observation, as shown in Figure 4.

**Figure 4 – Calculation of standard error**

From *m* we can calculate the standard error (cell L10) and the z-score (cell L12) as shown in Figure 3.

I think c9 should be 9 not 7

and k7 should be I4-I5-I6 not I11

don’t you think so?

LB,

Thanks for catching the mistake in cell C9. There was a mistake in the formula, which I have now corrected. The value should be 9 as you stated. Regarding cell K7 its value was already =I4-I5-I6. Thanks for your help.

Charles

Actually…LB is correct. K7 is shown as “=I4-I5-I11″ in Figure 1 and should be “=I4-I5-I6″. Take a look.

Thanks GTL,

The formula used in cell I7 was correct, but the description of the formula written in cell K7 was incorrect as you and LB correctly pointed out. I have now corrected this on the website. Thanks for your help.

Charles

Hi Charles,

Can you please generalize the formula to calculate standard error?

Thanking You

Swapna

Hi Swapna,

The formula for standard error is given in cell K10 of Figure 1. I am sure what you mean by generalizing this formula.

Charles

Thank you Charles.

I am having some confusion kindly help me to sort it out.The range of Kendall Tau’s test is from -1 to +1. I do not know why but, in my case the result showing (-) ve tau-value are only significant otherwise (+) ve value are not showing significant why?

If only the negative values will shows the significant in the test then what is the purpose of (+) and (-) rank correlation calculation.

I have analysed 15 different sets of data 4 are (-) ve and 11 are (+) result. All 4 results are significant only due to they are having negative value other 11 result are not significant. I have tried to explain you in detail what problem exactly I am having, So kindly have some idea to solve this.

Best regards

SWAPNA

Swapna,

I don’t really understand your question. Can you send me an example which you believe will help me understand?

Charles

Hi Charles

Thank you so much for your kind attention. By this time, I have come over that problem of confusion and got the expected result which was suppose to be from my data, so no need to worry.

Regards

SWAPNA

Hi Charles,

I think the p-value calculation shown in your first example (Fig. 1, with no ties) should be =NORMSDIST(I11) not F11?

Thanks, KEH

Hi KEH,

Thanks for catching this typo. I have now changed the formula on the webpage as you have suggested.

Charles

Hey Charles,

This is a great resource, do you have a source for the equations you gathered here?

-Joe

Thanks Joe,

The information comes from a few sources, but principally from [Ho] and [IB] in the Bibliography of the website.

Charles