When the requirements for the t-test for two independent samples are not satisfied, the **Wilcoxon Rank-Sum** non-parametric test can often be used provided the two independent samples are drawn from populations with an ordinal distribution.

For this test we use the following null hypothesis:

H_{0}: the observations come from the same population

From a practical point of view, this implies:

H_{0}: if one observation is made at random from each population (call them *x*_{0} and y_{0}), then the probability that *x*_{0} > y_{0} is the same as the probability that *x*_{0} < y_{0}, and so the populations for each sample have the same medians.

We illustrate the technique with the following examples.

**Example 1**: Repeat Example 2 from Two Sample t Test with Unequal Variances to test whether a new hay fever drug is effective, but this time using the data from Figure 1.

**Figure 1 – Data for Example 1**

When we look at the QQ Plot for the Control group we see that it is not very normal, but more concerning is that the Box Plot for the group that took the drug shows that the data is not very symmetric (see Figure 2). We therefore decide to use the Wilcoxon Sign-Rank test instead of the t-test.

The results of the Wilcoxon Rank-Sum test are displayed in Figure 3.

We begin by calculating the ranks of the combined 24 raw scores using the supplemental RANK_AVG function (or the standard RANK.AVG function in Excel 2010). See Ranking for details. E.g., the contents of cell D6 is the rank of the first participant in the Control group, namely RANK_AVG(A6,$A$6:$B$17,1) which is the same as

=RANK(A6,$A$6:$B$17,1) + (COUNTIF($A$6:$B$17,A6)-1)/2.

using the standard Excel 2007 rank function (see Ranking).

We then calculate the sum of the ranks for each group to arrive at the rank sums *R _{1}* = 119.5 and

*R*= 180.5. Since the sample sizes are equal, the value of the test statistic

_{2}*W*= the smaller of

*R*and

_{1}*R*, which for this example means that

_{2}*W*= 119.5 (cell H10).

We next compare *W* with the critical value *W _{crit}*, which can be found in the Wilcoxon Rank-Sum Table. Since the sample sizes are both 12, we look up the critical value in the table for

*α*= .05 (two-tail) where

*n*= 12, and find that

_{1}= n_{2}*W*= 115. This represents the smallest value we could expect to obtain for

_{crit}*W*if the null hypothesis were true. Since

*W*= 119.5 > 115 =

*W*, we cannot reject the null hypothesis, and so conclude there is no significant difference between the effectiveness of the drug and the control.

_{crit}**Example 2**: Repeat Example 1 with the last data element for the group that took the drug removed.

We again use the Wilcoxon Rank-Sum test, but this time the sample sizes are unequal. The test is as in Figure 4.

The rank sums are calculated as in the previous example, although since some of the data may be blank, we need to use a formula such as

=IF(A6<>””,RANK_AVG(A6,$A$6:$B$17,1),””).

Since the sample sizes are different, a bit more care is required. Essentially *W* represents the left tail statistic and so we need to also evaluate the right tail statistic *W′*, which can be obtained by using reverse ranking as described in Figure 5:

**Figure 5 – Calculation of W′ using reverse ranks**

The value of *W′* is therefore the sum of the ranks for the smaller sample, i.e. 105.5. Fortunately, because of symmetry, W’ can more easily be obtained via the formula

where (the smaller sample size) and (the larger sample size). Thus we obtain

*W′* = 11(11+12+1) –158.5 = 105.5 (the value in cell H11)

For the two tailed test, which is what we usually require, we compare the smaller of *W* and *W′ *with *W _{crit}*. To find the value of

*W*, we again use the Wilcoxon Rank-Sum Table for

_{crit}*α*= .05 (two-tail) where

*n*= 11 and

_{1}*n*= 12 to obtain

_{2}*W*= 99. Since min(

_{crit}*W*,

*W′*) = min(158.5,105.5) = 105.5 > 99 =

*W*, once again we cannot reject the null hypothesis.

_{crit}**Observation**: When* n _{1}* =

*n*, then

_{2}*W′*=

*R*, i.e. the rank sum of the larger sample. Thus in Example 1,

_{2}*W′*= 180.5

**Property 1**: Suppose sample 1 has size *n _{1} *and rank sum

*R*and sample 2 has size

_{1}*n*and rank sum

_{2}*R*, then

_{2}*R*+

_{1}*R*=

_{2}*n*(

*n*+1)/2 where

*n*=

*n*+

_{1}*n*.

_{2}**Property 2**: When the two samples are sufficiently large (say of size > 10, although some say 20), then the *W* statistic is approximately normal *N*(*μ, σ*) where

**Observation**: Click here for a proof of Property 1 or 2.

**Observation**: Using Property 2, for samples sufficiently large, we can test *W* using the techniques from Sampling Distributions. Note that the result is the same whether we use *W* or *W′*.

**Observation**: Since it compares rank sums, the Wilcoxon Rank-Sum test is more robust than the t-test as it is less likely to indicate spurious results based on the presence of outliers. Even for large samples where the assumptions for the t-test are met, the Wilcoxon Rank-Sum test is only a little less efficient than the t-test.

**Example 3**: The objective of a study was to determine whether there is a significant difference in the median life expectancy between smokers and non-smokers. 38 smokers and 40 non-smokers were chosen at random and their age at death recorded in Figure 6.

A table of ranks is created and the values of *W* and *W′* are calculated as in Examples 1 and 2. Since the sample sizes are sufficiently large, we can test *W* (or *W′*) using the normal distribution as described in Figure 7.

Since there are fewer smokers than non-smokers, *W* = the rank sum for the smokers = 1227 (cell U8). We calculate the mean (cell U14) and variance (cell U15) for *W* using the formulas =U6*(T6+U6+1)/2 and =U14*T6/6 respectively. The standard deviation (cell U16) is then given by the formula =SQRT(U15) as usual.

We now calculate the p-value (cell U17) using the formula =NORMDIST(U8, U14, U16, TRUE) since *W* < *W̄*. If *W* > *W̄*, as usual we would use the formula =1 – NORMDIST(U8, U14, U16, TRUE). Alternatively, we could have created the z-score and calculated the p-value using NORMSDIST.

Since p-value = .03 < .05 = *α*, we reject the null hypothesis (one tail test) and conclude that there is a significant difference between the life expectancy of smokers and non-smokers.

Note that if we had used *W′* (column T of Figure 7), we would get the same p-value and come to the same conclusion.

**Real Statistics Excel Functions**: The following functions are provided in the Real Statistics Pack:

**RANK_COMBINED**(*x*, R1, R2, *d*) = the ranging of element *x* in the combination of ranges R1 and R2. If *d* = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. The rank is corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

**RANK_SUM**(R1, R2, *d*) = sum of the ranks of all the elements in range R1 based on the combination of ranges R1 and R2. If *d* = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. Rankings are corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

**RANK_SUM**(R1, *k*, *d*) = sum of the ranks of all the elements in the *k*th column of range R1. If *d* = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. Rankings are corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

**WILCOXON**(R1, R2) = minimum of *W* and *W′* for the samples contained in ranges R1 and R2

**WILCOXON**(R1, *n*) = minimum of *W* and *W′* for the samples contained in the first *n* columns of range R1 and the remaining columns of range R1. If the second argument is omitted it defaults to 1.

**WTEST**(R1, R2, *t*) = p-value of the Wilcoxon rank-sum test for the samples contained in ranges R1 and R2; *t* = the # of tails: *t* = 1 (default) or* t* = 2.

**WTEST**(R1, *n, t*) = p-value of the Wilcoxon rank-sum test for the samples contained in the first *n* columns of range R1 and the remaining columns of range R1. If the second argument is omitted it defaults to 1. *t* = the # of tails: *t* = 1 (default) or* t* = 2.

**WCRIT**(*n _{1}*,

*n*,

_{2}*α*,

*t*) = critical value of the Wilcoxon Rank-Sum test for samples of size

*n*and

_{1}*n*, for the given value of alpha and

_{2}*t*= 1 (one tail) or 2 (two tails) based on the Wilcoxon Rank Sum Table.

**WPROB**(*x, n*1*, n*2*, tails, iter*) = an approximate p-value for Wilcoxon rank-sum test *x* (= the minimum of *W* and *W*′) for samples of size *n*1 and *n*2 and *tails* = 1 (one tail) or 2 (two tails, default) based on a linear interpolation of the values in the Wilcoxon Rank Sum Table using *iter* number of iterations (default = 40).

Note that the values for α in the Wilcoxon Rank Sum Table range from .01 to .2 for tails = 2 and .005 to .1 for tails = 1. If the p-value is less than .01 (tails = 2) or .005 (tails = 1) then the p-value is given as 0 and if the p-value is greater than .2 (tails = 2) or .1 (tails = 1) then the p-value is given as 1.

The WILCOXON and WTEST functions ignore any empty or non-numeric cells.

**Caution**: For releases of the Real Statistics Resource Pack prior to Release 2.16.1, *n*1 ≥ *n*2 in WCRIT, while *n*1 ≤ *n*2 in WPROB. Also there are some errors in the calculation of the WPROB function. These shortcomings are eliminated starting with Release 2.16.1.

**Observation**: If R1 represents the first *n* columns of range R and R2 represents the remaining columns in range R, then WILCOXON(R, *n*) = WILCOXON(R1, R2) and WTEST(R, *n*) = WTEST(R1, R2). Of course, WILCOXON(R1, R2) and WTEST(R1, R2) can also be used when the two ranges are not contiguous.

Similarly, if R1 represents the first *n* columns of range R and R2 represents the remaining columns in range R, then RANK_COMBINED(*x*, R1, R2, *d*) = RANK_AVG(*x*, R, *d*). The RANK_COMBINED function is especially useful, however, when R1 and R2 are not contiguous.

**Observation**: In Example 2, we can use the supplemental function to arrive at the same value for the minimum of *W* and *W′*, namely WILCOXON(A6:B17) = 105.5. Also RANK_COMBINED(34, A6:A17, B6:B7, 1) = 2.5, RANK_SUM(A6:A17, B6:B17) = 170.5 and RANK_SUM(B6:B17, A6:A17) = 105.5.

Also WCRIT(H5,I5,H8,H9) = WCRIT(12, 11, .05, 2) = 99 (the value in cell H12 of Figure 4). Finally note that the p-value = WPROB(H11,I5,H5,H9) = WPROB(105.5, 11, 12, 2) = .125 > .05 = *α*, and so once again we can’t reject the null hypothesis.

Similarly in Example 3, we can use the supplemental function to arrive at the same value for the minimum of *W* and *W′*, namely WILCOXON(J6:Q15, 4) = WILCOXON(J6:M15, N6:Q15) = 1227, as well as the same p-value (assuming a normal approximation), namely WTEST(J6:Q15, 4) = WTEST(J6:M15, N6:Q15) = 0.003081. Also RANK_COMBINED(72, J6:M15, N6:Q15, 1) = 37, RANK_SUM(J6:M15, N6:Q15) = 1854 and RANK_SUM(N6:Q15, J6:M15) = 1227

**Observation**: The **effect size** for the Wilcoxon Rank Sum test is given by the correlation coefficient (see Basic Concepts of Correlation). The correlation coefficient for the Wilcoxon Rank Sum test is given by the formula

where the z-score is

For Example 3,

and so

As described in Correlation in Relation to t-test, a rough estimate of effect size is that *r* = *.*5 represents a large effect size, *r* = .3 represents a medium effect size and *r* = .1 represents a small effect. Thus, for Example 3 we have a medium sized effect.

Also see Mann-Whitney Test (including Figure 2) for more information about how to calculate the effect size *r* in Excel.

**Exact Test**

Click here for a description of the exact version of the Wilcoxon Rank-Sum Exact Test using the permutation function.

Sir

At the end of Example 1, you wrote:” Since W = 119.5 > 115 = Wcrit, we cannot reject the null hypothesis, and so conclude there is no significant difference between the effectiveness of the drug and the control.”

Is that right?

Colin,

Yes. It is correct. When W > W-crit you cannot reject the null hypothesis.

Charles

Charles:

Indeed, well explained, but I am still not sure why we cannot reject the null hypothesis (as oppose to t-test) because W = 119.5 and 115 = W-crit. According to your eariler tutorial “Hypothesis Testing”, my understanding is to reject the null hypothesis since W-value is within the critical region.

For this and other non-parametric tests the critical region is the area less than the critical value. You can think of W-crit as the critical value on the left tail.

Charles

Sir

In Real Statistics Excel Functions, when d = o or omitted the ranking is in descending order.

Colin

Colin,

You are correct. I have corrected on the website. Thanks for catching this error.

Charles

First of all, congratulations with your site.

I have a question related to the use of the W score in the Wilcoxon rank sum test.

If you define W as the smallest of R1 and R2, why do you use a two-tailed test and not just a one tailed?

Jean-Pierre,

If n1 = n2, you will get the same test result whether you use R1 or R2. If I remember correctly one should be compared with the left critical value and the other with the right critical value. The smaller one corresponds to the left critical value, which can be compared with the values in the critical values.

This very similar to the t test where negative t value is compared with the left critical value and the positive t value is compared with the right critical value. Given symmetry to do a two-sided test you just pick one side and compare with the t-critical value determined by halving the value of alpha. A similar thing happens in the Wilcoxon Rank Sum test.

Charles

suppose I have two samples with unequal sizes, how can I compare them using with Wilcoxon rank sum?

Kembo,

Examples 2 and 3 on the referenced webpage compare two samples of unequal size. I suggest that you look at these.

Charles