Wilcoxon Rank Sum Test for Independent Samples

When the requirements for the t-test for two independent samples are not satisfied, the Wilcoxon Rank-Sum non-parametric test can often be used provided the two independent samples are drawn from populations with an ordinal distribution.

For this test we use the following null hypothesis:

H0: the observations come from the same population

From a practical point of view, this implies:

H0: if one observation is made at random from each population (call them x0 and y0), then the probability that x0 > y0 is the same as the probability that x0 < y0, and so the populations for each sample have the same medians.

We illustrate the technique with the following examples.

Example 1: Repeat Example 2 from Two Sample t Test with Unequal Variances to test whether a new hay fever drug is effective, but this time using the data from Figure 1.

Figure 1 – Data for Example 1

When we look at the QQ Plot for the Control group we see that it is not very normal, but more concerning is that the Box Plot for the group that took the drug shows that the data is not very symmetric (see Figure 2). We therefore decide to use the Wilcoxon Sign-Rank test instead of the t-test.

Figure 2 – QQ Plot and Box Plots for data in Example 1

The results of the Wilcoxon Rank-Sum test are displayed in Figure 3.

Figure 3 – Wilcoxon Rank-Sum Test for Example 1

We begin by calculating the ranks of the combined 24 raw scores using the supplemental RANK_AVG function (or the standard RANK.AVG function in Excel 2010). See Ranking for details. E.g., the contents of cell D6 is the rank of the first participant in the Control group, namely RANK_AVG(A6,\$A\$6:\$B\$17,1) which is the same as

=RANK(A6,\$A\$6:\$B\$17,1) + (COUNTIF(\$A\$6:\$B\$17,A6)-1)/2.

using the standard Excel 2007 rank function (see Ranking).

We then calculate the sum of the ranks for each group to arrive at the rank sums R1 = 119.5 and R2 = 180.5. Since the sample sizes are equal, the value of the test statistic W = the smaller of R1 and R2, which for this example means that W = 119.5 (cell H10).

We next compare W with the critical value Wcrit, which can be found in the Wilcoxon Rank-Sum Table. Since the sample sizes are both 12, we look up the critical value in the table for α = .05 (two-tail) where n1 = n2 = 12, and find that Wcrit = 115. This represents the smallest value we could expect to obtain for W if the null hypothesis were true. Since W = 119.5 > 115 = Wcrit, we cannot reject the null hypothesis, and so conclude there is no significant difference between the effectiveness of the drug and the control.

Example 2: Repeat Example 1 with the last data element for the group that took the drug removed.

We again use the Wilcoxon Rank-Sum test, but this time the sample sizes are unequal. The test is as in Figure 4.

Figure 4 – Wilcoxon Rank-Sum Test for Example 2

The rank sums are calculated as in the previous example, although since some of the data may be blank, we need to use a formula such as

=IF(A6<>””,RANK_AVG(A6,\$A\$6:\$B\$17,1),””).

Since the sample sizes are different, a bit more care is required. Essentially W represents the left tail statistic and so we need to also evaluate the right tail statistic W′, which can be obtained by using reverse ranking as described in Figure 5:

Figure 5 – Calculation of W′ using reverse ranks

The value of W′ is therefore the sum of the ranks for the smaller sample, i.e. 105.5. Fortunately, because of symmetry, W’ can more easily be obtained via the formula

where n1 = 11 (the smaller sample size) and n2 = 12 (the larger sample size). Thus we obtain

W′ = 11(11+12+1) –158.5 = 105.5 (the value in cell H11)

For the two tailed test, which is what we usually require, we compare the smaller of W and W′ with Wcrit. To find the value of Wcrit, we again use the Wilcoxon Rank-Sum Table for α = .05 (two-tail) where n1 = 11 and n2 = 12 to obtain Wcrit = 99. Since min(W,W′) = min(158.5,105.5) = 105.5 > 99 = Wcrit , once again we cannot reject the null hypothesis.

Observation: When n1 = n2, then W′ = R2, i.e. the rank sum of the larger sample. Thus in Example 1, W′ = 180.5

Property 1: Suppose sample 1 has size n1 and rank sum R1 and sample 2 has size n2 and rank sum R2, then R1 + R2 = n(n+1)/2 where n = n1 + n2.

Property 2: When the two samples are sufficiently large (say of size > 10, although some say 20), then the W statistic is approximately normal N(μ, σ) where

Observation: Using Property 2, for samples sufficiently large, we can test W using the techniques from Sampling Distributions. Note that the result is the same whether we use W or W′.

Observation: Since it compares rank sums, the Wilcoxon Rank-Sum test is more robust than the t-test as it is less likely to indicate spurious results based on the presence of outliers. Even for large samples where the assumptions for the t-test are met, the Wilcoxon Rank-Sum test is only a little less efficient than the t-test.

Example 3: The objective of a study was to determine whether there is a significant difference in the median life expectancy between smokers and non-smokers. 38 smokers and 40 non-smokers were chosen at random and their age at death recorded in Figure 6.

Figure 6 – Life expectancy for both groups

A table of ranks is created and the values of W and W′ are calculated as in Examples 1 and 2. Since the sample sizes are sufficiently large, we can test W (or W′) using the normal distribution as described in Figure 7.

Figure 7 – Wilcoxon rank-sum test using normal approximation

Since there are fewer smokers than non-smokers, W = the rank sum for the smokers = 1227 (cell U8). We calculate the mean (cell U14) and variance (cell U15) for W using the formulas =U6*(T6+U6+1)/2 and =U14*T6/6 respectively. The standard deviation (cell U16) is then given by the formula =SQRT(U15) as usual.

We now calculate the p-value (cell U17) using the formula =NORMDIST(U8, U14, U16, TRUE) since W < . If W > , as usual we would use the formula =1 – NORMDIST(U8, U14, U16, TRUE). Alternatively, we could have created the z-score and calculated the p-value using NORMSDIST.

Since p-value = .003081 < .05 = α, we reject the null hypothesis (one tail test) and conclude that there is a significant difference between the life expectancy of smokers and non-smokers.

Note that if we had used W′ (column T of Figure 7), we would get the same p-value and come to the same conclusion.

Real Statistics Excel Functions: The following functions are provided in the Real Statistics Pack:

RANK_COMBINED(x, R1, R2, d) = the ranging of element x in the combination of ranges R1 and R2. If d = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. The rank is corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

RANK_SUM(R1, R2, d) = sum of the ranks of all the elements in range R1 based on the combination of ranges R1 and R2. If d = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. Rankings are corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

RANK_SUM(R1, kd) = sum of the ranks of all the elements in the kth column of range R1. If d = 0 (or is omitted), then the ranking is in decreasing order; otherwise it is in increasing order. Rankings are corrected for ties as in RANK.AVG or RANK_AVG (see Ranking).

WILCOXON(R1, R2) = minimum of W and W′ for the samples contained in ranges R1 and R2

WILCOXON(R1, n) = minimum of W and W′ for the samples contained in the first n columns of range R1 and the remaining columns of range R1. If the second argument is omitted it defaults to 1.

WTEST(R1, R2, tails) = p-value of the Wilcoxon rank-sum test for the samples contained in ranges R1 and R2; tails = the # of tails: 1 (default) or 2.

WTEST(R1, n, tails) = p-value of the Wilcoxon rank-sum test for the samples contained in the first n columns of range R1 and the remaining columns of range R1. If the second argument is omitted it defaults to 1. tails = the # of tails: 1 (default) or 2.

WCRIT(n1, n2, α, tails, h) = critical value of the Wilcoxon Rank-Sum test for samples of size n1 and n2 for the given value of alpha (default α = .05) and tails = 1 (one tail) or 2 (two tails, default) based on the Wilcoxon Rank-Sum Table. If h = TRUE (default) harmonic interpolation is used; otherwise linear interpolation is used.

WPROB(x, n1, n2, tails, iter, h) = an approximate p-value for Wilcoxon rank-sum test x (= the minimum of W and W′) for samples of size n1 and n2 and tails = 1 (one tail) or 2 (two tails, default) based on a linear interpolation (if h = FALSE) or harmonic interpolation (if h = TRUE, default) of the values in the Wilcoxon Rank-Sum Table using iter number of iterations (default = 40).

Note that the values for α in Wilcoxon Rank Sum Table range from .01 to .2 for tails = 2 and .005 to .1 for tails = 1. If the p-value is less than .01 (tails = 2) or .005 (tails = 1) then the p-value is given as 0 and if the p-value is greater than .2 (tails = 2) or .1 (tails = 1) then the p-value is given as 1.

Any empty or non-numeric cells in R1 or R2 are ignored.

Observation: If R1 represents the first n columns of range R and R2 represents the remaining columns in range R, then WILCOXON(R, n) = WILCOXON(R1, R2) and WTEST(R, n) = WTEST(R1, R2). Of course, WILCOXON(R1, R2) and WTEST(R1, R2) can also be used when the two ranges are not contiguous.

Similarly, if R1 represents the first n columns of range R and R2 represents the remaining columns in range R, then RANK_COMBINED(x, R1, R2, d) = RANK_AVG(x, R, d). The RANK_COMBINED function is especially useful, however, when R1 and R2 are not contiguous.

Observation: In Example 2, we can use the supplemental function to arrive at the same value for the minimum of W and W′, namely WILCOXON(A6:B17) = 105.5. Also RANK_COMBINED(34, A6:A17, B6:B7, 1) = 21.5, RANK_SUM(A6:A17, B6:B17) = 170.5 and RANK_SUM(B6:B17, A6:A17) = 105.5.

Also WCRIT(H5,I5,H8,H9) = WCRIT(12, 11, .05, 2) = 99 (the value in cell H12 of Figure 4). Finally note that the p-value = WPROB(H11,I5,H5,H9) = WPROB(105.5, 11, 12, 2) = .125 > .05 = α, and so once again we can’t reject the null hypothesis.

Similarly in Example 3, we can use the WILCOXON function to arrive at the same value for the minimum of W and W′, namely WILCOXON(A6:H15, 4) = WILCOXON(A6:D15, E6:H15) = 1227, as well as the same p-value (assuming a normal approximation), namely WTEST(A6:H15, 4) = WTEST(A6:D15, E6:H15) = 0.003081. Also RANK_COMBINED(72, A6:D15,E6:H15,1) = 37, RANK_SUM(A6:D15,E6:H15,1) = 1854 and RANK_SUM(E6:H15, A6:D15,1) = 1227.

Observation: The effect size for the Wilcoxon Rank Sum test is given by the correlation coefficient  (see Basic Concepts of Correlation). The correlation coefficient for the Wilcoxon Rank Sum test is given by the formula

where the z-score is

For Example 3,

and so

As described in Correlation in Relation to t-test, a rough estimate of effect size is that r.5 represents a large effect size, r = .3 represents a medium effect size and r = .1 represents a small effect. Thus, for Example 3 we have a medium sized effect.

Also see Mann-Whitney Test (including Figure 2) for more information about how to calculate the effect size r in Excel.

Exact Test

Click here for a description of the exact version of the Wilcoxon Rank-Sum Exact Test using the permutation function.

53 Responses to Wilcoxon Rank Sum Test for Independent Samples

1. Colin says:

Sir
At the end of Example 1, you wrote:” Since W = 119.5 > 115 = Wcrit, we cannot reject the null hypothesis, and so conclude there is no significant difference between the effectiveness of the drug and the control.”
Is that right?

• Charles says:

Colin,
Yes. It is correct. When W > W-crit you cannot reject the null hypothesis.
Charles

• Tze says:

Charles:

Indeed, well explained, but I am still not sure why we cannot reject the null hypothesis (as oppose to t-test) because W = 119.5 and 115 = W-crit. According to your eariler tutorial “Hypothesis Testing”, my understanding is to reject the null hypothesis since W-value is within the critical region.

• Charles says:

For this and other non-parametric tests the critical region is the area less than the critical value. You can think of W-crit as the critical value on the left tail.
Charles

2. Colin says:

Sir

In Real Statistics Excel Functions, when d = o or omitted the ranking is in descending order.

Colin

• Charles says:

Colin,
You are correct. I have corrected on the website. Thanks for catching this error.
Charles

3. Jean-Pierre Baeyens says:

First of all, congratulations with your site.

I have a question related to the use of the W score in the Wilcoxon rank sum test.
If you define W as the smallest of R1 and R2, why do you use a two-tailed test and not just a one tailed?

• Charles says:

Jean-Pierre,

If n1 = n2, you will get the same test result whether you use R1 or R2. If I remember correctly one should be compared with the left critical value and the other with the right critical value. The smaller one corresponds to the left critical value, which can be compared with the values in the critical values.

This very similar to the t test where negative t value is compared with the left critical value and the positive t value is compared with the right critical value. Given symmetry to do a two-sided test you just pick one side and compare with the t-critical value determined by halving the value of alpha. A similar thing happens in the Wilcoxon Rank Sum test.

Charles

4. kembo says:

suppose I have two samples with unequal sizes, how can I compare them using with Wilcoxon rank sum?

• Charles says:

Kembo,
Examples 2 and 3 on the referenced webpage compare two samples of unequal size. I suggest that you look at these.
Charles

5. Sarah says:

Hi, I am doing a wilcoxon test with two uneven samples. I don’t understand your equation in example 2:

=IF(A6””,RANK_AVG(A6,\$A\$6:\$B\$17,1),””).

What is the “” supposed to indicate.

Sarah

• Charles says:

Sarah,

Text information is surrounded by quote marks in Excel. Thus “London” means the capital of the UK. When the text is empty (i.e. blank) then there is nothing between the quote marks and you see “”

Also the formula is =IF(A6<>“”,RANK_AVG(A6,\$A\$6:\$B\$17,1),””).

Charles

6. sar says:

Hi,
Suppose I have two very large samples of several thousand observations each. One sample is a few thousand larger than the other. With uneven samples, I would use the smaller W value, and refer to the critical value of the left tail. If W-smaller sample is larger than the W-critical value, I cannot reject the null hypothesis. Is that correct?

Now let’s say I am using SAS to perform the wilcoxon test. For this wilcoxon test, SAS generates this for a NEGATIVE W value:
pr Z = .00001.
Would this mean that I cannot reject the null hypothesis?

• Charles says:

If W (smaller) < W-crit then you would reject the null hypothesis (at least based on the table of critical values that I have provided in the website). I am not familiar with how SAS performs the test, and so I can't answer your question, although it seems very surprising that SAS would generate a negative value. Charles

7. sar says:

It seems my message wasn’t uploaded correctly, SAS generates this for the negative W value:

pr less than Z = .00001

8. sar says:

(pr is shorthand for probability)

I should note the Chi Square was significant for this test..

9. Nicolas says:

Charles,

This is brilliant. Thank you for all your effort.

Unfortunately I am having problems with using your functions with array formulas. A typical sample code would look like this.

{=WTEST(IF(\$D\$28:\$D\$30=F\$21,\$C\$28:\$C\$30),IF(\$D\$21:\$D\$27=F\$22,\$C\$21:\$C\$27),2)}

Have you heard of similar problems? Do you know what could cause these problems?

Thank you very much in advance.

Regards,

Nicolas

• Charles says:

Nicolas,

Many of the functions were intended to reference specific ranges and not formulas that output arrays that are equivalent to matrices. I have begun changing these functions so that they work in array formulas of the type that you have described.

I have already revised the WTEST function, although I believe the revised version will be in the next release of the software. It is important to recall that although the formula you have written outputs a single value, it has an embedded array formula and so you must press Ctrl-Shft-Enter for it to work.

Charles

10. Ro says:

Hello, thank you for the website. It has helped a lot in translating a lot of the formulas for these tests to excel.

I was just wondering about the calculation of the variance in example 3. Your formula for variance reads U14*T6/6. I was just wondering where the 6 came from.

• Charles says:

As you can see from the referenced webpage the formula for the variance is n1*n2*(n1+n2+1)/12. But the formula for the mean is n1*(n1+n2+1)/2. Using simple algebra, this means that an alternative formula for the variance is mean*n2/6.
Charles

• Ro says:

Hello again Dr. Charles,

I am in a bit of a predicament as I have some survey data in which I have sampled the same individuals both before and after, but I don’t have anyway to link their before and after results to one another (as the survey itself was anonymous). In addition, the before and after groups have different number of responses. The data is from Likert items (not scales) so I assume non parametric tests would be the way to go. My only question is would it be appropriate to use the Wilcoxon Sum rank test even though I cannot assume independent samples?The loss in power would give more conservative results, but I was wondering if another test would be more appropriate.

• Charles says:

I assume that you are trying to see whether there is a significant difference between Before and After. I am not sure how you would test such data since the Wilcoxon Rank Sum test requires independent samples. I can’t think of another test, but frankly I haven’t had enough time to really think too much about the situation that you have described.
Charles

11. Bessie says:

My N1 is only 16, but N2 is 5035. How am I suppose to find alpha then?

• Charles says:

Bessie,

You won’t be able to use the Wilcoxon Rank Sum Table with such a high value for N2. Instead you use the normal approximation, which doesn’t rely on the table, as described in Example 3 of the referenced webpage.

Also the table doesn’t give you alpha. It gives you the critical values.

Charles

• Bessie says:

Thanks !
I am actually still confused here. My n1 set of data isn’t normal. and N2 since it has such a high number, we assume it to be normal. My problem is to compare the mean of this two set of data see if they are significantly different from each other.
N2 is actually my population

• Charles says:

Bessie,
The Wilcoxon Rank Sum Test doesn’t compare the two data sets, it compares the ranks of the values in the data set. These will be approximately normally distributed (even if the original data is not normally distributed). If one set is a sample from the second set (i.e. the population), then you are violating the independence assumption of the Wilcoxon Rank Sum Test; in fact the Wilcoxon Rank Sum Test is really testing whether the two data sets come from the same population, which in this case would clearly be true since one of the sets is the population from which the other is derived.
Charles

• Bessie says:

Thanks very much!

12. Ahmed Abbas says:

Dear Dr. Charles,

I have two methods. Each method is tested on 8 samples and for each sample we have Precision, Recall, F-score. The method X has higher average F-score than method Y. However, the difference is small. I am asked to calculate the p-value of the difference.

Is the Wilcoxon rank sum test the correct way, or I should think in another direction?

How to calculate the p-value of the difference? Should I list the array F-score for X and array F-score for Y in Matlab and use the command ranksum?

Thanks a lot

• Charles says:

Ahmed,
I can’t tell from your description what Precision, Recall and F-score represent. Are Precision and Recall the two random variables? Is F-score the F statistic?
I am not familiar with Matlab’s ranksum command and so can’t comment on that.
Charles

13. Bee cee says:

Kindly help with this, its very urgent. What study design can be used for sign test, wilcoxon sign-ranked test, median test and mann whitney test. Thanks in anticipation.

• Charles says:

Please look at the webpages for each of these tests to get the information that you are looking for.
Charles

14. Peter says:

Hello, I am searching for the significance levels of a Wilcoxon rank sum (Mann-Whitney) test. I used stata to generate the p values but i am wondering at which level do i say the figures are significant at e.g 0.01, 0.05 0r 0.20? Is there a way i could select the level of significance in stata?

• Charles says:

Peter,
The significance level really depends on you. It simply states the level of Type I error you are willing to accept for the test. The typical value is .05 (i.e. one type I error every 20 tests). You can set it lower if you like. See Null and Alternative Hypothesis for details.
Charles

15. Parul says:

Dear Sir,

I am using Wilcoxon rank sum test for my research results. I have results of two algorithms for 30 functions that means n1 is 30 and n2 is also 30. I calculated p value and used significant level .05. Now, I want to find which values of n1 (out of 30) is significantly different from n2. If the any of the value is significantly different then which one is better.

Thank you in anticipation.

• Charles says:

Parul,
Sorry, but I don’t understand your question.
Charles

16. narges says:

hi
I have a question in order to modify data by using wilcoxon rank-sum non-parametric rank. suppose I have a rating for 1 parameters which I have nitrate concentrates as well. I am going to modify rating respect to nitrate concentration. How would I be able to modify rating by Wilcoxon test?
for example:
rate nitrate concentration modified rate
4 1.3 ?
5 2 ?
8 18.5 ?

• Charles says:

Sorry, but I don’t know what you mean by “modify data using wilcocon rank-sum non-parametric rank”.
Charles

17. Alberto M Pendas says:

May I use a Wilcoxon singed-rank test when the vairances are not similar between the two groups compared?

• Charles says:

Alberto,
The Wilcoxon Signed Ranks test operates on the differences between the data items and so the variances won’t matter. The situation is different for the Wilcoxon Rank Sum test.
Charles

18. FELIX says:

Hi Charles:

I have a question for example 2 (unequal samples).
n1=12; R1=117,5; R1’=170,5
n2=11; R2=158,5; R2’=105,5
Ws=min(158,5;105,5)=105,5

I don’t know if 158,5 is choosen because is the bigger value of left tail or because if the value of the smaller sample and no matter if is the bigger value or not.

Best regard

Felix

• Charles says:

Felix,
The smaller sample is chosen.
Charles

19. jesamae says:

hi! Charles do you about WILCOXON MANN-WHITNEY TEST ? and to get the U-statistics?

20. Lara Pozzato says:

Hello,
first of all thanks for this very clear explanation!
I do have a “practical” question: I am trying to test big sample sizes, n1=40 and n2=29 and I cannot manage to find a table that gives me the critical values for n>20…. how can I find my critical W for a=0.05 to compare to my W left and W’ right?
Thanks a lot
Kind regards

• Charles says:

Lara,
The largeest table I have seen only go up to n1 = 40 and n2 = 20, but with samples so large you can safely use the normal approximation instead of the tables of critical values. This approach is described on the referenced webpage.
Charles

21. Tariku Zekarias says:

hi Charles thanks for clear explanation of the stage. my qoustion is how to inter those data on SPSS softwares?

• Charles says:

Tariku,
Sorry, but I don’t use SPSS.
Charles

22. KyleKim says:

Thank you for your exellent website. For a short comment, is it right to use the Wilcoxon Rank Sum test instead of Sign-Rank test shown in the below sentence in the text?
“We therefore decide to use the Wilcoxon Sign-Rank test instead of the t-test.”

• Charles says:

KyleKim,
The Wicoxon Rank Sum test (or equivalently the Mann-Whitney test) and be used instead of the two independent sample t test, while the Wicoxon Signed Ranks test is used in place of the paired t test.
Charles

23. HJ says:

Dear Charles,
Thanks for introducing this new test.
In practice, I have one job which requires me to test if there is a drift for runs wkN vs wkN-1. i use t-test. But noticed there are cases whereby the runs are less than 30, and on top of that, the population is not normal distributed. In this case, can I say Wilcoxon Rank Sum Test will be more appropriate?
If yes, we 1st use QQ plot to validate that two samples from past 2 wks are not normal if they are not sitted close to the 45 degree line randomly. 2nd, we construct the WRS Tets and compare the W value with Wcritical to conclude whether or not there is a drift(two tail since we do not care the direction)?
Is my understanding and steps to make conclusion correct?

• Charles says:

HJ,
You can still use the t test even when the population is not normally distributed provided the data is not too far from normally, especially if the data is reasonably symmetric.
You should also make sure that the two samples are independent. If not then instead of using the Wilcoxon Rank Sum test (or the Mann-Whitney test, which is equivalent), you should use the Wilcoxon Signed-Ranks test.
Provided you have two independent sample, then what you have stated seems correct.
Charles

24. Francis says:

Pls when you have negative observations in Wilcoxon rank sum test, how do you go about the ranking.
Something like this (2,0,-1,5,6,1) .

• Charles says:

Francis,
The ranking is done in the same way< the fact that there are negative observations doesn't change anything. (2,0,-1,5,6,1) has ranks (4,2,1,5,6,3) if lowest value is ranked 1 or (3,5,6,2,1,4) if highest value is ranked 1. Charles