The following version of the Shapiro-Wilk Test handles samples between 12 and 5,000 elements, although samples of at least 20 elements are recommended. We also show how to handle samples with more than 5,000 elements.

Assuming that the sample has *n* elements, perform the following steps:

1. Sort the data in ascending order *x*_{1} ≤ … ≤ *x _{n}*

2. Define the values *m*_{1}*, …, m _{n}* by

*m _{i}* = NORMSINV((

*i −*.375)/(

*n*+ .25))

3. Let *M* = [*m _{i}*] be the

*n*× 1 column vector whose elements are these

*m*and let

_{i}If *M* is represented by the *n* × 1 range R1 in Excel, then =SUMSQ(R1) calculates the value *m*.

4. Set *u* = 1/ and define the coefficients *a _{1}*

*, …, a*where

_{n}*a _{i}* =

*m*/ for 2 <

_{i}*i*<

*n*− 1

*a*_{2} = −*a _{n}*

_{-1}

*a*

_{1}= −

*a*

_{n}It turns out that *a _{i}* = −

*a*

_{n-i+}_{1}

*for all*

*i*and that

_{ }

where* A* = [*a _{i}*] is the

*n*× 1 column vector whose elements are the

*a*

_{i}.5. The *W* statistic is now defined by

Because of the above properties of the coefficients *a _{1}*

*, …, a*it turns out that

_{n}*W*= the square of the correlation coefficient between

*a*

_{1}*, …, a*and

_{n}*x*

_{1}*, …, x*. Thus the values of

_{n}*W*are always between 0 and 1.

It also turns out that for values of *n* between 12 and 5,000 the statistic ln (1−*W*) is approximately normally distributed with the following mean and standard deviation:

6. Thus we can test the statistic

using the standard normal distribution. If the p-value ≤ *α* then we reject the null hypothesis that the original data is normally distributed.

**Example 1**: Repeat Example 1 of Shapiro Wilk Original Test using the expanded test.

**Figure 1 – Expanded Shapiro-Wilk Test**

We carry out the calculations described above to get the results shown in Figure 1 (see Figure 2 for key formulas used). The *W* statistic is 0.971066. The p-value = .921649 > .05 = *α* shows that there are no grounds for rejecting the null hypothesis that the data is normally distributed.

**Figure 2 – Key formulas from Figure 1**

**Real Statistics Excel Functions**: The Real Statistics Resource Pack contains the following supplemental functions where R1 consists of only numerical data without headings.

**SHAPIRO**(R1) = the Shapiro-Wilk test statistic *W* for the data in the range R1 using the expanded method

**SWTEST**(R1) = p-value of the Shapiro-Wilk test on the data in R1 using the expanded method

**SWCoeff**(*n, j*) = the *j*th coefficient for samples of size *n*

**SWCoeff**(R1, C1) = the coefficient corresponding to cell C1 within sorted range R1

**SWPROB**(*n , W*) = p-value for the Shapiro-Wilk test for a sample of size *n* and statistic *W*

Note that these functions can optionally take an additional argument *b, *i.e. SHAPIRO(R1, *b*), SWTEST(R1, *b*), SWCoeff(*n, j, b*) and SWCoeff(R1, C1, *b*) and SWPROB(*n, W, b*). When omitted this argument defaults to TRUE (i.e. the values for the expanded Shapiro-Wilk test as described above are used). If *b* is set to FALSE then the values for the original Shapiro-Wilk test are used instead.

**Observation**: The functions SHAPIRO and SWTEST ignore all empty and non-numeric cells. The range R1 in SWCoeff(R1, C1) should not contain any empty or non-numeric cells.

**Example 2**: Determine whether the data in range A3:E14 of Figure 3 is normally distributed using the Shapiro-Wilk’s test.

**Figure 3 – SW Test using Real Statistics formulas**

This time we use the Real Statistics functions described above to obtain the results shown in Figure 3. The value of *W* and the p-value are as indicated using the formulas indicated. Since p-value = 0.019314 < .05 = *α*, we reject the hypothesis that the data is normally distributed. Note that we don’t need to sort the data and the data does not have to be arranged in a column to use the formulas.

If for some reason we want to obtain the coefficients, we need to sort the data. This is done by highlighting the range G3:K14 and entering =QSORT(A3:E14) and simultaneously pressing **Ctrl-Shft-Enter**. The first coefficient is obtained by entering the formula =SWCoeff($G$3:$K$14,G3) in cell M3. If you highlight the range M3:Q14 and press **Ctrl-R** and **Ctrl-D**, all the coefficients will be displayed as shown in Figure 3.

**Observation**: If a sample larger than 5,000, you can randomly divide the larger sample into a number of approximately equal-sized smaller samples and then run the SW algorithm as described above on each sample to obtain the z score for each smaller sample. Suppose that there are *k* such samples with z scores of* z*_{1}*, …, z _{k}.* Recall that if range R1 contains sample

*i*then

*z*= NORMSINV(SWTEST(R1)).

_{i}The average of the z-scores will be an approximation of the z value for the whole sample. The expected mean of *z* is the average of the means of the *z _{i},* namely 0 and the standard deviation of

*z*should be the standard deviation of the

*z*divided by √

_{i}*k*, namely 1/. Thus you should test

*z*using the standard normal distribution.

**Real Analysis Data Analysis Tool**: The **Descriptive Statistics and Normality** supplemental data analysis tool in the Real Statistics Resource pack has a Shapiro-Wilk option. You can use this option to perform the Royston version of the SW Test, as described in Example 3.

**Example 3**: Determine which of the three samples displayed on the left side (range A3:C16) of Figure 4 is normally distributed.

**Figure 4 – Using the Descriptive Statistics and Normality data analysis tool (part 1)**

Enter **Ctrl-m** and select the **Descriptive Statistics and Normality** tool from the dialog box that appears. The dialog box shown in Figure 5 now appears. Choose the **Descriptive Statistics**, **Box Plot** and **Shapiro-Wilk** options as shown and press the **OK** button.

**Figure 5 – Dialog box for Descriptive Statistics and Normality**

The resulting output is shown on the right side of Figure 4 and in Figure 6. As we can see, based on the Shapiro-Wilk test only Sample 2 shows a significant departure from normality (p-value = 0.044 < .05 = *α*).

**Figure 6 – Descriptive Statistics and Normality data analysis tool (part 2)**

This conclusion is supported by the fact that the kurtosis for Sample 2 is high (3.326) and the fact that the Box Plot for Sample 2 is not very symmetric.

Note that choosing the Shapiro-Wilk option also displays the d’Agostino-Pearson test, as described in d’Agostino-Pearson Test.

Hi,

I’m updating my statistical textbook (a Finnish one) and found your site when seeking a simple calculation of Royston’s procedure. Here I found it – if it can be simple…. Thanks for it. I’ll make a link from the book to your page for thise who are interested in working in the Excel environment.I myself have demonstrated practically all the methods first by using Excel and then by SPSS.

In this page, there seems to be a tiny mistake which may be good to update. Namely in the formula of W, there seems to miss the second power of SUM(aixi). In your Excel figure the value is calculated by using the second power but the formula misses it. Am I correct?

JMe

Hello Jari,

I am very pleased that my website has been helpful. I also appreciate your adding a link from your textbook to my site.

Thanks also for catching the typo on the web page. I have now corrected the error.

Charles

Dear Charles,

I’m trying to apply the Shapiro-Wilk expanded Test to my dataset but I don’t understand how to calculate the parameters used to estimate the coefficients a(n), a(n-1),…

Could you help me?

Thank you.

Carole

Hi Carole,

The website offers three choices. If your sample has at most 50 elements you can use the a(n), … coefficients in the original paper of Shapiro and Wilk. A table of these coefficients can be found on the webpage http://www.real-statistics.com/statistics-tables/shapiro-wilk-table/. Note that a(n) = a(1), a(n-1) = a(2), etc. A description of how you use these coefficients can be found on the webpage http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-test/

Alternatively you can calculate the coefficients as described on the webpage http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-expanded-test

The Real Statistics Resource Pack also provides an Excel function called SWCoeff(n, j). This returns the coefficient a(j) for a sample of size n. You can download the Real Statistics Resource Pack for free as described on the webpage http://www.real-statistics.com/free-download/real-statistics-resource-pack/

You can also use the function SWTEST which carries out the Shapiro-Wilk test without having to worry about the actual values of the a(j) coefficients. This function can also be found in the Real Statistics Resource Pack.

Charles

Carole,

I forgot to mention one other important thing. You can look at the actual Excel worksheet that carries out the calculation found in Figure 1 of the webpage http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-expanded-test/. This will tell you exactly how I was able to calculate the values for a(n), … This and all the other examples in the website are contained in the Real Statistics Examples Workbook.

You can download Real Statistics Examples Workbook for free from the webpage http://www.real-statistics.com/free-download/real-statistics-examples-workbook/.

Charles

Hello Mr. Zaionts,

thank you very much for all that great work you did here. I could follow you explanations to the expanded Shapiro Wilk test and I could calculate the values for the first example (the ages statistics). The only thing I couldn’t find is how to actually calculate the p-value for samples with n>50. In the example the p-value is calculated with 0,921649 with W=0,971066. How is the p-value calculated? and what is the z-value for.

Thank you very much!

Joerg

Hello Joerg,

As described in step 5 on the webpage http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-expanded-test/, it turns out that ln(1-W) has a normal distribution with the mean and standard deviation as given in step 5. z is simply the z-score for ln(1-W); i.e. z = STANDARDIZE(ln(1-W),mean,stdev). Thus we can test z using NORMSDIST or NORM.S.DIST, as described in step 6; i.e. p-value = 1-NORMSDIST(z).

For Example 1, z is calculated by the formula =STANDARDIZE(LN(1-Y9),Y10,Y11) and p-value is calculated by the formula =1-NORMSDIST(Y12).

Charles

Hi Charles,

Thanks for the article. To tell you the truth though, I am really struggling to figure out how you are calculating the a coefficients.

Is the value of mn in the coefficient function: an = …, equal to the last item in the M vector (e.g., in this example, is mn equal to 1.63504)?

Also, is mn-1 in the second function, a(n-1) equal to the second to last item in the M vector?

I think it would really help me, and many others who view this already great article, if you would show the exact values that are being plugged into the a(n), a(n-1), e = …, functions. And if you could just please show a complete ai() example, that would make this article even better.

Thank you so much,

Hi MWS,

I will update the webpage to provide a complete example as you suggest. I will do this after I finish Release 2.8 of the Real Statistics Resource Pack which I hope to release in the next few days.

Charles

Sorry to keep bugging you Charles, but could you please show me us a complete example for calculating the first two values and the last two values of the A vector? Anyways, thanks again for the helpful article. Now if only I could get these a values calculated 🙂

No problem. Now that I have finished the latest releases, I plan to update the referenced webpage to provide a more complete explanation. Charles

I have updated the referenced webpage. See Figure 2 for a description of most of the formulas shown in Figure 1, including the formulas for the A vector. Charles

Charles,

Great tools – thanks very much. I have a query on the use of the NORMSDIST function to generate a p-value.

The NORMSDIST function returns a cumulative probability. One minus the NORMSDIST output returns the right sided probability (p for the tail). This is a one-sided test result as I understand it. If you want to obtain a two-sided p-value then we ask for 2*(1-NORMSDIST()).

Is this not more appropriate. Is there some reason why the Shapiro-Wilks test should be based on a on-sided z-test?

Thanks

Nigel

Nigel, sorry, but I don’t know why Shapiro and Wilk designed the test in this way. Charles

Sir

In the last Observation (sample size > 5000), you wrote: zi = NORMSINV(SWTEST(R1)). Why not zi = NORMSINV(1-SWTEST(R1)).

And you said “Thus you should test z/sqrt(k) using the standard normal distribution”, is that a typo? Do you mean z*sqrt(k)

Colin

Suppose i have a values of W= 0.96, W=0.87, W=0.92, W=0.91 and sample of 80 for my ECG data what could be the P value for each W?? Kindly let me the P value by using the Extended shapiro test..

The p-value is calculated from W using the fact that ln(1-W) is approximately normally distributed with mean and standard deviation as given on the referenced webpage just before Example 1. From this fact you should be able to calculate the p-values using NORM.DIST, NORM.S.DIST, NORMDIST or NORMSDIST.

Charles

Hi Charles,

Thanks, I am trying to do this in excel and really appreciate your website. I have some question regarding to formulas.

W: I can not get the value as your excel :0.97XXX, it looks like the formula is wrong. Not sure if current formula of fig 2 is correct.

P value: can I know where I can find putt he formula?

thanks,

Kim

Kim,

I just checked all the formulas in Figure 2 and they seem correct. Which cells do you think have the wrong formulas, and what do you think is the correct formula (or why do you think the formula for W is wrong)? I don’t understand your question about the p value.

Charles

Hey Charles,

I’d like to perform a Shapiro-Wilk expanded test with correction for ties as suggested by Royston (1989). It bassically implies adding the Sheppard correction to the sum of squares about the mean: -(n-1)h²/12. Can you give me a hint how to realise this with your Real Statistics Resource Pack? What makes it even more complicated is that grouping intervals are of unequal size.

All best

Philippe

Philippe,

I am not familiar with this correction factor for ties in the SW test and so don’t have any suggestions for how to do this with the Real Statistics software.

Charles

Thanks a lot, for your wonderful site. This is really the best site, I know of, to lean statistics from basics. at the same time, your addin is too good. Congratulations for painstakingly carrying out this great service !

Two, inquiries:

1. Here, I am not able to find out the equation for the calculation of p-value.

2. In my laptop I could use the add in, but in office computer (windows 7, MSoffice 2007), I am not able to use your addin, the message comes is as follows: “can’t find project or library” in a visual basic windows. Can you please help !

Once again, I express my sincerest thanks, Sir. Regards!

1. The formula for the p-value is =1-NORMSDIST(Y12) (see Figure 1 and 2)

2. Which version of Excel are you using on your office computer? Is it the same version of Excel as you are using on your laptop?

Charles

Hello – I also appreciate your excellent instruction. However using =SHAPIRO(R1,TRUE) [ver 3.5.1] for the data in in Figure 1 – Expanded Shapiro-Wilk Test, I’m unable to duplicate the W of 0.971066 worked out in that table. I get W = 0.9711225. Could you please show me my error? Thanks again for this resource!

Wade,

Thanks very much for your comment. It turns out that there is an error in the =SHAPIRO(R1,TRUE) formula. The value for W of 0.971066 is correct. In the calculation of the an coefficient I used 0.22157u instead of 0.221157u. As you can see the difference is quite small, but in any case I will correct this error in the next release of the software.

Charles

Wade,

I have now corrected the error in the Shapiro-Wilk capabilities. This is included in the latest release of the Real Statistics Resource Pack, Release 3.6.1, which was issued today.

Charles

Sir Charles,

Thank you for this tool. It will help me a lot in my research. All credits to you sir. This will significantly reduce my computations.

Again, thanks a lot. More power to you sir.

Jem

i am struggling on how to interpolate the data .can someone please help me and show me the clear formula.because I know how to calculate the value of W .

If you are using the expanded version version of the SW test, you shouldn’t need to interpolate. If you are using the original SW test, then you might indeed need to interpolate, although the Real Statistics function will do this for you. In any case, to get more information about how to interpolate see the webpage http://www.real-statistics.com/excel-capabilities/table-lookup/.

Charles

Hi Dr. Zaiontz,

By chance do you have a reference on where the values for an and an-1 came from:

a_n=-2.706056u^5+4.434685u^4-2.071190u^3-0.147981u^2+0.221157u+m_n m^(-0.5)

a_(n-1)=〖-3.582633u〗^5+〖5.682633u〗^4 〖-1.752461u〗^3+〖-0.293762u〗^2+0.042981u+m_(n-1) m^(-0.5)

Thank you so much

RobS

Hi RobS,

The reference is Royston’s paper. See Bibliography for details.

Charles

Hello Charles,

I find your website very interesting. Thank you.

When trying to follow your lead, I encounter that I would have used =NORMDIST(z;mu;sigma;1). If I just write =NORMDIST() I assume that it is a N(0,1) which is not the case but that’s the way I obtain your p-value.

What am I misunderstanding?

Thank you very much

Gaussiano,

The formula =NORMDIST() is not valid and will result in an error. I used =NORMSDIST(x) which is N(0,1).

Charles

Dear Charles

Thank you for a very useful add-on. However when I try to use e.g. a test for normality I get an error message: Compile error in hidden module: Analysis. I get an error when I use any of the functions.

Can you help?

Many thanks

Philip

Dear Philip,

In order to help you I need some additional information, namely:

1. Does the function =VER() work? If not, then you probably haven’t installed the software as indicated on the website. When you press Alt-TI you should see RealStats in the list that is displayed with a check mark next to it. If not, you need to click on the Browse button and locate where you have stored the file containing the Real Statistics add-in that you downloaded from the website.

2. If this is not the problem, then please tell me what is displayed when you use the =VER() function and what release of Excel you are using (Excel 2007, Excel 2010, etc.).

Charles

I would like to thank you for all the support I found here during the development of my final work (I’m student yet). Never stop your work, it’s very helpful and you write clear and understandable.

Regards,

André Vendramini

Thank you very much André for your support. I am glad that you found the website helpful.

Charles

I am looking for a version of the Shapiro & Wilk tables (coefficients and p-values) that I can easil paste in to my spreadshhet. Thanks, Kevin.

Kevin,

You might want to use the version described on the following webpage:

http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/shapiro-wilk-test/

You can use the version on the examples workbook, which you can download for free.

Charles

Hi!

First of all, thanks for your work. I’ve used your webpage several times to understand some statistics.

I have one doubt here, the .375 and .25 numbers used in “mi = NORMSINV((i − .375)/(n + .25))” are totally independent of the data?

I have a problem with that argument because my data has negative values, so excel can’t return anything in that formula, given that I’m entering a negative probability.

Could you help me with that? Thank you!

Alvaro,

If your data values are say -3, 7, 10, -1, -5, then then data needs to be arranged in increasing order as -5, -3, -1, 7, 10.

When calculating mi, it is important to note that i is not the data value but its index. Thus -5 corresponds to 1, -3 corresponds to 2, etc. These indices are all positive integers.

Charles

Thank you so much! I was using the actual data, instead of the index. My fault.

Hi Charles,

Thank you for your time addressing the comments posted on here. It is much appreciated.

I have managed to expand a spreadsheet to hold and statistically analyse up to 5000 data points(from 200), though I still cant get the w-test to calculate correctly. Are you able to help or are you able to put in my touch with someone who can? I believe it would be a simple fix but I am baffled.

Eliza,

If you send me an Excel file with your data and analysis, I will try figure out what is going wrong.

You can find my email address at Contact Us.

Charles

Thank you so much Charles.

I am sending through an email now.

Hello,

I am working with radar data that was taken to analyse flooding. I use GIS to determine water flow and accumulation. I have made a histogram and QQ plot of the radar elevations using GIS, and elevations are not normally distributed. The graphics are great, but I also need statistics to show the abnormal distribution. The radar data contains 9,853,404 elevation points. How can I use stats to show the elevations values are not normally distributed?

Jackie,

You can use the Shapiro-Wilk test as described on the referenced webpage.

Charles

Hello Charles,

First thanks a lot for this very clear and helpful document! I just had a question, if I wanted to check whether my data follows a log-normal law, which formulas would I have to change? I was wondering if it would work by just taking the log of my datas and then applying the same formulas as before… Thanks in advance for your reply!

Fiona,

Yes, that approach should work fine.

Charles

Thanks for the quick reply Charles!

Hey Charles,

Could the expanded Shapiro-Wilk test also be applicable to less than 12 samples? Or do I have to apply the original Shapiro-Wilk test?

Please let me know! Thanks.

David, the expanded Shapiro-Wilk test is applicable to less than 12 samples. It requires a minimum of 3 elements.

Charles

Dear Charles,

First I want to say that you have an amazing site, I have no words to describe the superb content. Thank you Charles 🙂

I tried to calculate the coefficients a(1), …, a(n) {actually we use only a(1),…, a (n/2)}

a(n) is positive.

Did you write above that a1=-a(n), which say a(1) is negative?

Also a(i)=m(i)/sqrt(eps) is negative.

But in the Shapiro-Wilk Tables, I can see all a(i) values are positive?

Another question, the approximating support slightly difference results.

For example for n=6: a(1), a(2), a(3)

Shapiro-Wilks: (0.6431, 0.2806, 0.0875

Approximating: ( -0.64397, -0.28071, -0.08825)

What a(i) values is better to use for i=1 to 50?

I understand that Shapiro-Wilks values are also not accurate.

Is the approximating you describe above more accurate the original Shapiro-Wilks values?

Thanks

Oren,

1. a(1) = -a(n) doesn’t mean that a(1) is negative. E.g. if a(n) is -0.3, then a(1) is 0.3, which is positive. In any case, half the coefficients are positive and the other half are the negative of the positive values.

2. When you say approximate value, then I assume that you are referring to the Royston approximation. Probably the original Shapiro-Wilk values are more accurate for values in the table.

Charles

Hi Charles,

When I use the Royston approximating you mention above I get negative amounts instead of positive numbers in SW table. [ n=1 to 50. table a(1)-a(25) ]

Do you have any idea?

Thanks,

a(i) = m(i)/sqrt(eps)

1. I assume sqrt(eps) is positive.

2. SW table contains only the first half – m(1) to m(n/2): which are on the left side of the normal distribution, so must be negative

3. positive/negative => Negative

example for n=6:

i=3 => (i-0.375)/(n+0.25)=(3-0.75)/(6+0.25)=0.42

m(3)=NORM.S.INV(0.42)=-0.20189

Method: a(1), a(2), a(3)

Shapiro-Wilks: 0.6431, 0.2806, 0.0875

Approximating: -0.64397, -0.28071, -0.08825

Oren,

Yes, I just checked and the coefficients using the Royston version of the test are the negative of those using the original version. E.g.

data coeff

34 -0.605060857

56 -0.316302802

78 -0.175109665

99 -0.056484528

111 0.056484528

189 0.175109665

211 0.316302802

290 0.605060857

Here the first coefficient (in cell B2) is calculated using the formula =SWCoeff($A$2:$A$9,A2), assuming that the data is in range A2:A9 and the coefficients are in range B2:B9

Charles