Shapiro-Wilk Expanded Test

The following version of the Shapiro-Wilk Test handles samples between 12 and 5,000 elements, although samples of at least 20 elements are recommended. We also show how to handle samples with more than 5,000 elements.

Assuming that the sample has n elements, perform the following steps:

1. Sort the data in ascending order x1 ≤ … ≤ xn

2. Define the values m1, …, mn by

mi = NORMSINV((i − .375)/(+ .25))

3. Let M = [mi] be the n × 1 column vector whose elements are these mi and let

image8101

If M is represented by the n × 1 range R1 in Excel, then =SUMSQ(R1) calculates the value m.

4. Set u = 1/\sqrt{n} and define the coefficients a1, …, an  where

image8102

image8103

ai = mi /\sqrt{\epsilon} for 2 < in − 1

a2 = −an-1        a1 = −an

where
image8104

It turns out that ai = −an-i+1  for all i and that

image8105

where A = [ai] is the n × 1 column vector whose elements are the ai.

5. The W statistic is now defined by

Shapiro Wilk W

Because of the above properties of the coefficients a1, …, an it turns out that W = the square of the correlation coefficient between a1, …, an and x1, …, xn. Thus the values of W are always between 0 and 1.

It also turns out that for values of n between 12 and 5,000 the statistic ln (1−W) is approximately normally distributed with the following mean and standard deviation:

image8107

image8108

6. Thus we can test the statistic

image8109

using the standard normal distribution. If the p-value ≤ α then we reject the null hypothesis that the original data is normally distributed.

Example 1: Repeat Example 1 of Shapiro Wilk Original Test using the expanded test.

Shapiro Wilk Excel Royston

Figure 1 – Expanded Shapiro-Wilk Test

We carry out the calculations described above to get the results shown in Figure 1 (see Figure 2 for key formulas used). The W statistic is 0.971066. The p-value = .921649 > .05 = α shows that there are no grounds for rejecting the null hypothesis that the data is normally distributed.

Shapiro-Wilk formula listFigure 2 – Key formulas from Figure 1

Real Statistics Excel Functions: The Real Statistics Resource Pack contains the following supplemental functions where R1 consists of only numerical data without headings.

SHAPIRO(R1) = the Shapiro-Wilk test statistic W for the data in the range R1 using the expanded method

SWTEST(R1) = p-value of the Shapiro-Wilk test on the data in R1 using the expanded method

SWCoeff(n, j) = the jth coefficient for samples of size n

SWCoeff(R1, C1) = the coefficient corresponding to cell C1 within sorted range R1.

Note that these functions can optionally take an additional argument b, i.e. SHAPIRO(R1, b), SWTEST(R1, b), SWCoeff(n, j, b) and SWCoeff(R1, C1, b). When omitted this argument defaults to True (i.e. the values for the expanded Shapiro-Wilk test as described above are used). If b is set to False then the values for the original Shapiro-Wilk test are used instead.

Observation: The functions SHAPIRO and SWTEST ignore all empty and non-numeric cells. The range R1 in SWCoeff(R1, C1) should not contain any empty or non-numeric cells.

Example 2: Determine whether the data in range A3:E14 of Figure 3 is normally distributed using the Shapiro-Wilk’s test.

Shapiro Wilk Excel normality

Figure 3 – SW Test using supplemental formulas

This time we use the supplemental functions described above to obtain the results shown in Figure 3. The value of W and the p-value are as indicated using the formulas indicated. Since p-value = 0.019314 < .05 = α, we reject the hypothesis that the data is normally distributed. Note that we don’t need to sort the data and the data does not have to be arranged in a column to use the formulas.

If for some reason we want to obtain the coefficients, we need to sort the data. This is done by highlighting the range G3:K14 and entering =QSORT(A3:E14) and simultaneously pressing Ctrl-Shft-Enter. The first coefficient is obtained by entering the formula =SWCoeff($G$3:$K$14,G3) in cell M3. If you highlight the range M3:Q14 and press Ctrl-R and Ctrl-D, all the coefficients will be displayed as shown in Figure 3.

Observation: If a sample larger than 5,000, you can randomly divide the larger sample into a number of approximately equal-sized smaller samples and then run the SW algorithm as described above on each sample to obtain the z score for each smaller sample. Suppose that there are k such samples with z scores of z1, …, zk. Recall that if range R1 contains sample i then  zi = NORMSINV(SWTEST(R1)).

The average of the z-scores will be an approximation of the z value for the whole sample. The expected mean of z is the average of the means of the zi, namely 0 and the standard deviation of z should be the standard deviation of the zi divided by √k, namely 1/\sqrt{k}. Thus you should test z\sqrt{k} using the standard normal distribution.

Real Analysis Data Analysis Tool: The Descriptive Statistics and Normality supplemental data analysis tool in the Real Statistics Resource pack has a Shapiro-Wilk option. You can use this option to perform the Royston version of the SW Test, as described in Example 3.

Example 3: Determine which of the three samples displayed on the left side (range A3:C16)  of Figure 4 is normally distributed.

Shapiro-Wilk data analysis

Figure 4 – Using the Descriptive Statistics and Normality data analysis tool

Enter Ctrl-m and select the Descriptive Statistics and Normality tool from the dialog box that appears. The dialog box shown in Figure 5 now appears. Choose the Descriptive Statistics, Box Plot and Shapiro-Wilk options as shown.

Shapiro-Wilk dialog box

Figure 5  Dialog box for Descriptive Statistics and Normality

The resulting output is shown on the right side of Figure 4. As we can see, based on the Shapiro-Wilk test only Sample 2 shows a significant departure from normality (p-value = 0.044 < .05 = α).

This conclusion is supported by the fact that the kurtosis for Sample 2 is high (3.326) and the fact that the Box Plot for Sample 2 is not very symmetric.

Note that at present, the SW Test only produces the correct answer if there are no blank or non-numeric cells in the data.

15 Responses to Shapiro-Wilk Expanded Test

  1. Jari Metsämuuronen says:

    Hi,

    I’m updating my statistical textbook (a Finnish one) and found your site when seeking a simple calculation of Royston’s procedure. Here I found it – if it can be simple…. Thanks for it. I’ll make a link from the book to your page for thise who are interested in working in the Excel environment.I myself have demonstrated practically all the methods first by using Excel and then by SPSS.

    In this page, there seems to be a tiny mistake which may be good to update. Namely in the formula of W, there seems to miss the second power of SUM(aixi). In your Excel figure the value is calculated by using the second power but the formula misses it. Am I correct?

    JMe

    • Charles says:

      Hello Jari,

      I am very pleased that my website has been helpful. I also appreciate your adding a link from your textbook to my site.

      Thanks also for catching the typo on the web page. I have now corrected the error.

      Charles

  2. Carole says:

    Dear Charles,

    I’m trying to apply the Shapiro-Wilk expanded Test to my dataset but I don’t understand how to calculate the parameters used to estimate the coefficients a(n), a(n-1),…

    Could you help me?

    Thank you.

    Carole

  3. Joerg Schmidt says:

    Hello Mr. Zaionts,

    thank you very much for all that great work you did here. I could follow you explanations to the expanded Shapiro Wilk test and I could calculate the values for the first example (the ages statistics). The only thing I couldn’t find is how to actually calculate the p-value for samples with n>50. In the example the p-value is calculated with 0,921649 with W=0,971066. How is the p-value calculated? and what is the z-value for.

    Thank you very much!
    Joerg

  4. MWS says:

    Hi Charles,

    Thanks for the article. To tell you the truth though, I am really struggling to figure out how you are calculating the a coefficients.

    Is the value of mn in the coefficient function: an = …, equal to the last item in the M vector (e.g., in this example, is mn equal to 1.63504)?

    Also, is mn-1 in the second function, a(n-1) equal to the second to last item in the M vector?

    I think it would really help me, and many others who view this already great article, if you would show the exact values that are being plugged into the a(n), a(n-1), e = …, functions. And if you could just please show a complete ai() example, that would make this article even better.

    Thank you so much,

    • Charles says:

      Hi MWS,
      I will update the webpage to provide a complete example as you suggest. I will do this after I finish Release 2.8 of the Real Statistics Resource Pack which I hope to release in the next few days.
      Charles

  5. MWS says:

    Sorry to keep bugging you Charles, but could you please show me us a complete example for calculating the first two values and the last two values of the A vector? Anyways, thanks again for the helpful article. Now if only I could get these a values calculated :)

    • Charles says:

      No problem. Now that I have finished the latest releases, I plan to update the referenced webpage to provide a more complete explanation. Charles

    • Charles says:

      I have updated the referenced webpage. See Figure 2 for a description of most of the formulas shown in Figure 1, including the formulas for the A vector. Charles

  6. Charles,
    Great tools – thanks very much. I have a query on the use of the NORMSDIST function to generate a p-value.
    The NORMSDIST function returns a cumulative probability. One minus the NORMSDIST output returns the right sided probability (p for the tail). This is a one-sided test result as I understand it. If you want to obtain a two-sided p-value then we ask for 2*(1-NORMSDIST()).

    Is this not more appropriate. Is there some reason why the Shapiro-Wilks test should be based on a on-sided z-test?
    Thanks
    Nigel

  7. Colin says:

    Sir

    In the last Observation (sample size > 5000), you wrote: zi = NORMSINV(SWTEST(R1)). Why not zi = NORMSINV(1-SWTEST(R1)).
    And you said “Thus you should test z/sqrt(k) using the standard normal distribution”, is that a typo? Do you mean z*sqrt(k)

    Colin

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>