**Histogram**

A histogram can be used to test whether data is normally distributed. This test simply consists of looking at the histogram and discerning whether it approximates the bell curve of a normal distribution.

**Example 1**: Determine whether the data in column B of Figure 1 are normally distributed using a histogram.

**Figure 1 – Testing for normality using a histogram**

The sample contains 20 data elements. To make sure that the intervals in the histogram are equal and consistent, we first standardize the data points (in column C) as in Expectation. E.g. the formula in cell C4 is =STANDARDIZE(B4,B24,B25). Choosing bins from -2 to 2 standard deviations, we create a histogram as described in Histograms.

As you can see from Figure 1, the histogram doesn’t look particularly normal.

**QQ Plot**

A PP plot (**point-point plot**) is simply a scatter diagram comparing two samples of the same size. The more similar the underlying distributions, the more closely the scatter points will conform to a line with slope 1. If the data is standardized then the scatter points would be close to the line y* = x.*

We also use PP plots to compare a data set with a distribution. If the distribution has cdf *F*(*x*) and the data set has elements *x*_{1}, …, *x _{n}* in ascending order, then the PP plot is the scatter diagram of the set {

*F*(

*x*

_{1}), …,

*F*(

*x*)} versus the set {1/2

_{n}*n*, 3/2

*n*, …, 1−1/2

*n*}. Here the second set is an attempt to divide the interval between 0 and

*n*into

*n*evenly spaced intervals (except for the first and last elements which are half the length).

A **QQ plot** (**quantile-quantile plot**) is also used to compare a data set with a distribution, and consists of a scatter diagram of the data set {*x*_{1}, …, *x _{n}*} in ascending order with the values {

*F*

^{-1}(1/2

*n*),

*F*

^{-1}(3/2

*n*)

*,*…,

*F*

^{-1}(1−1/2

*n*)}. Here the

*i*th value

*F*

^{-1}(

*i*/

*n*−1/2

*n*) in the second set is the inverse of the cdf at

*i*/

*n*−1/2

*n*(these are the

**quantiles**).

As for the PP plots, if the points on the scatter plot align with the diagonal line y = *x* then the data set conforms with the distribution.

When using a QQ plot to see whether a data set is normally distributed, you create a scatter diagram between range R1 consisting of the elements *x*_{1}, …, *x _{n}* in ascending order and R2 consisting of the values NORM.INV(1/2

*n*,

*x̄*,

*s*), …, NORM.INV(1−1/2

*n*,

*x̄*,

*s*), where

*x̄*= AVERAGE(R1) and

*s*= STDEV(R1).

Alternatively, you can create a scatter diagram between range R1 consisting of the standardized elements *z*_{1}, …, z* _{n}*, where each

*z*= STANDARDIZE(

_{i}*x*,

_{i}*x̄*,

*s*), and R2 consisting of the values NORM.S.INV(1/2

*n*), …, NORM.S.INV(1−1/2

*n*).

Commonly, the QQ plot is used much more often than the PP plot. PP plots tend to magnify deviations from the distribution in the center, QQ plots tend to magnify deviation in the tails.

**Example 2**: Using a QQ plot determine whether the data set with 8 elements {-5.2, -3.9, -2.1, 0.2, 1.1, 2.7, 4.9, 5.3} is normally distributed.

The mean of this data set is .375 and the standard deviation is 3.89. If the data set is normally distributed then for any value *x*, the cumulative distribution at *x* would be given by

*F*(*x*) = NORMDIST(*x*, .375, 3.89)

We now split the interval (-∞, ∞) into 8 sub-intervals (-∞, y_{1} ), (y_{1}, y_{2}), …, (y_{7}, y_{8}), (y_{8}, ∞) such that the area under the standard normal curve for the 2nd through 7th intervals are equal and the area under the curve of the first and last intervals are half the middle intervals. This is equivalent to finding points *z _{1}, z_{3}, z_{5}, z_{7}, z_{9}, z_{11}, z_{13}* and

*z*such that

_{15}*z*= NORMSINV(

_{i}*i*/16). Thus y

*=*

_{i}*z*. If the original data is normally distributed then

_{2i-1}*F*(*x _{i}*) = NORMSINV((2

*i–*1)/16).

We summarize this approach in Figure 2, where we have also standardized the original data so that it is easier to compare the standardized data with the standard normal approximation for each data point (under the assumption that the original data was normally distributed). Finally we have included a scatter diagram (the QQ plot) of the data vs. the standardized normal data.

**Figure 2 – Using a QQ plot to test for normality**

Cell E5 contains the Excel formula =2*COUNT(A4:A11), while cell E10 contains the formula =NORMSINV(C10/E5) and cell F10 contains =STANDARDIZE(D10,D$6,D7), and similarly for the other cells in columns E and F.

We can see that the data pretty well fits with the trend line, which is a good indicator that the original data is roughly normal. In fact, if the original data is normally distributed, then when the standardized data is plotted against the standard normal values the trend line should be the line .

**Real Statistics Data Analysis Tool**: The **Descriptive Statistics and Normality** data analysis tool contained in the Real Statistics Resource Pack allows you to create QQ plots automatically. We illustrate the use of the **QQ Plot** data analysis tool in the following example.

**Example 3**: Determine whether the data in Example 1 is normal by using a QQ plot.

The data is repeated in range A3:A23 of Figure 3. To run the analysis press **Ctrl-m** and select the **Descriptive Statistics and Normality** option. Fill in the dialog box that appears as shown in Figure 3, choosing the **QQ Plot** option, and press the **OK** button.

**Figure 3 – QQ Plot dialog box**

The output is shown in Figure 4

This time you can see that the data is not quite so normal.

**Box Plots**

While box plots can’t actually be used to test for normality, they can be useful for testing for symmetry, which often is a sufficient substitute for normality.

**Example 4**: Use a box plot to gain more evidence as to whether the data in Example 1 is normally distributed.

To produce the Box Plot, press **Ctrl-m** and select the **Descriptive Statistics and Normality** option. Fill in the dialog box that appears as shown in Figure 3, choosing the **Box Plot** option instead of (or in addition to) the QQ Plot option, and press the **OK** button. The output is shown in Figure 5.

As we can see from Figure 5, the data is relatively symmetric, and so although as we saw in Example 1 and 3, the data is probably not normally distributed, it does appear to be relatively symmetric, which is sufficient for some of the tests that we would like to use.

this formula:

F10 contains the formula =STANDARDIZE(B4,B$15,B$16),

isn’t right, because your B cells are empty

Thanks Afke for catching this error. I must have changed the worksheet and forgot to change the reference on the webpage. The correct formula is =STANDARDIZE(D10,D$6,D$7). I have now corrected this on the referenced webpage. Charles

Normally I use spss for statistic work, but the company i do my internship don’t have spps. So I must do it with Excel. So I found this helpful website.

I am not sure what I am mising, but when drawing the standard normal, I do not get a straight line – using excel. I am assuming that the entire point is to standardize the actual data and plot it over what a normal distribution (with same mean and SD) would look like. Am I wrong?

Are you trying to create a QQ Plot as in Figure 4 of the referenced webpage? If so you should get a straight line when using data from a standard normal distribution. If you send me your data I will try to see what went wrong.

Charles

Sure Charles. can you please provide me with your email address?

You can find it when you click on the Contact Us menu.

Charles

I’m trying to plot a QQ plot but I don’t understand how the interval is caluclated in Example 2, Figure 2?

If the sample has n elements then the intervals are based on the endpoints 1, 3, 5, …, 2n-1; i.e. the odd numbers from 1 to 1 less than 2n.

Charles

I’m very happy to learning from your website, amazing explanation, easy to understand. Great website!

I have a problem about this statement:

A QQ plot (quantile-quantile plot) is a PP plot where the samples points are equally spaced. Given a sample of size n which is normally distributed, if we sort the data elements in ascending order then “the area under the normal curve between each point will be equal.”

the area under the normal curve between each point will be equal, is that means the probability between each point will be the same? I’m try to calculate the area between each point with different between F(x) and can’t get equal values. Am I wrong?

Yes, these paragraphs have not been expressed properly. I have just updated the text to express the concepts more clearly and more accurately

Thanks for identifying this problem.

Charles

While judging the QQ plot, what is the criteria that it is hugging the diagonal reasonably well. Is it only visual or there is a mathematical treatment? I have a variable with more than 1000 data values. Its QQ plot appears reasonably well (to me) but the Shapiro Will test declares it to be non-normal. Will be grateful for guidance.

Pramod,

As far as I am aware of, use of a QQ plot is based on visual inspection. I find Shapiro-Wilk to be the most useful test for normality, but its results should agree reasonably well with the QQ plot.

If you send me an Excel file with your data and test results, I will take a look at it to see why the results are in disagreement. You can find my email address at Contact Us.

Charles

I am extremely sorry I did not see this reply from you and hence had raised a fresh query on the same subject. It is very kind of you to have explained the subject and offered to have a look at my data. I will forward the same.

You have created a great and very useful website on the subject.

Pramod,

Thank you for sending me your data. Let me make the following observations:

1. You are testing for the normality of three samples. The Shapiro-Wilk result is for each of the three samples, while the QQ plot that you have created combines all the data into one sample. You need to create three separate QQ plots. The result won’t be that different, although the QQ plot for sample 1 looks a little less normal than for sample 2.

2. There are a lot of repetitions in the data and so it is a bit difficult for me to determine normality from looking at the QQ plots.

3. While, in my opinion, in general the Shapiro-Wilk test is the best test for normality, it has a weak spot, namely when there are a lot of repetitions. In this case it is better to use the D’Agostino-Pearson test. The p-values for this test for the three samples are .068, .533 and .541. This would indicate that for each sample you can’t reject the null hypothesis that the sample is normally distributed. Clearly, the result for sample 1 is more marginal than for samples 2 and 3 (i.e. there is a higher probability of a type I error).

Charles

Thanks for taking effort to create this fantastic site and coming up with the toolbox.

I have a data of 54 values. After removing one outlier, I check the normality using the descriptive statistics function from the toolbox and I get a low p value for the shapiro- wilk test. The Q-Q plot also is not linear and histogram shows a right skewed distribution when standardized , and more like a 2 modal distribution when non- standardized. I would like to know what distribution best fits this data. How should I progress in the next steps.?

Thanks for your advise in advance

Madan,

You can try to use the chi-square or KS tests to fit the data with various distributions. I have not automated this approach yet.

Charles

Hi Charles

Thanks for quick the reply.

Your example describes how to use KS tests to check for normality, you also mentioned that it is for bigger sample size.

My sample size is only 54 and I get confirmation that it is not following normal distribution based on Box plot, QQ plot and SW test.

The standardized histogram for my data shows a right skewed distributions. Which distribution should I test for ? How should I do the tests using chi-square or KM and where do I get the critical values.

Thank you for your advise in advance

Madan,

I prefer the chi-square test instead of the KS test since tables of critical values for different distributions is not necessary.

See the following webpage for how to use the chi-square test

Goodness of Fit — see Fitting data to a distribution

As to which distribution to test against, it is best to choose the distribution based on domain knowledge or theoretical considerations. When this is not available, you might need to use trial and error, experimenting with different distributions.

A sample of size 54 should be sufficient to fit your data.

Charles

Hi Charles, I came across your page and found it very useful to run statistics with Excel. I’ve seen elsewhere that it is possible to plot a confidence tube around the QQ scatter, which would greatly improve the reading of the plot. Thus, I was wondering whether it was possible to do so also in Excel. Thank you!

Jes,

I’ll look into this and consider adding it to the list of future enhancements.

Charles

Fine, I hope the updates will come soon! Thank you for your prompt reply

Jes,

I can’t promise this since I have a long list of potential enhancements and so need to prioritize which ones I can work on.

Charles

Charles,

I am using Excel 2010 and have downloaded the new .xlam file. I cannot get the qqplot to work. The array function qsort isn’t sorting the data numbers and is getting an error on invalid values. I have deleted the data column in the qqplot data table and rerun the qsort array command myself the the qqplot is then fine. I think there is a bug in the qqplot for Excel 2010.

By the way, the new features and this qqplot in particular are excellent additions to your Excel add-in. Well done!

Regards,

Joe

Joe,

I have not seen this problem before. Can you send me an Excel file with your data and the error that you have identified? You can send this to me at czaiontz@gmail.com.

Charles

Charles,

Also re the qqplot: it would be good to plot the standard data values on the x axis rather than the raw values. Either as standard or allow the option to plot raw or standardised values. I think qqplots are normally standardised. With raw data the x axis has to be re-scaled after it is drawn.

Thanks,

Joe

Joe,

Thanks for your comment. Actually, the only difference is that the x and y axes are flipped and the data is translated left or right. It shouldn’t need to be re-scaled. In any case, I will consider implementing your suggestion.

Charles

Why do you use NORM.S.INV(1/2n)? for the first element of the partition? Then the area over the tails is half the area over the other intervals- right?

Andy,

It is a bit arbitrary, but this seems to be the best way to deal with the left and right tails of the normal distribution which are actually infinitely long.

Charles

i am very grateful for this site i enjoyed it very much keep it up