**Hypothesis Testing**

**Definition 1**: Let *x _{1},…,x_{n}* be an ordered sample with

*x*≤ … ≤

_{1}*x*and define

_{n}*S*(

_{n}*x*) as follows:

Now suppose that the sample comes from a population with cumulative distribution function *F*(*x*) and define *D _{n}* as follows:

**Observation**: It can be shown that *D _{n}* doesn’t depend on

*F*. Since

*S*(x) depends on the sample chosen,

_{n}*D*is a random variable. Our objective is to use

_{n}*D*as a way of estimating

_{n}*F*(

*x*).

The distribution of *D _{n}* can be calculated (see Kolmogorov Distribution), but for our purposes now the important aspect of this distribution are the critical values. These can be found in the Kolmogorov-Smirnov Table.

If *D _{n,α}* is the critical value from the table, then

*P*(

*D*) = 1 –

_{n}≤ D_{n,α}*α*.

*D*can be used to test the hypothesis that a random sample came from a population with a specific distribution function

_{n}*F*(

*x*). If

then the sample data is a good fit with *F*(*x*).

Also from the definition of *D _{n}* given above, it follows that

Thus *S _{n}*(

*x*) ±

*D*provides a confidence interval for F(x)

_{n,α}**Example 1**: Determine whether the data represented in the following frequency table is normally distributed.

This means that 8 elements have an *x* value less than 100, 25 elements have an *x* value between 101 and 200, etc. We need to find the mean and standard deviation of this data. Since this is a frequency table, we can’t simply use Excel’s AVERAGE and STDEV functions. Instead we first use the midpoints of each interval and then use an approach similar to that described in Frequency Tables as follows:

Thus, the mean is 481.4 and the standard deviation is 155.2. We can now build the table that allows us to carry out the KS test, namely:

Columns A and B contain the data from the original frequency table. Column C contains the corresponding cumulative frequency values and column D simply divides these values by the sample size (*n* = 1000) to yield the cumulative distribution function *S _{n}*(

*x*)

_{ }

Column E uses the mean and standard deviation calculated previously to standardize the values of *x* from column A. E.g. the formula in cell E4 is =STANDARDIZE(A4,N$5,N$10), where cell N5 contains the mean and cell N10 contains the standard deviation. Column F uses these standardized values to calculate the cumulative distribution function values assuming that the original data is normally distributed. E.g. cell F4 contains the formula =NORMSDIST(E4). Finally column G contains the differences between the values in columns D and F. E.g. cell G4 contains the formula =ABS(F4—D4). If the original data is normally distributed these differences will be zero.

Now *D _{n}* = the largest value in column G, which in our case is 0.0117. If the data is normally distributed then the critical value

*D*will be larger than

_{n,α}*D*. From the Kolmogorov-Smirnov Table we see that

_{n}*D _{n,α}* =

*D*

_{1000,.05}= 1.36 / SQRT(1000) = 0.043007

Since *D _{n} *= 0.0117 < 0.043007 =

*D*, we conclude that the data is a good fit with the normal distribution.

_{n,α}**Example 2**: Using the KS test, determine whether the data in Example 1 of Graphical Tests for Normality and Symmetry is normally distributed.

We follow the same procedure as in the previous example to obtain the following results. Since the frequencies are all 1, this example should be a bit easier to understand.

The Kolmogorov-Smirnov Table shows that the critical value *D _{n,α} *=

*D*

_{15,.05}= .338

Since *D _{n}* = 0.1874988 < 0.338 =

*D*, we conclude that the data is a reasonably good fit with the normal distribution (more precisely that there is no significant difference between the data and data which is normally distributed). Note that is not the same conclusion we reached from looking at the histogram and QQ plot.

_{n,α}**Real Statistics Excel Function**: The following function is provided in the Real Statistics Resource Pack:

**KSCRIT**(*n, α, tails, h*) = the critical value of the Kolmogorov-Smirnov test for a sample of size *n*, for the given value of alpha (default = .05) and *tails* = 1 (one tail) or 2 (two tails, default), based on the KS Table. If *h* = TRUE (default) harmonic interpolation is used; otherwise linear interpolation is used.

**KSPROB**(*x, n, tails, iter*) = an approximate p-value for the KS test for value equal to *x* for a sample of size *n* and *tails* = 1 (one tail) or 2 (two tails, default) based on a linear interpolation of the values in the Kolmogorov-Smirnov Table, using *iter* number of iterations (default = 40).

Note that the values for *α* in the Kolmogorov-Smirnov Table range from .01 to .2 (for tails = 2) and .005 to .1 for tails = 1. If the p-value is less than .01 (tails = 2) or .005 (tails = 1) then the p-value is given as 0 and if the p-value is greater than .2 (tails = 2) or .1 (tails = 1) then the p-value is given as 1.

For Example 2, KSCRIT(15, .05, 2) = .338 (the same as given in cell H21 of Figure 4). Also note that the p-value = KSPROB(H20, B21) = KSPROB(0.1874988, 15) = 1 (meaning that p-value > .2), and so once again we can’t reject the null hypothesis that the data is normally distributed.

If the value of *D _{n}* had been .35 in Example 2, then

*D*= .35 > .338 =

_{n}*D*

_{crit}*,*and so we would have rejected the null hypothesis that the data is normally distributed. In this case we would have seen that p-value = KSPROB(.35,15) = .0427, which once again leads us to reject the null hypothesis.

**Kolmogorov Distribution**

As referenced above, the Kolmogorov distribution can be useful in conducting the Kolmogorov-Smirnov test. Click here for more information about this distribution, including some useful functions provided by the Real Statistics Resource Pack.

**Lilliefors Test**

When the population mean and standard deviation for the Kolmogorov-Smirnov Test is estimated from the sample mean and standard deviation, as was done in Example 1 and 2, then the **Kolmogorov-Smirnov Table** yields results that are too conservative. More accurate results can be derived from the Liiliefors Table as described in the Lilliefors Test for Normality.

Dear Sir:

I am looking for a test to compare if one sub-sample of size “n” taken from a sample of size “N” (source sample), with n<<N, has the same attributes of the source sample.

Is Kolmogorov-Smirnov the best test?

The source sample is a multimodal distribution (fish size frequencies); and I have some doubts about how to construct the accumulative sample to make the KS test.

Tha data is in a table of frequencies by ranges of size

Thanks for your answer

Renato

Renato,

Whether the KS test is the right one depends on what you mean by “has the same attributes”. In any case, the webpage at http://www.real-statistics.com/tests-normality-and-symmetry/statistical-tests-normality-symmetry/kolmogorov-smirnov-test/ describes in detail the steps you need to perform the KS test. You can also download the Real Statistics Examples Workbook and look at the Excel worksheet for the KS test and use it as a model for your test. I have not yet created a supplemental function to automate the calculation of the KS test, but I will eventually add this.

Charles

Renato,

I have now provided another example of how to apply the KS test to determine whether a sample follows a specified distribution. See the webpage http://www.real-statistics.com/non-parametric-tests/one-sample-kolmogorov-smirnov-test/.

Charles

Hi, the spss software use the Z K-S = D*SQRT(n), and a P-value, but, i can´t calculate the result of the p-value, is not the probablility of the normal distribution.

Example, n = 20 D = .416, ZK-S =.416*SQRT(20) = 1.861 SPSS P-value (two sided) = .002.

But, 2*(1-NORMSDIST(1.861)) is not .002

Do you know how is the p-value calculated?

Tks a lot

PD. Sorry, mi english is not the best

Hi Juan Pablo,

You need the distribution function. You can find this at http://www.jstatsoft.org/v08/i18/paper or http://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test.

Charles

hello sir, i found this article very helpful. i need to fit log normal distribution either from chi square or K-S test. you have explained only normal distribution. please explain log normal distribution also.

here is my test data

mean= 5.1439

δ= 0.2506

median= 4.99

σlnz = 0.247

interval observed frequency

1.81 2.759 9

2.759 3.708 61

3.708 4.657 116

4.657 5.606 155

5.606 6.555 120

6.555 7.504 42

7.504 8.453 7

8.453 9.402 2

9.402 10.351 2

10.351 11.3 3

sum= 517

The procedure for using the K-S test with the log normal distribution is pretty much the same as for the normal distribution. E.g. in Figure 3, you won’t need the E column. Simply enter the formula for the log-normal distribution in column F. E.g. cell F4 would contain a formula like =LOGNORMDIST(A4,N5,N10). The rest is the same as in the examples provided on the webpage.

Charles

thank you very much sir for your reply.

sir i have one more doubt, should we use “mean and standard deviation” or “Median and σlnz in lognormal distribution?

Sandeep,

If I understand your original question correctly, then you should use the mean and std dev, esp. since Excel has the LOGNORM.DIST function available which use these two parameters. Why do you think the median and σlnz might be good choices? Perhaps this is correct and I am not answering the right question.

Charles

thank you sir

hai, may I know what the p-value mean by and how to find the p-value of kolmogorov-smirnov ?

besides that is it possible to use the statistical value of other distribution as a critical value to find the p-value of KS test?

for example, use the z value of normal distribution to find the p-value by KS test.

Sally,

Sorry, but I don’t understand your question. In any case I will be adding the KS p-value shortly.

Charles

Sally,

I am revising the KS part of the website/software and will add the p-value. Stay tuned.

Charles

Sally,

I have now provided a way of calculating the p-value for the KS test, using the functions KSPROB and KSDIST. These are available in the latest release of the Real Statistics Resource Pack (Rel 2.15).

Charles

Sir,

I am trying to determine if Rokeach value survey (RVS) responses for two different groups are statistically significant. The RVS has subjects rank 18 values in order of importance to them. I have calculated the mean response for each value within each group and ordered them from most important (lowest mean) to least important (highest mean). I was told I could use the Kolmogorov-Smirnov Test to determine if differences in mean value rankings between groups are statistically significant.

I would appreciate an explanation of this process in Excel.

Thank you in advance,

Kevin, Excel expert, stats neophyte

P.S. I have learned more practical statistics from your site than my undergrad and masters professors have been able to drill into me… Well done, Sir!

Kevin,

It is good to hear that the site has been helpful. My goal was exactly as you stated, to help people make practical use of (and understand) statistics in the environment is probably the most available for most people, namely Excel.

If your goal is to determine whether there is a significant difference between the means of the two groups, you probably want to use the t test (if the data in the two groups are normally distributed) or the Mann-Whitney test if they are not. You could also use the two-sample Kolmogorov-Smirnov Test to determine whether the two groups of data come from the same population. I have already described the one sample Kolmogorov-Smirnov Test on the website, but not the two sample test.

Fortunately, I have just implemented the two sample test in the Real Statistics Resource Pack (Release 2.15) and have written the description for the website (including two examples). I plan to release these in the next couple of days. Stay tuned.

Charles

Kevin,

The two-sample KS test is now included in the Real Statistics Resource Pack. The procedure is described on the webpage http://www.real-statistics.com/non-parametric-tests/two-sample-kolmogorov-smirnov-test/.

Charles

Hi all,

I am trying to fit an appropriate probability distribution with my data. I have known that I can use K-S test, but my problem is that, as I am going to use MATLAB or EXCEL softwares for this purpose, I do not know how I can use these softwares for this test. My problem is that I have not ever seen any example of this test for exponential or other distributions rather than normal and lognormal distributions. How can I decide whether for example lognormal distribution is appropriate or exponential distribution?

Thank you very much for your help inn advance.

Hi Zohreh,

The approach for using the KS Test to test whether the data is exponentially distributed is very similar to that shown on the referenced webpage. I will add an example using the exponential distribution to the website in the next couple of days. This should help you.

Charles

Zohreh,

I have now added a description of how to determine whether data fits the exponential distribution using the KS test. See the webpage http://www.real-statistics.com/non-parametric-tests/one-sample-kolmogorov-smirnov-test/.

Charles

Can we use the Kolmogorov Smirnov test if we want to know whether the data follow a

binomial distribution?

Cathy,

Yes, you can use the KS test for this purpose. In addition to the referenced webpage, which shows how to use the KS test to determine whether data fits with the normal distribution, I give an example of how to do this for the exponential distribution on the webpage http://www.real-statistics.com/non-parametric-tests/one-sample-kolmogorov-smirnov-test/. The approach for the binomial distribution is similar. Also note that if the sample size is sufficiently large the binomial distribution can be approximated by a normal distribution, as described on the webpage http://www.real-statistics.com/binomial-and-related-distributions/relationship-binomial-and-normal-distributions/.

Charles

Dear Sir,

Can you give an example, where we can use KS table to determine whether the distribution follows poisson dist. or not, an excel worksheet will be helpful.

Regards,

Jerome

jeromegomes89@gmail.com

Jerome,

You can use the one sample KS test as described on the webpage

One Sample KS Test

The only problem is that the test is more accurate if you know the mean of the distribution instead of estimating it from the sample.

You can also use the chi-square goodness of fit test as described on the webpage

Goodness of Fit

Charles

Charles,

with your tool it is possible to use the Shapiro-Wilk-Test on a time series and get a besides the p-value a “yes” or “no” for the normal assumption. Therefore I can do this test for multiple series parallel with only one formula which is very nice.

Is there also a possibility to test for other distributions (Poisson, Stuttering Poisson, Gamma, Negative Binomial, etc.) for multiple series (KS-Test or Chi-Square-Test), so I can see which distribution would fit best?

Sven

Sven,

I haven’t yet implemented software versions of chi-square or KS to test for a fit with a specific distribution. The One Sample Kolmogorov-Smirnov Test and Goodness of Fit webpages explain how this can be done, however.

Charles

Hello Sir,

I am searching for Kolmogorov-Smirnov Test two sample data in excel. Can you help me?

See the webpage Two Sample Kolmogorov-Smirnov Test

Charles

Great article. I know understand how you calculate the P value for the KS test. Thanks so much. However when I try to replicate in excel, the NORMDIST function does not return the same values. Is there something different you are doing, as excel is asking me for the mean and stDev (which i input) but does not return the same values you have in your sheet

many thanks

Avi,

I don-t see any reference to the NORMDIST function on the referenced webpage. There is a reference to =NORMSDIST(E4(, which is the standard normal distribution function (mean = 0 and standard deviation = 1).

Charles

Dear Sir,

Thank you for sharing this.

I have a question: why in the first example we calculate the Z-score with x=100, 200, etc., but with mean and standard deviation calculated from the mid points (150, 250, etc.)?

Shouldnt’e be correct to have the mid points of the intervals in column A for Z-score calculation?

Best regards,

Gianma

Gianma,

Probably either approach is acceptable, but here I have used the endpoints of the various intervals with the mean and stdev based on the midpoint of the intervals.

Charles

Dear Charles,

Sorry for insisting, but it’s not a negligible difference: using the midpoints of the intervals for calculating the Z-score, the resulting Dn is equal to 0.117>Dn,a, so the overall result is the opposite (the data is NOT a good fit with the normal distribution)…

Considering that the definition of Z is (Xi-u)/S, where u is the mean of the X values and S is their stdev, I think that only the midpoints of the intervals should be used, if u is calculated as their mean.

Otherwise, we can use the endpoints of the intervals as Xi, but in this case also mean and stdev should be calculated on these values, and not on the midpoints.

Do you agree?

Gianma,

I realize that depending on the choice you make you might come to a different conclusion. This is why it is important to view significance values such as alpha = .05 not as absolute things. In fact if you set alpha = .05 as your significance value, any p-value near .05 can be viewed with some caution.

Unfortunately, this is the nature of statistics. If you get a p-value of .0003 you are fairly confident of your result (at least as far as type I error is concerned), but often depending on which test you choose to use (or which version of a test you use), you might get different outcomes.

Charles

Dear Sir,

Thank you for your answer. Sincerely, I’m not 100% convinced, but at least this discussion forced me to look deeper into this topic, and review some forgotten page of statistics!

Best regards,

Gianmarco

how to i find cumulative distribution function F(x)?

Examples 1 and 2 on the referenced webpage explain how to compute the cumulative distribution function F(x).

Charles

Charles, thanks, but I too have a question.

May I perform the KS test on two samples with different counts or n values?

For example, there are 7 possible categories, and there are 3 individual samples that will distribute within those 7 categories (dealing with sediment, sieves, and weights). I need to compare this set of samples to another set of samples, however, the number of samples here is 7. So, 2 sets of samples. The first, has 3 samples, and the second has 7. There are 7 sieve sizes or categories into which the samples are distributed. Can KS test be run on them?

If not, then would it be permissible to take the means of each sample, thus giving congruency to the data (same n values, but with means), and use the n from the sample size (n=10 (3 from first, and 7 from second)), rather than the mean (n=2) to establish the critical value, or would I need to use the n from the mean sample size to establish the CV?

Thanks,

Brody

Brody,

Although I don’t completely understand your description, you can perform a two sample KS test with samples of different sizes to determine whether these samples come from populations with the same distributions. See the following webpage for more details

Two Sample KS Test

Charles

Dear Charles I appreciate your contributions.

Please consider the following, in your second example you state the following:

“Since Dn = 0.1874988 < 0.338 = Dn,α, we conclude that the data is a reasonably good fit with the normal distribution (more precisely that there is no significant difference between the data and data which is normally distributed). Note that is not the same conclusion we reached from looking at the histogram and QQ plot"

So the same remains for Dn = 0.1874988 .338 = Dcrit, and so we would have rejected the null hypothesis that the data is normally distributed. In this case we would have seen that p-value = KSPROB(.35,15) = .0427, which once again leads us to reject the null hypothesis”

But if the α=0.01 then the critical value is 0.404 and Dn = 0.35 < 0.404 = Dn,α,

Then, should we conclude that data is normally distributed ???

I´ll appreciate your comments,

Kind regards

Edgar

Edgar,

Changing the value of alpha from .05 to .01, changes the value for Dcrit, but doesn’t change the value of Dn. I don’t see where the Dn = 0.35 comes from?

The null hypothesis that the data comes from a normal population cannot be rejected if Dn < Dcrit. Charles

Dear Sir,

Thank you very much, I’m learning a lot from your website.

Unfortunately, my data set dose not fit with normal distribution.

I have very large data and I read in this paper (Open Access): Langlois, T. J., Fitzpatrick, B. R., Fairclough, D. V., Wakefield, C. B., Hesp, S. A., McLean, D. L., … Meeuwig, J. J. (2012). Similarities between Line Fishing and Baited Stereo-Video Estimations of Length-Frequency: Novel Application of Kernel Density Estimates. PLoS ONE, 7(11), 1–9. doi:10.1371/journal.pone.0045973

“We used Monte Carlo simulations to overcome uncertainty regarding the asymptotic distributions of KS test statistics under the null hypothesis”.

How can I do the simulation in excel so my data can fit with normal distribution, so I can run the KS test for my data.

Thank you very much

I want to do KS tow sample test

See the following webpage

Kolmogorov-Smirnov Two Sample Test.

harles

Sorry, but I don’t understand your question. You don’t do simulations to make data fit a distribution. If you knew the data was normally distributed then you wouldn’t need to perform the KS test. Please clarify what you are trying to do.

Charles

Hi,

great site I learn from it a lot.

Can I please ask, how did you calculate a column K? midpt-sq?

Thank you.

Rob,

Cell K4 contains the formula =I4^2, and similarly for the other cells in column K.

Charles

Thank you very much Charles.

I dont know if I get it right, about numbers in column A (x values).

Let me explain on this example.

I have scale of loneliness and results can be

10-20 – low loneliness

20-30 – average loneliness

30-40 – high loneliness

so I calculate:

data midp freq midp^2

10-20 15 24 225

20-30 25 32 625

30- 40 35 33 1225

So:

n: 89

M: 26,011

Msq: 676,584

Sq-sum/n: 739,606

Varp: 63,022

Var: 63,738

Stdev: 7,983

x freq Cum Snx Z F(x) D

15 24 24 0,270 -1,379 0,084 0,186

25 32 56 0,629 -0,127 0,450 0,180

35 33 89 1 1,126 0,870 0,130

Dmax 0,185

Dkrit 0,144

0,185>0,144 so data is not normally distributed.

Is this rigt calculated or not?, I am not sure about choosing data for column A.

Rob,

If you assume that the data in each interval is concentrated at the midpoint then the calculation is correct. I have typically assumed this for the calculation of the mean, but have used the right end-point of the intervals for the KS calculation. I can see advantages with both approaches.

I suggest that instead of using the KS table to calculate the critical value you use the Lilliefors Table instead. It is more accurate for determining whether data is normal when you use the sample mean and standard deviation. See the following webpage:

Lilliefors Test

Charles