The two sample Kolmogorov-Smirnov test is used to test whether two samples come from the same distribution. The procedure is very similar to the One Kolmogorov-Smirnov Test (see also Kolmogorov-Smirnov Test for Normality).
Suppose that the first sample has size m with an observed cumulative distribution function of F(x) and that the second sample has size n with an observed cumulative distribution function of G(x). Define
The null hypothesis is H0: both samples come from a population with the same distribution. As for the Kolmogorov-Smirnov test for normality, we reject the null hypothesis (at significance level α) if Dm,n > Dm,n,α where Dm,n,α is the critical value.
For m and n sufficiently large
where c(α) = the inverse of the Kolmorogov distribution at α, which can be calculated in Excel as
Dm,n,α = KINV(α)*SQRT((m+n)/(m*n))
Example 1: Determine whether the two samples on the left side of Figure 1 come from the same distribution.
Figure 1 – Two-sample Kolmogorov-Smirnov test
We carry out the analysis on the right side of Figure 1. Column E contains the cumulative distribution for Men (based on column B), column F contains the cumulative distribution for Women and column G contains the absolute value of the differences. E.g. cell E4 contains the formula =B4/B14, cell E5 contains the formula =B5/B14+E4 and cell G4 contains the formula =ABS(E4-F4).
Cell G14 contains the formula =MAX(G4:G13) for the test statistic and cell G15 contains the formula =KSINV(G1,B14,C14) for the critical value. Since D-stat =.229032 > .224317 = D-crit, we conclude there is a significant difference between the distributions for the samples.
We can also use the following supplemental functions to carry out the analysis:
Real Statistics Function: The following functions are provided in the Real Statistics Resource Pack:
KSDIST(x, n1, n2, b, m) = the p-value of the two-sample Kolmogorov-Smirnov test at x (i.e. D-stat) for samples of size n1 and n2.
KSINV(p, n1, n2, b, iter, m) = the critical value for significance level p of the two-sample Kolmogorov-Smirnov test for samples of size n1 and n2.
As usual, m = the # of iterations used in calculating an infinite sum (default = 10) in KDIST and KINV and iter (default = 40) = the # of iterations used to calculate KINV.
When the argument b = TRUE (default) then an approximate value is used which works better for small values of n1 and n2 (since no tables for critical values for small values of n1 and n2 are being provided). If b = FALSE then it is assumed that n1 and n2 are sufficiently large so that the approximation described previously can be used.
For Example 1, we have the following:
D-crit = KSINV(G14,B14,C14) = .224526
p-value = KSDIST(G14,B14,C14) = .043055
We can also use the following array function to perform the test:
Real Statistics Function: The following function is provided in the Real Statistics Resource Pack:
KS2TEST(R1, R2, lab, alpha, b, iter, m) is an array function which outputs a column vector with the values D-stat, p-value, D-crit, n1, n2 from the two sample KS test for the samples in ranges R1 and R2, where alpha is the significance level (default = .05) and b, iter and m are as in KSINV.
If R2 is omitted (the default) then R1 is treated as as frequency table (e.g. range B4:C13 in Figure 1).
If lab = TRUE then an extra column of labels is included in the output; thus the output is a 5 × 2 range instead of a 1 × 5 range if lab = FALSE (default).
For Example 1, the formula =KS2TEST(B4:C13,,TRUE) inserted in range F21:G25 generates the output shown in Figure 2.
Figure 2 – Output from KS2TEST function
Example 2: Determine whether the samples for Italy and France in Figure 3 come from the same distribution.
Figure 3 – Two data samples
We first show how to perform the KS test manually and then we will use the KS2TEST function.
Figure 4 – Two sample KS test
The approach is to create a frequency table (range M3:O11 of Figure 4) similar to that found in range A3:C14 of Figure 1, and then use the same approach as was used in Example 1. This is done by using the supplemental array formula =SortUnique(J4:K11) in range M4:M10 and then inserting the formula =COUNTIF(J$4:J$11,$M4) in cell N4 and highlighting the range N4:O10 followed by Ctrl-R and Ctrl-D. Finally the formulas =SUM(N4:N10) and =SUM(O4:O10) are inserted in cells N11 and O11.
We can also calculate the p-value using the formula =KSDIST(S11,N11,O11), getting the result of .62169.
We see from Figure 4 (or from p-value > .05), that the null hypothesis is not rejected, showing that there is no significant difference between the distribution for the two samples. The same result can be achieved using the array formula
which produces the output in Figure 5.
Figure 5 – Output from KS2TEST function