A number of unplanned comparisons are available. A few of the commonly used post-hoc tests (e.g. Fisher’s LSD, Student Newman-Keuls (SNK) and Tukey’s B) are not very accurate and usually should not be used. The key issue is to correct for experiment-wise error.
Although the Bonferroni and Dunn/Sidák correction factors can be used, since we are considering unplanned tests, we must assume that all pairwise tests will be made (or at least taken into account). For k groups, this results in m = C(k, 2) = k(k–1)/2 tests. For an experiment-wise error of α we need to use α/m as the alpha for each test (Bonferroni) or 1 – (1 – α)1/m (Dunn/ Sidák). This makes these tests too conservative.
More useful tests are Tukey’s HSD and REGWQ. These tests are designed only for pairwise comparisons (i.e. no complex contrasts). We also describe extensions to Tukey’s HSD test (Tukey-Kramer and Games and Howell) where the sample sizes or variances are unequal. We also describe the Scheffé test, which can be used for non-pairwise comparisons.
We discuss these tests on this webpage. Where the variances are unequal we can also use the Brown-Forsythe F* Test.
General guidelines are:
- Tukey’s test is usually the safe choice. It is a good choice for comparing large numbers of means
- REGWQ test is even better (i.e. has more power) for comparing all pairs of means, but should not be used when group sizes are different
- Hochberg’s GT2 (not reviewed here) is best when sample sizes are very different
- Games-Howell is useful when uncertain about whether population variances are equivalent.
Tukey’s HSD (Honestly Significant Difference)
The idea behind this test is to focus on the largest value of the difference between two group means. The relevant statistic is
and n = the size of each of the group samples. The statistic q has a distribution called the studentized range q (see Studentized Range Distribution). The critical values for this distribution are presented in the Studentized Range q Table based on the values of α, k (the number of groups) and dfW. If q > qcrit then the two means are significantly different.
This test is equivalent to
Picking the largest pairwise difference in means allows us to control the experiment-wise for all possible pairwise contrasts; in fact, Tukey’s HSD keeps experiment-wise α = .05 for the largest pairwise contrast, and is conservative for all other comparisons.
Note that the statistic q is related to the usual t statistic by q = t. Thus we can use the following t statistic
The critical value for t is now given by tcrit = qcrit /. If t > tcrit then we reject the null hypothesis that H0: μmax = μmin, and similarly for other pairs.
As described above, to control type I error, we can’t simply use the usual critical value for the distribution, but instead use a critical value based on the largest difference of the means.
From these observations we can calculate confidence intervals in the usual way:
Example 1: Analyze the data from Example 3 of Planned Comparisons using Tukey’s HSD test to compare the population means of women taking the drug and the control group taking the placebo.
Using the Studentized Range q Table with α = .05, k = 4 and dfW = 44, we get qcrit = 3.7775. Note that since there is no table entry for df = 44, we need to interpolate between the entries for df = 40 and df = 48. Alternatively, we can employ Excel’s table lookup capabilities. We can also use the supplemental function QCRIT(4,44,.05,2), as described below, to get the same result of 3.7775.
The critical value for differences in means is
Since the difference between the means for women taking the drug and women in the control group is 5.83 – 3.83 = 1.75 and 1.75 is smaller than 1.8046, we conclude that the difference is not significant (just barely). The following table shows the same comparisons for all pairs of variables:
Figure 1 – Pairwise tests using Tukey’s HSD for Example 1
From Figure 1 we see that the only significant difference in means is between women taking the drug and men in the control group (i.e. the pair with largest difference in means). We can also use the t-statistic to calculate the 95% confidence interval as described above. In Figure 2 we compute the confidence interval for the comparison requested in the example as well as for the variables with maximum difference.
Figure 2 – Tukey HSD confidence intervals for Example 1
Real Statistics Function: The following function is provided in the Real Statistics Resource Pack:
QCRIT(k, df, α, tails, h) = the critical value of the Studentized range q for k independent variables, the given degrees of freedom and value of alpha, and tails = 1 (one tail) or 2 (two tails, default). If h = TRUE (default) harmonic interpolation is used; otherwise linear interpolation is used.
The above two functions are based on the table of critical values provided in Studentized Range q Table. The Real Statistics Resource Pack also provides the following functions which provide estimates for the Studentized range distribution and its inverse based on a somewhat complicated algorithm.
QDIST(q, k, df) = the value of the Studentized range distribution at q for k independent variables and df degrees of freedom.
QINV(p, k, df, tails) = the inverse of the Studentized range distribution at p for k independent variables, df degrees of freedom and tails = 1 or 2 (default 2).
Observation: Note that the values calculated by QCRIT and QINV will be similar, at least within the range of alpha values in the table of critical values. E.g. QINV(.015,4,18,2) = 4.82444 while QCRIT(4,18,.015,2) = 4.75289.
Note that QDIST outputs a two-tailed value. E.g. QDIST(4.82444,4,18) = 0.15. To get the usual cdf value for the Studentized range distribution, you need to divide the result from QDIST by 2, which for this example is .0075, as confirmed by the fact that QINV(.0075,4,18,1) = 4.82444.
Finally note that the algorithm used to calculate QINV (and QDIST) is pretty accurate except at low values of p and df. In particular, for df = 1 and certainly when p ≤ .025, QCRIT will be more accurate than QINV (at least for those values found in the table of critical values). This is also true when df = 2 and p ≤ .01 or when df = 3 and p = .001.
Real Statistics Data Analysis Tool: The Real Statistics Resource Pack contains a Tukey’s HSD Test data analysis tool which produces output very similar to that shown in Figure 2.
For example, to produce the first test in Figure 2, follow the following steps: Enter Ctrl-m and select the Analysis of Variance data analysis tool from the list. On the dialog that appears, select the Single Factor Anova option. A dialog box similar to that shown in Figure 1 of Confidence Interval for ANOVA appears. Enter A3:D15 in the Input Range, check Column headings included with data, select the Tukey HSD option and click on the OK button.
The report shown in Figure 3 now appears. We see that only MC-WD is significant, although WC-WD is close.
Figure 3 – Real Statistics Tukey HSD data analysis
When sample sizes are unequal, the Tukey test can be modified by replacing by in the above formulas. In particular, the standard error for the q statistic becomes
Note that the Real Statistics Tukey HSD data analysis tool described above actually performs the Tukey-Kramer Test when the sample sizes are unequal.
Example 2: Analyze the data in range A3:D15 of Figure 4 using the Tukey-Kramer test to compare the population means of women taking the drug and the control group taking the placebo. This example is the same as Example 1 but with some data missing, and so there are unequal sample sizes.
Figure 4 – Data and ANOVA for Example 2
Enter Ctrl-m and select Single Factor Anova and Follow-up Tests from the menu. A dialog box similar to that shown in Figure 1 of Confidence Interval for ANOVA appears. Enter A3:D15 in the Input Range, check Column headings included with data, select the Tukey HSD option and click on OK.
The output is shown in Figure 5.
Figure 5 – Tukey-Kramer Data Analysis
Games and Howell Test
A better alternative to Tukey-Kramer when variances are unequal is Games and Howell. Here the standard error becomes
Thus we use different pooled variances for each pair instead of the same pooled variance MSW. We employ the following test
where the standard error is as above and q′crit is the critical value of the Studentized range statistic but with the degrees of freedom given by df′ as defined in Two Sample t-Test with Unequal Variances, namely
In this way we also take care of the case where the variances are unequal in exactly the same manner as in Theorem 1 of Two Sample t-Test with Unequal Variances, except that we now use the q-statistic instead of the t-statistic. Note that the supplemental function DF_POOLED can be used to calculate df′.
Example 3: Repeat Example 2 using the Games-Howell test.
We repeat the same steps as we used in Example 2 except that we choose the Games-Howell option. The output of the test is shown in Figure 6.
Figure 6 – Games-Howell Data Analysis
The Ryan, Einot, Gabriel, Welsh Studentized Range Q (REGWQ) test uses what is known as a step-down approach. No confidence intervals are calculated.
First, the equality of all of the means is tested at the αk level. If the null hypothesis of equality of means is rejected, then each subset of k – 1 means is tested at the αk-1 level; otherwise, the procedure stops.
In general, if the hypothesis of equality of a set of p means is rejected at the αp level, then each subset of p – 1 means is tested at the αp-1 level; otherwise, the set of p means is considered not to differ significantly and none of its subsets is tested. One continues in this manner until no subsets remain to be tested.
First we arrange the sample means in descending order x̄1 ≥ x̄2 ≥ … ≥ x̄k. At each stage p, we define αp as follows:
Now, the equality of the means x̄i, …, x̄j where p = j – i + 1 is rejected when
as for Tukey’s HSD test, except that the αp are defined as above. The only problem with this approach is that qcrit values are required for values of α are usually not found in the Studentized Range q Table. To partially address this issue we provide a second table of values for q (also in Studentized Range q Table) for values of α = .025, .005 and .001. The best we can do for values not in the table is to interpolate.
Example 4: Apply the REGWQ to the data in Example 1 of Confidence Interval for ANOVA.
Figure 7 – REGWQ for Example 4
There are three tables in Figure 7. The table in the upper left hand side of the figure consists of the variables sorted from highest to lowest mean. We will refer to the index (i.e. rank) of these variables, where i = 1 refers to the variable with highest mean (i.e. women who take the drug) and i = 4 refers to the variable with lowest mean (i.e. men in the control).
The table in the upper right side of the figure is used to determine the critical values of mean differences. These are based on the possible values of p, namely 2, 3 and 4. For each p, first we determine the required values of αp (range R8:T8) as described above. For each p we then find the qcrit values from the Studentized Range q Table (interpolating as necessary – via rows 10 and 11 in Figure 4) corresponding to p, df = 44 and αp (range R12:T12). We multiply these values by
to get the critical values for the mean differences corresponding to p = 2, 3 and 4 (range R13:T13).
The final table iterates all possible pairs of variables (given by the indices i and j, which point to the entries in the first table). For each pair the value of p is calculated as i + j – 1 (row 17). The difference of the means corresponding to indices i and j is calculated in row 20. Each mean difference is then compared with the critical value calculated in the second table for the value of p in that column. The pairs whose mean difference is larger than the critical value are significant.
For our example, the only mean differences in the first two columns are significant. This means that there is a significant difference between women who take the drug and men in the control (first column) and there is also a significant difference between women who take the drug and women in the control.
This test is used when we want to compare one group (usually the control treatment) with the other groups. In this case, Dunnett’s test is more powerful than the other tests described on this webpage.
The test is similar to Tukey’s HSD, except that instead of testing
we test whether
where n = size of the group samples, x̄o = mean of the control group, x̄j is the mean of any other group and td is the (two-tailed) Dunnett’s critical value given in the Dunnett’s Table. The table contains the values td(k, dfW, α) where k = the number of groups (treatments) including the control.
The test can also be used when the group means are not equal via
although, since Dunnett’s Table is based on equal group sizes, the above formula is only accurate if the group sizes are not too different.
Real Statistics Function: The following function is provided in the Real Statistics Resource Pack:
DCRIT(k, df, α, h) = the critical value td for k independent variables, the given degrees of freedom and value of alpha. If h = TRUE (default) harmonic interpolation is used; otherwise linear interpolation is used.
Real Statistics Data Analysis Tool: We now show how to use the Dunnett’s Test data analysis tool to address Example 5.
Example 5: Assuming that Method 1 is the control group in Example 1, compare the mean of this method with the means of the other methods using Dunnett’s test. The data is repeated in Figure 8.
Enter Ctrl-m, select Analysis of Variance and press the OK button. Next select Single Factor Anova from the dialog box that appears. A dialog box similar to that shown in Figure 1 of Confidence Interval for ANOVA appears. Enter A3:D11 in the Input Range, check Column headings included with data, select the Dunnett’s Test option and click on the OK button.
The output is similar to that shown in Figure 8, although initially the contrast coefficients in range G4:G7 are blank. You need to enter a +1 and -1 in two of these cells, one of which must correspond to the control group (Method 1 in this example). The output for the comparison of Methods 1 and 3 is shown in Figure 8.
Figure 8 – Dunnett’s Test
We see that there is a significant difference between Method 1 and 3 (cell L11). We can of course compare Method 1 with Method 2 or Method 4 by simply changing where the value for the +1 contrast is placed.
Note that cell I11 contains the formula =DCRIT(COUNT(I4:I7),H11,L2). Thus the critical value in Dunnett’s Table when k = 4, df = 28 and α = .05 is 2.483.
To carry out Scheffé’s test follow the following steps:
- Calculate the planned comparison t-test
- Square the t-statistic to get F (since F = t2)
- Find the critical value of F with dfB, dfW degrees of freedom for given value of α and multiply it by dfB. Thus the critical value is dfB* FINV(α, dfB, dfW).
- If F > the critical value then reject null hypothesis
Since Scheffé’s test is very conservative, it is not recommended for pairwise comparisons where better tests are available, but it can be useful for more complicated comparisons where the other unplanned tests don’t apply.
Real Statistics Data Analysis Tool: Scheffé’s test can be carried out using the Single Factor Anova and Follow-up Tests data analysis tool provided by the Real Statistics Resource Pack. We show how this is done in the following example.
Example 6: Carry out the test in Example 2 from Planned Comparisons for ANOVA using Scheffé’s test.
The steps are very similar to those used for Tukey’s HSD, as described above, namely enter Ctrl-m and select Single Factor Anova and Follow-up Tests from the menu. A dialog box similar to that shown in Figure 1 of Confidence Interval for ANOVA will appear. Enter I21:L29 in the Input Range, check Column headings included with data, select the Scheffe option and click on OK.
A report similar to that shown in Figure 9 will appear but with no numbers in the shaded range O23:O26. You must now enter the contrast values. Setting the contrast coefficients for Method 1 and 2 to -0.5 and the contrast coefficient for Method 4 to 1, we get the output shown in Figure 5.
Figure 9 – Scheffé data analysis for Example 6
This figure shows there is no significant difference between Method 4 and the average of Methods 1 and 2.