In an unbalanced ANOVA the sample sizes for the various cells are unequal. Provided the cells sizes are not too different, this is not a big problem for one-way ANOVA, but for factorial ANOVA, the approaches described in Factorial ANOVA are generally not adequate. In these cases the regression approach described in ANOVA using Regression can be used instead.
Usually when conducting a study, the intention is to create groups of equal size, but it is often difficult to maintain such equality, perhaps because one or more subjects pulls out of the study at the last minute or for some other reason. The technique we now review is appropriate when any differences in cell size are due to random factors.
If, for example, the cells correspond to questions in a survey and it turns out that some cell has fewer entries because many people were offended by the question or found the question ambiguous or too difficult and so didn’t answer the question, then this is not a random difference in cell size, and so the approach given here is not applicable. In fact, in this case the survey should be redesigned unless the fact that the question was unanswered is being tested for and so a non-answer could be considered as a type of response, in which case we can use a balanced model.
Example 1: Perform ANOVA for the situation in Example 2 of ANOVA using Regression on the sample data in the table on the left side of Figure 1 using multiple regression.
Figure 1 – Data for Example 1 plus coding of dichotomous variables
We also provide a coding for the data in Figure 1. As you can see, the cells are unequal in size. Since we assume that any differences are due to random factors, we would like to treat each cell as having equal weight. In such case, we are better off assuming that the row means are simply the average of the cells in each row and similarly for the columns. Thus we have the following modified means:
Figure 2 – Means with equal weights
The regression model takes the form
Using the same approach as in Example 1 of ANOVA using Regression, ignoring the error term, we see that
Since the population grand mean is the sum of the terms on the left sides of the above six equations divided by 6, we see that the grand mean is the sum of the terms on the right side of the above equations divided by 6. This turns out to be β0, i.e. β0 = μ. Similarly,
Thus we conclude that β1 = μX – μ.
it follows that β2 = μCorn – μ. Similarly, β3 = μSoy – μ.
We summarize these results as follows:
- β0 = population grand mean
- β1 = population effect for Blend X group (i.e. group mean – grand mean)
- β2 = population effect for Corn group
- β3 = population effect for Soy group
- β4 = population effect for Blend X × Corn interaction (i.e. Blend X × Corn mean – Blend X group mean – Corn group mean + population grand mean)
- β5 = population effect for Blend X × Soy interaction
The coefficient table from Excel’s Regression data analysis tool is given in Figure 3.
Figure 3 – Coefficients from regression analysis for Example 1
Note that these coefficients can be obtained using the weighted means from Figure 2 as follows:
- β0 = sample grand mean = 152.62
- β1 = sample Blend X group mean – grand mean = 149.2 – 152.62 = -3.42
- β2 = sample Corn group mean – grand mean = 147.5 – 152.62 = -5.12
- β3 = sample Soy group mean – grand mean = 157.38 – 152.62 = 4.76
- β4 = Blend X × Corn mean – Blend X group mean – Corn group mean + grand mean = 135.4 – 149.2 – 147.5 + 152.62 = -8.68
- β5 = Blend X × Soy mean – Blend X group mean – Soy group mean + grand mean = 171 – 149.2 – 157.38 + 152.62 = 17.04
To create the ANOVA we proceed in a manner similar to what we did in Example 2 of ANOVA using Regression. The output, using the Regression data analysis, for the full model, i.e. α + β + αβ, where α, β and αβ (= δ) are as defined in Definition 1 of Two Factor ANOVA with Replication, is as follows:
Figure 4 – Regression analysis for the complete model
We now run the regression analysis with only t1, t2 and t3 (no interaction terms) to obtain the α + β model, and then run the analysis with t1, t1 * t2 and t1 * t3 to obtain the α + αβ model. Finally we run the analysis with t2, t3, t1 * t2 and t1 * t3 to obtain the β + αβ model. Although we don’t show the output for each of these models, we summarize the key results in the upper portion of Figure 5.
But A = (α + β + αβ) – (β + αβ), B = (α + β + αβ) – (α + αβ) and AB = (α + β + αβ) – (α + β), and so we obtain the values for the SSReg, dfReg and R Square for A, B and AB (the lower part of Figure 5) from the values in the upper part of Figure 5.
Figure 5 – Regression analysis for partial models
and similarly for the other terms in the table above.
Using these facts, we can obtain the output for ANOVA as shown in Figure 6. The values for Rows (A), Columns (B) and Interaction (AB) come from Figure 5. The Within values come from the values for SSRes, dfRes, MSRes in the complete regression model (Figure 4). The Total values also come from the values in Figure 4.
Figure 6 – ANOVA output for Example 1
Note that SSA + SSB + SSAB + SSW = 14513.90 < 14702.07 = SST, since the above model doesn’t quite account for all the variation. While in the balanced model A, B and AB partition the total variation, in the case of unbalanced models A, B and AB overlap. The model we have given leaves out the overlap, but corresponds to the approach of equally weighting the cells means.
From the analysis we see that there are no significant differences between Blend or Crop means, but there is a significant differences in the Interactions. We can now perform the usual comparison t-tests to further investigate these differences.
Real Statistics Data Analysis Tool: The Two Factor ANOVA data analysis tool provided by the Real Statistics Resource Pack contains a Regression option which automates the above procedure. If the input data is in Excel Two Factor ANOVA format the data is first converted to standard format and the appropriate regression models are generated.
For example, to perform the analysis for Example 1, click on cell F1 (where the output will start), enter Ctrl-m and select the Two Factor ANOVA option from the menu that appears.
Figure 7 – Dialog box for unbalanced Anova models
When the dialog box in Figure 7 appears, enter A4:D14 in the Input Range, click on Column/row headings included with data, select Standard format as the Input Format, select the Regression as the Analysis Type and click on the OK button. The output is shown in Figure 8 and 9.
Figure 8 – Unbalanced Two Factor ANOVA (part 1)
The first step in the analysis is that the input data is converted to standard format. The usual descriptive statistics are then calculated and the regression form of analysis of variance is computed. Finally the rows and columns of the original input data are exchanged to facilitate follow-up tests.
If the input data had been in standard format the analysis would have proceeded in a similar manner, although no conversions would have been performed.
Observation: The approach described in this section requires that all the interactions have at least one element in common. E.g. in Figure 8, if one of the cells in the range K5:M6 contains a zero value, then the output from the analysis will be in error.
Observation: When the Regression option of the Two Factor ANOVA data analysis tool is chosen you are limited to 64 independent variables (i.e. the same limitation as the Linear Regression data analysis tool described in Multiple Regression Analysis). This means that if a = the number of levels for factor A and b = the number of levels for factor B, then ab can be at most 64.
Observation: The approach described here for two factor ANOVA can be extended to ANOVA with more than two factors. In Three Factor ANOVA using Regression we show how this is done.