Unbalanced Factorial ANOVA

In an unbalanced ANOVA the sample sizes for the various cells are unequal. Provided the cells sizes are not too different, this is not a big problem for one-way ANOVA, but for factorial ANOVA, the approaches described in Factorial ANOVA are generally not adequate. In these cases the regression approach described in ANOVA using Regression can be used instead.

Usually when conducting a study, the intention is to create groups of equal size, but it is often difficult to maintain such equality, perhaps because one or more subjects pulls out of the study at the last minute or for some other reason. The technique we now review is appropriate when any differences in cell size are due to random factors.

If, for example, the cells correspond to questions in a survey and it turns out that some cell has fewer entries because many people were offended by the question or found the question ambiguous or too difficult and so didn’t answer the question, then this is not a random difference in cell size, and so the approach given here is not applicable. In fact, in this case the survey should be redesigned unless the fact that the question was unanswered is being tested for and so a non-answer could be considered as a type of response, in which case we can use a balanced model.

Example 1: Perform ANOVA for the situation in Example 2 of ANOVA using Regression on the sample data in the table on the left side of Figure 1 using multiple regression.

Anova coding regression

Figure 1 – Data for Example 1 plus coding of dichotomous variables

We also provide a coding for the data in Figure 1. As you can see, the cells are unequal in size. Since we assume that any differences are due to random factors, we would like to treat each cell as having equal weight. In such case, we are better off assuming that the row means are simply the average of the cells in each row and similarly for the columns. Thus we have the following modified means:

Means with equal weights

Figure 2 – Means with equal weights

The regression model takes the form


Using the same approach as in Example 1 of ANOVA using Regression, ignoring the error term, we see that

image1991 image1992 image1993 image1994 image1995 image1996 image1997 image1998

Since the population grand mean is the sum of the terms on the left sides of the above six equations divided by 6, we see that the grand mean is the sum of the terms on the right side of the above equations divided by 6.  This turns out to be β0, i.e. β0 = μ. Similarly,

image5051 image5052

and so

Thus we conclude that β1 = μX – μ.


it follows that β2 = μCorn – μ. Similarly, β3 = μSoy – μ.


we have

and similarly

We summarize these results as follows:

  • β0 = population grand mean
  • β1 = population effect for Blend X group (i.e. group mean – grand mean)
  • β2 = population effect for Corn group
  • β3 = population effect for Soy group
  • β4 = population effect for Blend X × Corn interaction (i.e. Blend X × Corn mean – Blend X group mean – Corn group mean + population grand mean)
  • β5 = population effect for Blend X × Soy interaction

The coefficient table from Excel’s Regression data analysis tool is given  in Figure 3.

Regression coefficients Excel

Figure 3 – Coefficients from regression analysis for Example 1

Note that these coefficients can be obtained using the weighted means from Figure 2 as follows:

  • β0 = sample grand mean = 152.62
  • β1 = sample Blend  X group mean – grand mean = 149.2 – 152.62 = -3.42
  • β2 = sample Corn group mean – grand mean = 147.5 – 152.62 = -5.12
  • β3 = sample Soy group mean – grand mean = 157.38 – 152.62 = 4.76
  • β4 = Blend X × Corn mean – Blend X group mean – Corn group mean + grand mean = 135.4 – 149.2 – 147.5 + 152.62 = -8.68
  • β5 = Blend X × Soy mean – Blend X group mean – Soy group mean + grand mean = 171 – 149.2 – 157.38 + 152.62 = 17.04

To create the ANOVA we proceed in a manner similar to what we did in Example 2 of ANOVA using Regression. The output, using the Regression data analysis, for the full model, i.e. α + β + αβ, where α, β and αβ (= δ) are as defined in Definition 1 of Two Factor ANOVA with Replication, is as follows:

Regression complete model Excel

Figure 4 – Regression analysis for the complete model

We now run the regression analysis with only t1, t2 and t3 (no interaction terms) to obtain the α + β model, and then run the analysis with t1, t1 * t2 and t1 * t3 to obtain the α + αβ model. Finally we run the analysis with t2, t3, t1 * t2 and t1 * t3 to obtain the β + αβ model. Although we don’t show the output for each of these models, we summarize the key results in the upper portion of Figure 5.

But A = (α + β + αβ) – (β + αβ), B = (α + β + αβ) – (α + αβ) and AB = (α + β + αβ) – (α + β), and so we obtain the values for the SSReg, dfReg and R Square for A, B and AB (the lower part of Figure 5) from the values in the upper part of Figure 5.

Regression partial models

Figure 5 – Regression analysis for partial models

Note too that

and similarly for the other terms in the table above.

Using these facts, we can obtain the output for ANOVA as shown in Figure 6. The values for Rows (A), Columns (B) and Interaction (AB) come from Figure 5. The Within values come from the values for SSRes, dfRes, MSRes in the complete regression model (Figure 4). The Total values also come from the values in Figure 4.

ANOVA unbalanced model Excel

Figure 6 – ANOVA output for Example 1

Note that SSA + SSB + SSAB + SSW = 14513.90 < 14702.07 = SST, since the above model doesn’t quite account for all the variation. While in the balanced model A, B and AB partition the total variation, in the case of unbalanced models A, B and AB overlap. The model we have given leaves out the overlap, but corresponds to the approach of equally weighting the cells means.

From the analysis we see that there are no significant differences between Blend or Crop means, but there is a significant differences in the Interactions. We can now perform the usual comparison t-tests to further investigate these differences.

Real Statistics Data Analysis Tool: The Two Factor ANOVA data analysis tool provided by the Real Statistics Resource Pack contains a Regression option which automates the above procedure. If the input data is in Excel Two Factor ANOVA format the data is first converted to standard format and the appropriate regression models are generated.

For example, to perform the analysis for Example 1, click on cell F1 (where the output will start), enter Ctrl-m and select the Two Factor ANOVA option from the menu that appears.

Unbalanced Anova dialog regression

Figure 7 – Dialog box for unbalanced Anova models

When the dialog box in Figure 7 appears, enter A4:D14 in the Input Range, click on Column/row headings included with data, select Standard format as the Input Format, select the Regression as the Analysis Type and click on the OK button. The output is shown in Figure 8 and 9.

Unbalanced Anova Excel

Figure 8 – Unbalanced Two Factor ANOVA (part 1)

unbalanced-anova-regressionFigure 9 – Unbalanced Two Factor ANOVA (part 2)

The first step in the analysis is that the input data is converted to standard format. The usual descriptive statistics are then calculated and the regression form of analysis of variance is computed. Finally the rows and columns of the original input data are exchanged to facilitate follow-up tests.

If the input data had been in standard format the analysis would have proceeded in a similar manner, although no conversions would have been performed.

Observation: The approach described in this section requires that all the interactions have at least one element in common. E.g. in Figure 8, if one of the cells in the range K5:M6 contains a zero value, then the output from the analysis will be in error.

Observation: When the Regression option of the Two Factor ANOVA data analysis tool is chosen you are limited to 64 independent variables (i.e. the same limitation as the Linear Regression data analysis tool described in Multiple Regression Analysis). This means that if a = the number of levels for factor A and b = the number of levels for factor B, then ab can be at most 64.

Observation: The approach described here for two factor ANOVA can be extended to ANOVA with more than two factors. In Three Factor ANOVA using Regression we show how this is done.

23 Responses to Unbalanced Factorial ANOVA

  1. Rhyth9m says:

    Hello Charles,

    Is it possible to use this method in this case? I have 4 groups and each group contains very different number of samples, one has 32, 19, 28 and the last group contains only 1 sample. This seemed hopeless, but hope you can help. =) Thanks.

    • Charles says:

      This method works with groups that have different numbers of samples, but not when one sample has only one element.

      • Rhyth9m says:

        Oh I see. Thank you Charles.
        But are there other possible statistical methods applicable for this case? =(

        • Charles says:

          The only thing I can think of is to drop the group with only one sample or to combine it with another group.

  2. Lipika Ray says:

    If the degrees of freedom is 99 for columns and inter in Figure 9 type of data, it cannot calculate the SS or F or any P-value. Is there any limit for df? How to deal with that? Thanks.

    • Charles says:

      I don’t know of any such limit, but if you send me an Excel file with your data and calculations I will try to figure out what is going on. You can find my email address at:
      Contact Us

  3. sandy says:

    Could you explain why it is better philosophically to use regression (which is essentially an unweighted means approach) than to use sub-sampling to achieve equal n or simply to run ANOVA? What are the implications?

    • Charles says:

      In any case ANOVA (with or without equal samples) is really a type of regression. If you can achieve a balanced model then by all means run the classical ANOVA model (the regression approach will yield the same results as the ANOVA approach).

  4. merahe says:

    among the different types of anova, which ones cannot be handled by multiple linear regression?

    • Charles says:

      All the forms of ANOVA that I am familiar with can be handled by multiple linear regression or some other form of the regression.

  5. Domingos Motta says:

    Could you please explain how to estimate the standard error of the coefficients given in Figure 3. Thank you in advance.

  6. Rachel says:

    I believe I am following all of your directions correctly, but I keep getting the following message: “input in standard form cannot contain an empty cell.”

    • Charles says:

      If your input data is in the format of range A4:D14 of Figure 8 (part 1), then choose the Excel format and Regression options. Your input may have some empty cells but you shouldn’t get an error message.
      If your input data is in the format of range F3:H29 of Figure 8 (part 1), then choose the Standard format and Regression options. Your data cannot have any empty cells or you will receive an error message.
      If you are doing either of these correctly and are still getting the error message, then something else has gone wrong. In this case, if you send me an Excel file with your data I will try to figure out what has gone wrong.

      • Rachel says:

        Oh, I see. I was choosing standard format when I should have been choosing excel. Now I get the following message: “number of rows per sample must divide number of rows in input range evenly.” Your example has an even number of rows across conditions, whereas I have 36 in one and 51 in the other. Is it still possible to use Excel to analyze these data?
        Thanks again for your help,

        • Charles says:

          I believe that all you need to do is make the number of rows per group the same. Since the model is unbalanced you just need to fill the smaller group with empty cells.

  7. Bronwyn says:

    Hi Charles,

    After using the unbalanced two-factor ANOVA, is it possible to run a post-hoc test to determine between which factors there are significant differences? I have two independent variables, one with two levels and the other with three and one dependent variable.

    As Tukey’s HSD requires equal group sizes, which test would you recommend? I have read that the Scheffe procedure allows different group sizes but is very conservative. Is it at all possible to carry this out using Excel?

    Thanks for this page, it has been very helpful.

  8. Statistic student says:

    Hello Charles,
    My experiment is – we have asked set of questions to 4 different org- 1) with leadership A &B 2) With leadership A & not B 3) with leadership B & not A 4) no A and no B
    The questions answered on likert scale of 1-5. I am planning to use ANOVA with regression to see how the answers to question differ based on type of leadership. the sample sizes are unequal. My questions- 1.using ANOVA with regression is correct?
    2. category with no leadership represents the intercept or should it be considered as seperate group

    • Charles says:

      In general, you should be able to use regression to perform the ANOVA. I need to better understand the 4 org and your data better to answer your specific questions. E.g. are the samples for the 4 different orgs independent? Please provide more details.

      • Statistic student says:

        The samples for different org are independent. the sample sizes for each type is-
        with leadership A &B – 67
        With leadership A & not B -4
        3) with leadership B & not A – 94
        4) no A and no B- 45

Leave a Reply

Your email address will not be published. Required fields are marked *