# Unbalanced Factorial ANOVA

In an unbalanced ANOVA the sample sizes for the various cells are unequal. Provided the cells sizes are not too different, this is not a big problem for one-way ANOVA, but for factorial ANOVA, the approaches described in Factorial ANOVA are generally not adequate. In these cases the regression approach described in ANOVA using Regression can be used instead.

Usually when conducting a study, the intention is to create groups of equal size, but it is often difficult to maintain such equality, perhaps because one or more subjects pulls out of the study at the last minute or for some other reason. The technique we now review is appropriate when any differences in cell size are due to random factors.

If, for example, the cells correspond to questions in a survey and it turns out that some cell has fewer entries because many people were offended by the question or found the question ambiguous or too difficult and so didn’t answer the question, then this is not a random difference in cell size, and so the approach given here is not applicable. In fact, in this case the survey should be redesigned unless the fact that the question was unanswered is being tested for and so a non-answer could be considered as a type of response, in which case we can use a balanced model.

Example 1: Perform ANOVA for the situation in Example 2 of ANOVA using Regression on the sample data in the table on the left side of Figure 1 using multiple regression.

Figure 1 – Data for Example 1 plus coding of dichotomous variables

We also provide a coding for the data in Figure 1. As you can see, the cells are unequal in size. Since we assume that any differences are due to random factors, we would like to treat each cell as having equal weight. In such case, we are better off assuming that the row means are simply the average of the cells in each row and similarly for the columns. Thus we have the following modified means:

Figure 2 – Means with equal weights

The regression model takes the form

Using the same approach as in Example 1 of ANOVA using Regression, ignoring the error term, we see that

Since the population grand mean is the sum of the terms on the left sides of the above six equations divided by 6, we see that the grand mean is the sum of the terms on the right side of the above equations divided by 6.  This turns out to be β0, i.e. β0 = μ. Similarly,

and so

Thus we conclude that β1 = μX – μ.

From

it follows that β2 = μCorn – μ. Similarly, β3 = μSoy – μ.

Since

we have

and similarly

We summarize these results as follows:

• β0 = population grand mean
• β1 = population effect for Blend X group (i.e. group mean – grand mean)
• β2 = population effect for Corn group
• β3 = population effect for Soy group
• β4 = population effect for Blend X × Corn interaction (i.e. Blend X × Corn mean – Blend X group mean – Corn group mean + population grand mean)
• β5 = population effect for Blend X × Soy interaction

The coefficient table from Excel’s Regression data analysis tool is given  in Figure 3.

Figure 3 – Coefficients from regression analysis for Example 1

Note that these coefficients can be obtained using the weighted means from Figure 2 as follows:

• β0 = sample grand mean = 152.62
• β1 = sample Blend  X group mean – grand mean = 149.2 – 152.62 = -3.42
• β2 = sample Corn group mean – grand mean = 147.5 – 152.62 = -5.12
• β3 = sample Soy group mean – grand mean = 157.38 – 152.62 = 4.76
• β4 = Blend X × Corn mean – Blend X group mean – Corn group mean + grand mean = 135.4 – 149.2 – 147.5 + 152.62 = -8.68
• β5 = Blend X × Soy mean – Blend X group mean – Soy group mean + grand mean = 171 – 149.2 – 157.38 + 152.62 = 17.04

To create the ANOVA we proceed in a manner similar to what we did in Example 2 of ANOVA using Regression. The output, using the Regression data analysis, for the full model, i.e. α + β + αβ, where α, β and αβ (= δ) are as defined in Definition 1 of Two Factor ANOVA with Replication, is as follows:

Figure 4 – Regression analysis for the complete model

We now run the regression analysis with only t1, t2 and t3 (no interaction terms) to obtain the α + β model, and then run the analysis with t1, t1 * t2 and t1 * t3 to obtain the α + αβ model. Finally we run the analysis with t2, t3, t1 * t2 and t1 * t3 to obtain the β + αβ model. Although we don’t show the output for each of these models, we summarize the key results in the upper portion of Figure 5.

But A = (α + β + αβ) – (β + αβ), B = (α + β + αβ) – (α + αβ) and AB = (α + β + αβ) – (α + β), and so we obtain the values for the SSReg, dfReg and R Square for A, B and AB (the lower part of Figure 5) from the values in the upper part of Figure 5.

Figure 5 – Regression analysis for partial models

Note too that

and similarly for the other terms in the table above.

Using these facts, we can obtain the output for ANOVA as shown in Figure 6. The values for Rows (A), Columns (B) and Interaction (AB) come from Figure 5. The Within values come from the values for SSRes, dfRes, MSRes in the complete regression model (Figure 4). The Total values also come from the values in Figure 4.

Figure 6 – ANOVA output for Example 1

Note that SSA + SSB + SSAB + SSW = 14513.90 < 14702.07 = SST, since the above model doesn’t quite account for all the variation. While in the balanced model A, B and AB partition the total variation, in the case of unbalanced models A, B and AB overlap. The model we have given leaves out the overlap, but corresponds to the approach of equally weighting the cells means.

From the analysis we see that there are no significant differences between Blend or Crop means, but there is a significant differences in the Interactions. We can now perform the usual comparison t-tests to further investigate these differences.

Real Statistics Data Analysis Tool: The Two Factor ANOVA data analysis tool provided by the Real Statistics Resource Pack contains a Regression option which automates the above procedure. If the input data is in Excel Two Factor ANOVA format the data is first converted to standard format and the appropriate regression models are generated.

For example, to perform the analysis for Example 1, click on cell F1 (where the output will start), enter Ctrl-m and select the Two Factor ANOVA option from the menu that appears.

Figure 7 – Dialog box for unbalanced Anova models

When the dialog box in Figure 7 appears, enter A4:D14 in the Input Range, click on Column/row headings included with data, select Standard format as the Input Format, select the Regression as the Analysis Type and click on the OK button. The output is shown in Figure 8 and 9.

Figure 8 – Unbalanced Two Factor ANOVA (part 1)

Figure 9 – Unbalanced Two Factor ANOVA (part 2)

The first step in the analysis is that the input data is converted to standard format. The usual descriptive statistics are then calculated and the regression form of analysis of variance is computed. Finally the rows and columns of the original input data are exchanged to facilitate follow-up tests.

If the input data had been in standard format the analysis would have proceeded in a similar manner, although no conversions would have been performed.

Observation: The approach described in this section requires that all the interactions have at least one element in common. E.g. in Figure 8, if one of the cells in the range K5:M6 contains a zero value, then the output from the analysis will be in error.

Observation: When the Regression option of the Two Factor ANOVA data analysis tool is chosen you are limited to 64 independent variables (i.e. the same limitation as the Linear Regression data analysis tool described in Multiple Regression Analysis). This means that if a = the number of levels for factor A and b = the number of levels for factor B, then ab can be at most 64.

Observation: The approach described here for two factor ANOVA can be extended to ANOVA with more than two factors. In Three Factor ANOVA using Regression we show how this is done.

### 45 Responses to Unbalanced Factorial ANOVA

1. Statistic student says:

Hello Charles,
My experiment is – we have asked set of questions to 4 different org- 1) with leadership A &B 2) With leadership A & not B 3) with leadership B & not A 4) no A and no B
The questions answered on likert scale of 1-5. I am planning to use ANOVA with regression to see how the answers to question differ based on type of leadership. the sample sizes are unequal. My questions- 1.using ANOVA with regression is correct?
2. category with no leadership represents the intercept or should it be considered as seperate group

• Charles says:

In general, you should be able to use regression to perform the ANOVA. I need to better understand the 4 org and your data better to answer your specific questions. E.g. are the samples for the 4 different orgs independent? Please provide more details.
Charles

• Statistic student says:

The samples for different org are independent. the sample sizes for each type is-
with leadership A &B – 67
With leadership A & not B -4
3) with leadership B & not A – 94
4) no A and no B- 45

• Charles says:

If the four samples are independent then each of the four would be a separate group.
Charles

2. Bronwyn says:

Hi Charles,

After using the unbalanced two-factor ANOVA, is it possible to run a post-hoc test to determine between which factors there are significant differences? I have two independent variables, one with two levels and the other with three and one dependent variable.

As Tukey’s HSD requires equal group sizes, which test would you recommend? I have read that the Scheffe procedure allows different group sizes but is very conservative. Is it at all possible to carry this out using Excel?

3. Rachel says:

Hi,
I believe I am following all of your directions correctly, but I keep getting the following message: “input in standard form cannot contain an empty cell.”
Thanks,
Rachel

• Charles says:

Rachel,
If your input data is in the format of range A4:D14 of Figure 8 (part 1), then choose the Excel format and Regression options. Your input may have some empty cells but you shouldn’t get an error message.
If your input data is in the format of range F3:H29 of Figure 8 (part 1), then choose the Standard format and Regression options. Your data cannot have any empty cells or you will receive an error message.
If you are doing either of these correctly and are still getting the error message, then something else has gone wrong. In this case, if you send me an Excel file with your data I will try to figure out what has gone wrong.
Charles

• Rachel says:

Oh, I see. I was choosing standard format when I should have been choosing excel. Now I get the following message: “number of rows per sample must divide number of rows in input range evenly.” Your example has an even number of rows across conditions, whereas I have 36 in one and 51 in the other. Is it still possible to use Excel to analyze these data?
Rachel

• Charles says:

Rachel,
I believe that all you need to do is make the number of rows per group the same. Since the model is unbalanced you just need to fill the smaller group with empty cells.
Charles

• Rachel says:

That worked. Thanks!

4. Domingos Motta says:

Could you please explain how to estimate the standard error of the coefficients given in Figure 3. Thank you in advance.

5. merahe says:

among the different types of anova, which ones cannot be handled by multiple linear regression?

• Charles says:

All the forms of ANOVA that I am familiar with can be handled by multiple linear regression or some other form of the regression.
Charles

6. sandy says:

Could you explain why it is better philosophically to use regression (which is essentially an unweighted means approach) than to use sub-sampling to achieve equal n or simply to run ANOVA? What are the implications?

• Charles says:

Sandy,
In any case ANOVA (with or without equal samples) is really a type of regression. If you can achieve a balanced model then by all means run the classical ANOVA model (the regression approach will yield the same results as the ANOVA approach).
Charles

7. Lipika Ray says:

If the degrees of freedom is 99 for columns and inter in Figure 9 type of data, it cannot calculate the SS or F or any P-value. Is there any limit for df? How to deal with that? Thanks.

• Charles says:

I don’t know of any such limit, but if you send me an Excel file with your data and calculations I will try to figure out what is going on. You can find my email address at:
Charles

8. Rhyth9m says:

Hello Charles,

Is it possible to use this method in this case? I have 4 groups and each group contains very different number of samples, one has 32, 19, 28 and the last group contains only 1 sample. This seemed hopeless, but hope you can help. =) Thanks.

• Charles says:

This method works with groups that have different numbers of samples, but not when one sample has only one element.
Charles

• Rhyth9m says:

Oh I see. Thank you Charles.
But are there other possible statistical methods applicable for this case? =(

• Charles says:

The only thing I can think of is to drop the group with only one sample or to combine it with another group.
Charles

Hi. Thank you for this post.
I would like to figure out if this is the approach i need to take.

here is my situation.
I have two groups of students (X and Y)
A survey was given at 2 time points (Pre and Post)
The survey has 4 categories of questions (1-4).

some students did not due both surveys.
the X and Y groups do not have equal numbers of students.

After removing the students who only did one survey, I think the proper analysis would be a repeated-measures 2-way unbalanced ANOVA (one analysis for each category of questions seems ok — rather than a 3-way ANOVA?)

does doing the regression like this work when (1) only pre/post is a repeated measure and (2) the X and Y groups are unbalanced?

• Charles says:

I don’t believe that the version of repeated measures ANOVA that is described on the website or supported by the Real Statistics software will work in this case. I plan to add a new version that will handle unbalanced models.
Charles

thank you charles for the quick response. Am I right in thinking that that is what I need?

I do have access to other software like matlab that can do regression. I just don’t know how to set it up. Can you hint as to what would be necessary?

• Charles says:

This would depend on the software that you use. I don’t use matlab, and so can’t help you with that.
Charles

Hi Charles,

thank you for the input. I have a related question.
is it possible to turn by unbalanced ANOVA into a balanced one, and then use Real Statistics to do a 2-way mixed (one factor is a repeated measure but the other is not) balanced ANOVA?

one of my groups has 50 samples and the other has 20. I was thinking of taking a random sample of 20 from the larger group to equalize the sizes before analysis. But i wonder if i would need to repeat that procedure 1000 times or something and then if the stats are significant (or not) 95% of the time then it would be ok. Does that make sense or is it totally unkosher?

• Charles says:

You can use this approach. The main drawback is that you will lose power because of the reduced sample size.
Charles

10. Wayne says:

Hello there,
Thank you for the page.

I’m wondering if I’ve got this right.

For 10 minutes I observed randomly chosen snorkellers and noted their rate of contact with the reef according to whether they came with a boat that had a tour guide or not.

At the same time I noted how far each observed snorkeller was from the guide (5m) throughout the observation period and if they swam alone or in a group.

Because of the nature of tourism at the study site, there is unequal numbers of people in each of the groups (ie people who stayed near the guide; people who stayed far from the guide; people who had no guide; and group size (1,2,3 etc)).

If I sqrt the counts, to investigate guide and group size can I use 2way ANOVA with regression?

Thank you.

• Charles says:

Wayne,
You can use 2-way ANOVA using regression, but whether or not this is the correct test to use depends on what you are trying to test. Also, why do you want to take the square root of the counts?
Charles

Hi Dr.
Many thanks for this great page & tools.
I have 3 questions:
1. I’m confused on how to determine dummy variables. For your example, rows is Fertilizer. So i determined t1 (Blend X), t2 (Blend Y), t3 (Corn), t4 (Soy), t5 (Rice).
To compute SSrows: It’s regression between Y (values) and t1, t2, t1*t3, t1*t4, t1*t5, t2*t3, t2*t4, t2*t5. But using yours tools it’s inverted. Where is the problem.
3. What is the difference between Split Plot & Two Factor ANOVA with replications? Is is the same methodology?
3. Did you developped page of Principal Component Analysis (PCA)?

Thanks,

• Charles says:

1) Sorry, but I don’t understand why you say it is inverted.
2) Two-factor ANOVA with replications and Split-plot ANOVA are not the same
3) Sorry, but I don’t understand your question
Charles

Thanks Dr.
Question 1 is solved, i was confusing dummy variables.
Question 3: need step by step to do a principal component analysis (PCA).
Thank you,

12. Tao Huang says:

Hi Dr
Did you think this below table is suitable for the two-way unbalanced ANOVA?
Straw management Nitrogen management
N0 N150 N300 N200+M100 N200+W100
Straw return 23.8 25.3 35.6 27.9 27.7
20.8 24.9 28.2 32.0 35.0
21.8 25.6 28.4 35.4 31.4
Straw removal 16.6 19.2 24.9 None None
19.7 26.4 24.0 None None
18.1 26.9 25.8 None None
Note: N0 means chemical nitrogen input rate is 0 kg N ha-1 yr-1; M100 and W100 means manure nitrogen and waste nitrogen input rate is 100 kg N ha-1 yr-1, respectively.

• Charles says:

Tao,
The unbalanced model still needs data. That the last two columns for Straw removal are completely empty will likely cause a problem.
Charles

• Tao Huang says:

Thank you very much.
However, i hear from a reviewer that these data could used to two-way unbalanced ANOVA analysis by R studio. How about you think?
Wishes

• Charles says:

Tao,
I don’t use R Studio, so I can’t say, but I suggest that you try. If you get a result that is good news.
Charles

13. Tao Huang says:

Thank you very much!

14. Tom Sullivan says:

I have some uneven data sets and want to run ANOVA two factor with replication. I have downloaded the Addins and they are checked off on my Addins (Alt TI) but when I select ANOVA two factor with replication I do not have the Excel/Standard option in ANOVA. Any suggestions? I have Excel Version 10.

• Charles says:

Tom,
If I understand correctly you are using Excel 2002, which is no longer supported by Microsoft. Although it is supported by Real Statistics, this version of Real Statistics is not being updated with new features. You will need to use Excel 2007, 2010, 2011, 2013 or 2016 to access these newer capabilities.
Charles

• Tom Sullivan says:

Sorry, I should have indicated that I am using Excel Version: 14.0.7194.5000 (32-bit)

• Charles says:

Tom,
Thanks for this information. The latest version of the Real Statistics software does support Standard format for two factor ANOVA (as well as the past few preceding versions). The only reason I can think of for not finding the Standard format option is that you are looking at the standard Excel version of the ANOVA data analysis tool (which does not have this option) and not the Real Statistics version (which does). To access Real Statistics you can press Ctrl-m.
Charles

• Tom Sullivan says:

Hi Charles,