In an unbalanced ANOVA the sample sizes for the various cells are unequal. Provided the cells sizes are not too different, this is not a big problem for one-way ANOVA, but for factorial ANOVA, the approaches described in Factorial ANOVA are generally not adequate. In these cases the regression approach described in ANOVA using Regression can be used instead.

Usually when conducting a study, the intention is to create groups of equal size, but it is often difficult to maintain such equality, perhaps because one or more subjects pulls out of the study at the last minute or for some other reason. The technique we now review is appropriate when any differences in cell size are due to random factors.

If, for example, the cells correspond to questions in a survey and it turns out that some cell has fewer entries because many people were offended by the question or found the question ambiguous or too difficult and so didn’t answer the question, then this is not a random difference in cell size, and so the approach given here is not applicable. In fact, in this case the survey should be redesigned unless the fact that the question was unanswered is being tested for and so a non-answer could be considered as a type of response, in which case we can use a balanced model.

**Example 1**: Perform ANOVA for the situation in Example 2 of ANOVA using Regression on the sample data in the table on the left side of Figure 1 using multiple regression.

**Figure 1 – Data for Example 1 plus coding of dichotomous variables**

We also provide a coding for the data in Figure 1. As you can see, the cells are unequal in size. Since we assume that any differences are due to random factors, we would like to treat each cell as having equal weight. In such case, we are better off assuming that the row means are simply the average of the cells in each row and similarly for the columns. Thus we have the following modified means:

**Figure 2 – Means with equal weights**

The regression model takes the form

Using the same approach as in Example 1 of ANOVA using Regression, ignoring the error term, we see that

Since the population grand mean is the sum of the terms on the left sides of the above six equations divided by 6, we see that the grand mean is the sum of the terms on the right side of the above equations divided by 6. This turns out to be *β _{0}*, i.e.

*β*. Similarly,

_{0}= μThus we conclude that *β _{1} = μ_{X} – μ*.

it follows that *β _{2} = μ_{Corn} – μ*. Similarly,

*β*.

_{3}= μ_{Soy}– μWe summarize these results as follows:

*β*= population grand mean_{0}*β*= population effect for Blend X group (i.e. group mean – grand mean)_{1}*β*= population effect for Corn group_{2}*β*= population effect for Soy group_{3}*β*= population effect for Blend X × Corn interaction (i.e. Blend X × Corn mean – Blend X group mean – Corn group mean + population grand mean)_{4}- β
_{5}= population effect for Blend X × Soy interaction

The coefficient table from Excel’s **Regression** data analysis tool is given in Figure 3.

**Figure 3 – Coefficients from regression analysis for Example 1**

Note that these coefficients can be obtained using the weighted means from Figure 2 as follows:

*β*= sample grand mean = 152.62_{0}*β*= sample Blend X group mean – grand mean = 149.2 – 152.62 = -3.42_{1}*β*= sample Corn group mean – grand mean = 147.5 – 152.62 = -5.12_{2}*β*= sample Soy group mean – grand mean = 157.38 – 152.62 = 4.76_{3}*β*= Blend X × Corn mean – Blend X group mean – Corn group mean + grand mean = 135.4 – 149.2 – 147.5 + 152.62 = -8.68_{4}*β*= Blend X × Soy mean – Blend X group mean – Soy group mean + grand mean = 171 – 149.2 – 157.38 + 152.62 = 17.04_{5}

To create the ANOVA we proceed in a manner similar to what we did in Example 2 of ANOVA using Regression. The output, using the Regression data analysis, for the full model, i.e. *α + β + αβ*, where *α, β* and *αβ* (= *δ*) are as defined in Definition 1 of Two Factor ANOVA with Replication, is as follows:

**Figure 4 – Regression analysis for the complete model**

We now run the regression analysis with only *t _{1}, t_{2}* and

*t*(no interaction terms) to obtain the

_{3}*α + β*model, and then run the analysis with

*t**

_{1}, t_{1}*t*and

_{2}*t**

_{1}*t*to obtain the

_{3}*α + αβ*model. Finally we run the analysis with

*t**

_{2}, t_{3}, t_{1}*t*and

_{2}*t**

_{1}*t*to obtain the

_{3}*β + αβ*model. Although we don’t show the output for each of these models, we summarize the key results in the upper portion of Figure 5.

But A = (*α + β + αβ*) – (*β + αβ*), B = (*α + β + αβ*) – (*α + αβ*) and AB = (*α + β + αβ*) – (*α + β*), and so we obtain the values for the *SS _{Reg}, df*

_{Reg}and R Square for A, B and AB (the lower part of Figure 5) from the values in the upper part of Figure 5.

**Figure 5 – Regression analysis for partial models**

and similarly for the other terms in the table above.

Using these facts, we can obtain the output for ANOVA as shown in Figure 6. The values for Rows (A), Columns (B) and Interaction (AB) come from Figure 5. The Within values come from the values for *SS _{Res}, df_{Res}, MS_{Res}* in the complete regression model (Figure 4). The Total values also come from the values in Figure 4.

**Figure 6 – ANOVA output for Example 1**

Note that *SS _{A} + SS_{B} + SS_{AB} + SS_{W}* = 14513.90 < 14702.07 =

*SS*, since the above model doesn’t quite account for all the variation. While in the balanced model A, B and AB partition the total variation, in the case of unbalanced models A, B and AB overlap. The model we have given leaves out the overlap, but corresponds to the approach of equally weighting the cells means.

_{T}From the analysis we see that there are no significant differences between Blend or Crop means, but there is a significant differences in the Interactions. We can now perform the usual comparison t-tests to further investigate these differences.

**Real Statistics Data Analysis Tool**: The **Two Factor ANOVA** data analysis tool provided by the Real Statistics Resource Pack contains a **Regression** option which automates the above procedure. If the input data is in Excel Two Factor ANOVA format the data is first converted to standard format and the appropriate regression models are generated.

For example, to perform the analysis for Example 1, click on cell F1 (where the output will start), enter **Ctrl-m** and select the **Two Factor ANOVA **option from the menu that appears.

**Figure 7 – Dialog box for unbalanced Anova models**

When the dialog box in Figure 7 appears, enter A4:D14 in the **Input Range**, click on **Column/row headings included with data**, select **Standard format **as the **Input Format**, select the **Regression** as the **Analysis Type** and click on the **OK** button. The output is shown in Figure 8 and 9.

**Figure 8 – Unbalanced Two Factor ANOVA (part 1)**

**Figure 9 – Unbalanced Two Factor ANOVA (part 2)**

The first step in the analysis is that the input data is converted to standard format. The usual descriptive statistics are then calculated and the regression form of analysis of variance is computed. Finally the rows and columns of the original input data are exchanged to facilitate follow-up tests.

If the input data had been in standard format the analysis would have proceeded in a similar manner, although no conversions would have been performed.

**Observation**: The approach described in this section requires that all the interactions have at least one element in common. E.g. in Figure 8, if one of the cells in the range K5:M6 contains a zero value, then the output from the analysis will be in error.

**Observation**: When the **Regression** option of the **Two Factor ANOVA** data analysis tool is chosen you are limited to 64 independent variables (i.e. the same limitation as the **Linear Regression** data analysis tool described in Multiple Regression Analysis). This means that if *a* = the number of levels for factor *A* and *b* = the number of levels for factor *B,* then *ab* can be at most 64.

**Observation**: The approach described here for two factor ANOVA can be extended to ANOVA with more than two factors. In Three Factor ANOVA using Regression we show how this is done.

Hi. Thank you for this post.

I would like to figure out if this is the approach i need to take.

here is my situation.

I have two groups of students (X and Y)

A survey was given at 2 time points (Pre and Post)

The survey has 4 categories of questions (1-4).

some students did not due both surveys.

the X and Y groups do not have equal numbers of students.

After removing the students who only did one survey, I think the proper analysis would be a repeated-measures 2-way unbalanced ANOVA (one analysis for each category of questions seems ok — rather than a 3-way ANOVA?)

does doing the regression like this work when (1) only pre/post is a repeated measure and (2) the X and Y groups are unbalanced?

Adam,

I don’t believe that the version of repeated measures ANOVA that is described on the website or supported by the Real Statistics software will work in this case. I plan to add a new version that will handle unbalanced models.

Charles

thank you charles for the quick response. Am I right in thinking that that is what I need?

I do have access to other software like matlab that can do regression. I just don’t know how to set it up. Can you hint as to what would be necessary?

Adam,

This would depend on the software that you use. I don’t use matlab, and so can’t help you with that.

Charles

Hi Charles,

thank you for the input. I have a related question.

is it possible to turn by unbalanced ANOVA into a balanced one, and then use Real Statistics to do a 2-way mixed (one factor is a repeated measure but the other is not) balanced ANOVA?

one of my groups has 50 samples and the other has 20. I was thinking of taking a random sample of 20 from the larger group to equalize the sizes before analysis. But i wonder if i would need to repeat that procedure 1000 times or something and then if the stats are significant (or not) 95% of the time then it would be ok. Does that make sense or is it totally unkosher?

Adam,

You can use this approach. The main drawback is that you will lose power because of the reduced sample size.

Charles

Hello Charles,

Is it possible to use this method in this case? I have 4 groups and each group contains very different number of samples, one has 32, 19, 28 and the last group contains only 1 sample. This seemed hopeless, but hope you can help. =) Thanks.

This method works with groups that have different numbers of samples, but not when one sample has only one element.

Charles

Oh I see. Thank you Charles.

But are there other possible statistical methods applicable for this case? =(

The only thing I can think of is to drop the group with only one sample or to combine it with another group.

Charles

If the degrees of freedom is 99 for columns and inter in Figure 9 type of data, it cannot calculate the SS or F or any P-value. Is there any limit for df? How to deal with that? Thanks.

I don’t know of any such limit, but if you send me an Excel file with your data and calculations I will try to figure out what is going on. You can find my email address at:

Contact Us

Charles

Could you explain why it is better philosophically to use regression (which is essentially an unweighted means approach) than to use sub-sampling to achieve equal n or simply to run ANOVA? What are the implications?

Sandy,

In any case ANOVA (with or without equal samples) is really a type of regression. If you can achieve a balanced model then by all means run the classical ANOVA model (the regression approach will yield the same results as the ANOVA approach).

Charles

among the different types of anova, which ones cannot be handled by multiple linear regression?

All the forms of ANOVA that I am familiar with can be handled by multiple linear regression or some other form of the regression.

Charles

Could you please explain how to estimate the standard error of the coefficients given in Figure 3. Thank you in advance.

The output in Figure 3 comes from Excel’s Regression data analysis tool. A description of how to estimate the standard errors of regression coefficients is explained on the webpage Multiple Regression, esp. the Least Squares Method and Multiple Regression Analysis

Charles

Hi,

I believe I am following all of your directions correctly, but I keep getting the following message: “input in standard form cannot contain an empty cell.”

Thanks,

Rachel

Rachel,

If your input data is in the format of range A4:D14 of Figure 8 (part 1), then choose the Excel format and Regression options. Your input may have some empty cells but you shouldn’t get an error message.

If your input data is in the format of range F3:H29 of Figure 8 (part 1), then choose the Standard format and Regression options. Your data cannot have any empty cells or you will receive an error message.

If you are doing either of these correctly and are still getting the error message, then something else has gone wrong. In this case, if you send me an Excel file with your data I will try to figure out what has gone wrong.

Charles

Oh, I see. I was choosing standard format when I should have been choosing excel. Now I get the following message: “number of rows per sample must divide number of rows in input range evenly.” Your example has an even number of rows across conditions, whereas I have 36 in one and 51 in the other. Is it still possible to use Excel to analyze these data?

Thanks again for your help,

Rachel

Rachel,

I believe that all you need to do is make the number of rows per group the same. Since the model is unbalanced you just need to fill the smaller group with empty cells.

Charles

That worked. Thanks!

Hi Charles,

After using the unbalanced two-factor ANOVA, is it possible to run a post-hoc test to determine between which factors there are significant differences? I have two independent variables, one with two levels and the other with three and one dependent variable.

As Tukey’s HSD requires equal group sizes, which test would you recommend? I have read that the Scheffe procedure allows different group sizes but is very conservative. Is it at all possible to carry this out using Excel?

Thanks for this page, it has been very helpful.

Hi,

A good choice in this case may be to use the Games-Howell test. See the webpage http://www.real-statistics.com/one-way-analysis-of-variance-anova/unplanned-comparisons/ for this and other choices. This test is included in the Real Statistics Resource Pack for use in Excel.

Charles

Hello Charles,

My experiment is – we have asked set of questions to 4 different org- 1) with leadership A &B 2) With leadership A & not B 3) with leadership B & not A 4) no A and no B

The questions answered on likert scale of 1-5. I am planning to use ANOVA with regression to see how the answers to question differ based on type of leadership. the sample sizes are unequal. My questions- 1.using ANOVA with regression is correct?

2. category with no leadership represents the intercept or should it be considered as seperate group

In general, you should be able to use regression to perform the ANOVA. I need to better understand the 4 org and your data better to answer your specific questions. E.g. are the samples for the 4 different orgs independent? Please provide more details.

Charles

The samples for different org are independent. the sample sizes for each type is-

with leadership A &B – 67

With leadership A & not B -4

3) with leadership B & not A – 94

4) no A and no B- 45

If the four samples are independent then each of the four would be a separate group.

Charles