ANOVA using Regression

As we saw in Linear Regression Models for Comparing Means, categorical variables can often be used in a regression analysis by first replacing the categorical variable by a dummy variable (also called a tag variable).

We now illustrate more complex examples, and show how to perform Two Factor ANOVA using multiple regression. See Three Factor ANOVA using Regression for information about how to apply these techniques to factorial ANOVA with more than two factors.

Example 1: Repeat the analysis from Example 1 of Basic Concepts for ANOVA with the sample data in the table on the left of Figure 1 using multiple regression.

Data ANOVA using Regression

Figure 1 – Data for Example 1

Our objective is to determine whether there is a significant difference between the three flavorings. In this example, we have reduced the sample size from Example 1 of Basic Concepts for ANOVA to better illustrate the key concepts. Instead of doing the analysis using ANOVA as we did there, this time we will use regression analysis instead. First we define the following two dummy variables and map the original data into the model on the right side of Figure 1.

t1 = 1 if flavoring 1 is used; = 0 otherwise
t2 = 1 if flavoring 2 is used; = 0 otherwise

Note that in general, if the original data has k values the model will require k – 1 dummy variables.

The null hypothesis is

H0: µ1 = µ2 = µ3

where xj = the score for Flavor group j. The linear regression model is

image5049Note that

image1937

since for the Flavor 1 group, t1 = 1 and t2 = 0

image1940

since for the Flavor 2 group, t1 = 0 and t2 = 1

image1941

since for the Flavor 3 group, t1 = 0 and t2 = 0

Thus the null hypothesis given above is equivalent to

image1942

Simplifying, this means that the null hypothesis is equivalent to:

H0:  β1 = β2  = 0

The results of the regression analysis are displayed in Figure 2.

Regression ANOVA Excel

Figure 2 – Regression analysis for data in Example 1

We now compare the regression results from Figure 2 with the ANOVA on the same data found in Figure 3. Note that the F value 0.66316 is the same as that in the regression analysis. Similarly, the p-value .52969 is the same in both models.

ANOVA data analysis Excel

Figure 3 – ANOVA for data in Example 1

Note the following about the regression coefficients:

  • The intercept b0 = mean of the Flavor 3 group = 14.
  • The coefficient b1 for variable t1 = mean of the Flavor 1 group – mean of the Flavor 3 group = 12 – 14 = -2
  • The coefficient b2 for variable t2 = mean of the Flavor 2 group – mean of the Flavor 3 group = 11.5 – 14 = -2.5

This is consistent with what we noted above when relating the population group means to the population coefficients, namely µ3 = β0µ1 = β0 + β1 and µ2 = β0 + β2.

Example 1 (alternative approach): An alternative coding for Example 1 is as follows

t1 = 1 if flavoring 1 is used; = -1 if flavoring 3 is used; = 0 otherwise
t2 = 1 if flavoring 2 is used; = -1 if flavoring 3 is used; = 0 otherwise

In general, If there are k groups then the jth dummy variable tj = 1 if the jth group, tk = -1 if the kth group and = 0 otherwise.

The data now can be expressed as in the table on the left of Figure 4.

Alternative dichotomous variable coding

Figure 4 – Alternative coding for data in Example 1

The null hypothesis and linear regression model are as before. Now we have:

image1937

since for the Flavor 1 group, t1 = 1 and t2 = 0

image1940

since for the Flavor 2 group, t1 = 0 and t2 = 1

image1955

since for the Flavor 3 group, t1 = -1 and t2 = -1

Thus the null hypothesis is equivalent to β0 + β1 = β0 + β2 = β0 – (β1 + β2). Simplifying, this means once again that the null hypothesis is equivalent to:

H0: β1 = β2  = 0

Note too that μ2 = β0 – (β1 + β2) = β0 – (μ1 – β0 + μ2 – β0), and so β0 = (μ1 + μ2 + μ3)/3, i.e. β0 = the population grand mean. Also β1 = β0 – μ1 and β2 = β0 – μ2, and so β1 = the population Flavor 1 mean less the population grand mean and β2 = the population Flavor 2 mean less the population grand mean.

The results of the regression analysis are given on the right side of Figure 4.

The first Summary and ANOVA tables are identical to the results from the previous analysis, and so once again we see that the results are the same as for the ANOVA. The regression coefficients, however, are different.

Figure 5 displays the grand mean, the group means and the group effect sizes (i.e. the group mean less the grand mean).

Group means effect size

Figure 5 – Group means and group effect sizes

We note that the intercept of the regression model is the grand mean 12.5 and the other coefficients correspond to the group effects for the Flavor 1 and Flavor 2 groups.

Example 2: Repeat the analysis from Example 1 of Two Factor ANOVA with Replication on the reduced sample data in the table on the left of Figure 6 using multiple regression.

Data two factor ANOVA

Figure 6 – Data for Example 2

This time we show how to perform a two factor ANOVA using multiple regression. As we did in the previous example, we first define the dummy variables as follows:

t1 = 1 if Blend X; = 0 otherwise
t2 = 1 if Corn; = 0 otherwise
t3 = 1 if Soy; = 0 otherwise

The data now takes the form shown in Figure 7 where y is the yield.

Dichotomous variable coding ANOVA

Figure 7 – Coded data for Example 2

Note that this time we model the interaction of t1 with t2 and t3, as described in Interaction. The regression model that we use is of form

image1968

We now build a table of the means for each of the 6 groups (i.e. cells), as described in Figure 8.

Group means ANOVA

Figure 8 – Group means for Example 2

This table can be constructed by calculating the means of each of the above 6 groups from the original data or by applying the AVERAGEIFS function to the transformed data.

As we did in Example 1, we note that the mean in the case for Blend Y and Rice (i.e. where t1 = t2 = t3 = 0) is given by

image1970

And similarly for the other combinations:

image1971 image1972 image1973 image1974 image1975

Solving the simultaneous equations, we get the following values for the coefficients:

        b0 = 165.4          b1 = -24.2       b2 = -5.8          b3 = -25.2          b4 = 0            b5 = 59.8

We get the same results when we run the Regression data analysis tool (see Figure 9).

Regression dichotomous variables Excel

Figure 9 – Regression for data in Example 2

The relatively high value of R and low value of Significance F show that the above model is a sufficiently good fit. Using the ANOVA: Two factor data analysis tool, we get the output shown in Figure 10.

Two factor ANOVA Excel

Figure 10 – Two factor ANOVA for the data in Example 2

We now show how to obtain the ANOVA results from the Regression model and vice versa. Note that MSW = 450.33 = MSRes, which is as expected since both of these denote the portion of the variation due to error. Also note that MST = 17457.87/29 = 602.00 for both models, and so the systemic variation for both models is the same as well. For the ANOVA model this is

image1980= (136.53 + 553.27 + 5960.07) / (1 + 2 + 2) = 46649.87/5 = 1329.97

This is the same as MSRes = 6649.87/5 = 1329.97 for the Regression model.

To obtain the Rows (A), Columns (B) and Interaction (AB) values in the ANOVA model from the Regression model, first rerun the regression analysis using only t1 as an independent variable. The values obtained for SSReg, dfReg and MSReg are the values of SSRow, dfRow and MSRow in the ANOVA model. Then rerun the regression analysis using only t2 and  t3. The values obtained for SSReg, dfReg and MSReg are the values of SSCol, dfCol and MSCol in the ANOVA model. Now SSInteraction = SSBet SSRow SSCol (and similarly for the df terms) where SSBet is SSReg in the original (complete) regression model.

Finally note that the value of R Square = .381. This has two interpretations. First it is the square of Multiple R (whose value = .617), which is simply the correlation coefficient r. Second it measures the percentage of variation explained by the regression model (or by the ANOVA model), which is

SSReg/SST = 6649.87/5793 = 0.381

which is also equal to 1 – SSW/SST from the ANOVA model.

Observation: Just as we did in the single factor ANOVA of Example 1, we can obtain similar results for Example 2 using the alternative coding of dummy variables, namely

t1 = 1 if Blend X; = -1 otherwise
t2 = 1 if Corn; -1 if Rice; = 0 otherwise
t3 = 1 if Soy; = -1 if Rice; = 0 otherwise

This approach is especially useful in creating unbalanced ANOVA models, i.e. where the sample sizes are not equal in a factorial ANOVA (see Unbalanced Factorial Anova).

Real Statistics Function: The following array supplemental function is contained in the Real Statistics Resource Pack.

SSAnova2(R1, r) – returns a column array with SSRow, SSCol, SSInt and SSW for a two factor ANOVA for the data in R1 using a regression model; if r > 0 then R1 is assumed to be in Excel Anova format with r rows per sample, while if r = 0 or is omitted then R1 is assumed to be in standard format; data is without headings.

29 Responses to ANOVA using Regression

  1. Himanshu Arora says:

    Hi Charles,
    Can you show me an example of how to convert a one variable linear regression problem to ANOVA. I am struggling in how to divide the values in 2 groups.
    Please help.
    Distance (miles) Cost (USD)
    337 59.5
    2565 509.5
    967 124.5
    5124 1480.4
    2398 696.23
    2586 559.5
    7412 1481.5
    522 474.5
    1499 737.5

    The distance in miles is my predictor variable, and Cost(in USD) is my dependent variable.
    Regards

    • Charles says:

      The referenced webpage describes how to convert an ANOVA problem into a linear regression problem, not the reverse.
      Why do you want to convert a linear regression into an ANOVA problem?
      Charles

  2. Michelle Wah says:

    Hi. I’m stuck. I’m studying independent restaurants featuring vegan food in three catogories of restaurant types. I want to determine which out of the three are successful by using the rate of return of the past three years. Does this fall under the catogory of ANOVA regression? And to confirm, vegan and RofR are dependent variables and my three catorgories are IV?

    Thanks

    • Charles says:

      Michelle,
      I don’t completely understand the scenario that you are describing. Can you give a specific example?
      For example, suppose you have categories of restaurants A, B and C and 10 restaurants in each category along with their rate of return over the past three years. You should be able to use one-way ANOVA to determine whether there is a significant difference in the rate of return among the three categories. Note that since this is a balanced model (all categories have the same number of restaurants in the samples), you don’t need to use regression. See One-way ANOVA for details.
      Charles

  3. Alessandro says:

    Hello,
    Thanks for the clear description, that helped me a lot.
    I have two further questions about ANOVA using regression:
    1) time series: suppose that the data of example 2 are a subset of a 3 years long experiment, i.e., that the same data have been also collected in two subsequent years. Suppose also that the results of year 1 influence those of year 2, and that those of year 1 and 2 influence those of year 3, i.e., that this should be considered as a time series. In order to estimate the year effect, could it be enough to add a further dummy variable t4 accounting for the year effect, with values 1; 2; 3 (or -1; 0; +1?) for the corresponding years? And should be the interaction terms accounted for by t1*t4, t2*t4, t3*t4, t1*t2*t4 and t1*t3*t4?
    2) nested factors: now suppose that (always starting from example 2) instead of “blend X” and “blend Y” we have “rice” and “corn” and that for both rice and corn we have 3 varieties (rice1, rice2, rice3 for rice; corn1, corn2, corn3 for corn). How do we deal with the fact that the varieties are nested within the plant types? And what about year effect (like in previous example) and interactions in this case?
    Thanks for the patience, even only for reading the questions until the end 🙂
    Alessandro

    • Charles says:

      Alessandro,

      1. I can view this as ANOVA with two fixed factors and one repeated measures factor. Alternatively, I can look at this as some sort of multivariate time series analysis. Your approach seems reasonable, but I honestly haven’t had time to verify that it is correct.

      2. See Nested ANOVA for more information about such examples.

      Charles

  4. Amina says:

    Hello
    I will like to know the importance of ANOVA after running a modertaed and hierarchical regression analysis. Research question: Do employee job satisfaction and demographic variables explain a significant amount of variance of ostracism?

  5. Cheryl Hennessy says:

    I just had a question: how is ANOVA like a multiple regression, as in what are the specific similarities? Hope you can answer my question soon. Thank you!

    • Charles says:

      Cheryl,
      I am asserting that you can carry out ANOVA by redefining it as a regression problem.
      Charles

      • Cheryl Hennessy says:

        I’m not working on a problem exactly. It’s actually a question my professor gave as homework to answer and i have found it difficult to answer! The question is literally: How is ANOVA like a multiple regression? That’s all it says and goes to the next question. So, I’m stuck!

  6. debanandabehera says:

    what is definition of analysis of variance in k-variable regression model.

  7. Gerard says:

    Thank you for your extremely useful website.
    I have a question. I want to know whether a specific chicken feed affects height, length and weight of chickens. So I have a team of three raters, each recorded weight, length and height of each 50 chicken at time 0, 1 and 2 months after this specific feed.
    How should I analyse these data?
    I plan to do intraclass correlation coefficient first to ensure the reliability of different raters. Should I use one factor at one time-point (such as weight at time 0)?
    How do I test for normality? Do I average weights of each chicken and test for normality, then do the same for length and weight?
    If I want to look at one factor (such as height), I shall then do repeated measures ANOVA. But if I want to look at three factors (height, weight and length), do I do repeated measures ANOVA for each factor separately? Is there a better way? From my understanding, two-factor ANOVA with replication does not apply to this situation.
    If I reject the null hypothesis, do I then do repeated measures ANOVA for each factor separately?
    If I reject the null hypothesis of weight alone, how do I do post-hoc analysis in this situation?
    I hope my question is not too troublesome. I look forward to hearing from you in due course and than you in advance for your help.
    Kind regards
    Gerard

    • Charles says:

      Gerard,

      Before answering any of these questions, it is important to understand what it is that you are trying to prove; i.e. what hypothesis are you trying to test. It seems that you are trying to understand whether the chicken feed is significantly better (in terms of the height, length and weight of chickens) compared to something unstated, probably the existing chicken feed. In this case it you probably want to use a MANOVA test (since you are comparing height, length and weight all at the same time.

      You have given three time references 0, 1 and 2 months. If you care about the comparisons for all three periods then you will need a repeaated measures test. This might lead to a repeated measures MANOVA.

      You also have three raters, but I don’t understand their role. Does each rater meausre the height, length and weight of each chicken or is their role different?

      Before answering the detailed questions, please clarify the above issues.

      Charles

  8. Craig atkinson says:

    Can you give a brief answer on why the value of a given Coefficients in a predictive equation is exactly half the numerical value of the corresponding Effect size in a Factorial DOE ANOVA analysis?

    • Charles says:

      Craig,
      Which measure of effect size for a factorial Anova are you referring to? Can you give me an example where the coefficient is half the effect size?
      Charles

  9. Ashutosh says:

    Ok. I shall go through it. Thanks a ton.

  10. Ashutosh says:

    Hi Charles, did you get my question? Your first response to my question was inspiring and expecting again a good response to my question please. Thank you.

  11. Ashutosh says:

    So I want to know how do we calculate Sequential Sum of Squares and Adjusted Sum of Squares for Interaction of a Factor with self. i.e. A^2 or B^2 etc…

    • luc says:

      Hi,

      Same problem for me.

      Do you have found any solution for quadratic term SS computation ?

      Thanks a lot

      Luc

  12. Ashutosh says:

    Hi Charles,
    Thanks a lot for your reply. We are talking about the ANOVA problem. And yes, Whether the model fits or not is a different issue.

    I am not looking for regression…Basically if you take Response Surface Designs (Central Composite or Box Behnken). Let me explain what I am looking for…You have 3 factors (A,B,C). Now in CCD or BBD you get ANOVA for

    Term Seq.SS Adj. SS Adj.MS F and P
    ————————————————–
    A
    B
    C
    AB
    AC
    BC
    A^2
    B^2
    C^2

    Here, I got the solution to calculate Seq. SS of Factor A,B,C. Cross Interaction of Factors – AB,AC and BC. But unable to get the value of Quadratic Terms i.e. A^2 or B^2 or C^2.
    Same way, I could generate the exact value in Adj. SS for Factor A,B,C. Cross Interaction of Factors (AB,AC and BC). But unable to generate the exact value of Quadratic Terms i.e. (A^2 = AA, B^2 and C^2).

  13. Ashutosh says:

    Hi,

    Everybody talks about and the formulas are given for calculating sequential and adjusted sum of squares of individual factor i.e. Factor A…

    But no body has given the explanation on how to calculate sequential or adjusted sum of squares for a factor square i.e. Factor A^2. Any Idea how to calculate sequential sum of square for a factor square (A^2)?

    • Charles says:

      Ashutosh,
      If Factor A^2 means the square of the values in factor A then it seems to me that we are still just talking about an Anova problem using the same formulas as before and merely squaring the data values. Whether the model fits is a different issue. The situation is similar for polynomial regression problems as described in http://www.real-statistics.com/multiple-regression/polynomial-regression/. Is this the issue that you are raising? Perhaps I am missing the point.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *