As we saw in Linear Regression Models for Comparing Means, categorical variables can often be used in a regression analysis by first replacing the categorical variable by a dummy variable (also called a tag variable).

We now illustrate more complex examples, and show how to perform Two Factor ANOVA using multiple regression. See Three Factor ANOVA using Regression for information about how to apply these techniques to factorial ANOVA with more than two factors.

**Example 1**: Repeat the analysis from Example 1 of Basic Concepts for ANOVA with the sample data in the table on the left of Figure 1 using multiple regression.

Our objective is to determine whether there is a significant difference between the three flavorings. In this example, we have reduced the sample size from Example 1 of Basic Concepts for ANOVA to better illustrate the key concepts. Instead of doing the analysis using ANOVA as we did there, this time we will use regression analysis instead. First we define the following two dummy variables and map the original data into the model on the right side of Figure 1.

*t*_{1} = 1 if flavoring 1 is used; = 0 otherwise

*t*_{2} = 1 if flavoring 2 is used; = 0 otherwise

Note that in general, if the original data has *k* values the model will require *k* – 1 dummy variables.

The null hypothesis is

H_{0}: µ_{1} = µ_{2} = µ_{3}

where *x _{j}* = the score for Flavor group

*j*. The linear regression model is

since for the Flavor 1 group, *t*_{1} = 1 and *t*_{2} = 0

since for the Flavor 2 group, *t*_{1} = 0 and *t*_{2} = 1

since for the Flavor 3 group, *t*_{1} = 0 and *t*_{2} = 0

Thus the null hypothesis given above is equivalent to

Simplifying, this means that the null hypothesis is equivalent to:

H_{0}: *β*_{1} = *β*_{2}_{ } = 0_{
}

The results of the regression analysis are displayed in Figure 2.

We now compare the regression results from Figure 2 with the ANOVA on the same data found in Figure 3. Note that the *F* value 0.66316 is the same as that in the regression analysis. Similarly, the p-value .52969 is the same in both models.

Note the following about the regression coefficients:

- The intercept
*b*_{0}= mean of the Flavor 3 group = 14. - The coefficient
*b*_{1}for variable*t*_{1}= mean of the Flavor 1 group – mean of the Flavor 3 group = 12 – 14 = -2 - The coefficient
*b*_{2}for variable*t*_{2}= mean of the Flavor 2 group – mean of the Flavor 3 group = 11.5 – 14 = -2.5

This is consistent with what we noted above when relating the population group means to the population coefficients, namely *µ _{3} = β_{0}*,

*µ*and

_{1}= β_{0}+ β_{1}*µ*

_{2}= β_{0}+ β_{2.}

**Example 1** (**alternative approach**): An alternative coding for Example 1 is as follows

*t*_{1} = 1 if flavoring 1 is used; = -1 if flavoring 3 is used; = 0 otherwise

*t*_{2} = 1 if flavoring 2 is used; = -1 if flavoring 3 is used; = 0 otherwise

In general, If there are *k* groups then the *j*th dummy variable *t _{j}* = 1 if the

*j*th group,

*t*= -1 if the

_{k}*k*th group and = 0 otherwise.

The data now can be expressed as in the table on the left of Figure 4.

The null hypothesis and linear regression model are as before. Now we have:

since for the Flavor 1 group, *t*_{1} = 1 and *t*_{2} = 0

since for the Flavor 2 group, *t*_{1} = 0 and *t*_{2} = 1

since for the Flavor 3 group, *t*_{1} = -1 and *t*_{2} = -1

Thus the null hypothesis is equivalent to *β _{0} + β_{1} = β_{0} + β_{2} = β_{0}* – (

*β*). Simplifying, this means once again that the null hypothesis is equivalent to:

_{1}+ β_{2}H_{0}: *β _{1} = β_{2}_{ }* = 0

Note too that *μ _{2} = β_{0}* – (

*β*) =

_{1}+ β_{2}*β*– (

_{0}*μ*), and so

_{1}– β_{0}+ μ_{2}– β_{0}*β*= (

_{0}*μ*)/3, i.e.

_{1}+ μ_{2}+ μ_{3}*β*= the population grand mean. Also

_{0}*β*and

_{1}= β_{0}– μ_{1}*β*, and so

_{2}= β_{0}– μ_{2}*β*= the population Flavor 1 mean less the population grand mean and

_{1}*β*= the population Flavor 2 mean less the population grand mean.

_{2}The results of the regression analysis are given on the right side of Figure 4.

The first Summary and ANOVA tables are identical to the results from the previous analysis, and so once again we see that the results are the same as for the ANOVA. The regression coefficients, however, are different.

Figure 5 displays the grand mean, the group means and the group effect sizes (i.e. the group mean less the grand mean).

We note that the intercept of the regression model is the grand mean 12.5 and the other coefficients correspond to the group effects for the Flavor 1 and Flavor 2 groups.

**Example 2**: Repeat the analysis from Example 1 of Two Factor ANOVA with Replication on the reduced sample data in the table on the left of Figure 6 using multiple regression.

This time we show how to perform a two factor ANOVA using multiple regression. As we did in the previous example, we first define the dummy variables as follows:

*t*_{1} = 1 if Blend X; = 0 otherwise

*t*_{2}* *= 1 if Corn; = 0 otherwise

*t*_{3} = 1 if Soy; = 0 otherwise

The data now takes the form shown in Figure 7 where y is the yield.

Note that this time we model the interaction of *t*_{1} with *t*_{2} and *t*_{3}, as described in Interaction. The regression model that we use is of form

We now build a table of the means for each of the 6 groups (i.e. cells), as described in Figure 8.

This table can be constructed by calculating the means of each of the above 6 groups from the original data or by applying the AVERAGEIFS function to the transformed data.

As we did in Example 1, we note that the mean in the case for Blend Y and Rice (i.e. where *t _{1} = t_{2} = t_{3}* = 0) is given by

And similarly for the other combinations:

Solving the simultaneous equations, we get the following values for the coefficients:

*b _{0}_{ }*= 165.4

*b*= -24.2

_{1}*b*= -5.8

_{2}*b*= -25.2

_{3}*b*= 0

_{4}*b*= 59.8

_{5}We get the same results when we run the Regression data analysis tool (see Figure 9).

The relatively high value of *R* and low value of Significance *F* show that the above model is a sufficiently good fit. Using the ANOVA: Two factor data analysis tool, we get the output shown in Figure 10.

We now show how to obtain the ANOVA results from the Regression model and vice versa. Note that *MS _{W} *= 450.33 =

*MS*, which is as expected since both of these denote the portion of the variation due to error. Also note that

_{Res}*MS*= 17457.87/29 = 602.00 for both models, and so the systemic variation for both models is the same as well. For the ANOVA model this is

_{T}*= (136.53 + 553.27 + 5960.07) / (1 + 2 + 2) = 46649.87/5 = 1329.97*

This is the same as *MS _{Res} *= 6649.87/5 = 1329.97 for the Regression model.

To obtain the Rows (A), Columns (B) and Interaction (AB) values in the ANOVA model from the Regression model, first rerun the regression analysis using only *t _{1}* as an independent variable. The values obtained for

*SS*,

_{Reg}*df*and

_{Reg}*MS*are the values of

_{Reg}*SS*,

_{Row}*df*and

_{Row}*MS*in the ANOVA model. Then rerun the regression analysis using only

_{Row}*t*and

_{2}*t*. The values obtained for

_{3}*SS*,

_{Reg}*df*and

_{Reg}*MS*are the values of

_{Reg}*SS*,

_{Col}*df*and

_{Col }*MS*in the ANOVA model. Now

_{Col}*SS*=

_{Interaction}*SS*–

_{Bet}*SS*–

_{Row}*SS*(and similarly for the

_{Col}*df*terms) where

*SS*is

_{Bet}*SS*in the original (complete) regression model.

_{Reg}Finally note that the value of R Square = .381. This has two interpretations. First it is the square of Multiple R (whose value = .617), which is simply the correlation coefficient *r*. Second it measures the percentage of variation explained by the regression model (or by the ANOVA model), which is

*SS _{Reg}*/

*SS*= 6649.87/5793 = 0.381

_{T}which is also equal to 1 – *SS _{W}*/

*SS*from the ANOVA model.

_{T}**Observation**: Just as we did in the single factor ANOVA of Example 1, we can obtain similar results for Example 2 using the alternative coding of dummy variables, namely

*t _{1}* = 1 if Blend X; = -1 otherwise

*t*

_{2}*= 1*

*if Corn; -1 if Rice; = 0 otherwise*

*t*= 1 if Soy; = -1 if Rice; = 0 otherwise

_{3}This approach is especially useful in creating unbalanced ANOVA models, i.e. where the sample sizes are not equal in a factorial ANOVA (see Unbalanced Factorial Anova).

**Real Statistics Function**: The following array supplemental function is contained in the Real Statistics Resource Pack.

**SSAnova2**(R1, *r*) – returns a column array with *SS _{Row}*,

*SS*,

_{Col}*SS*and

_{Int}*SS*for a two factor ANOVA for the data in R1 using a regression model; if

_{W}*r*> 0 then R1 is assumed to be in Excel Anova format with

*r*rows per sample, while if

*r*= 0 or is omitted then R1 is assumed to be in standard format; data is without headings.

Hey,

I have a complete ANOVA table and want to draw a regression fit line.

How to do that ?

Ahmad,

You can draw an regression fit line at least for the case with only one x variable by using the Trendline option of a Scatter chart. If you have more than one x variable or are employing Anova using Regression, then the regression fit line is not a line but a hyperplane and it won’t be easy to draw.

Charles

Hi Dr.

Many thanks for this great work. Statistic became very easy with you. Thanks.

Is it possible to inform how to do ANOVA using regression by hand. Step by step with formulas. I need to understand it, to be able to explain it to my student and then can use your soft.

Thanks,

The referenced webpage shows how to perform ANOVA by manually modifying the output from the Excel regression data analysis tool. Elsewhere on the website I show how to manually perform the same calculations as Excel’s regression data analysis tool. If you combine both of the these, you have what you are requesting.

Charles

Hi Charles,

Can you show me an example of how to convert a one variable linear regression problem to ANOVA. I am struggling in how to divide the values in 2 groups.

Please help.

Distance (miles) Cost (USD)

337 59.5

2565 509.5

967 124.5

5124 1480.4

2398 696.23

2586 559.5

7412 1481.5

522 474.5

1499 737.5

The distance in miles is my predictor variable, and Cost(in USD) is my dependent variable.

Regards

The referenced webpage describes how to convert an ANOVA problem into a linear regression problem, not the reverse.

Why do you want to convert a linear regression into an ANOVA problem?

Charles

Hi. I’m stuck. I’m studying independent restaurants featuring vegan food in three catogories of restaurant types. I want to determine which out of the three are successful by using the rate of return of the past three years. Does this fall under the catogory of ANOVA regression? And to confirm, vegan and RofR are dependent variables and my three catorgories are IV?

Thanks

Michelle,

I don’t completely understand the scenario that you are describing. Can you give a specific example?

For example, suppose you have categories of restaurants A, B and C and 10 restaurants in each category along with their rate of return over the past three years. You should be able to use one-way ANOVA to determine whether there is a significant difference in the rate of return among the three categories. Note that since this is a balanced model (all categories have the same number of restaurants in the samples), you don’t need to use regression. See One-way ANOVA for details.

Charles

I’m feeling confident already.

thanks for your assistance and conformation.

Hello,

Thanks for the clear description, that helped me a lot.

I have two further questions about ANOVA using regression:

1) time series: suppose that the data of example 2 are a subset of a 3 years long experiment, i.e., that the same data have been also collected in two subsequent years. Suppose also that the results of year 1 influence those of year 2, and that those of year 1 and 2 influence those of year 3, i.e., that this should be considered as a time series. In order to estimate the year effect, could it be enough to add a further dummy variable t4 accounting for the year effect, with values 1; 2; 3 (or -1; 0; +1?) for the corresponding years? And should be the interaction terms accounted for by t1*t4, t2*t4, t3*t4, t1*t2*t4 and t1*t3*t4?

2) nested factors: now suppose that (always starting from example 2) instead of “blend X” and “blend Y” we have “rice” and “corn” and that for both rice and corn we have 3 varieties (rice1, rice2, rice3 for rice; corn1, corn2, corn3 for corn). How do we deal with the fact that the varieties are nested within the plant types? And what about year effect (like in previous example) and interactions in this case?

Thanks for the patience, even only for reading the questions until the end 🙂

Alessandro

Alessandro,

1. I can view this as ANOVA with two fixed factors and one repeated measures factor. Alternatively, I can look at this as some sort of multivariate time series analysis. Your approach seems reasonable, but I honestly haven’t had time to verify that it is correct.

2. See Nested ANOVA for more information about such examples.

Charles

Thanks!

Hello

I will like to know the importance of ANOVA after running a modertaed and hierarchical regression analysis. Research question: Do employee job satisfaction and demographic variables explain a significant amount of variance of ostracism?

Sorry Amina, but I don’t yet support these topics.

Charles

I just had a question: how is ANOVA like a multiple regression, as in what are the specific similarities? Hope you can answer my question soon. Thank you!

Cheryl,

I am asserting that you can carry out ANOVA by redefining it as a regression problem.

Charles

I’m not working on a problem exactly. It’s actually a question my professor gave as homework to answer and i have found it difficult to answer! The question is literally: How is ANOVA like a multiple regression? That’s all it says and goes to the next question. So, I’m stuck!

what is definition of analysis of variance in k-variable regression model.

Analysis of Variance is defined on the webpage One-way Analysis of Variance.

Regression is a way of performing the calculations required to create the model.

Charles

Thank you for your extremely useful website.

I have a question. I want to know whether a specific chicken feed affects height, length and weight of chickens. So I have a team of three raters, each recorded weight, length and height of each 50 chicken at time 0, 1 and 2 months after this specific feed.

How should I analyse these data?

I plan to do intraclass correlation coefficient first to ensure the reliability of different raters. Should I use one factor at one time-point (such as weight at time 0)?

How do I test for normality? Do I average weights of each chicken and test for normality, then do the same for length and weight?

If I want to look at one factor (such as height), I shall then do repeated measures ANOVA. But if I want to look at three factors (height, weight and length), do I do repeated measures ANOVA for each factor separately? Is there a better way? From my understanding, two-factor ANOVA with replication does not apply to this situation.

If I reject the null hypothesis, do I then do repeated measures ANOVA for each factor separately?

If I reject the null hypothesis of weight alone, how do I do post-hoc analysis in this situation?

I hope my question is not too troublesome. I look forward to hearing from you in due course and than you in advance for your help.

Kind regards

Gerard

Gerard,

Before answering any of these questions, it is important to understand what it is that you are trying to prove; i.e. what hypothesis are you trying to test. It seems that you are trying to understand whether the chicken feed is significantly better (in terms of the height, length and weight of chickens) compared to something unstated, probably the existing chicken feed. In this case it you probably want to use a MANOVA test (since you are comparing height, length and weight all at the same time.

You have given three time references 0, 1 and 2 months. If you care about the comparisons for all three periods then you will need a repeaated measures test. This might lead to a repeated measures MANOVA.

You also have three raters, but I don’t understand their role. Does each rater meausre the height, length and weight of each chicken or is their role different?

Before answering the detailed questions, please clarify the above issues.

Charles

Can you give a brief answer on why the value of a given Coefficients in a predictive equation is exactly half the numerical value of the corresponding Effect size in a Factorial DOE ANOVA analysis?

Craig,

Which measure of effect size for a factorial Anova are you referring to? Can you give me an example where the coefficient is half the effect size?

Charles

Ok. I shall go through it. Thanks a ton.

Hi Charles, did you get my question? Your first response to my question was inspiring and expecting again a good response to my question please. Thank you.

See the response I just sent to your other comment. Charles

So I want to know how do we calculate Sequential Sum of Squares and Adjusted Sum of Squares for Interaction of a Factor with self. i.e. A^2 or B^2 etc…

Hi,

Same problem for me.

Do you have found any solution for quadratic term SS computation ?

Thanks a lot

Luc

Hi Charles,

Thanks a lot for your reply. We are talking about the ANOVA problem. And yes, Whether the model fits or not is a different issue.

I am not looking for regression…Basically if you take Response Surface Designs (Central Composite or Box Behnken). Let me explain what I am looking for…You have 3 factors (A,B,C). Now in CCD or BBD you get ANOVA for

Term Seq.SS Adj. SS Adj.MS F and P

————————————————–

A

B

C

AB

AC

BC

A^2

B^2

C^2

Here, I got the solution to calculate Seq. SS of Factor A,B,C. Cross Interaction of Factors – AB,AC and BC. But unable to get the value of Quadratic Terms i.e. A^2 or B^2 or C^2.

Same way, I could generate the exact value in Adj. SS for Factor A,B,C. Cross Interaction of Factors (AB,AC and BC). But unable to generate the exact value of Quadratic Terms i.e. (A^2 = AA, B^2 and C^2).

Hi Ashutosh,

I am not familiar with Response Surface Designs (Central Composite or Box Behnken), and so i don’t have an immediate answer for you. My first impression is to use regression of the type y = b0 + b1 t1 + b2 t2 + b3 t3 + b4 t1 t2 + b5 t1 t3 + b6 t2 t3 + b7 t1^2 + b8 t2^2 + b9 t3^2 (perhaps with more dummy variables). This is the approach used to create 3 factor ANOVA models (using regression). Instead of an ABC term you need A^2, B^2 and C^2 terms.

See http://www.real-statistics.com/multiple-regression/anova-using-regression/ and http://www.real-statistics.com/multiple-regression/three-factor-anova-using-regression/ for more details on how to use regression to build ANOVA models.

Charles

Yes instead of ABC, I need A^2, B^2 and C^2.

Hi,

Everybody talks about and the formulas are given for calculating sequential and adjusted sum of squares of individual factor i.e. Factor A…

But no body has given the explanation on how to calculate sequential or adjusted sum of squares for a factor square i.e. Factor A^2. Any Idea how to calculate sequential sum of square for a factor square (A^2)?

Ashutosh,

If Factor A^2 means the square of the values in factor A then it seems to me that we are still just talking about an Anova problem using the same formulas as before and merely squaring the data values. Whether the model fits is a different issue. The situation is similar for polynomial regression problems as described in http://www.real-statistics.com/multiple-regression/polynomial-regression/. Is this the issue that you are raising? Perhaps I am missing the point.

Charles