Sometimes the dependent variable depends not just on the independent variables but also on the interaction between the variables. The model to use in this case is:

This is equivalent to a usual multiple regression model

studied in Multiple Regression Analysis where *x*_{3}* = x*_{1 }· *x*_{2}.

**Example 1**: We postulate that the amount of votes a candidate gets depends on the amount of amount of money they spend and their quality (position on issues, ability to debate, charisma, organizational abilities, etc.). The table on the left of Figure 1 shows the percentage of votes 10 candidates received in different elections along with the amount of money spent and their quality. Determine the relationship between votes, money and quality.

**Figure 1 – Data for Example 1 plus interaction model**

To capture the interaction between money and quality, we add an independent variable called “Interaction” (as described in the table on the right of Figure 1). Interaction is simply the product of the money and quality values. We now use the Regression data analysis tool on the interaction model. The resulting output is shown in Figure 2.

**Figure 2 – Regression with interaction**

This model is almost a perfect fit for the data (99.7% Adjusted R Square), and shows that we can predict the percentage of votes a candidate will get via the formula:

Votes = -12.22 – 0.86 * Money + 4.86 * Quality + 1.56 * Money * Quality

We can also run the Regression data analysis tool on the original data without the interaction variable, obtaining the output in Figure 3.

**Figure 3 – Regression without interaction**

This model is also a good fit for the data (p-value = 0.000499 < .05 = *α*), but with an Adjusted R Square value of 77.4%, not quite as good as the model with interaction.

We can use the Real Statistics **Extract Columns from a Data Range** data analysis tool to automate the process of creating the interaction between two variables.

For example, to create the interaction between Money and Quality in Example 1, press **Ctrl-m** and select **Extract Columns from a Data Range** from the menu. Now enter A3:D19 into the **Input Range** of the dialog box that appears (as described in Figure 4 of Categorical Coding in Regression) and press the **OK** button.

Now, select both Money and Quality from the list box in the dialog box that appears as shown on the right side of Figure 4 (by clicking on Money and, while holding down the **Ctrl** key, clicking on Quality) and press the **Add Inter** button. Since neither Money nor Quality have yet been added to the output, these too are copied over along with the interaction. The result is as shown in range E4:G16 of Figure 4.

**Figure 4 – Adding interaction using the Extract Columns data analysis tool**

I have a scenario in which I need to incorporate age into a model – where the concept of age is not so straight forward – and I’d be interested in what you think.

When the population being studied is premature infants, there are a couple ways to think about age. One is their day of life (the is straightforward – the number of days since they were born). The other is “corrected gestational age,” which is the number of days since conception.

Examples: 40 weeks after conception, an infant born prematurely at 36 weeks of pregnancy will have a day of life of 28, and a corrected gestation age of 40 weeks. A full term baby born at 40 weeks of pregnancy would have a day of life of 0 at birth, and a corrected gestational age of 40 weeks. Conversely, a two week old baby born at 30 weeks of pregnancy and a two week old full term baby have the same day of life, but different corrected gestational ages and very different biology.

Gestational age at birth (time since conception/length of pregnancy) is typically the variable that one knows and uses to convert day of life into corrected gestational age, but it is also an important variable on its own as it directly indicates how developed an infant typically is at birth.

Corrected gestational age (CGA) is meaningful because it captures developmental age. Day of life (DOL) is meaningful because it captures the amount of time out of the womb and in the world. Biologically, these are distinct factors and we want to account for both of them.

When we attempt to build a model that captures these different aspects of age and development, it seems to me we have some choices:

1) We can use DOL and CGA both as terms in the model

2) We can use DOL, CGA, and the interaction term CGA*DOL as terms in the model

3) We can use DOL, CGA, and gestational age at birth (GAB) as terms in the model

4) We can use DOL, GAB, and the interaction term DOL*GAB as terms in the model

5) We could derive another term such as CGA-DOL, or GAB+DOL, to replace GAB or CGA, respectively…

It’s tricky because these variables aren’t really independent – given any two you can compute the third – but I’m not sure that in a regression model, no information would be lost by using only two of the variables.

This question isn’t really specific to your software or to Excel. I just like the way you explain these kinds of things and you seem willing to answer questions. I would appreciate your input very much, even if it were only to point me towards additional information.

Alex,

In general, if you have two independent variables x1 and x2 and x2 can be expressed as a linear combination of x1, i.e. there are constants a and b such that x2 = b*x1 + a, then the regression model will fail due to collinearity. With three variables, collinearity can occur if there are constants a, b and c such that x3 = c*x1 + b*x2 + a, in which case the linear regression will again fail. If you don’t have this problem, then all the approaches that you have proposed should yield a regression model.

Charles

Thank you for the clear explanation.

Thanks for this very useful explanation! But what should I do on excel or minitab if I want to get a quadratic polynomial regression for something like YS = B0 + B1*H + B2*T + B3*H^2+ B4*T^2 + B5 H*T, where Bs are the coefficients, T is temperature of a hardness test, H is the hardness measured and YS is the Yield strength of a tool steel that was compressed?

See the webpage Polynomial Regression.

Charles

What is y and x

Saya,

You need to look at the webpage Multiple Regression Analysis.

Charles

In what situations do you use interaction multiple regression?

Jas,

You use in interaction when the interaction between two or more variables makes a significant contribution to the accuracy of the regression model (see the webpage http://www.real-statistics.com/multiple-regression/testing-significance-extra-variables-regression-model/).

You can also get more information about this at http://en.wikipedia.org/wiki/Interaction_(statistics)

Charles