In this section we show how to use dummy variables to model categorical variables using linear regression in a way that is similar to that employed in Dichotomous Variables and the t-test. In particular we show that hypothesis testing of the difference between means using the t-test (see Two Sample t Test with Equal Variances and Two Sample t Test with Unequal Variances) can be done by using linear regression.

**Example 1**: Repeat the analysis of Example 1 of Two Sample t Test with Equal Variances (comparing means from populations with equal variance) using linear regression.

**Figure 1 – Regression analysis of data in Example 1**

The leftmost table in Figure 1 contains the original data from Example 1 of Two Sample t Test with Equal Variances. We define the dummy variable* x* so that *x* = 0 when the data element is from the New group and *x* = 1 when the data element is from the Old group. The data can now be expressed with an independent variable and a dependent variable as described in the middle table in Figure 1.

Running the Regression data analysis tool on *x* and y, we get the results on the right in Figure 1. We can now compare this with the results we obtained using the t-test data analysis tool, which we repeat here in Figure 2.

**Figure 2 – t-****test on data in Example 1**

We now make some observations regarding this comparison:

*F*= 4.738 in the regression analysis is equal to the square of the t-stat (2.177) from the t-test, which is consistent with Property 1 of F Distribution- R Square = .208 in the regression analysis is equal to = where t is the t-stat from the t-test, which is consistent with the observation following Theorem 1 of One Sample Hypothesis Testing for Correlation
- The p-value = .043 from the regression analysis (called Significance
*F*) is the same as the p-value from the test (called P(T<=t) two-tail).

**Effect Size**

We can also see from the above discussion that the regression coefficient can be expressed as a function of the t-stat using the following formula:

The impact of this is that the effect size for the t-test can be expressed in terms of the regression coefficient. The general guidelines are that *r* = .1 is viewed as a small effect, *r* = .3 as a medium effect and *r* = .5 as a large effect. For Example 1, *r* = 0.456 which is close to .5, and so is viewed as a large effect.

Note that this formula can also be used to measure the effect size for t-tests even when the population variances are unequal (see next example) and for the case of paired samples.

**Model coefficients**

Also note that the coefficients in the regression model y = *bx + a* can be calculated directly from the original data as follows. First calculate the means of the data for each flavoring (new and old). The mean of the data in the new flavoring sample is 15 and the mean of the data in the old flavoring sample is 11.1. Since *x* = 0 for the new flavoring sample and *x* = 1 for the old flavoring sample, we have

This means that *a* = 15 and *b* = 11.1 – *a* = 11.1 – 15 = -3.9, and so the regression line is y = 15 – 3.9*x*, which agrees with the coefficients in Figure 1.

**Unequal variance**

As was mentioned in the discussion following Figure 4 of Testing the Regression Line Slope, the Regression data analysis tool provides an optional Residuals Plot. The output for Example 1 is displayed in Figure 3.

**Figure 3 – Residual plot for data in Example 1**

From the chart we see how the residual values corresponding to *x* = 0 and *x* = 1 are distributed about the mean of zero. The spreading about *x* = 1 is a bit larger than for *x* = 0, but the difference is quite small, which is an indication that the variances for *x* = 0 and *x* = 1 are quite equal. This suggests that the variances for the New and Old samples are roughly equal.

**Example 2**: Repeat the analysis of Example 2 of Two Sample t Test with Unequal Variances (comparing means from populations with unequal variance) using linear regression.

**Figure 4 – Regression analysis of data in Example 2**

We note that the regression analysis displayed in Figure 4 agrees with the t-test analysis assuming equal variances (the table on the left of Figure 5).

**Figure 5 – ** **t-tests on data in Example 2**

Unfortunately, since the variances are quite unequal, the correct results are given by the table on the right in Figure 5. This highlights the importance of the requirement that variances of the values for each be equal for the results of the regression analysis to be useful.

Also note that the plot of the Residuals for the regression analysis clearly shows that the variances are unequal (see Figure 6).

**Figure 6 – Residual plot for data in Example 2**

So we could also use regression analysis if only if the two groups have the same variance? Correct me if I’m wrong. Thanks

You can use the t test (with equal variances) as long as the variances are not too different. If not you can use the t test with unequal variances. The t tests work even with moderate violations of the normality assumption. The regression approach, as described on the website, is equivalent to the t test (as explained at http://www.real-statistics.com/regression/linear-regression-models-for-comparing-means/). If the normality assumption is severely violated (esp. if the data is clearly unsymmetric), then the Mann-Whitney test is probably your best choice.

Charles

Charles my study have four independet variable and one dependent varible iwant to use linear regression method for analysis but the problem is the sample size which will give the acceptable r-square value for generalizations of my findings

January,

You can use the Real Statistics Statistical Power and Sample Size data analysis tool to calculate the sample size necessary for your regression analysis based on the expected effect size (i.e. R-square) value. See the following webpage for more details

Sample Size Regression

Charles

Hello Charles,

The example that you describe is between groups. Would it be appropriate to compare two means using regression with a within subjects design? All of my participants were exposed to both conditions. Can I still use the method that you described above?

Thank you

Agr,

You can use regression in this case, but I have not yet described how to do this on the website.

Charles

Charles,

similar to Agr, I am looking to compare two biological measurements taken from the same group of individuals to see the effect of tissue type on the value of this measurement. Will you be describing how one sets up a regression when observations are not independent any time soon? This would be immensely helpful.

Thanks for the clear and detailed explanations you’ve already provided above.

Elisabeth

Elsabeth,

I understood that Agr was looking to perform a version of repeated measures ANOVA using regression.

The website describes how to perform repeated measures ANOVA. See the following webpage:

Repeated Measures ANOVA

I still plan to develop a method which uses regression to provide a more robust solution.

Charles

Hi Charles

Can you give me the reference for the general guidelines for r (r = .1 small effect, r = .3 médium effect, r = .5 large effect)?

Many thanks

Helena

Helena,

You can find this in many places. Just google “interpreting effect size r”. Here is one such reference:

http://www.polyu.edu.hk/mm/effectsizefaqs/thresholds_for_interpreting_effect_sizes2.html

Charles

Many thanks, H