In this section we show how to use dummy variables to model categorical variables using linear regression in a way that is similar to that employed in Dichotomous Variables and the t-test. In particular we show that hypothesis testing of the difference between means using the t-test (see Two Sample t Test with Equal Variances and Two Sample t Test with Unequal Variances) can be done by using linear regression.
Example 1: Repeat the analysis of Example 1 of Two Sample t Test with Equal Variances (comparing means from populations with equal variance) using linear regression.
Figure 1 – Regression analysis of data in Example 1
The leftmost table in Figure 1 contains the original data from Example 1 of Two Sample t Test with Equal Variances. We define the dummy variable x so that x = 0 when the data element is from the New group and x = 1 when the data element is from the Old group. The data can now be expressed with an independent variable and a dependent variable as described in the middle table in Figure 1.
Running the Regression data analysis tool on x and y, we get the results on the right in Figure 1. We can now compare this with the results we obtained using the t-test data analysis tool, which we repeat here in Figure 2.
Figure 2 – t-test on data in Example 1
We now make some observations regarding this comparison:
- F = 4.738 in the regression analysis is equal to the square of the t-stat (2.177) from the t-test, which is consistent with Property 1 of F Distribution
- R Square = .208 in the regression analysis is equal to = where t is the t-stat from the t-test, which is consistent with the observation following Theorem 1 of One Sample Hypothesis Testing for Correlation
- The p-value = .043 from the regression analysis (called Significance F) is the same as the p-value from the test (called P(T<=t) two-tail).
We can also see from the above discussion that the regression coefficient can be expressed as a function of the t-stat using the following formula:
The impact of this is that the effect size for the t-test can be expressed in terms of the regression coefficient. The general guidelines are that r = .1 is viewed as a small effect, r = .3 as a medium effect and r = .5 as a large effect. For Example 1, r = 0.456 which is close to .5, and so is viewed as a large effect.
Note that this formula can also be used to measure the effect size for t-tests even when the population variances are unequal (see next example) and for the case of paired samples.
Also note that the coefficients in the regression model y = bx + a can be calculated directly from the original data as follows. First calculate the means of the data for each flavoring (new and old). The mean of the data in the new flavoring sample is 15 and the mean of the data in the old flavoring sample is 11.1. Since x = 0 for the new flavoring sample and x = 1 for the old flavoring sample, we have
This means that a = 15 and b = 11.1 – a = 11.1 – 15 = -3.9, and so the regression line is y = 15 – 3.9x, which agrees with the coefficients in Figure 1.
As was mentioned in the discussion following Figure 4 of Testing the Regression Line Slope, the Regression data analysis tool provides an optional Residuals Plot. The output for Example 1 is displayed in Figure 3.
Figure 3 – Residual plot for data in Example 1
From the chart we see how the residual values corresponding to x = 0 and x = 1 are distributed about the mean of zero. The spreading about x = 1 is a bit larger than for x = 0, but the difference is quite small, which is an indication that the variances for x = 0 and x = 1 are quite equal. This suggests that the variances for the New and Old samples are roughly equal.
Example 2: Repeat the analysis of Example 2 of Two Sample t Test with Unequal Variances (comparing means from populations with unequal variance) using linear regression.
Figure 4 – Regression analysis of data in Example 2
We note that the regression analysis displayed in Figure 4 agrees with the t-test analysis assuming equal variances (the table on the left of Figure 5).
Figure 5 – t-tests on data in Example 2
Unfortunately, since the variances are quite unequal, the correct results are given by the table on the right in Figure 5. This highlights the importance of the requirement that variances of the values for each be equal for the results of the regression analysis to be useful.
Also note that the plot of the Residuals for the regression analysis clearly shows that the variances are unequal (see Figure 6).
Figure 6 – Residual plot for data in Example 2