In this section we test the following null hypothesis:

H_{0}: the regression line doesn’t capture the relationship between the variables

If we reject the null hypothesis it means that the line is a good fit for the data. We now express the null hypothesis in a way that is more easily testable:

H_{0}: ≤

As described in Two Sample Hypothesis Testing to Compare Variances, we can use the* F* test to compare the variances in two samples. To test the above null hypothesis we set *F* = *MS _{Reg}*/

*MS*and use

_{Res}*df*,

_{Reg}*df*degrees of freedom.

_{Res}**Observation**: The use of the linear regression model is based on the following assumptions:

- Linearity of the phenomenon measured
- Constant variance of the error term
- Independence of the error terms
- Normality of the error term distribution

In fact the normality assumption is equivalent to the condition that the sample comes from a population with a bivariate normal distribution. See Multivariate Normal Distribution for more information about this distribution. The homogeneity of the variance assumption is equivalent to the condition that for any values *x*_{1} and *x*_{2} of *x*, the variance of y for those *x* are equal, i.e.

**Observation**: Linear regression can be effective with a sample size as small as 20.

**Example 1**: Test whether the regression line in Example 1 of Method of Least Squares is a good fit for the data.

**Figure 1 – Goodness of fit of regression line for data in Example 1**

We note that *SS _{T} *= DEVSQ(B4:B18) = 1683.7 and

*r*= CORREL(A4:A18, B4:B18) = -0.713, and so by Property 3 of Regression Analysis,

*SS*=

_{Reg}*r*= (1683.7)(0.713)

^{2}·SS_{T}^{2}= 857.0. By Property 1 of Regression Analysis,

*SS*=

_{Res}*SS*–

_{T}*SS*= 1683.7 – 857.0 = 826.7. From these values, it is easy to calculate

_{Reg}*MS*and

_{Reg}*MS*.

_{Res}We now calculate the test statistic *F = MS*_{Reg}/*MS _{Res} *= 857.0/63.6 = 13.5. Since

*F*= FINV(

_{crit}*α, df*) = FINV(.05, 1, 13) = 4.7 < 13.5 =

_{Reg}, df_{Res}*F*, we reject the null hypothesis, and so accept that the regression line is a good fit for the data (with 95% confidence). Alternatively, we note that p-value = FDIST(

*F, df*) = FDIST(13.5, 1, 13) = 0.0028 < .05 =

_{Reg}, df_{Res}*α*, and so once again we reject the null hypothesis.

**Observation**: There are many ways of calculating *SS _{Reg}*,

*SS*and

_{Res}*SS*. E.g., using the worksheet in Figure 1 of Regression Analysis, we note that

_{T}*SS*= DEVSQ(K5:K19) and

_{Reg}*SS*= DEVSQ(L5:L19). These formulas are valid since the means of the y values and ȳ values are equal by Property 5(b) of Regression Analysis.

_{Res}Also by Definition 2 of Regression Analysis, *SS _{Res} *= (y

*– ŷ*

_{i}*)*

_{i}^{2}= SUMXMY2(J5:J19, K5:K19). Finally,

*SS*= DEVSQ(J5:J19), but alternatively

_{T}*SS*

_{T }=

*var*(y) ∙

*df*= VAR(J5:J19) * (COUNT(J5:J19)-1).

_{T}
The cells that you are using for calculations in the final Observation section must have been changed. There are no columns J, K, or L in Figure 1.

John,

These formulas are references to the spreadsheet in Figure 1 of the Regression Analysis webpage: http://www.real-statistics.com/regression/regression-analysis/

I have now updated the referenced webpage to make this clearer.

Thanks for bringing this problem to my attention.

Charles

Sir

In example 1 you wrote: “We note that SST = DEVSQ(A4:A18) = 1683.7” It seems to be a typo. It should be DEVSQ(B4:B18)

In the second last paragraph you said “SSRes = \sum {(y_i^2 – \hat{y}_i^2)} = SUMX2MY2 (J5:J19, K5:K19).” It looks like a mistake. I’ve checked in spreadsheet.

Hi Colin,

You are correct on all counts. I have made the changes that you have identified on the webpage. Thanks again for your diligence. You have certainly helped make things clearer for everyone who looks at the site.

Charles