In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).
Recall that the equation for a straight line is y = bx + a, where
b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the y-axis
For our purposes we write the equation of the best fit line as
For each i, we define ŷi as the y-value of xi on this line, and so
The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.
Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by
Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.
Definition 1: The best fit line is called the regression line.
Observation: The theorem shows that the regression line passes through the point (x̄, ȳ) and has equation
Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:
b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)
a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)
Proof: By Definition 2 of Correlation,
and so by the above observation we have
Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:
SLOPE(R1, R2) = slope of the regression line as described above
INTERCEPT(R1, R2) = y-intercept of the regression line as described above
FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).
TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.
TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.
To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.
To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.
Excel 2016 Function: Excel 2016 introduces a new function FORECAST.LINEAR, which is equivalent to FORECAST.
Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.
Figure 1 – Fitting a regression line to the data in Example 1
Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).