In Correlation we study the linear correlation between two random variables *x* and y. We now look at the line in the *x*y plane that best fits the data (*x*_{1}, y_{1}), …, (*x _{n},* y

*).*

_{n}Recall that the equation for a straight line is y = *bx + a*, where

*b* = the slope of the line

*a* = y-intercept, i.e. the value of y where the line intersects with the x-axis

For our purposes we write the equation of the best fit line as

For each *i*, we define ŷ* _{i}* as the y-value of

*x*on this line, and so

_{i}The best fit line is the line for which the sum of the distances between each of the *n* data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

**Theorem 1**: The best fit line for the points (*x*_{1}, y_{1}), …, (*x _{n}*, y

*) is given by*

_{n}Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

**Definition 1**: The best fit line is called the **regression line**.

**Observation**: The theorem shows that the regression line passes through the point (*x̄*, ȳ) and has equation

Note too that *b* = *cov*(*x*,y)/*var*(*x*). Since the terms involving *n* cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus *a* and *b* can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of *x* values:

*b* = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

*a* = INTERCEPT(R1, R2) = AVERAGE(R1) – *b* * AVERAGE(R2)

Proof: By Definition 2 of Correlation,

and so by the above observation we have

**Excel Functions**: Excel provides the following functions for forecasting the value of y for any *x* based on the regression line. Here R1 = the array of y data values and R2 = the array of *x* data values:

**SLOPE**(R1, R2) = slope of the regression line as described above

**INTERCEPT**(R1, R2) = y-intercept of the regression line as described above

**FORECAST**(*x*, R1, R2) calculates the predicted value y for the given value of *x*. Thus FORECAST(x, R1, R2) = *a + b * x* where *a* = INTERCEPT(R1, R2) and *b* = SLOPE(R1, R2).

**TREND**(R1, R2) = array function which produces an array of predicted y values corresponding to *x* values stored in array R2, based on the regression line calculated from *x* values stored in array R2 and y values stored in array R1.

**TREND**(R1, R2, R3) = array function which predicts the y values corresponding to the *x* values in R3 based on the regression line based on the *x* values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for *x* (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for *x* (array R2) followed by another comma and highlight the array R3 containing the values for *x* for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

**Example 1**: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Using Theorem 1 and the observation following it, we can calculate the slope *b* and y-intercept *a* of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select **Layout > Analysis|Trendline** and choose a **Linear Trendline** from the list of options. This will display the regression line given by the equation y = *bx + a* (see Figure 1).

How do you balance the accuracy of the trendline showed with its r2?

I am studying very similar trends in a proyect and i have my doubts regarding how cautious one must be, specially with values like 50, a bit far away from the central data

In this case i see that there isn´t a regression that tends to average ( like in studies of heights in families ). How can you increase the likeliness of this doesn´t happening in your study?

thanks!

Sorry Andrés, but I don’t understand your comment.

Charles

Hello there Sir,

The first three equations doesn’t make sense to me yet. Can you elaborate on the meaning of each symbol, like where does “c” and “x-bar”come from and what is the reason of introducing them into the original linear equation? Can you provide me references for further understanding these equations?

Suggestion: Is it possible for you to put equation references like (Eq. 2.11).

Thank you so much!

x-bar is the mean of the x sample values. This is standard notation and is used throughout the website.

An equation of a straight line takes the form y = b*x + a (slope b and y-intercept a). An alternative form (from high school algebra) is y – y0 = b(x – x0) where (x0, y0) is any point on the line (a straight is determined by any point on the line and its slope). I am choosing to use a point who x-value is x-bar and whose y-value is an unknown value c. Thus y – c = b*(x – x-bar). By algebra y = b*x – b*x-bar + c. But y = b*x + a, and so b*x – b*x-bar + c = b*x + a, from which it follows that -b*x-bar + c = a, and so c = a + b*x-bar: i.e. (x-bar, a + b*x-bar) lies on the line.

Charles

What is the difference between the FORECAST(x, R1, R2) and TREND(R1, R2, R3) functions?

Ryan,

When R2 contains a single column (simple linear regression) then FORECAST(x, R1, R2) is equivalent to TREND(R1, R2, x) and FORECAST(R3, R1, R2) is equivalent to TREND(R1, R2, R3). TREND can be used when R2 contains more than one column (multiple regression) while FORECAST cannot.

Charles

Nice website!

Is there a function for the slope of a regression line, when forced to have an intercept of zero?

I know I can plot the data, fit a trend line, and then print the equation, but is there a more direct way?

Thanks.

Charles,

For the case where there is only one independent variable x, the formula for the slope is b = ∑x_iy_i/∑x_i^2.

I plan to add information about this situation to the website in the future.

Charles

if my observed values of y are greater than the values of x how can the slope be .9 less than one?

Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y. Charles

hy good job ur work is very good plz improv it more charles. i hav 2 question plz ans me. 1.what is correction factor and why we subtract it from to.s.s. and othr. 2.why we take standard deviation of data aftr takng variance. MEHRAN FROM PAKISTAN

Hi Mehran,

Thanks for using the website. I don’t see anything about a correction factor on the referenced webpage (i.e. Method of Least Squares). Which webpage are you referring to?

Charles