Method of Least Squares

In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).

Recall that the equation for a straight line is y = bx + a, where

b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the x-axis

For our purposes we write the equation of the best fit line as

image1667

and so the y-intercept is
image1668

For each i, we define ŷi as the y-value of xi on this line, and so

image1670

The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

image1671

Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by

image1673

where
image1674

Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point (, ȳ) and has equation

image1673

where the slope is
image1674

and the y-intercept is
image1676

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1image1679

Proof: By Definition 2 of Correlation,

image1680

and so by the above observation we have

image1681

Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Regression line fit Excel

Figure 1 – Fitting a regression line to the data in Example 1

Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).

4 Responses to Method of Least Squares

  1. Mehran says:

    hy good job ur work is very good plz improv it more charles. i hav 2 question plz ans me. 1.what is correction factor and why we subtract it from to.s.s. and othr. 2.why we take standard deviation of data aftr takng variance. MEHRAN FROM PAKISTAN

    • Charles says:

      Hi Mehran,
      Thanks for using the website. I don’t see anything about a correction factor on the referenced webpage (i.e. Method of Least Squares). Which webpage are you referring to?
      Charles

  2. shade may says:

    if my observed values of y are greater than the values of x how can the slope be .9 less than one?

    • Charles says:

      Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>