Method of Least Squares

In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).

Recall that the equation for a straight line is y = bx + a, where

b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the y-axis

For our purposes we write the equation of the best fit line as

image1667

and so the y-intercept is
image1668

For each i, we define ŷi as the y-value of xi on this line, and so

image1670

The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

image1671

Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by

image1673

where
image1674

Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point (, ȳ) and has equation

image1673

where the slope is
image1674

and the y-intercept is
image1676

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1image1679

Proof: By Definition 2 of Correlation,

image1680

and so by the above observation we have

image1681

Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Excel 2016 Function: Excel 2016 introduces a new function FORECAST.LINEAR, which is equivalent to FORECAST.

Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Regression line fit Excel

Figure 1 – Fitting a regression line to the data in Example 1

Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).

23 Responses to Method of Least Squares

  1. Knowledgeless lady says:

    Nice page!

    But I’m looking for more specific info more concerning trends in existing time series (e.g. temperature) than in estimations of the future.

    Having in an excel table a column with dates and one with temperature values (or whatever else) , I can easily construct a chart giving a line linking all values, and then, by selecting that line, produce a trend line with the info:
    y = ax + b; R² = .

    (About the exact significance of R², there are about as many meanings as web pages talking about it.)

    But this is still not quite what I expect: I would like Excel computing the trend value for the data series, possibly with a standard error associated to the trend, like done here:

    http://www.ysbl.york.ac.uk/~cowtan/applets/trend/trend.html

    Is it possible to invoke in Excel a function computing the trend as understood here?
    The Excel trend function is certainly not the right candidate!

    Thanks in advance

    • Charles says:

      Glad you liked the referenced webpage.

      As you probably know, you can add a linear trendline to an Excel scatter chart. This will provide the trendline, but not the standard error.

      If you know the standard error and so can compute the equations of the upper and lower lines (as in the site you referenced), then you can add these lines manually to the Excel chart. In fact for any line once you know two points on the line you can create a line through these points using Excel’s Scatter with Straight Lines chart capability.

      Charles

      • Knowledgeless lady says:

        Thanks for the quick answer, Charles, but… it is exactly what I already know and did not want to to.

        Maybe you misunderstood me: I’m not interested in incorporating the lines in Kevin’s charts, which seem to denote some kind of “uncertainty surface” encompassing the value deviations from the computed trend.

        I even don’t need the error deviation he computes, as e.g. for RSS data between 1979 and today:

        Trend: 0.126 ±0.063 °C/decade

        Solely the trend would be enough.

        Thus my surprise when trying, in an Excel table I read RSS data in, to obtain the same trend info in a cell where I invoke the trend function with a year sequence as xes and the anomaly sequence as ys.

        The result is bare nonsense, what tells me no more than that I don’t use that function properly.

        I can’t imagine Excel displaying a trend line in a chart but refusing to put in a cell what it itself had computed before drawing!

        • Charles says:

          Sorry, but I don’t understand.
          Charles

          • Knowledgeless lady says:

            That’s a pity indeed! Here’s some stuff that hopefully might help.

            The example above you can see displayed by Excel in a chart here (in pdf form):

            http://fs5.directupload.net/images/160317/3zuwxkzk.pdf

            You see
            – a curve linking 444 data points together (stored in a table column)
            – the trend line together with
            – its specification
            — y = 0.001 x – 0.1183
            — R2 = 0.3029

            What I miss here: the trend value itself along these values, which is
            — Trend: 0.126
            when the x-axis runs in months from 1979 to 2015.

            And, Knowledgeless lady as I am, I naively thought that invoking in a cell on the table the f(x)

            TREND(C1:C444; A1:A444)

            would give exactly that value. However, I get as result

            -0,113

            Thus I don’t invoke the TREND fonction properly. Where is the mistake?

          • Charles says:

            Assuming that C1:C444 contains the y values of your data and A1:A444 contains the x values, =TREND(C1:C444,A1:A444) returns the forecasted y value for the first x value. If you treat =TREND(C1:C444,A1:A444) as an array formula, then you need to highlight a column range with 444 cells enter the formula =TREND(C1:C444,A1:A444) and press Ctrl-Shft-Enter (not just Enter) and in this case you would get the forecasted values corresponding to all 444 data elements. This is equivalent to the trend line.
            Charles

          • Knowledgeless lady says:

            Many thanks for “March 19, 2016 at 6:59 pm”.

            Inbetween I found a pretty good alternative (“linest”) giving trend, standard deviation and R^2 in one step.

  2. LP says:

    First, this is great stuff. Thanks for putting this out there! One question, the phrase at the top: “the value of y where the line intersects with the x-axis”…isn’t this always zero (0)? I think we mean “[the value of y] when x=0”. Or am I missing something?

    • Charles says:

      LP,
      You are correct. Thank you very much for catching this error. I have just revised the webpage to reflect this change.
      Charles

  3. SHoxha says:

    Can you tell me the whole steeps finding m and c

  4. Andrés says:

    How do you balance the accuracy of the trendline showed with its r2?

    I am studying very similar trends in a proyect and i have my doubts regarding how cautious one must be, specially with values like 50, a bit far away from the central data

    In this case i see that there isn´t a regression that tends to average ( like in studies of heights in families ). How can you increase the likeliness of this doesn´t happening in your study?

    thanks!

  5. Dyadee says:

    Hello there Sir,

    The first three equations doesn’t make sense to me yet. Can you elaborate on the meaning of each symbol, like where does “c” and “x-bar”come from and what is the reason of introducing them into the original linear equation? Can you provide me references for further understanding these equations?

    Suggestion: Is it possible for you to put equation references like (Eq. 2.11).

    Thank you so much!

    • Charles says:

      x-bar is the mean of the x sample values. This is standard notation and is used throughout the website.

      An equation of a straight line takes the form y = b*x + a (slope b and y-intercept a). An alternative form (from high school algebra) is y – y0 = b(x – x0) where (x0, y0) is any point on the line (a straight is determined by any point on the line and its slope). I am choosing to use a point who x-value is x-bar and whose y-value is an unknown value c. Thus y – c = b*(x – x-bar). By algebra y = b*x – b*x-bar + c. But y = b*x + a, and so b*x – b*x-bar + c = b*x + a, from which it follows that -b*x-bar + c = a, and so c = a + b*x-bar: i.e. (x-bar, a + b*x-bar) lies on the line.

      Charles

  6. Ryan says:

    What is the difference between the FORECAST(x, R1, R2) and TREND(R1, R2, R3) functions?

    • Charles says:

      Ryan,
      When R2 contains a single column (simple linear regression) then FORECAST(x, R1, R2) is equivalent to TREND(R1, R2, x) and FORECAST(R3, R1, R2) is equivalent to TREND(R1, R2, R3). TREND can be used when R2 contains more than one column (multiple regression) while FORECAST cannot.
      Charles

  7. Charles Steinmeyer says:

    Nice website!

    Is there a function for the slope of a regression line, when forced to have an intercept of zero?

    I know I can plot the data, fit a trend line, and then print the equation, but is there a more direct way?

    Thanks.

    • Charles says:

      Charles,
      For the case where there is only one independent variable x, the formula for the slope is b = ∑x_iy_i/∑x_i^2.
      I plan to add information about this situation to the website in the future.
      Charles

  8. shade may says:

    if my observed values of y are greater than the values of x how can the slope be .9 less than one?

    • Charles says:

      Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y. Charles

  9. Mehran says:

    hy good job ur work is very good plz improv it more charles. i hav 2 question plz ans me. 1.what is correction factor and why we subtract it from to.s.s. and othr. 2.why we take standard deviation of data aftr takng variance. MEHRAN FROM PAKISTAN

    • Charles says:

      Hi Mehran,
      Thanks for using the website. I don’t see anything about a correction factor on the referenced webpage (i.e. Method of Least Squares). Which webpage are you referring to?
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *