# Method of Least Squares

In Correlation we study the linear correlation between two random variables x and y. We now look at the line in the xy plane that best fits the data (x1, y1), …, (xn, yn).

Recall that the equation for a straight line is y = bx + a, where

b = the slope of the line
a = y-intercept, i.e. the value of y where the line intersects with the y-axis

For our purposes we write the equation of the best fit line as

and so the y-intercept is

For each i, we define ŷi as the y-value of xi on this line, and so

The best fit line is the line for which the sum of the distances between each of the n data points and the line is as small as possible. A mathematically useful approach is therefore to find the line with the property that the sum of the following squares is minimum.

Theorem 1: The best fit line for the points (x1, y1), …, (xn, yn) is given by

where

Click here for the proof of Theorem 1. Two proofs are given, one of which does not use calculus.

Definition 1: The best fit line is called the regression line.

Observation: The theorem shows that the regression line passes through the point (, ȳ) and has equation

where the slope is

and the y-intercept is

Note too that b = cov(x,y)/var(x). Since the terms involving n cancel out, this can be viewed as either the population covariance and variance or the sample covariance and variance. Thus a and b can be calculated in Excel as follows where R1 = the array of y values and R2 = the array of x values:

b = SLOPE(R1, R2) = COVAR(R1, R2) / VARP(R2)

a = INTERCEPT(R1, R2) = AVERAGE(R1) – b * AVERAGE(R2)

Property 1

Proof: By Definition 2 of Correlation,

and so by the above observation we have

Excel Functions: Excel provides the following functions for forecasting the value of y for any x based on the regression line. Here R1 = the array of y data values and R2 = the array of x data values:

SLOPE(R1, R2) = slope of the regression line as described above

INTERCEPT(R1, R2) = y-intercept of the regression line as described above

FORECAST(x, R1, R2) calculates the predicted value y for the given value of x. Thus FORECAST(x, R1, R2) = a + b * x where a = INTERCEPT(R1, R2) and b = SLOPE(R1, R2).

TREND(R1, R2) = array function which produces an array of predicted y values corresponding to x values stored in array R2, based on the regression line calculated from x values stored in array R2 and y values stored in array R1.

TREND(R1, R2, R3) = array function which predicts the y values corresponding to the x values in R3 based on the regression line based on the x values stored in array R2 and y values stored in array R1.

To use TREND(R1, R2), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by a right parenthesis. Finally press Crtl-Shft-Enter.

To use TREND(R1, R2, R3), highlight the range where you want to store the predicted values of y. Then enter TREND and a left parenthesis. Next highlight the array of observed values for y (array R1), enter a comma and highlight the array of observed values for x (array R2) followed by another comma and highlight the array R3 containing the values for x for which you want to predict y values based on the regression line. Now enter a right parenthesis and press Crtl-Shft-Enter.

Excel 2016 Function: Excel 2016 introduces a new function FORECAST.LINEAR, which is equivalent to FORECAST.

Example 1: Calculate the regression line for the data in Example 1 of One Sample Hypothesis Testing for Correlation and plot the results.

Figure 1 – Fitting a regression line to the data in Example 1

Using Theorem 1 and the observation following it, we can calculate the slope b and y-intercept a of the regression line that best fits the data as in Figure 1 above. Using Excel’s charting capabilities we can plot the scatter diagram for the data in columns A and B above and then select Layout > Analysis|Trendline and choose a Linear Trendline from the list of options. This will display the regression line given by the equation y = bx + a (see Figure 1).

### 36 Responses to Method of Least Squares

1. Mehran says:

hy good job ur work is very good plz improv it more charles. i hav 2 question plz ans me. 1.what is correction factor and why we subtract it from to.s.s. and othr. 2.why we take standard deviation of data aftr takng variance. MEHRAN FROM PAKISTAN

• Charles says:

Hi Mehran,
Thanks for using the website. I don’t see anything about a correction factor on the referenced webpage (i.e. Method of Least Squares). Which webpage are you referring to?
Charles

if my observed values of y are greater than the values of x how can the slope be .9 less than one?

• Charles says:

Remember that the intercept plays a role as well as the slope. E.g. in y = x/2 + 1000 (slope .5) as long as x < 2000 x will be less than y. Charles

3. Charles Steinmeyer says:

Nice website!

Is there a function for the slope of a regression line, when forced to have an intercept of zero?

I know I can plot the data, fit a trend line, and then print the equation, but is there a more direct way?

Thanks.

• Charles says:

Charles,
For the case where there is only one independent variable x, the formula for the slope is b = ∑x_iy_i/∑x_i^2.
Charles

4. Ryan says:

What is the difference between the FORECAST(x, R1, R2) and TREND(R1, R2, R3) functions?

• Charles says:

Ryan,
When R2 contains a single column (simple linear regression) then FORECAST(x, R1, R2) is equivalent to TREND(R1, R2, x) and FORECAST(R3, R1, R2) is equivalent to TREND(R1, R2, R3). TREND can be used when R2 contains more than one column (multiple regression) while FORECAST cannot.
Charles

Hello there Sir,

The first three equations doesn’t make sense to me yet. Can you elaborate on the meaning of each symbol, like where does “c” and “x-bar”come from and what is the reason of introducing them into the original linear equation? Can you provide me references for further understanding these equations?

Suggestion: Is it possible for you to put equation references like (Eq. 2.11).

Thank you so much!

• Charles says:

x-bar is the mean of the x sample values. This is standard notation and is used throughout the website.

An equation of a straight line takes the form y = b*x + a (slope b and y-intercept a). An alternative form (from high school algebra) is y – y0 = b(x – x0) where (x0, y0) is any point on the line (a straight is determined by any point on the line and its slope). I am choosing to use a point who x-value is x-bar and whose y-value is an unknown value c. Thus y – c = b*(x – x-bar). By algebra y = b*x – b*x-bar + c. But y = b*x + a, and so b*x – b*x-bar + c = b*x + a, from which it follows that -b*x-bar + c = a, and so c = a + b*x-bar: i.e. (x-bar, a + b*x-bar) lies on the line.

Charles

6. Andrés says:

How do you balance the accuracy of the trendline showed with its r2?

I am studying very similar trends in a proyect and i have my doubts regarding how cautious one must be, specially with values like 50, a bit far away from the central data

In this case i see that there isn´t a regression that tends to average ( like in studies of heights in families ). How can you increase the likeliness of this doesn´t happening in your study?

thanks!

• Charles says:

Sorry Andrés, but I don’t understand your comment.
Charles

7. SHoxha says:

Can you tell me the whole steeps finding m and c

• Charles says:

I don’t see any “m” on the referenced webpage. What are you referring to_
Charles

8. LP says:

First, this is great stuff. Thanks for putting this out there! One question, the phrase at the top: “the value of y where the line intersects with the x-axis”…isn’t this always zero (0)? I think we mean “[the value of y] when x=0”. Or am I missing something?

• Charles says:

LP,
You are correct. Thank you very much for catching this error. I have just revised the webpage to reflect this change.
Charles

Nice page!

But I’m looking for more specific info more concerning trends in existing time series (e.g. temperature) than in estimations of the future.

Having in an excel table a column with dates and one with temperature values (or whatever else) , I can easily construct a chart giving a line linking all values, and then, by selecting that line, produce a trend line with the info:
y = ax + b; R² = .

(About the exact significance of R², there are about as many meanings as web pages talking about it.)

But this is still not quite what I expect: I would like Excel computing the trend value for the data series, possibly with a standard error associated to the trend, like done here:

http://www.ysbl.york.ac.uk/~cowtan/applets/trend/trend.html

Is it possible to invoke in Excel a function computing the trend as understood here?
The Excel trend function is certainly not the right candidate!

• Charles says:

Glad you liked the referenced webpage.

As you probably know, you can add a linear trendline to an Excel scatter chart. This will provide the trendline, but not the standard error.

If you know the standard error and so can compute the equations of the upper and lower lines (as in the site you referenced), then you can add these lines manually to the Excel chart. In fact for any line once you know two points on the line you can create a line through these points using Excel’s Scatter with Straight Lines chart capability.

Charles

Thanks for the quick answer, Charles, but… it is exactly what I already know and did not want to to.

Maybe you misunderstood me: I’m not interested in incorporating the lines in Kevin’s charts, which seem to denote some kind of “uncertainty surface” encompassing the value deviations from the computed trend.

I even don’t need the error deviation he computes, as e.g. for RSS data between 1979 and today:

Solely the trend would be enough.

Thus my surprise when trying, in an Excel table I read RSS data in, to obtain the same trend info in a cell where I invoke the trend function with a year sequence as xes and the anomaly sequence as ys.

The result is bare nonsense, what tells me no more than that I don’t use that function properly.

I can’t imagine Excel displaying a trend line in a chart but refusing to put in a cell what it itself had computed before drawing!

• Charles says:

Sorry, but I don’t understand.
Charles

That’s a pity indeed! Here’s some stuff that hopefully might help.

The example above you can see displayed by Excel in a chart here (in pdf form):

You see
– a curve linking 444 data points together (stored in a table column)
– the trend line together with
– its specification
— y = 0.001 x – 0.1183
— R2 = 0.3029

What I miss here: the trend value itself along these values, which is
— Trend: 0.126
when the x-axis runs in months from 1979 to 2015.

And, Knowledgeless lady as I am, I naively thought that invoking in a cell on the table the f(x)

TREND(C1:C444; A1:A444)

would give exactly that value. However, I get as result

-0,113

Thus I don’t invoke the TREND fonction properly. Where is the mistake?

• Charles says:

Assuming that C1:C444 contains the y values of your data and A1:A444 contains the x values, =TREND(C1:C444,A1:A444) returns the forecasted y value for the first x value. If you treat =TREND(C1:C444,A1:A444) as an array formula, then you need to highlight a column range with 444 cells enter the formula =TREND(C1:C444,A1:A444) and press Ctrl-Shft-Enter (not just Enter) and in this case you would get the forecasted values corresponding to all 444 data elements. This is equivalent to the trend line.
Charles

Many thanks for “March 19, 2016 at 6:59 pm”.

Inbetween I found a pretty good alternative (“linest”) giving trend, standard deviation and R^2 in one step.

10. barry says:

Hi, great page.

Forgive this 101 question.

I’ve been using the stats add-on for Excel than includes the regression function.

I want to know which box to read to see what the trend/slope is.

Here’s what I do in pictures, to make it clearer.

First I open the statistics add-on package and choose Regression: Picture 1
Then I select Labels, Residuals, Residual plots, Standardised residuals and Line fit plots, plug in the X and Y variables in their respective dialogue boxes: Picture 2
When I click OK the results appear in a new spreadsheet: Picture 3

Is the slope given in one of the boxes? Or do I need to make another calculation?

Not being a regular user of Excel, I’ve simply subtracted the first point of the line fit from the last and divided by the number of data points to get the trend (in this case, divide by number of months).

Does one of the boxes in Picture 3 give me the slope?

• Charles says:

Barry,
Picture 3 shows the slope. It is the coefficient (below intercept).
Charles

• barry says:

Thanks very much!

I noticed Knowledgeless lady (below) was working with monthly global lower tropospheric temperature data from Remote Sensing Systems. My examples were based on another source that provides data for the same metric – University of Huntsville, Alabama.

I would not be surprised if you got a few drop-ins following/participating in the online climate debates. Global temperatures are a hot-button issue in the semi-popular debate, and there are a few online apps that give trends (and some give the confidence intervals, such as at the link she provided) for various climate data just by pressing a few buttons. Many regulars, including me, have no training in stats at all, and some of us are trying to get to grips with it conceptually, even if the math is beyond us.

Don’t know if you know of it, but there’s an excellent blog devoted to analysing climate trends and educating people on how statistics work, including common – and uncommon – pitfalls. The author is a high-end statistical analyst and his posts (if you ignore the biffo) are wonderfully clear and concise. If you’ve a few minutes and it sounds interesting, to you, I’d recommend checking it out.

This is the blog:

https://tamino.wordpress.com/

His older posts have been archived here.

Once again, thanks for your help.

• Charles says:

Thanks Barry for sharing this information with us. I have actually seen this blog before.
Charles

11. Ima says:

Dear Charles
First, I would like to thank you for you great page. Second, my problem is; I have 3 input data (time, speed, acceleration) and 1 output data ( emissions). I would like to establish the relitionship between input and output data .
Can you help me what method that I can used it. Finally, thank you for your kind support in advance
Ima

12. Shahar Dagan says:

I don’t understand nothing that write here where I should begin to study this subject to understand this(Some free internet basic Course) ?

• Charles says:

Shahar,
Sorry, but we don-t offer free courses.
Charles

13. Jonathan Bechtel says:

Charles,

For the equation of the best fine line, what does c stand for in:

y-c=b(x-xavg)??

• Charles says:

Jonathan,
c is the value of y when x is the average of the x values.
Charles

• Jonathan Bechtel says:

Thank you. Do y and x represent the vector of values for X and Y when not denoted by Xi and Yi?

• Charles says:

Jonathan,
Yes, you can view y as representing the vector consisting of the elements yi. Alternatively y can be viewed as a random variable.
The same is true for x, except that now in addition to being viewed as a vector consisting of the elements xi, it can also be viewed as a matrix with values xij (this is the multiple linear regression case).
Charles