Method of Least Squares for Multiple Regression

We will now extend the method of least squares to equations with multiple independent variables of form

Multiple regression model

As in Method of Least Squares, we express this line in the form

image1811

Thus,
image1812

Given a set of n points (x11, …, x1k, y1), … , (xn1, …, xnk, yn), our objective is to find a line of the above form which best fits the points. As in the simple regression case, this means finding the values of the bj coefficients for which the sum of the squares, expressed as follows, is minimum:

image1671

where ŷi is the y-value on the best fit line corresponding to x, …, xik.

Definition 1: The best fit line is called the (multiple) regression line

Multiple regression line

Theorem 1: The regression line has form

image1817

where the coefficients bm are the solutions to the following k equations in k unknowns.

image1819

Click here for a proof of Theorem 1 (using calculus).

Observation: We can use either the population or sample formulas for covariance (as long as we stick to one or the other). Thus, we can use the Excel function COVAR for the population covariance (or COVARIANCE.P in Excel 2010/2013) or COVARIANCE.S for the sample covariance in Excel 2010/2013 (or the supplemental function COVARS), although as we will see below there are ways of calculating all the covariances together. Note too that where j = m

image1821

Example 1: A jeweler prices diamonds on the basis of quality (with values from 0 to 8, with 8 being flawless and 0 containing numerous imperfections) and color (with values from 1 to 10, with 10 being pure white and 1 being yellow). Based on the price per carat (in hundreds of dollars) of the following 11 diamonds weighing between 1.0 and 1.5 carats, determine the relationship between quality, color and price.

Data least squares example

Figure 1 – Data for Example 1

As described above, we need to solve the following equations:

image1823

image9324

where x1 = quality and x2 = color, which for our problem yields the following equations (using the sample covariances that can be calculated via the COVAR function as described in Basic Concepts of Correlation):

image1825

image1826

For this example, finding the solution is quite straightforward: b1 = 4.90 and b2 = 3.76. Thus the regression line takes the form

image1829

Using the means found in Figure 1, the regression line for Example 1 is

(Price – 47.18) = 4.90 (Color – 6.00) + 3.76 (Quality – 4.27)

or equivalently

Price = 4.90 ∙ Color + 3.76 ∙ Quality + 1.75

Thus, the coefficients are b0 = 1.75, b1 = 4.90 and b2 = 3.76.

Observation: The fact that coefficient b1 is larger than b2 doesn’t mean that it plays a stronger role in the prediction described by the regression line. If, however, we standardize all the variables that are used to create the regression line, then indeed the coefficients that have a larger absolute value do have a greater influence in the prediction defined by the regression line. Note that if we do this the  intercept  will be zero.

Observation: With only two independent variables, it is relatively easy to calculate the coefficients for the regression line as described above. With more variables, this approach becomes tedious, and so we now define a more refined method.

Definition 2: Given m random variables x1, x2, …, xm and a sample x1jx2j, …, xnj of size n for each random variable xj, the covariance matrix is an m × m array of form [cij] where cij = cov(xi, xj).  The correlation matrix is an m × m array of form [cij] where cij is the correlation coefficient between xi and xj.

The sample covariance matrix is the covariance matrix where the cij refer to the sample covariances and the population covariance matrix is the covariance matrix where the cij refer to the population covariances.

Since the corresponding sample and population correlation matrices are the same, we refer to them simply as the correlation matrix.

Property 0: If X is the n × m array [xij] and is the 1 × m array [j], then the sample covariance matrix S and the population covariance matrix Σ have the following property:

image7005

image7006

Example 2: Find the regression line for the data in Example 1 using the covariance matrix.

The approach is described in Figure 2.

Regression coefficients Excel

Figure 2 – Creating the regression line using the covariance matrix

The sample covariance matrix for this example is found in the range G6:I8. Since we have 3 variables, it is a 3 × 3 matrix. In general, the covariance matrix is a (k+1) × (k+1) matrix where k = the number of dependent variables. The sample covariance matrix can be created in Excel, cell by cell using the COVARIANCE.S or COVARS function. Alternatively, using Property 0, it can be created by highlighting the range G6:I8 and using the following array formula:

=MMULT(TRANSPOSE(A4:C14-A15:C15),A4:C14-A15:C15)/(B17-1)

(see Matrix Operations for more information about these matrix operations). The sample covariance matrix can also be created using the following supplemental array function (as described below):

=COV(A4:C14)

Note that the linear equations that need to be solved arise from the first 2 rows (in general, the first k rows) of the covariance matrix, which we have repeated in the range G12:I13 of Figure 2.

Solving this system of linear equations is equivalent to solving the matrix equation AX = C where X is the k × 1 column vector consisting of the bj, C = the k × 1 column vector consisting of the constant terms and A is the k × k matrix consisting of the coefficients of the bi terms in the above equations. Using the techniques of Matrix Operations and Simultaneous Linear Equations, the solution is given by X = A-1C. For this example the solution A-1C is located in the range K16:K17, and can be calculated by the array formula:

= MMULT(MINVERSE(F16:G17),I16:I17))

Thus b1 is the value in cell K16 (or G20) and b2 is the value in cell K17 (or G21). The value of the coefficient b0 (in cell G19) is found using the following Excel formula:

=C15-MMULT(A15:B15,G20:G21)

Real Statistics Excel Support: The Real Statistics Resources Pack provides the following supplemental array functions:

COV(R1, b) = the covariance matrix for the sample data contained in range R1, organized by columns. If R1 is a k × n array (i.e.  variables, each with a sample of size n), then COV(R1) must be a k × k array.

COVP(R1, b) = the population covariance matrix for the data contained in range R1. The result is the same as COV(R1) except that entries use the population version of covariance (i.e. division by n instead of n – 1).

CORR(R1, b) = the correlation matrix for the data contained in range R1.

If b = TRUE (default) then any row in R1 which contains a blank or non-numeric cell is not used, while if b = FALSE then correlation/covariance coefficients are calculated pairwise (by columns) and so any row which contains non-numeric data for either column in the pair is not used to calculate that coefficient value.

The Real Statistics Resource Pack also contains a Matrix Operations data analysis tool which includes similar functionality.

Observation: Let R1 be a k × n range which contains only numeric value, let R2 be a 1 × n range containing the means of the columns in R1 and let R3 be a 1 × n range containing the standard deviations of the columns in R1. Then

COV(R1) = MMULT(TRANSPOSE(R1-R2),R1-R2)/(ROWS(R1)–1)

CORR(R1) = MMULT(TRANSPOSE((R1-R2)/R3),(R1-R2)/R3)/(ROWS(R1)–1)

17 Responses to Method of Least Squares for Multiple Regression

  1. Rainer says:

    Hi,
    I think the second formula below figure 1 contains an error:
    http://www.real-statistics.com/wp-content/uploads/2013/02/image1824.png
    It should probably read:
    cov(y,x2) =
    Regards,
    Rainer

    • Charles says:

      Rainer,
      Another good catch. I have made the change. Thanks for your careful examination. It is much appreciated.
      Charles

  2. Tapan kumar mahanta says:

    Hi,
    sir how you found 20.5= 5.80b1-2.10b2 from figure 1 by using the equation
    cov(y,x1)=b1 cov(x1,x1)+b2 cov(x2,x1)

    • Charles says:

      You have only written one equation, but there are two equations, not just one. In fact there are two linear equations in two unknowns. The coefficients b1 and b2 are the unknowns, the values for cov(y1,x1), cov(x1,x2), etc. are known (they can be calculated from the sample data values). Normal algebra can be used to solve two equations in two unknowns. With more equations and more unknowns you can still use algebra, but you can also use the techniques shown elsewhere on the site. Charles

  3. Joshua Dunn says:

    Hi Charles,

    Just wanted to say a massive thank you for creating this whole site. It has fantastically written pieces with the relevant mathematically formulations for those who wish to fully understand the processes and brilliant examples for those who just wish to use them. I wished to do both so to have both in one place in depth is invaluable.

    Many, many thanks,
    Josh

  4. Emrah says:

    Hi Charles, do you have a matlab code for the least square with an example?

    • Charles says:

      Hi Emrah,
      Sorry, but I don’t have matlab code for the least square. Examples using Excel can be found on my website.
      Charles

      • Joshua Dunn says:

        I do if it is still no too late. There is also a lot of regression add-ons for matlab on the File Exchange. Let me know and good luck

  5. Vaish says:

    Thanks a lot for this website! Really helped a lot..

  6. Rhijuta says:

    Thank you for the formulas. My textbook and Prof ( for online course) are both useless. This helped a lot and I was actually able to do my assignment. Appreciate it!!

  7. Pierre van Eeden says:

    Good, clearly written approach – thanks!

  8. leonidas says:

    Hi Charles,

    I made the calculations of the equations below figure 1 and i found that there is a trouble… the result of my calculation is
    20.5 =5.8b1 -2.1b2
    15.34 = -2.1b1 – 6.82b2
    S0, cov(y,x1)= 15,34, cov(x1,x2)=-2.10, cov(x1,x1)=6.82, cov(x2,x2)= 5.8
    and cov(y,x2)= 20,5
    with color=x2 and quality= x1 (as you say in the start of the text)
    Are there any mistakes int he equations?

    • Charles says:

      Everything looks good except that you made typo in the second equation. It should be 15.34 = -2.1b1 + 6.82b2.
      Charles

  9. Stan says:

    How did you end up with the factors for b1 and b2, 4.9 and 3.76, respectively.

    • Charles says:

      Stan,

      This is explained on the referenced webpage. Essentially you generate the following two linear equations

      20.5 = 5.80b1 – 2.10b2
      15.35 = -2.10b1 + 6.82b2

      and then you find the solution using high school algebra. E.g. if you multiply the first equation by 2.10 and multiply the second equation by 5.80 and then add the two equations together, the b1 term will drop out and you can solve the resulting equation for b2. Once you have the value for b2, you can substitute it in the first equation and solve for b1.

      Charles

  10. Stan says:

    Can you please explain the following:
    For this example, finding the solution is quite straightforward: b1 = 4.90 and b2 = 3.76.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>