We will now extend the method of least squares to equations with multiple independent variables of form

As in Method of Least Squares, we express this line in the form

Given a set of *n* points (*x*_{11}, …, *x*_{1k}, y_{1}), … , (*x _{n}*

_{1}, …,

*x*, y

_{nk}*), our objective is to find a line of the above form which best fits the points. As in the simple regression case, this means finding the values of the*

_{n}*b*coefficients for which the sum of the squares, expressed as follows, is minimum:

_{j}where ŷ* _{i}* is the y-value on the best fit line corresponding to

*x*

*, …, x*

_{ik}.**Definition 1**: The best fit line is called the (**multiple**) **regression line**

**Theorem 1**: The regression line has form

where the coefficients *b _{m}* are the solutions to the following

*k*equations in

*k*unknowns.

Click here for a proof of Theorem 1 (using calculus).

**Observation**: We can use either the population or sample formulas for covariance (as long as we stick to one or the other). Thus, we can use the Excel function COVAR for the population covariance (or COVARIANCE.P in Excel 2010/2013) or COVARIANCE.S for the sample covariance in Excel 2010/2013 (or the supplemental function COVARS), although as we will see below there are ways of calculating all the covariances together. Note too that where *j* = *m*

**Example 1**: A jeweler prices diamonds on the basis of quality (with values from 0 to 8, with 8 being flawless and 0 containing numerous imperfections) and color (with values from 1 to 10, with 10 being pure white and 1 being yellow). Based on the price per carat (in hundreds of dollars) of the following 11 diamonds weighing between 1.0 and 1.5 carats, determine the relationship between quality, color and price.

As described above, we need to solve the following equations:

where *x*_{1} = quality and *x*_{2} = color, which for our problem yields the following equations (using the sample covariances that can be calculated via the COVAR function as described in Basic Concepts of Correlation):

For this example, finding the solution is quite straightforward: *b*_{1} = 4.90 and *b*_{2} = 3.76. Thus the regression line takes the form

Using the means found in Figure 1, the regression line for Example 1 is

(Price – 47.18) = 4.90 (Color – 6.00) + 3.76 (Quality – 4.27)

or equivalently

Price = 4.90 ∙ Color + 3.76 ∙ Quality + 1.75

Thus, the coefficients are *b*_{0} = 1.75, *b*_{1} = 4.90 and *b*_{2} = 3.76.

**Observation**: The fact that coefficient *b*_{1} is larger than *b*_{2} doesn’t mean that it plays a stronger role in the prediction described by the regression line. If, however, we standardize all the variables that are used to create the regression line, then indeed the coefficients that have a larger absolute value do have a greater influence in the prediction defined by the regression line. Note that if we do this the intercept will be zero.

**Observation**: With only two independent variables, it is relatively easy to calculate the coefficients for the regression line as described above. With more variables, this approach becomes tedious, and so we now define a more refined method.

**Definition 2**: Given *m* random variables *x*_{1}, *x*_{2}, …, *x _{m}* and a sample

*x*

_{1j},

*x*

_{2j}, …,

*x*of size

_{nj}*n*for each random variable

*x*, the

_{j}**covariance matrix**is an

*m*×

*m*array of form [

*c*] where

_{ij}*c*= cov(

_{ij}*x*,

_{i}*x*). The

_{j}**correlation matrix**is an

*m*×

*m*array of form [

*c*] where

_{ij}*c*is the correlation coefficient between

_{ij}*x*and

_{i }*x*.

_{j}The **sample covariance matrix** is the covariance matrix where the *c _{ij}* refer to the sample covariances and the

**population covariance matrix**is the covariance matrix where the

*c*refer to the population covariances.

_{ij}Since the corresponding sample and population correlation matrices are the same, we refer to them simply as the correlation matrix.

**Property 0**: If *X* is the *n* × *m* array [*x _{ij}*] and

*x̄*is the 1 ×

*m*array [

*x̄*], then the sample covariance matrix S and the population covariance matrix Σ have the following property:

_{j}**Example 2**: Find the regression line for the data in Example 1 using the covariance matrix.

The approach is described in Figure 2.

The sample covariance matrix for this example is found in the range G6:I8. Since we have 3 variables, it is a 3 × 3 matrix. In general, the covariance matrix is a (*k*+1) × (*k*+1) matrix where *k* = the number of dependent variables. The sample covariance matrix can be created in Excel, cell by cell using the COVARIANCE.S or COVARS function. Alternatively, using Property 0, it can be created by highlighting the range G6:I8 and using the following array formula:

=MMULT(TRANSPOSE(A4:C14-A15:C15),A4:C14-A15:C15)/(B17-1)

(see Matrix Operations for more information about these matrix operations). The sample covariance matrix can also be created using the following supplemental array function (as described below):

=COV(A4:C14)

Note that the linear equations that need to be solved arise from the first 2 rows (in general, the first *k* rows) of the covariance matrix, which we have repeated in the range G12:I13 of Figure 2.

Solving this system of linear equations is equivalent to solving the matrix equation *AX = C* where *X* is the *k* × 1 column vector consisting of the *b _{j}*,

*C*= the

*k*× 1 column vector consisting of the constant terms and

*A*is the

*k*×

*k*matrix consisting of the coefficients of the

*b*terms in the above equations. Using the techniques of Matrix Operations and Simultaneous Linear Equations, the solution is given by

_{i}*X*=

*A*

^{-1}

*C*. For this example the solution

*A*

^{-1}C is located in the range K16:K17, and can be calculated by the array formula:

= MMULT(MINVERSE(F16:G17),I16:I17))

Thus *b*_{1} is the value in cell K16 (or G20) and *b*_{2} is the value in cell K17 (or G21). The value of the coefficient *b*_{0} (in cell G19) is found using the following Excel formula:

=C15-MMULT(A15:B15,G20:G21)

**Real Statistics Excel Support**: The Real Statistics Resources Pack provides the following supplemental array functions:

**COV**(R1, *b*) = the covariance matrix for the sample data contained in range R1, organized by columns. If R1 is a *k × n* array (i.e. variables, each with a sample of size *n*), then COV(R1) must be a *k × k* array.

**COVP**(R1, *b*) = the population covariance matrix for the data contained in range R1. The result is the same as COV(R1) except that entries use the population version of covariance (i.e. division by *n* instead of *n* – 1).

**CORR**(R1, *b*) = the correlation matrix for the data contained in range R1.

If *b* = TRUE (default) then any row in R1 which contains a blank or non-numeric cell is not used, while if *b* = FALSE then correlation/covariance coefficients are calculated pairwise (by columns) and so any row which contains non-numeric data for either column in the pair is not used to calculate that coefficient value.

The Real Statistics Resource Pack also contains a **Matrix Operations** data analysis tool which includes similar functionality.

**Definition 3**: We now reformulate the least squares model using matrix notation (see Basic Concepts of Matrices and Matrix Operations for more details about matrices and how to operate with matrices in Excel).

We start with a sample {y_{1}, …, y* _{n}*} of size

*n*for the dependent variable y and samples {

*x*

_{1j},

*x*

_{2j}, …,

*x*} for each of the independent variables

_{nj}*x*for

_{j}*j*= 1, 2, …,

*k*.

Let *Y* = an *n* × 1 column vector with the entries y_{1}, …, y_{n}. Let *X* be the *n* × (*k*+1) matrix (called the **design matrix**):

Let *B* be a (*k*+1) × 1 column vector consisting of the coefficients *b*_{0}, *b*_{1}, …, *b*_{k}. and let *Y*-hat be the (*k*+1) × 1 column vector consisting of the entries ŷ_{1}, …, ŷ* _{n}*. Then the least squares model can be expressed as

Furthermore we define the *n* × *n* **hat matrix** *H* as

Proof: The second assertion follows from the first since by Definition 3,

**Property 2**: The covariance matrix of *B* can be represented by

In particular, the diagonal of *C* = [*c _{ij}*] contains the variance of the

*b*, and so the standard error of

_{j}*b*can be expressed as

_{j}**Example 3**: Calculate the linear regression coefficients for the data in Example 1 using Property 1.

The result is displayed in Figure 3. Range E4:G14 contains the design matrix *X* and range I4:I14 contains *Y*. The matrix (*X*^{T}*X*)^{-1} in range E17:G19 can be calculated using the array formula

=MINVERSE(MMULT(TRANSPOSE(E4:G14),E4:G14))

Per Property 1, the coefficient vector *B* (in range K4:K6) can be calculated using the array formula:

=MMULT(E17:G19,MMULT(TRANSPOSE(E4:G14),I4:I14))

The predicted values of *Y*, i.e. *Y*-hat, can then be calculated using the array formula

=MMULT(E4:G14,K4:K6)

The standard error of each of the coefficients in *B* can be calculated as follows. First calculate the array of error terms E as C4:C14 – I4:I14. Then just as in the simple regression case *SS _{Res}* = DEVSQ(O4:O14) = 277.36,

*df*=

_{Res}*n*–

*k*–1 = 11 – 2 – 1 = 8 and

*MS*=

_{Res}*SS*/

_{Res}*df*= 34.67 (see Multiple Regression Analysis for more details).

_{Res}By Property 2 it follows that *MS _{Res}* (

*X*

^{T}

*X*)

^{-1}is the covariance matrix for the coefficients, and so the square root of the diagonal terms are the standard error of the coefficients. In particular, the standard error of the intercept

*b*

_{0}(in cell K9) is expressed by the formula =SQRT(I17), the standard error of the color coefficient

*b*

_{1}(in cell K10) is expressed by the formula =SQRT(J18), and the standard error of the quality coefficient

*b*

_{2}(in cell K11) is expressed by the formula =SQRT(K19).

**Real Statistics Excel Functions**: The Real Statistics Resource Pack contains the following supplemental array functions. Here R1 is an *n* × *k* array containing the *X* sample data and R2 is an *n* × 1 array containing the *Y* sample data.

**DESIGN**(R1) = *n* × (*k*+1) design matrix

**HAT**(R1) = *n × n* hat matrix

**CORE**(R1) = (*k*+1) × (k+1) matrix (*X*^{T}*X*)^{-1} which makes up the core of the hat matrix

**RegCov**(R1, R2) = (*k*+1) × (*k*+1) matrix *MS _{Res}*(

*X*

^{T}

*X*)

^{-1}which is the covariance matrix for the regression coefficients

**RegCoeff**(R1, R2) = (*k*+1) × 1 coefficient vector *B*

**RegCoeffSE**(R1, R2) = (*k*+1) × 1 vector of the standard errors of the coefficient vector *B*

**RegY**(R1, R2) = (*k*+1) × 1 column vector of predicted values *Y*-hat for *Y*.

**RegE**(R1, R2) = (*k*+1) × 1 residuals vector

See also the list of supplemental functions in Multiple Regression Analysis.

Hi,

I think the second formula below figure 1 contains an error:

http://www.real-statistics.com/wp-content/uploads/2013/02/image1824.png

It should probably read:

cov(y,x2) =

Regards,

Rainer

Rainer,

Another good catch. I have made the change. Thanks for your careful examination. It is much appreciated.

Charles

Hi,

sir how you found 20.5= 5.80b1-2.10b2 from figure 1 by using the equation

cov(y,x1)=b1 cov(x1,x1)+b2 cov(x2,x1)

You have only written one equation, but there are two equations, not just one. In fact there are two linear equations in two unknowns. The coefficients b1 and b2 are the unknowns, the values for cov(y1,x1), cov(x1,x2), etc. are known (they can be calculated from the sample data values). Normal algebra can be used to solve two equations in two unknowns. With more equations and more unknowns you can still use algebra, but you can also use the techniques shown elsewhere on the site. Charles