Weighted Linear Regression

Basic Concepts

Given a set of n points (x11, …, x1k, y1), …, (xn1, …, xnk, yn), in ordinary least squares (OLS) the objective is to find coefficients b0, …, bk so as to minimize

image039x

whereimage040image041

In weighted least squares, for a given set of weights w1, …, wn, we seek coefficients b0, …, bk so as to minimize

image042

whereimage043

In ordinary least squares, each point has equal weight, while in weighted least squares, points with higher weights contribute more than points with lower weights.

Using the same approach as that is employed in ordinary least squares, we find that

image044

where W is the n × n diagonal matrix whose diagonal consists of the weights w1, …, wn.

As for ordinary multiple regression, we make the following definitions

Weighted linear regression table

It follows that SST = SSReg + SSRes and dfT = dfReg + dfRes.

We don’t really need to come up with new techniques for performing weighted linear regression since we can use ordinary least squares on the data (w1x11, …, w1x1k, w1y1), …, (wnxn1, …, wnxnk, wnyn). We show how this is done to perform linear regression when the homogeneous variance assumption is not met (aka heteroscedasticity or heteroscedasticity).

Dealing with non-homogeneous variances

Suppose that we have data (x11, …, x1k, y1), …, (xn1, …, xnk, yn), where for each i, yi is the mean value of y for all samples where x1xi1, …, xkxik and σi is the standard deviation. Now we convert the usual linear regression

image063x

into the equivalent form

image064x

Since \sigma_i^2 = var(yi) = var(εi), it follows that var(εii) = 1, and so the homogeneity of variance assumption (aka homoscedasticity) of this second model is met. This model has the form

image067x

The coefficients of this model can be found using an OLS model without intercept (here x_{i0}^{*} = 1/σi).

Example 1: A survey was conducted to compile data about the relationship between CEO compensation and company size. The summarized data from 200 respondents is shown in Figure 1. Create a regression model for this data and use it to predict the wages of a CEO for a company whose annual revenues is $200 million a year.

Company size vs. compensation

Figure 1 – Relationship between company size and CEO compensation

The companies were divided into eight bands, as shown in columns A through C of Figure 1: band 1 consists of companies whose revenues are between $2 million and $25 million, while band 8 consists of companies with revenues between $5 billion and $10 billion. The mean wages for the CEO’s in each band is shown in column F with the corresponding standard deviations shown in column G.

Our goal is to build a regression model of the form

wages = b0 + b1 ∙ LN(mean company size)

where LN(mean company size) for the 8 bands are shown in column D of Figure 1. E.g. the value in cell D5 is calculated by the formula =LN(AVERAGE(B5,C5)).

Note that the standard deviations, and therefore the variances, for the different bands are quite different, and so we decide not to use an OLS regression model, but instead we use an OLS model without intercept as described above, namely

wages/stdev = b0/stdev + b1 ∙ LN(mean company size)/stdev

To do this, we create the transformed data shown in range F23:H31 of Figure 2. E.g. the values in cells F24, G24 and H24 are calculated respectively by the formulas =F5/G5, =1/G5 and =D5/G5.

Heteroskedasticity regression without constant

Figure 2 – Dealing with heteroscedasticity using a model without intercept

We next use the Real Statistics Multiple Linear Regression data analysis tool with the without intercept option on the input data in range F23:H31, as described in section 19.14, to obtain the regression analysis shown on the right side of Figure 2. This model takes the form

wages/stdev = -100.846 ∙ stdev + 126.8453 ∙ LN(mean company size)/stdev

or equivalently

wages = -100.846 + 126.8453 ∙ LN(mean company size)

Thus, the predicted average wages of a CEO in a company with $200 million in revenues is

wages = -100.846 + 126.8453 ∙ LN(200) = 571.221

Thus, a CEO for a company with $200 million in revenues is estimated to earn $571,221 in wages.

Note that if instead of the transformation described above, we had performed the usual OLS regression, we would have calculated coefficients of b0 = -204.761 and b1 = 149.045, which would have resulted in an estimate of $429,979 instead $571,221.

We can simplify this process, by using a weighted regression model where the weights are the reciprocals of the variances of the bands, as shown in column H of Figure 1 – e.g. the weight for band 1 is 0.00027321 (cell H5) as calculated by the formula =1/G5^2.

Real Statistics Data Analysis Tool: This type of regression can be performed by the Weighted Linear Regression data analysis tool as described below.

Example 2: Perform the regression for Example 1 using Real Statistics’ Weighted Linear Regression data analysis tool.

Press Ctrl-m and double click on the Regression option in the dialog box that appears. Next select Weighted linear Regression from the list of options. Fill in the dialog box that appears as shown in Figure 3 (with references to Figure 1).

Weighted regression dialog box

Figure 3 – Weighted Linear Regression dialog box

Upon clicking on the OK button, the output shown in Figure 4 appears. Note that this output is very similar to that shown in Figure 2.

Weighted linear regression

Figure 4 – Weighted Linear Regression

Real Statistics Function: The following array function computes the coefficients and their standard errors for weighted linear regression. Here R1 is an × k array containing the X sample data, R2 is an n × 1 array containing the Y sample data and R3 is an n × 1 array containing the weights.

WRegCoeff(R1, R2, R3) = k × 2 range consisting of the regression coefficient vector followed by vector of standard errors of these coefficients

For example, the range K17:L18 contains the array formula

=WRegCoeff(D5:D12,F5:F12,H5:H12)

where references are to cells in Figure 1.

 

19 Responses to Weighted Linear Regression

  1. Goita says:

    I cannot get the “Weighted Linear Regression dialog box” when i press CTRL+m?

    • Charles says:

      Goita,
      Click on the Reg tab.
      Charles

      • Robert says:

        Hi Charles. I’m having this same problem. I click on CTRL+m and only get the beep from Excel.

        • Charles says:

          Robert,
          When you press Alt-TI, do you see RealStats and Solver on the list of addins with check marks next to them?
          What do you see when you enter the formula =VER() in any cell?
          Charles

          • Robert says:

            Charles,
            Yes, RealStats and Solver are present and ticked on the list of addins, and when I enter =VER() in any cell I get the #NAME? after pressing enter.
            Robert

          • Charles says:

            Robert,
            Since =VER() is not working, Excel doesn’t recognize the RealStats software. Are you using Excel 2007? If so, check to make sure that you have stored the Real Statistics file in the correct place. See the instructions on the webpage from which you downloaded the Real Statistics file.
            Charles

          • Robert says:

            I thought it would be something like that, but I’m slightly confused. As both RealStats and Solver are on the Active Application Add-ins list. Also, the location I’ve used is the one stated on the webpage.

          • Robert says:

            Forgot to mention, I’m using Excel 2013.

          • Charles says:

            Robert,
            This is strange. If you have tried all the troubleshooting steps, then all I can suggest is that you delete the RealStats file and start all over. Close Excel. Open Excel and remove RealStats from the list of addins (to do this press Alt-TI and uncheck RealStats. If you get a message to delete RealStats from the list of addins, agree. If not recheck RealStats and hopefully you will get this message.
            Once you have completed these steps download the software from the website and repeat the installation

  2. BP says:

    When I run Weighted Linear regression I get R and Rsq values that exceed 9.5. When I run linear regression, I get R values of the order of .3 . I would expect similar R values since when I run weighted correlation coefficients and unweighted correlation coefficients there is a small difference.

  3. wytek szymanski says:

    Figure 4 confuses me. I would expect only Ln(mean) but not Weight. Weight is not the intercept, right?

    Is there an option for robust std. errors with weighted linear regression?

    • Charles says:

      Wytek,
      Weight is one of the independent variables and not an intercept.
      Currently there is no robust errors option for the Weighted Linear Regression data analysis tool. There is a robust errors option for the Multiple Regression data analysis tool.
      Charles

      • wytek szymanski says:

        I am still confused. Figure 3 shows that the independent variables (X) are D4:D12 while the Weights are H4:H12 and are required to be assigned to a separate input box.

        I have imported the excel data into another tool (gretl) and the output indicates that the Weights is indeed the intercept.

        Here are the results:
        coefficient std. error t-ratio p-value
        ——————————————————–
        const −100.846 53.2965 −1.892 0.1073
        Lnmean 126.845 11.8291 10.72 3.88e-05 ***

        • Charles says:

          Wytek,
          The way that I implemented weighted regression I turned the problem into one with linear regression without intercept. I guess gretl implemented this in a different way. From what I understand, the results are the same.
          Charles

          • wytek szymanski says:

            Charles,
            Without intercept gretl produces this:
            coefficient std. error t-ratio p-value
            ——————————————————–
            Lnmean 105.860 4.81325 21.99 1.01e-07 ***

            You may want to add an intercept checkbox to the dialog box above to see the difference.

            As you have correctly indicated above, the weights are a multiplier matrix W. Replace this W matrix with the identity matrix I and you get plain OLS.

          • Charles says:

            Wytek,
            Thanks for your suggestion. I’ll consider doing this at some point in the future.
            Charles

  4. Abel says:

    Hello Charles –

    Thank you for the explanations. I have a question around cell G (std dev) and how to compute it for a simple data set. Suppose we have for a planned regression of Y on X the following points. X=(1,2,3,4,5) and Y=(1.1,2.5,3.4,3.8,7). How would I calculate the standard deviation for each data point (in trying to determine the weights for each data point) in order to transform the model?

    Once I get this I understand how to arrive at the variance and hence the weights.

    Thank you and regards

    • Charles says:

      Abel,
      If each “group” consists of one data point, then the standard deviation would be undefined (division by zero) and so you couldn’t calculate weights in this way.
      Charles

Leave a Reply

Your email address will not be published. Required fields are marked *