Total Least Squares

In ordinary linear regression, our goal is to find the equation for a straight line y = bx + a which best fits the data (x1, y1), …, (xn, yn). This results in values ŷi = bxi + a. The approach is to select values for a and b which minimize the following

Least squares

As we can see from Figure 1, this minimizes the sum of the distances squared (i.e. e2) only in the y direction.

Distance between line point

Figure 1 – Distance between a point and a line

The actual distance is actually shorter, as shown by d in Figure 1. Here (\hat x, \tilde y) is the point on the line y = bx + a that is closest to (x0, y0). Note that

perpendicular point

In total least squares regression, (aka orthogonal linear regression) we find the values of a and b that minimize the sum of the squared Euclidean distances from the points to the regression line (i.e. the d2). It turns out that this is equivalent to minimizing:

Euclidean distance squared

The value of b that minimizes this expression is given by

Slope TLS regression

where

TLS regression formulas

and  and ȳ are the means of the xi and yi values respectively. The intercept can now be expressed asIntercept coefficient TLS regressionExample 1: Repeat Example 1 of Least Squares using total least squares regression (the data are replicated in Figure 2).

The calculations are shown in Figure 2.

TLS regression coefficients

Figure 2 – Total Least Squares Regression

We see that the regression line based on total least squares is y = -0.83705x + 89.77211. This is as compared to the ordinary linear regression line y = -0.6282x + 85.72042.

In Figure 3, we graph the ordinary regression line (in blue) from Example 1 versus the regression line based on total least squares (in red).

TLS vs. OLS

Figure 3 – TLS (red) vs. OLS (blue)

Real Statistics Function: For array or range R1 containing x values and R2 containing y values, we have the following array functions.

TRegCoeff0(R1, R2, lab) = 2 × 1 column array consisting of the intercept and slope coefficients based on total linear regression using the data in R1 and R2.

If lab = TRUE (default FALSE), then an extra column is appended to the output from TRegCoeff containing the labels “intercept” and “slope”.

For Example 1, the output from =TRegCoeff0(A4:A18,B4:B18) is the same as shown in range E11:E12 of Figure 2.

Caution: The version of the TRegCoeff0 function in Rel 5.4.2 and earlier releases has a bug that will be corrected in the next release. In the meantime, please use the  formula TRegCoeff(R1, R2, lab), which will give the correct result.

14 Responses to Total Least Squares

  1. Jiayue Wang says:

    How to evaluate the goodness of the fit using Total Least Square Method? Is it in the same way with that of OLS, like R^2.

  2. Jaco Delport says:

    How did you manage to plot figure 3?
    Did you make use of the Real Statistics Resource Pack that is available on your website?
    Regards

    • Charles says:

      Jaco,
      No, I didn’t need to use the Real Statistics Resource Pack to create the plot, but I did use it to find the Total Least Squares regression coefficients.
      The plot is simply two scatter plots superimposed, including linear trendlines. The first scatter plot is for the data in columns A and B, while the second is for the data in columns Q and R.
      Charles

  3. Ryan Sheftel says:

    Thanks for this solution in Excel. I have a question: when I implement your solution I get the answer you provided where the slope for Lif Exp as a function of Cig is -0.83705. I would expect then that the slope for the Cig as a function of Life Exp would be the inverse or -1.19467, and that is verified as correct by using the PCA method.

    But when I reverse the inputs and copy the Life Exp numbers to the A column and Cig to the B column I get a slope of 0.83705. This does not seem correct, do you know if I am doing something wrong? Why does switching the dependent and independent variables not have the expected effect? Thanks.

    • Charles says:

      Ryan,
      I suggest that you perform the following experiment. Calculate the sum of the distances squared from each of the sample data points to the line y = -.83705x + 89.7721 as shown on the referenced webpage. Then do the same with the line y = -1.19467x + a (where a is the intercept that you believe is correct). If the sum calculated from this second line is smaller than that from the first line, then clearly I have made an error.
      Charles

      • Ryan Sheftel says:

        That works fine. My question is how come switching the X and Y inputs does not change the output? Are you able to get a result of -1.19467 for the beta when the inputs are switched, I cannot.

  4. Otto Hänninen says:

    Charles, thank you for your statistics lessons. Comprehensive, well rooted in simple demonstration how to do it in Excel. Excellent, this is among the best content I find online. Excellent work, very valuable and very much appreciated!

  5. Alex says:

    Hi Charles,
    Thank you for the great post!

    I tried the method with your example data, and I got the same result. Then I started plugging in my own data and it’s doing something weird.
    x->1,2,3,4,5,6,7,8,9,10
    y->0.98, 0.75, 0.78, 0.85, 0.89, 1.13, 1.26, 1.57, 1.94, 1.61

    While the ordinary regression would give me slope = 0.12 and intercept = 0.53, the Total Least Squares gives me slope = -8.46 and intercept = 47.71

    I don’t understand why it’s like this. Can you please help me? Thanks a lot!

    • Charles says:

      Alex,
      The slope and intercept values do seem very different from the values from ordinary regression.
      I see that if in calculating the slope I use the formula n = w + sign(u-v) * sqrt(w^2 + 1), i.e. I use sign(u-v) instead of sign(v-u), then I get a slope of .1182 and intercept of .5259, which are pretty close to the ordinary regression values. I will check this out.
      In the meantime, thanks very much for bringing this issue to my attention.
      Charles

      • Alex says:

        Thanks for your reply, Charles. I will play with it. Let me know if you figure this out. Thanks again.

        • Charles says:

          Alex,
          I have corrected the error on the webpage. The new version of the formula for the slope is equivalent to the previous version except that the sign problem before the square root term is now fixed.The TRegCoeff0 function still has the error. I will fix this shortly. In the meantime you can use the TRegCoeff function.
          Thanks again for your help.
          Charles

Leave a Reply

Your email address will not be published. Required fields are marked *