In this section we extend the concepts from Linear Regression to models which use more than one independent variable. We explore how to find the coefficients for these multiple linear regression models using the method of least square, how to determine whether independent variables are making a significant contribution to the model and the impact of interactions between variables on the model.

We show how to apply the techniques of multiple linear regression to polynomial models and to analysis of variables (ANOVA). We also review the impact of issues such as collinearity, autocorrelation, outliers and influencers on the models.

Topics:

- Method of Least Squares
- Multiple Regression Analysis
- Polynomial Regression
- Logarithmic Transformations
- Interaction
- ANOVA using Regression
- Unbalanced Factorial ANOVA
- Three Factor ANOVA using Regression
- Effect Size for ANOVA
- Residuals
- Outliers and Influencers
- Autocorrelation
- Collinearity
- Testing the significance of extra variables on the model
- Stepwise Regression
- Multiple Correlation
- Statistical Power and Sample Size for Multiple Regression
- Confidence Intervals of Effect Size and Power for Regression
- Multiple Regression without a Constant Term
- Weighted Multiple Linear Regression
- Robust Standard Errors and Heteroscedasticity
- Shapley-Owen Decomposition
- Least Absolute Deviation (LAD) Regression
- Total Least Squares (TLS) Regression

Dear Charlse

Your tools help me a lot. Thank you!

I have a question regarding stepwise regression.

I used the stepwise regression option to determine which variables to use in the final model.

I have 10 variables in total, and I used the option, which concluded that there are only two variables that are statistically significant.

The thing is that, the p-values given in the first step in the output range is not identical with the p-values I get from a separate multiple linear regression using exactly the same variables and data.

To help you understand, I provide the p-values I got below.

P-values in the first step of stepwise regression:

0.027634262

0.83288323

0.007923924

0.218299547

6.31971E-21

0.790620417

0.627783482

0.866607918

0.068914612

P-vlues in a separate multiple linear regression:

0.502954508

0.773073026

0.314536192

0.563318878

0.000627492

2.73835E-23

0.434534911

0.57708841

0.431243818

0.729185186

According to the stepwise regression, I think the first and the third variables should’ve appeared statistically significant (p<0.05).

As you can see, however, that was not the case.

Why do the p-values differ?

Thank you,

Juyoung

Juyoung,

I can’t explain what is going on without seeing the data. If you send me an Excel file with your data and analysis, I will try to figure out what is happening. You can send the file to my email address as specified at Contact Us.

Charles

Dear Charles

Thank you for your wonderful website.

I have a question about the result obtained from the weighted multiple linear regression.

In the auto-generated ANOVA table, I found that the degree of freedom for regression is 2 from your programme. When I used others statistic softwares for conducting weighted regression, the degree of freedom is 1 instead . However, the coefficient of slope and y-intercept and its std errors are all the same from both your programme and others for the same data set.

Since both degree of freedoms are different, the SS and the R square are eventually different. Your programme can give a higher R square.

May i know what cause the difference?

thank you so much.

Regards

Tom

Tom,

These are different versions of weighted regression. Some packages offer both versions. Sometime in the future I plan to offer the version that you are referring to.

Charles

Hello Charles,

I’m interested in doing a stepwise regression approach to build a regression model that best predicts Earnings from golfers, but not sure which type of regression to use…ie linear, multiple, ANOVA, etc. I’ll choose 3 indicators as well. Here is some sample data…

Player Earnings Yards/Drive Age Greens in Regulation Putting Average Eagles

Xname 6,683,215 284.1 34 67.3 1.7 3

xname2 6,860,005 290 34 66 1.7 1

There is more data, but just wanted to understand which regression to choose using Real Statistics tool or if any manual excel formulas if the tool can’t do it.

Please share your thoughts?

I meant to adjust the format to shorten the fields

Player Earnings Yards/Drive Age Greens in Reg Put Avg Eagles

Xname 6,683,215 284.1 34 67.3 1.7 3

xname2 6,860,005 290 34 66 1.7 1

Jamel,

Although which analysis tool to use depends on your objectives, this does sound like a multiple linear regression problem. See the following webpage for details about how to perform stepwise regression.

Stepwise Regression

Charles

Thank you very much! That confirmed what I was doing by using multiple regression.

I am enjoying the site – thanks. I thought I would add a little info about using the Regression tools in Excel. I started using this technique back in Lotus 1-2-3 in the 80s and Excel works the same way. I frequently need to use 5th order polynomial to 7th order polynomial curve fitting to more accurately represent data than any multiple linear model will do. In order to do this I take my Y variable and it is single value I will relative to the value I want modeled and the X variable will be the X value and then the next column will be the X^2 and the next column is X^3, on up to however high of an order I need. Then I run the regression and it gives me the coefficients and I can calculate the new model numbers with the basic equation – aX^5 + bX^4+ cX^3 + dX^2 + eX + f. Using the original X I can calculate an extremely accurate curve fit that is not linear. I have also used a similar process and have one of the cells be a sin function or a log function and create a sinusoidal or logarithmic dampened curve to fit data that fits that kind of profile. I will generally play with the function until I get a high enough R squared. I hope this can help:).

George,

Thanks for sharing your insights. I plan to add a new function to the Real Statistics Resource Pack in the next release which should help determine the order of the polynomial to use.

Charles

Hello charles,

I have 10 independent variables but I also want to use a certain coefficients to 5 of them and then run multiple regression analysis to determine the coefficient of the rest . can I do that ? using excl or STATA

thank you for your help

Nada,

You can do this using Excel’s Solver. E.g. see how it is done for Exponential Regression of Logistic Regression.

Charles

Hi,

Thanks for all these great explanations.

I might have completely missed it but I don’t understand how you come up with the weights in a weighted linear regression.

John,

I show how to use the standard deviations to come up with weights, but in general you need to come up with weights based on factors about the scenario that you are trying to model. For more information, see Weighted Regression.

Charles

Dear Charles,

I have a sample of 30 to measure the factors constraining to the adoption of technology. I am thinking to run regression analysis to adoption rate ( if it is more that 50% considered as 1 and less than 50 is 0 taking 50 percent adoption as threshold limit).

to measure the constraining factors , I used the 5 point likert scale. (highly significant to least significant) and already extracted important variables using principal component analysis.

Now what kind of regression analysis should I use to measure the relative importance of each factors ?. Linear or multiple ?

Nirosha

Nirosha,

When you say “multiple” I assume that you mean “multiple linear regression”, which just means that you have more than one independent variable. When you have only one independent variable often the term “linear regression” or “simple linear regression” is used. Since you say that you have multiple factors, you would often use multiple linear regression.

Since your outcome (dependent variable) could be viewed as dichotomous (0 or 1), you might find that logistic regression gives a better fit for the data. You can compare AIC values for this.

Charles

Dear Charles,

thank you very much. I have one independent variable named as adoption rate ( less than 50 is 0 and higher than 1). Indeed, I have to measure the relative importance of 8 factors which affect to adoption rate which has measured using likert scale.

According to you, I think i must used linear regression. if so, which indicator should I used to measure the relative importance? is that coefficient value suitable to measure relative importance of factors.

can i use SPSS to run logistic regression?

Hi Charles,

Your compilation on regression analysis is very extensive and impressive. I wonder if you would like to extend it a little more by including model selection and regularization. That is Ridge Regression, LASSO, Bias-Variance trade off, and other techniques that will help fit a models well enough to make believable predictions.

Thanks Wytek, I will add this to my list of enhancements.

Charles

Hello Charles,

Thank you for this amazing resource. I was wondering if you could help me. I want to understand whether frequency of visit to a particular store for a person depends on quality, variety, Prices of products or location of the store(apparel). Can i build a regression model using dependent varibal as frequency of visit (low or high) based on ratings on a scale of 1 to 5 for independent variable like quality, price, variety or location? Which model will be useful to prove such hypothesis?

I appreciate your help!

Thanks in advance

Samantha

Samantha,

Regression requires that the independent variables be continuous, but Likert scale values are commonly used as independent variables, provided it is reasonable to assume that the distance between scale values are equal (e.g. for the quality variable, a the difference in quality between 3 and 2 is the same as the difference between 5 and 4). A Likert scale of 1 to 5 is ok, 1 to 7 would be better. If the independent variable is not ordered, then you are better off using dummy variables to code the independent variable (as explained on the Real Statistics website).

Charles

Hi Charles,

I am wondering if you can help me. I have measured three parameters (x,y, and z) that are all dependent on each other. I am trying to formulate a method to predict x by measuring only y and z. I have done the regressions between x-y and x-z individually, but I am wondering if there is a way to perform a regression to predict x from y and z simultaneously that will strengthen the correlation. Is this even possible? Or does the fact that there is no independent variable make this nonsensical?

Thanks,

Kevin,

Yes, you use multiple regression, as explained on the referenced webpage, where you assume that y and z are the independent variables and x is the dependent variable.

Charles

Charles:

Developing a VBA code for Partial Least Squares Regression (PLS Regression) is part of your future plans? It could be a very interesting tool in Real Statistics.

Thank you.

William Agurto.

William,

After the next release plan to turn my attention back to regression. I hope to add partial least squares regression as part of this process.

Charles

How to Calculate: Coefficient of Regression, Standard Error and t-Value when we are having more than 2 Independent Variables, and 1 Dependent Variable? Please Guide.

This is described on some of the webpages listed on the referenced webpage. In particular, see the webpage

http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/

Charles

Hello Charles,

My dependent variable is “Returns of the Stock in %” and my independent variables are factors that affect the Price of the Stock like “% Change in Net Profit to Sales of the Company, Inflation Rate”

When I run the Regression Analysis in the Excel, I get disastrous results. For instance, my r-square is 0.0132 and F value is 0.16 I don’t seem to understand where I could be going wrong. Could you please help me?

If you could provide me your email ID in the comment, I shall forward you the excel to look at my data.

Thanks in advance!

You can send the Excel worksheet to the email address in the Contact Us menu.

Charles

Hi! I am supposed to identify a multiple linear regression model to predict one variable and then discuss the best model. What kind of method is appropriate for these?

Thank you so much! 🙂

The appropriate model is “multiple linear regression”.

Charles

I have one dependent and three independent variable. Sir, help me to calculate my raw scores on excel or guide me with example on excel. W8ng 4 ur +ve reply

Sorry, but I don’t understand your question.

Charles

I am using logistic regression. When I run it the first time, everything looks fine. However, when I run the program second time on the same input data set, it shows runtime error. Also when I copy the data set to new sheet, it runs fine only the first time. Could you please help?

That is very strange and is the first time I have heard of this problem. It is important that the second time you run the program you aren’t trying to overwrite the results of the first run. Make sure that your output range the second time does not overlap the output from the first run.

If this is not the problem, then if you send me the spreadsheet with your data I will try to figure out where the problem is.

Charles

Rich,

This looks to be a great tool, but it does not seem to address the need I was looking for. Specifically, calculating confidence intervals and prediction intervals for multiple regressions. Is there something I’m missing? Thanks,

Tom

Tom,

You are correct. I explain confidence intervals and prediction intervals for linear regressions with one independent variable, but not where there are multiple independent variables. I will add this shortly.

Charles

Tom,

I have just added information about calculating confidence intervals and prediction intervals for multiple regression. See http://www.real-statistics.com/multiple-regression/confidence-and-prediction-intervals/

I plan to add these capabilities to the software in the next release.

Charles

Hi, Charles

Saw all the new tools regarding power and sample size you just added. Wondered if there might be similar discussion or tool(s) for regression studies at some future time?

Regards,

Rich

Rich,

I plan to add additional power and sample size capabilities, including for regression.

Charles

Hi, Charles

Let’s suppose I have determined a multiple regression model to be used in predicting a new sample’s values. After I have collected all the actual results from the new sample, I want to analyze these actual results with the predictions from the model… perhaps looking to continue or to scrap this particular model.

Is there a minimum suite of your tools to use to determine how successful(?) the model was with its predictions and what actually transpired? Or a trusted easy-to-understand reference?

Thanks, Rich

Rich,

I have touched on some of these issues in the website; especially regarding residuals for multiple regression and ROC for logistic regression. I plan to delve more deeply into risk assessment in the future. At this time I don’t have an easy-to-understand reference to recommend, but I will try to provide you with one once I focus more on this topic.

Charles

Using the new release that allows output range to be a new sheet.

Tried to do multiple regression outlier analysis using the Cooks Distance tool.

Keep getting Runtime error 424, object required.

Tried specifying a specific cell in the output range, and also erased the value that was there leaving the field blank. Either way gets the error popup.

Any ideas?

Rich,

Sorry that the tool didn’t work properly. I checked and the problem occurs even when you place the output on the same page as the input. I have now corrected the problem. If you download the Real Statistics Resource Pack (Release 1.7.2) now you should have a version that works correctly. If not, please let me know.

Charles