In this section we extend the concepts from Linear Regression to models which use more than one independent variable. We explore how to find the coefficients for these multiple linear regression models using the method of least square, how to determine whether independent variables are making a significant contribution to the model and the impact of interactions between variables on the model.

We show how to apply the techniques of multiple linear regression to polynomial models and to analysis of variables (ANOVA). We also review the impact of issues such as collinearity, autocorrelation, outliers and influencers on the models.

Topics:

- Method of Least Squares
- Multiple Regression Analysis
- Polynomial Regression
- Logarithmic Transformations
- Interaction
- ANOVA using Regression
- Unbalanced Factorial ANOVA
- Three Factor ANOVA using Regression
- Effect Size for ANOVA
- Residuals
- Outliers and Influencers
- Autocorrelation
- Collinearity
- Testing the significance of extra variables on the model
- Stepwise Regression
- Multiple Correlation
- Statistical Power and Sample Size for Multiple Regression
- Confidence Intervals of Effect Size and Power for Regression
- Multiple Regression without a Constant Term
- Weighted Multiple Linear Regression
- Robust Standard Errors and Heteroscedasticity
- Shapley-Owen Decomposition
- Least Absolute Deviation (LAD) Regression
- Total Least Squares (TLS) Regression

Dear Charles,

First of all, thank you for being such a great resource of statistics knowledge.

I have a question regarding moderated multiple regression analysis. I have managed to confirm that the variable ‘psychological resources during career change’ moderates the relationship between ‘future identity'(IV) and ‘proactive career behaviours’ (DV). The moderator ‘psychological resources for career change’ comprises 5 different sub-scales (readiness, confidence, locus of control, social support, and autonomy of decision making) with their own validity. I would like to take my analysis further and evaluate which of these 5 above-mentioned sub-scales has the greatest effect on the relationship between independent variable (future identity) and outcome (proactive career behaviour). Would you know how and what I should look into?

Thank you in advance!

Warm regards,

Maria

Maria,

I can’t say whether the following approach is the correct one for your situation, but a good way to identify which variables are most import is to use the Shapley-Owen Decomposition, which is described at

Shapley-Owen Decomposition

Charles

Thank you very much, Charles. I will have a look.

Regards,

Marija

Dear charles,

can you please explain how to find independent variables with dependent variable in multiple regression

Vinni,

Sorry, but I don’t understand your question.

Charles

Hi there,

my study is about the relationship between Action-centered leadership and team performance

I have used a five scale questionnaire. My difficulty is to identify dependent and independent variables

Ramos,

It sounds like team performance is the dependent variable and action-centered leadership is the independent variable. But this really depends on the details.

Charles

Dear Charles

Your website has aided me greatly with understanding statistics! However I still encounter some problems with my dissertation…

I need to research if X1 has an increasing influence on Y if the educational level rises. Dito for X2. I’m also interested in which X has the greater influence on Y.

The difficulty I encountered is the fitting of the proper model… I’m wondering if this is a regression model or a Three way ANOVA? I learned to execute both models from your website!

Kind regards

Joe,

I am not able to determine this from your description, but it is more likely to be regression. With regression you can use Shapley-Owen decomposition to determine which independent variables are more important. This is described on the Real Statistics website.

Charles

Pingback: Weighting Correlation Co-Efficients to measure performance

Hello Charles,

I’m working on a simulation test to forecast quarterly revenues impacted by several market decision making variables, such as large customer marketing budget, training, product improvement, price, and amount of sales personnel. However, I’m not sure if I should use multiple regression analysis or time series forecasting. The goal is to start with data from the previous quarter revenue, let’s say sarting with $3 million revenue, then use that data to predict the next four quarters, and forecast which variables to increase that could generate a trend towards $7million by the 4th quarter. Which regression model or forecasting model should I use in Real statistics for this please?

James,

I would need more information to say for sure, but this sounds like time series forecasting probably with seasonality.

See the following webpage for details: http://www.real-statistics.com/time-series-analysis/

Charles

Thank you Charles..I’ll read more on time series. I was leaning more towards that one. AS far as more information…

I believe using a marketing mix model of to analyze what lead indicators or drivers should I focus on increasing to forecast more sales, market share, and revenue per quarter for maybe the next 2 years for a marketing decision simulation.

Example: We sell one product. If customer A is my largest customer that pays high premium for Power, customer B pays high premium for Temperature and Power, Customer C only cares about Temperature and low price, and D only cares about lower price. They all care about sales support when they buy the product.

If I start out the quarter with $2million in revenue, what will happen if I increase budget for Sales support Training by X only on the highest paying customers, increase marketing budget by X to only largest, and increase product development budget by X to hopefully forecast an increase in the revenue overall and market share for the next quarter and so forth.

Y=B0+b1(x1)+b2(x2)+b3(x3)…etc. I was wondering if time series would help forecast my goal to reach 7 to 8 million? which model can do this for me in real statistics or in excel. If I can do this easily with the tools that would help. Does this sound feasible and attainable through the tools?

James,

People do build forecast models of the type that you are describing using regression and time series approaches. You have listed a number of ideas for parameters that you would like to tweak, but “the devil is in the details”. Which approaches to use depend on these details.

Charles

Dear Charlse

Your tools help me a lot. Thank you!

I have a question regarding stepwise regression.

I used the stepwise regression option to determine which variables to use in the final model.

I have 10 variables in total, and I used the option, which concluded that there are only two variables that are statistically significant.

The thing is that, the p-values given in the first step in the output range is not identical with the p-values I get from a separate multiple linear regression using exactly the same variables and data.

To help you understand, I provide the p-values I got below.

P-values in the first step of stepwise regression:

0.027634262

0.83288323

0.007923924

0.218299547

6.31971E-21

0.790620417

0.627783482

0.866607918

0.068914612

P-vlues in a separate multiple linear regression:

0.502954508

0.773073026

0.314536192

0.563318878

0.000627492

2.73835E-23

0.434534911

0.57708841

0.431243818

0.729185186

According to the stepwise regression, I think the first and the third variables should’ve appeared statistically significant (p<0.05).

As you can see, however, that was not the case.

Why do the p-values differ?

Thank you,

Juyoung

Juyoung,

I can’t explain what is going on without seeing the data. If you send me an Excel file with your data and analysis, I will try to figure out what is happening. You can send the file to my email address as specified at Contact Us.

Charles

Dear Charles

Thank you for your wonderful website.

I have a question about the result obtained from the weighted multiple linear regression.

In the auto-generated ANOVA table, I found that the degree of freedom for regression is 2 from your programme. When I used others statistic softwares for conducting weighted regression, the degree of freedom is 1 instead . However, the coefficient of slope and y-intercept and its std errors are all the same from both your programme and others for the same data set.

Since both degree of freedoms are different, the SS and the R square are eventually different. Your programme can give a higher R square.

May i know what cause the difference?

thank you so much.

Regards

Tom

Tom,

These are different versions of weighted regression. Some packages offer both versions. Sometime in the future I plan to offer the version that you are referring to.

Charles

Hello Charles,

I’m interested in doing a stepwise regression approach to build a regression model that best predicts Earnings from golfers, but not sure which type of regression to use…ie linear, multiple, ANOVA, etc. I’ll choose 3 indicators as well. Here is some sample data…

Player Earnings Yards/Drive Age Greens in Regulation Putting Average Eagles

Xname 6,683,215 284.1 34 67.3 1.7 3

xname2 6,860,005 290 34 66 1.7 1

There is more data, but just wanted to understand which regression to choose using Real Statistics tool or if any manual excel formulas if the tool can’t do it.

Please share your thoughts?

I meant to adjust the format to shorten the fields

Player Earnings Yards/Drive Age Greens in Reg Put Avg Eagles

Xname 6,683,215 284.1 34 67.3 1.7 3

xname2 6,860,005 290 34 66 1.7 1

Jamel,

Although which analysis tool to use depends on your objectives, this does sound like a multiple linear regression problem. See the following webpage for details about how to perform stepwise regression.

Stepwise Regression

Charles

Thank you very much! That confirmed what I was doing by using multiple regression.

I am enjoying the site – thanks. I thought I would add a little info about using the Regression tools in Excel. I started using this technique back in Lotus 1-2-3 in the 80s and Excel works the same way. I frequently need to use 5th order polynomial to 7th order polynomial curve fitting to more accurately represent data than any multiple linear model will do. In order to do this I take my Y variable and it is single value I will relative to the value I want modeled and the X variable will be the X value and then the next column will be the X^2 and the next column is X^3, on up to however high of an order I need. Then I run the regression and it gives me the coefficients and I can calculate the new model numbers with the basic equation – aX^5 + bX^4+ cX^3 + dX^2 + eX + f. Using the original X I can calculate an extremely accurate curve fit that is not linear. I have also used a similar process and have one of the cells be a sin function or a log function and create a sinusoidal or logarithmic dampened curve to fit data that fits that kind of profile. I will generally play with the function until I get a high enough R squared. I hope this can help:).

George,

Thanks for sharing your insights. I plan to add a new function to the Real Statistics Resource Pack in the next release which should help determine the order of the polynomial to use.

Charles

Hello charles,

I have 10 independent variables but I also want to use a certain coefficients to 5 of them and then run multiple regression analysis to determine the coefficient of the rest . can I do that ? using excl or STATA

thank you for your help

Nada,

You can do this using Excel’s Solver. E.g. see how it is done for Exponential Regression of Logistic Regression.

Charles

Hi,

Thanks for all these great explanations.

I might have completely missed it but I don’t understand how you come up with the weights in a weighted linear regression.

John,

I show how to use the standard deviations to come up with weights, but in general you need to come up with weights based on factors about the scenario that you are trying to model. For more information, see Weighted Regression.

Charles

Dear Charles,

I have a sample of 30 to measure the factors constraining to the adoption of technology. I am thinking to run regression analysis to adoption rate ( if it is more that 50% considered as 1 and less than 50 is 0 taking 50 percent adoption as threshold limit).

to measure the constraining factors , I used the 5 point likert scale. (highly significant to least significant) and already extracted important variables using principal component analysis.

Now what kind of regression analysis should I use to measure the relative importance of each factors ?. Linear or multiple ?

Nirosha

Nirosha,

When you say “multiple” I assume that you mean “multiple linear regression”, which just means that you have more than one independent variable. When you have only one independent variable often the term “linear regression” or “simple linear regression” is used. Since you say that you have multiple factors, you would often use multiple linear regression.

Since your outcome (dependent variable) could be viewed as dichotomous (0 or 1), you might find that logistic regression gives a better fit for the data. You can compare AIC values for this.

Charles

Dear Charles,

thank you very much. I have one independent variable named as adoption rate ( less than 50 is 0 and higher than 1). Indeed, I have to measure the relative importance of 8 factors which affect to adoption rate which has measured using likert scale.

According to you, I think i must used linear regression. if so, which indicator should I used to measure the relative importance? is that coefficient value suitable to measure relative importance of factors.

can i use SPSS to run logistic regression?

Hi Charles,

Your compilation on regression analysis is very extensive and impressive. I wonder if you would like to extend it a little more by including model selection and regularization. That is Ridge Regression, LASSO, Bias-Variance trade off, and other techniques that will help fit a models well enough to make believable predictions.

Thanks Wytek, I will add this to my list of enhancements.

Charles

Hello Charles,

Thank you for this amazing resource. I was wondering if you could help me. I want to understand whether frequency of visit to a particular store for a person depends on quality, variety, Prices of products or location of the store(apparel). Can i build a regression model using dependent varibal as frequency of visit (low or high) based on ratings on a scale of 1 to 5 for independent variable like quality, price, variety or location? Which model will be useful to prove such hypothesis?

I appreciate your help!

Thanks in advance

Samantha

Samantha,

Regression requires that the independent variables be continuous, but Likert scale values are commonly used as independent variables, provided it is reasonable to assume that the distance between scale values are equal (e.g. for the quality variable, a the difference in quality between 3 and 2 is the same as the difference between 5 and 4). A Likert scale of 1 to 5 is ok, 1 to 7 would be better. If the independent variable is not ordered, then you are better off using dummy variables to code the independent variable (as explained on the Real Statistics website).

Charles

Hi Charles,

I am wondering if you can help me. I have measured three parameters (x,y, and z) that are all dependent on each other. I am trying to formulate a method to predict x by measuring only y and z. I have done the regressions between x-y and x-z individually, but I am wondering if there is a way to perform a regression to predict x from y and z simultaneously that will strengthen the correlation. Is this even possible? Or does the fact that there is no independent variable make this nonsensical?

Thanks,

Kevin,

Yes, you use multiple regression, as explained on the referenced webpage, where you assume that y and z are the independent variables and x is the dependent variable.

Charles

Charles:

Developing a VBA code for Partial Least Squares Regression (PLS Regression) is part of your future plans? It could be a very interesting tool in Real Statistics.

Thank you.

William Agurto.

William,

After the next release plan to turn my attention back to regression. I hope to add partial least squares regression as part of this process.

Charles

How to Calculate: Coefficient of Regression, Standard Error and t-Value when we are having more than 2 Independent Variables, and 1 Dependent Variable? Please Guide.

This is described on some of the webpages listed on the referenced webpage. In particular, see the webpage

http://www.real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/

Charles

Hello Charles,

My dependent variable is “Returns of the Stock in %” and my independent variables are factors that affect the Price of the Stock like “% Change in Net Profit to Sales of the Company, Inflation Rate”

When I run the Regression Analysis in the Excel, I get disastrous results. For instance, my r-square is 0.0132 and F value is 0.16 I don’t seem to understand where I could be going wrong. Could you please help me?

If you could provide me your email ID in the comment, I shall forward you the excel to look at my data.

Thanks in advance!

You can send the Excel worksheet to the email address in the Contact Us menu.

Charles

Hi! I am supposed to identify a multiple linear regression model to predict one variable and then discuss the best model. What kind of method is appropriate for these?

Thank you so much! 🙂

The appropriate model is “multiple linear regression”.

Charles

I have one dependent and three independent variable. Sir, help me to calculate my raw scores on excel or guide me with example on excel. W8ng 4 ur +ve reply

Sorry, but I don’t understand your question.

Charles

I am using logistic regression. When I run it the first time, everything looks fine. However, when I run the program second time on the same input data set, it shows runtime error. Also when I copy the data set to new sheet, it runs fine only the first time. Could you please help?

That is very strange and is the first time I have heard of this problem. It is important that the second time you run the program you aren’t trying to overwrite the results of the first run. Make sure that your output range the second time does not overlap the output from the first run.

If this is not the problem, then if you send me the spreadsheet with your data I will try to figure out where the problem is.

Charles

Rich,

This looks to be a great tool, but it does not seem to address the need I was looking for. Specifically, calculating confidence intervals and prediction intervals for multiple regressions. Is there something I’m missing? Thanks,

Tom

Tom,

You are correct. I explain confidence intervals and prediction intervals for linear regressions with one independent variable, but not where there are multiple independent variables. I will add this shortly.

Charles

Tom,

I have just added information about calculating confidence intervals and prediction intervals for multiple regression. See http://www.real-statistics.com/multiple-regression/confidence-and-prediction-intervals/

I plan to add these capabilities to the software in the next release.

Charles

Hi, Charles

Saw all the new tools regarding power and sample size you just added. Wondered if there might be similar discussion or tool(s) for regression studies at some future time?

Regards,

Rich

Rich,

I plan to add additional power and sample size capabilities, including for regression.

Charles

Hi, Charles

Let’s suppose I have determined a multiple regression model to be used in predicting a new sample’s values. After I have collected all the actual results from the new sample, I want to analyze these actual results with the predictions from the model… perhaps looking to continue or to scrap this particular model.

Is there a minimum suite of your tools to use to determine how successful(?) the model was with its predictions and what actually transpired? Or a trusted easy-to-understand reference?

Thanks, Rich

Rich,

I have touched on some of these issues in the website; especially regarding residuals for multiple regression and ROC for logistic regression. I plan to delve more deeply into risk assessment in the future. At this time I don’t have an easy-to-understand reference to recommend, but I will try to provide you with one once I focus more on this topic.

Charles

Using the new release that allows output range to be a new sheet.

Tried to do multiple regression outlier analysis using the Cooks Distance tool.

Keep getting Runtime error 424, object required.

Tried specifying a specific cell in the output range, and also erased the value that was there leaving the field blank. Either way gets the error popup.

Any ideas?

Rich,

Sorry that the tool didn’t work properly. I checked and the problem occurs even when you place the output on the same page as the input. I have now corrected the problem. If you download the Real Statistics Resource Pack (Release 1.7.2) now you should have a version that works correctly. If not, please let me know.

Charles