Multiple Regression

In this part of the website, we extend the concepts from Linear Regression to models that use more than one independent variable. We explore how to find the coefficients for these multiple linear regression models using the method of least squares, how to determine whether independent variables are making a significant contribution to the model, and the impact of interactions between variables on the model.

In addition, we show how to apply the techniques of multiple linear regression to polynomial models and to the analysis of variables (ANOVA). We also review the impact of issues such as collinearity, autocorrelation, outliers, and influencers on the models.

Topics

References

Howell, D. C. (2010) Statistical methods for psychology (7th ed.). Wadsworth, Cengage Learning.
https://labs.la.utexas.edu/gilden/files/2016/05/Statistics-Text.pdf

94 thoughts on “Multiple Regression”

  1. Dr. Charles, one question:
    It is possible to transform part of the data to run a multiple regression model?
    That is, if we transform indenpendent variables, we have to do it in all of them. The reason is that the interpretation might be a problem when some independent variables are transformed and others not. Thanks for your anwer.

    Reply
    • You don’t need to transform all the variables. You can transform only one variable. The interpretation will depend on the transformation (even if you transform all of the variables).
      Charles

      Reply
    • Eduardo,
      I have not added a hierarchial linear model capability to Real Statistics yet, although I started working on it a couple of months ago. Unfortunately, I wasn’t able to finish this activity since I needed to complete some other developments for Release 7.4, which was issued a few days ago. I still plan to add this capability.
      Charles

      Reply
  2. Respected Doctor Zaionits, receive a cordial greeting of thanks, for your excellent page.
    Dr. How can I develop a general linear model (GLM) using Real Statitsics ?.
    Thank you

    Respetado Doctor Zaionits, reciba un Cordial saludo de agradecimiento, por su excelente página.
    Dr. ¿Como puedo desarrollar un modelo lineal general (GLM) usando Real Statitsics?.
    Muchas gracias

    Reply
  3. Hello,
    I did a multiple regression in excel, where multiple R is 0.82, p values of all coefficients well below 0.05 only the intercept p value is 0.62. All VIF is less than 3, so there is no multicolinearity.
    Why the p value of intercept is high? Can I use the model?

    Reply
  4. Hi Charles.

    I ran a hierarchical multiple regression and am trying to figure out how to interpret and covert the unstandardized coefficients (-9.236E-5) for variable 1 and (1.882E-5) for variable 2, so that I can report them more concisely in an APA-style table. All other unstandardized coefficients are reported with 3 decimal places.

    Any help would be most appreciated.

    Reply
    • Hello Naeem,
      In this case you need to violate the APA-style guidelines since you clearly need more than 3 decimal places.
      The only thing that I can think of is to multiple all your data by some constant (perhaps 1000). This change in units may result in coefficients that meet the APA guidelines.
      Charles

      Reply
  5. Hi,
    I got a question about ranking regression coefficients for a Lasso model for example.
    1. I scaled my data X and Y (subtracted the mean of each variable, and divided with the standard deviation).
    2. I obtained the regression coefficients, some are zero (non-important variables), and for the rest I get values, some are 0.2, 1, 2, 3.

    Question 1) Can I order them in absolute value so that I can say the variable with the highest absolute value is the most important?

    Question 2) Since it is a data-driven model, is there any rule of thumb so reject variables even if the regression coefficient is > 0? or should I just reject variables with coefficients = 0?

    Thanks in advance

    Reply
  6. Hi Charles,

    I would like to test the hypothesis ‘Electronics from country X are perceived as less favorable than products from country Y’.

    I measured this on a 5 point Likert scale with statements such as ‘Compared to products from country X, products from country Y are…’

    The answers include 8 levels, such as ‘…more innovative’, ‘…more reliable’, etc.

    Is it correct to calculate the Pearson correlation coefficients (for most variables > 0.3) and then to conduct a factor analysis (PCA)? I assume it would all load into the factor ‘product perception’. Then I have to check for cronbachs alpha afterwards to check the reliability.

    But how would I proceed? Which statistical tests are relvant here? Do I have to compute any coefficients? Does a multiple linear regression make sense? How can I answer my hypothesis correctly?

    Thank you so much for your help, I appreciate it!!

    Reply
    • Hello Sally,
      Please clarify the following:
      1. Does each respondent give a score in the 5 point Likert scale to country X and another score for country Y? Or does each respondent give one score which compares the two countries (for each of the 8 levels)?
      2. Assuming the latter, I understand that each respondent gives 8 scores. In this case the approach using Factor Analysis may make sense to find the latent factors (thereby reducing 8 initial factors to a lesser number). Assuming the scores are based on some questionnaire, you can use Cronbach’s alpha to determine reliability.
      3. Whether you have 8 factors or a lower number after using Factor Analysis, you can use some version of MANOVA for statistical testing. E.g. assuming each respondent gives 8 scores and factor analysis is not used, we can code the 5 point Likert scale as -2, -1, 0, 1, 2 where the negative values favor country X and the positive scores favor country Y. Then a one sample Hotelling’s T-square test (a simple version of MANOVA) could be used against the scores of 8 zeros to see whether there is a significant difference between sample and a neutral rating of a vector with 8 zeros.
      Charles

      Reply
      • Hello Charles,

        thank you so much for helping me.
        Each participant gives one score for each of the 8 levels. How would I conduct such a factor analysis in Excel? Do I have to code the Likert Data somehow in advance? Do I have to compare the means as well (e.g. ‘respondents agree that products from country Y are more expensive compared to products from country X with a mean of 3.8 based on a 5 point Likert scale’)? And would I need ANOVA to compare those means?

        Thank you in advance!

        Reply
        • Hello Sally,
          1. Factor Analysis can be used to find the latent factors (thereby reducing 8 initial factors to a smaller number). You need to provide some sort of coding in advance; probably using the Likert scale values.
          2. You can compare the means using a one-sample t test if you are comparing the results you have obtained with a fixed value. If you are comparing 8 means with each other then ANOVA would be used.
          Charles

          Reply
  7. Hi…
    I’m using Ridge Regression and have a pop up window with the message ‘Input X must have more rows of data than columns’. Currently I have 43 X rows and 48 X columns. Is there a way of working around this problem? This is just the start I have > 400 X columns.
    Regards
    Simon

    Reply
    • Hi Simon,
      In general, regression only works when you have more rows than columns. I don’t know of any workaround for this, although perhaps there is some version that you can find on the Internet.
      Charles

      Reply
  8. Hello Sir,

    I have used the Essential Regression software in Excel while doing Multiple Regression so far however, the software is only supported till Windows 8 versions and not above that. I am currently using Windows 10 and was wondering how to do an Auto Regression (AutoFit) for given source of data.

    For ex: If I have 2 columns X (1,2,3,4) and Y (3,6,9,12) then how would I manage to do an auto fit regression with this data? In the essential Regression software, this would happen with a on click as there was an Auto-Fit button which I would just click and the job would be done!

    Please let me know if there is any solution or alternative to this. I would really appreciate it. Thank You,

    Ravi

    Reply
    • Ravi,
      By Auto Regression do you mean autoregression, which is part of time series analysis. If so, then you can do this as part of the Time Series data analysis tools.
      If you just want to automatically build a multiple regression model, then simply use Real Statistics’ Multiple Linear Regression data analysis tool.
      Charles

      Reply
  9. Hi Charles, hope you are well.

    Do you have formulas to correct for restriction of range, Hi guys

    Wondering if anyone knows what the restriction of range especially if you don’t know the standard deviation of the unrestricted group? For instance, you run a selection process, 1st stage is a cognitive ability test and as a result you would not know the standard deviation of the population.

    Best regards

    Reply
      • Thanks for your attention, Charles :
        i mean, how to make a line in multiple regression by scatter plot ?
        i tried to make a line in simple linier regression but i don’t find how to make a line in multiple regression.

        Reply
        • Ujang,
          The problem is that this is a graph in more than two dimensions. With k independent variables you would need k+1 dimensions. We can cope pretty well with 2 dimensions, some in 3 dimensions, but beyond that it is pretty hopeless. In other words, for most of us, you need to look at one independent variable at a time, i.e. 2 dimensions.
          Charles

          Reply
  10. Hello Charles,

    I find your Excel based Statistical analysis as excellent and knowledge enhancing work.
    Are you going to add Partial Least Square?
    If so, when?

    Best Wishes!

    Reply
    • Ramesh,
      I am pleased that you are getting value from the Real Statistics software.
      Yes, I expect to add Partial Least Squares, but I don’t have a specific date for this enhancement. I expect to look at regression enhancements in the release after next.
      Charles

      Reply
  11. Dear Charles,
    I’m new with this. I’am student right now. So, I don’t really understand about multiple regression and simple regression. Can you tell me why multiple regression is far better than simple regression. If you didn’t really mind, can you explain by using an appropriate model support by statistical and mathematical notation.

    Reply
      • So,that mean the formula is almost same. The different is the independent variable that has been use.
        Thanks for your help Charles, i read the note over and over. It a little bit confused, but i get it now.

        Reply
  12. Dear Charles,

    First of all, thank you for being such a great resource of statistics knowledge.

    I have a question regarding moderated multiple regression analysis. I have managed to confirm that the variable ‘psychological resources during career change’ moderates the relationship between ‘future identity'(IV) and ‘proactive career behaviours’ (DV). The moderator ‘psychological resources for career change’ comprises 5 different sub-scales (readiness, confidence, locus of control, social support, and autonomy of decision making) with their own validity. I would like to take my analysis further and evaluate which of these 5 above-mentioned sub-scales has the greatest effect on the relationship between independent variable (future identity) and outcome (proactive career behaviour). Would you know how and what I should look into?

    Thank you in advance!

    Warm regards,

    Maria

    Reply
  13. Dear charles,
    can you please explain how to find independent variables with dependent variable in multiple regression

    Reply
  14. Hi there,
    my study is about the relationship between Action-centered leadership and team performance
    I have used a five scale questionnaire. My difficulty is to identify dependent and independent variables

    Reply
    • Ramos,
      It sounds like team performance is the dependent variable and action-centered leadership is the independent variable. But this really depends on the details.
      Charles

      Reply
  15. Dear Charles

    Your website has aided me greatly with understanding statistics! However I still encounter some problems with my dissertation…

    I need to research if X1 has an increasing influence on Y if the educational level rises. Dito for X2. I’m also interested in which X has the greater influence on Y.

    The difficulty I encountered is the fitting of the proper model… I’m wondering if this is a regression model or a Three way ANOVA? I learned to execute both models from your website!

    Kind regards

    Reply
  16. Hello Charles,

    I’m working on a simulation test to forecast quarterly revenues impacted by several market decision making variables, such as large customer marketing budget, training, product improvement, price, and amount of sales personnel. However, I’m not sure if I should use multiple regression analysis or time series forecasting. The goal is to start with data from the previous quarter revenue, let’s say sarting with $3 million revenue, then use that data to predict the next four quarters, and forecast which variables to increase that could generate a trend towards $7million by the 4th quarter. Which regression model or forecasting model should I use in Real statistics for this please?

    Reply
      • Thank you Charles..I’ll read more on time series. I was leaning more towards that one. AS far as more information…

        I believe using a marketing mix model of to analyze what lead indicators or drivers should I focus on increasing to forecast more sales, market share, and revenue per quarter for maybe the next 2 years for a marketing decision simulation.

        Example: We sell one product. If customer A is my largest customer that pays high premium for Power, customer B pays high premium for Temperature and Power, Customer C only cares about Temperature and low price, and D only cares about lower price. They all care about sales support when they buy the product.

        If I start out the quarter with $2million in revenue, what will happen if I increase budget for Sales support Training by X only on the highest paying customers, increase marketing budget by X to only largest, and increase product development budget by X to hopefully forecast an increase in the revenue overall and market share for the next quarter and so forth.

        Y=B0+b1(x1)+b2(x2)+b3(x3)…etc. I was wondering if time series would help forecast my goal to reach 7 to 8 million? which model can do this for me in real statistics or in excel. If I can do this easily with the tools that would help. Does this sound feasible and attainable through the tools?

        Reply
        • James,
          People do build forecast models of the type that you are describing using regression and time series approaches. You have listed a number of ideas for parameters that you would like to tweak, but “the devil is in the details”. Which approaches to use depend on these details.
          Charles

          Reply
  17. Dear Charlse

    Your tools help me a lot. Thank you!

    I have a question regarding stepwise regression.
    I used the stepwise regression option to determine which variables to use in the final model.
    I have 10 variables in total, and I used the option, which concluded that there are only two variables that are statistically significant.
    The thing is that, the p-values given in the first step in the output range is not identical with the p-values I get from a separate multiple linear regression using exactly the same variables and data.
    To help you understand, I provide the p-values I got below.

    P-values in the first step of stepwise regression:
    0.027634262
    0.83288323
    0.007923924
    0.218299547
    6.31971E-21
    0.790620417
    0.627783482
    0.866607918
    0.068914612

    P-vlues in a separate multiple linear regression:
    0.502954508
    0.773073026
    0.314536192
    0.563318878
    0.000627492
    2.73835E-23
    0.434534911
    0.57708841
    0.431243818
    0.729185186

    According to the stepwise regression, I think the first and the third variables should’ve appeared statistically significant (p<0.05).
    As you can see, however, that was not the case.
    Why do the p-values differ?

    Thank you,
    Juyoung

    Reply
    • Juyoung,
      I can’t explain what is going on without seeing the data. If you send me an Excel file with your data and analysis, I will try to figure out what is happening. You can send the file to my email address as specified at Contact Us.
      Charles

      Reply
  18. Dear Charles
    Thank you for your wonderful website.
    I have a question about the result obtained from the weighted multiple linear regression.
    In the auto-generated ANOVA table, I found that the degree of freedom for regression is 2 from your programme. When I used others statistic softwares for conducting weighted regression, the degree of freedom is 1 instead . However, the coefficient of slope and y-intercept and its std errors are all the same from both your programme and others for the same data set.
    Since both degree of freedoms are different, the SS and the R square are eventually different. Your programme can give a higher R square.
    May i know what cause the difference?
    thank you so much.
    Regards
    Tom

    Reply
    • Tom,
      These are different versions of weighted regression. Some packages offer both versions. Sometime in the future I plan to offer the version that you are referring to.
      Charles

      Reply
  19. Hello Charles,

    I’m interested in doing a stepwise regression approach to build a regression model that best predicts Earnings from golfers, but not sure which type of regression to use…ie linear, multiple, ANOVA, etc. I’ll choose 3 indicators as well. Here is some sample data…

    Player Earnings Yards/Drive Age Greens in Regulation Putting Average Eagles
    Xname 6,683,215 284.1 34 67.3 1.7 3
    xname2 6,860,005 290 34 66 1.7 1

    There is more data, but just wanted to understand which regression to choose using Real Statistics tool or if any manual excel formulas if the tool can’t do it.
    Please share your thoughts?

    Reply
    • I meant to adjust the format to shorten the fields
      Player Earnings Yards/Drive Age Greens in Reg Put Avg Eagles
      Xname 6,683,215 284.1 34 67.3 1.7 3
      xname2 6,860,005 290 34 66 1.7 1

      Reply
    • Jamel,
      Although which analysis tool to use depends on your objectives, this does sound like a multiple linear regression problem. See the following webpage for details about how to perform stepwise regression.
      Stepwise Regression
      Charles

      Reply
  20. I am enjoying the site – thanks. I thought I would add a little info about using the Regression tools in Excel. I started using this technique back in Lotus 1-2-3 in the 80s and Excel works the same way. I frequently need to use 5th order polynomial to 7th order polynomial curve fitting to more accurately represent data than any multiple linear model will do. In order to do this I take my Y variable and it is single value I will relative to the value I want modeled and the X variable will be the X value and then the next column will be the X^2 and the next column is X^3, on up to however high of an order I need. Then I run the regression and it gives me the coefficients and I can calculate the new model numbers with the basic equation – aX^5 + bX^4+ cX^3 + dX^2 + eX + f. Using the original X I can calculate an extremely accurate curve fit that is not linear. I have also used a similar process and have one of the cells be a sin function or a log function and create a sinusoidal or logarithmic dampened curve to fit data that fits that kind of profile. I will generally play with the function until I get a high enough R squared. I hope this can help:).

    Reply
    • George,
      Thanks for sharing your insights. I plan to add a new function to the Real Statistics Resource Pack in the next release which should help determine the order of the polynomial to use.
      Charles

      Reply
  21. Hello charles,

    I have 10 independent variables but I also want to use a certain coefficients to 5 of them and then run multiple regression analysis to determine the coefficient of the rest . can I do that ? using excl or STATA
    thank you for your help

    Reply
    • Nada,
      You can do this using Excel’s Solver. E.g. see how it is done for Exponential Regression of Logistic Regression.
      Charles

      Reply
  22. Hi,
    Thanks for all these great explanations.
    I might have completely missed it but I don’t understand how you come up with the weights in a weighted linear regression.

    Reply
    • John,
      I show how to use the standard deviations to come up with weights, but in general you need to come up with weights based on factors about the scenario that you are trying to model. For more information, see Weighted Regression.
      Charles

      Reply
  23. Dear Charles,
    I have a sample of 30 to measure the factors constraining to the adoption of technology. I am thinking to run regression analysis to adoption rate ( if it is more that 50% considered as 1 and less than 50 is 0 taking 50 percent adoption as threshold limit).
    to measure the constraining factors , I used the 5 point likert scale. (highly significant to least significant) and already extracted important variables using principal component analysis.
    Now what kind of regression analysis should I use to measure the relative importance of each factors ?. Linear or multiple ?

    Nirosha

    Reply
    • Nirosha,

      When you say “multiple” I assume that you mean “multiple linear regression”, which just means that you have more than one independent variable. When you have only one independent variable often the term “linear regression” or “simple linear regression” is used. Since you say that you have multiple factors, you would often use multiple linear regression.

      Since your outcome (dependent variable) could be viewed as dichotomous (0 or 1), you might find that logistic regression gives a better fit for the data. You can compare AIC values for this.

      Charles

      Reply
      • Dear Charles,
        thank you very much. I have one independent variable named as adoption rate ( less than 50 is 0 and higher than 1). Indeed, I have to measure the relative importance of 8 factors which affect to adoption rate which has measured using likert scale.
        According to you, I think i must used linear regression. if so, which indicator should I used to measure the relative importance? is that coefficient value suitable to measure relative importance of factors.
        can i use SPSS to run logistic regression?

        Reply
  24. Hi Charles,

    Your compilation on regression analysis is very extensive and impressive. I wonder if you would like to extend it a little more by including model selection and regularization. That is Ridge Regression, LASSO, Bias-Variance trade off, and other techniques that will help fit a models well enough to make believable predictions.

    Reply
  25. Hello Charles,

    Thank you for this amazing resource. I was wondering if you could help me. I want to understand whether frequency of visit to a particular store for a person depends on quality, variety, Prices of products or location of the store(apparel). Can i build a regression model using dependent varibal as frequency of visit (low or high) based on ratings on a scale of 1 to 5 for independent variable like quality, price, variety or location? Which model will be useful to prove such hypothesis?

    I appreciate your help!
    Thanks in advance
    Samantha

    Reply
    • Samantha,
      Regression requires that the independent variables be continuous, but Likert scale values are commonly used as independent variables, provided it is reasonable to assume that the distance between scale values are equal (e.g. for the quality variable, a the difference in quality between 3 and 2 is the same as the difference between 5 and 4). A Likert scale of 1 to 5 is ok, 1 to 7 would be better. If the independent variable is not ordered, then you are better off using dummy variables to code the independent variable (as explained on the Real Statistics website).
      Charles

      Reply
  26. Hi Charles,

    I am wondering if you can help me. I have measured three parameters (x,y, and z) that are all dependent on each other. I am trying to formulate a method to predict x by measuring only y and z. I have done the regressions between x-y and x-z individually, but I am wondering if there is a way to perform a regression to predict x from y and z simultaneously that will strengthen the correlation. Is this even possible? Or does the fact that there is no independent variable make this nonsensical?

    Thanks,

    Reply
    • Kevin,
      Yes, you use multiple regression, as explained on the referenced webpage, where you assume that y and z are the independent variables and x is the dependent variable.
      Charles

      Reply
  27. Charles:

    Developing a VBA code for Partial Least Squares Regression (PLS Regression) is part of your future plans? It could be a very interesting tool in Real Statistics.

    Thank you.

    William Agurto.

    Reply
    • William,
      After the next release plan to turn my attention back to regression. I hope to add partial least squares regression as part of this process.
      Charles

      Reply
  28. How to Calculate: Coefficient of Regression, Standard Error and t-Value when we are having more than 2 Independent Variables, and 1 Dependent Variable? Please Guide.

    Reply
  29. Hello Charles,
    My dependent variable is “Returns of the Stock in %” and my independent variables are factors that affect the Price of the Stock like “% Change in Net Profit to Sales of the Company, Inflation Rate”

    When I run the Regression Analysis in the Excel, I get disastrous results. For instance, my r-square is 0.0132 and F value is 0.16 I don’t seem to understand where I could be going wrong. Could you please help me?

    If you could provide me your email ID in the comment, I shall forward you the excel to look at my data.

    Thanks in advance!

    Reply
  30. Hi! I am supposed to identify a multiple linear regression model to predict one variable and then discuss the best model. What kind of method is appropriate for these?

    Thank you so much! 🙂

    Reply
  31. I have one dependent and three independent variable. Sir, help me to calculate my raw scores on excel or guide me with example on excel. W8ng 4 ur +ve reply

    Reply
  32. I am using logistic regression. When I run it the first time, everything looks fine. However, when I run the program second time on the same input data set, it shows runtime error. Also when I copy the data set to new sheet, it runs fine only the first time. Could you please help?

    Reply
    • That is very strange and is the first time I have heard of this problem. It is important that the second time you run the program you aren’t trying to overwrite the results of the first run. Make sure that your output range the second time does not overlap the output from the first run.

      If this is not the problem, then if you send me the spreadsheet with your data I will try to figure out where the problem is.

      Charles

      Reply
  33. Rich,

    This looks to be a great tool, but it does not seem to address the need I was looking for. Specifically, calculating confidence intervals and prediction intervals for multiple regressions. Is there something I’m missing? Thanks,

    Tom

    Reply
  34. Hi, Charles
    Saw all the new tools regarding power and sample size you just added. Wondered if there might be similar discussion or tool(s) for regression studies at some future time?
    Regards,
    Rich

    Reply

Leave a Comment