Linear Regression

The goal of linear regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable.

Topics

73 thoughts on “Linear Regression”

Jojo

December 15, 2023 at 2:00 am

Hello! Do I need to find the missing data for this question? If yes, how do I get with it on excel? (A video will help) Below is the data and what I want to get
Use excel to run this data. Generate summery statistics, ANOVA Table and descriptive statistics.
Employees
A
B
C
D
E
F
G
H
I
Talent_Mgt
12
X
16
X
20
X
X
30
32
Performance
56
67
X
X
X
X
89
X
100
Reply
- Charles
  
  December 15, 2023 at 6:48 pm
  
  Hello Jojo,
  Standard Excel doesn’t provide such a capability, but you can use the approaches described on this website using Excel.
  See https://real-statistics.com/handling-missing-data/
  Charles
  Reply
SP

December 13, 2023 at 10:15 pm

I have the data, which is as below:
x = [1, 2, 3, 4, 5]
y1 = [26.5, 139, 269, 337, 409]
y2 = [26.2, 138, 273, 343, 404]
Using the raw data, I want to have following Results of Regression Analysis as per web attachment. Thanks for the help.
Reply
- SP
  
  December 13, 2023 at 10:15 pm
  
  image web address is https://ibb.co/r7WFrzm
  Reply
Maaruf Bussri

July 28, 2023 at 9:31 pm

Dear Sir,

Please look at my work below. I am not sure if this method is acceptable.
I want to generate a synthetic data for testing or teaching linear regression.
I use the Excel random function to generate;

Linear regression modeled as;
y(i) = a + b*x(i) + e(i)

a & b is constant, thus for all i;
a = RAND()*(U_a – L_a) + L_a
b = RAND()*(U_b – L_b) + L_b

U_a & L_a is upper & lower bound for a
U_b & L_b is upper & lower bound for b

x(i) is a variable, thus for all i;
x(1) = RAND()*(U_x – L_x) + L_x
x(2) = RAND()*(U_x – L_x) + L_x
…
x(n-1) = RAND()*(U_x – L_x) + L_x
x(n) = RAND()*(U_x – L_x) + L_x

U_x & L_x is upper & lower bound for x

e(i) is a variable, normally distributed with mean 0, thus for all i;
e(1) = NORM.INV(RAND(), m, s)
e(2) = NORM.INV(RAND(), m, s)
…
e(n-1) = NORM.INV(RAND(), m, s)
e(n) = NORM.INV(RAND(), m, s)

m is mean which is equal to 0
s is standard deviation

therefore, variable y(i) for all i;
y(1) = a + b*x(1) + e(1)
y(2) = a + b*x(2) + e(2)
…
y(n-1) = a + b*x(n-1) + e(n-1)
y(n) = a + b*x(n) + e(n)

please note that, a & b is constant. Therefore, it only have 1 random value respectively.
So, when dragging, use cell reference and anchor, to avoid a & b generate multiple value for all i.
Reply
- Charles
  
  August 8, 2023 at 6:04 pm
  
  Hello Maaruf,
  If I understand aht you are doing correctly, your approach does seem to be correct.
  Charles
  Reply
Bhushan

October 2, 2022 at 3:35 pm

Hello Charles,
I have conducted a survey of patients with a chronic medical condition. I have received 39 responses to the questionnaire. I need to identify if there is a relation between their mobility level (independent variable) (mobility is classified into 5; level 1 -level 5 with decreasing mobility) and them receiving a transition plan (dependent variable; yes or no)
What statistical test will serve my purpose best? I think logistic regression is not feasible due to the small sample size.
Thank you so much for taking the time to reply.

Regards
Bhushan
Reply
- Charles
  
  October 4, 2022 at 7:01 pm
  
  Hello Bhushan,
  You can divide the 39 people who answered the questionnaire into two groups: Group A who received a transition plan and group B who didn’t receive a transition plan. Now you can perform a two sample t-test between the two samples based on the mobility value. Since it is likely that the normality assumption will fail, you can instead perform a Mann-Whitney test.
  Charles
  Reply
bongjun kim

April 10, 2022 at 1:37 pm

RegCoeff(x,y) cause error when there are missing values in x.
Is there any method to ignore missing values in x and export regression coefficients with non-missing data?
Reply
- Charles
  
  April 11, 2022 at 11:45 am
  
  Yes, you need to first eliminate any rows with missing data. The usual approach is to use listwise deletion. This means that any row that is missing any element of X or Y data is removed. If, for example, your X data is contained in range A1:C10 and your Y data is contained in range D1:D10, then you can use the Real Statistics array formula =DELROWS(A1:D10) to remove any rows with missing data. Next, you can use RegCoeff to find the regression coefficients.
  Listwise elimination may be appropriate when you have very few missing data entries. There are other techniques, however, for dealing with missing data. These consist of imputing values for the missing data based on the data that is not missing. These are described at
  https://www.real-statistics.com/handling-missing-data/
  Charles
  Reply
Genamo Irensa

March 13, 2022 at 8:47 am

what if Hosmer and Lemeshow Test is greater than 0.05?
Reply
- Charles
  
  March 13, 2022 at 8:25 pm
  
  Hello Genamo,
  A significant result (i.e. when p-value < .05) indicates that the logistic regression model is not a good fit for the data. See the following webpage for details: https://www.real-statistics.com/logistic-regression/hosmer-lemeshow-test/
  Charles
  Reply
Carlos Gazani

November 28, 2020 at 7:33 pm

Hello Charles;
I am stack trying to test the homoscedasticity and normality assumptions, but my simple regression lineal model does not accomplish both. The Pearson coefficient is about 0.95.
My data is the following
Polvo Respirable, Silice Cristalino
1.2 0.084
0.76 0.039
7.9 0.33
14 0.7
27 2.6
4 0.11
3.1 0.25
4.1 0.35
3.3 0.3
7.1 0.47
Can I keep working with my model even if not accomplished those assumptions, how valid my model will be?
Thank you for your help
Reply
- Charles
  
  November 28, 2020 at 10:41 pm
  
  Carlos,
  See the following webpages for the answers to your questions:
  1. Homoscedasticity: https://www.real-statistics.com/multiple-regression/heteroskedasticity/
  2. Normality: You need to check whether the residuals are normally distributed:
  https://www.real-statistics.com/multiple-regression/residuals/
  https://www.real-statistics.com/tests-normality-and-symmetry/
  3. You can use the regression model for forecasting even when the assumptions are not met. p-values and confidence intervals may not be accurate though.
  Charles
  Reply
  - Carlos Gazani
    
    November 28, 2020 at 11:48 pm
    
    Thank you Charles, you help me a lot.
    God bless you and your family
    Reply
- Carlos Gazani
  
  November 29, 2020 at 12:26 am
  
  Charles
  
  I click Reg, but doesn’t show me breusch-Pagan test, maybe I have office 2007
  Reply
  - Charles
    
    November 30, 2020 at 11:13 am
    
    Carlos,
    Enter the formula =VER() to see whether you are using the Excel 2007 version of the software.
    Charles
    Reply
    - Carlos Gazani
      
      November 30, 2020 at 7:40 pm
      
      Thanks Charles,
      When I enter =VER(), it shows me
      5.11 Excel 2007
      
      I did manually the test for homoscedasticity by the following three methods:
      1.- NWRSR – The Neter-Wasserman / Ramsey / Spearman Rho T,
      2.- NWGQ – Neter-Wasserman / Goldfeld-Quandt , and 3.- Breusch-Pagan-Godfrey.
      When I used an significance level alpha= 0.05 there is not homoscedasticity, but when I used an alpha=0.01 in all pass the test for homoscedasticity.
      It is valid to use an alpha =0.01 for my model?
      Reply
      - Charles
        
        December 9, 2020 at 9:07 pm
        
        Hello Carlos,
        Yes, this seems to be a perplexing development. This is a consequence of the fact that usually, the alternative hypothesis is what we are trying to demonstrate, but in this case, the null hypothesis is the desirable state.
        I don’t have an easy answer for you. I simply conclude that homoscedasticity is questionable and move forward with the ordinary regression. You might want to correct for heteroscedasticity and see whether the results are different.
        Charles
Maria sanchez

September 28, 2020 at 12:20 am

Se puede realizar pronosticos de demanda con 4 años base?
Reply
- Charles
  
  September 28, 2020 at 10:02 am
  
  Sorry, but I don’t understand your question. What do the 4 base years refer to and what sort of forecast do you want to make?
  Charles
  Reply
Ian Moffit

May 14, 2020 at 8:25 pm

Dear Charles,

Does RealStats allow one to run mixed model linear regression?

Thanks,

Ian Moffit
Reply
- Charles
  
  May 21, 2020 at 12:13 pm
  
  Hello Ian,
  RealStats doesn’t currently support mixed model linear regression. I do plan to add this capability in the future.
  Charles
  Reply
RRC

March 31, 2020 at 3:39 am

Hi Chareles,

Would you please help me out on the following doubts?

Do I need to have normal y and x values for regression analysis? I am not talking about the errors but the variables. Or, I can start without checking the normality of the available data. Awaiting for your kind response.
Reply
- Charles
  
  March 31, 2020 at 10:04 am
  
  Normality is not a requirement for regression. The normality of the residuals is required for the various tests (e.g. whether or not a regression coefficient is significantly different from zero). You can use the regression model to make predictions even if the normality assumption is not met.
  Charles
  Reply
  - RRC
    
    March 31, 2020 at 12:00 pm
    
    Many thanks
    Reply
Shlomi

March 11, 2020 at 11:21 am

Dear Charles,

Thank you so much for this helpful tool!

I need to compare two measurement units, but their units are different. The “gold-standard” is 0.7-1 (variance of 0.03) and the new method is 3-6 (variance of 0.2).
I was thinking of model ii regression and ICC, but I’m not sure which to use and how to interpret the results of both methods.

Could you please expand on their interpretation and if I’m doing the right thing with this type of analysis?

Many thanks
Reply
- Charles
  
  March 11, 2020 at 6:35 pm
  
  Hello Shlomi,
  I would need more information before I could determine whether either of these is appropriate and which is better.
  Other approaches for comparison with a gold standard are Bland Altman, Lin’s Concordance Correlation Coefficient and Passing-Bablok Regression. All of these are described on the Real Statistics website.
  Charles
  Reply
  - Shlomi
    
    March 12, 2020 at 7:57 am
    
    Hi Charles,
    
    Thank you for your reply.
    
    Since I’m comparing 2 measurements with different units, I believe I can only use one of your suggestions if I normalize the data somehow (min-max or z) so I can compare both.
    Otherwise, I thought about Deming and ICC since, as much as I know, I can use them even if my units are different.
    
    I have one array (n=33) from the gold-standard of scores between 0.7 to 1, and the other (also 33) from the new device of scores between 3-6.
    
    Many thanks for your help
    Reply
Tim

March 6, 2020 at 7:07 pm

To begin, I am not a statistician, so please excuse my ignorance. I’m trying to determine the effects of several factors on the results of a finite element analysis. I have several categorical variables and some continuous ones. I’ve read about and have completed the categorical coding for regression and the linear regression analysis using Real Statistics Using Excel. What I don’t understand is how to use the results of the analysis. Can you tell me where do find this information?
Reply
- Charles
  
  March 9, 2020 at 9:14 am
  
  Hi Tim,
  This is explained on the website. Since you have several independent variables, you should look at
  Multiple Regression
  See especially Multiple Regression Analysis
  Usually, regression is used to make forecasts or predictions (although there are other motivations). This can be done using the TREND function, as explained at:
  https://real-statistics.com/multiple-regression/multiple-regression-analysis/multiple-regression-analysis-excel/
  
  Charles
  Reply
Lina

September 1, 2018 at 2:45 am

Hi Charles,
Thanks for providing a wonderful resource! I have a question regarding linear regression. Assuming normality, I think the mean and the standard error of the y-estimate is a function of x. Is there a function currently available to find the standard error of the y-estimate at each point of x? I am assuming mean(y-hat) = a + bx, Var(y-hat) = Var(a) + x^2*Var(b) + 2*x*Cov(a,b). Thanks for your help!
Reply
- Charles
  
  September 1, 2018 at 8:47 am
  
  Lina,
  See https://real-statistics.com/regression/confidence-and-prediction-intervals/
  Charles
  Reply
  - Lina
    
    September 1, 2018 at 2:41 pm
    
    Thank you Charles!
    Reply
Steven

August 13, 2018 at 8:50 pm

Hi Charles,
I am trying to compare the regressions of the responses of 4 mutants against a control. All are independent. I need to test whether the intercepts are significantly different. Could you point me in the right direction. Many thanks.
Reply
- Charles
  
  August 14, 2018 at 9:06 am
  
  Steven
  Are you saying that you have 4 linear regression models, each one consists of one of the mutants vs the control, and want to test whether the intercepts are equal (or show a significant difference)?
  Are the 4 regression lines parallel?
  Charles
  Reply
Adam

November 13, 2017 at 2:19 pm

Hi. Does your Excel add-in support quantile regression, as well as supporting linear models for the mean? Thanks. Adam
Reply
- Charles
  
  November 13, 2017 at 4:40 pm
  
  Adam,
  The Real Statistics addin supports quantile regression in the case where the quantile is the median (i.e. LAD regression), but not the general case.
  Charles
  Reply
  - Adam
    
    November 14, 2017 at 11:12 pm
    
    Thanks.
    Reply
Diego

November 7, 2017 at 6:58 pm

Buenas tardes,
Tengo la siguiente consulta:
Realizo un trabajo entre variables continuas,
el numero del coeficiente de correlación es -0,496630749
significa que la relación lineal entre las variables es negativa y muy débil , sin embargo cuando aumenta x disminuye y.
como obtengo el numero de p= probabilidad ?
Muchas gracias !!
Reply
- Charles
  
  November 8, 2017 at 8:10 am
  
  Diego,
  Please look at the following webpage
  One sample hypothesis testing for correlation
  Charles
  Reply
mary ann r. aquino

March 7, 2017 at 4:07 am

HELLO, the computed r is 0.201566193 which means that tere is a weak correlatio between the variables, while the pvalue is 6.63411E-10 which tells me that i have to reject a null hypothesis. is this result really possible?
Reply
- Charles
  
  March 7, 2017 at 10:16 am
  
  Mary Ann,
  This is certainly possible. If you are using simple linear regression, then the p-value being so low only means that there is a significant difference between the population correlation and zero. It doesn’t mean that the population value of r is high; it just means that it is not likely to be zero.
  Charles
  Reply
Marcos

October 30, 2016 at 8:44 pm

Charles,
Thanks for all the interesting information you have available here.
I’m relatively new to regressions and I’m hoping you can give me your thoughts on the following:
I’m trying to roughly estimate/predict what the hourly energy prices ($/MWh), at a certain grid point, will be going forward, out 12 months.
The issue I’m having is that the price I’m trying to estimate, lets call it “Price A”, is relatively new, with only 6 months of hourly historical prices available. Knowing that this price is highly correlated to a different price (r = 0.98 and r-squared = 95%), let’s call it “Price B”, and that Price B does have available historical data going back multiple years, here’s what I’ve done: calculated in Excel, using the equation y=m*x+a (where y = price A and x = price B) and parameters calculated in Excel (“m” and “a”), what the prices would had been at point A, let’s say for the last 12 months.
Essentially, I would appreciate if you could tell me whether or not this is valid approach and also what would I should be doing next to estimate the prices for the next 12 months.
I hope this isn’t too confusing, please let me know otherwise.
Thank you in advance.
Reply
- Charles
  
  October 31, 2016 at 3:38 pm
  
  Marcos,
  It seems like a reasonable approach. You should be aware of issues such as seasonality and autocorrelation.
  Charles
  Reply
Sandy

July 21, 2016 at 10:40 am

Hi Charles

For linear regression, if R square is .209 (20%) and ANOVA p= 0.01, what does this imply? Isn’t R square value should be greater than 30% ?
Reply
- Charles
  
  July 21, 2016 at 11:36 am
  
  Sandy,
  p = .01 means that the slope is significantly different from zero. I don’t know of any rule that R square needs to be greater than 30%.
  Charles
  Reply
  - Sandy
    
    July 25, 2016 at 2:36 pm
    
    Thank you Charles
    Reply
Will

April 12, 2016 at 1:57 am

Hi Charles,
Thanks so much for helping on here first off. My question is that I am attempting to compare “OB GYN per capita population” for each state to the “percent of total deliveries that are via cesarean section” to see if there is a coorelation between the two. So basically less OB GYNs is perhaps associated with higher rates of cesarean section for instance. Would this be a simple linear regression? Would it be proper to enter data for all 50 states to arrive at conclusion? Thanks again
Reply
- Charles
  
  April 12, 2016 at 8:40 am
  
  Hi Will,
  Yes, this could be a way to test this hypothesis. You don’t know a priori whether the relationship is linear (or quadratic or something else), but it makes sense to start with simple linear regression.
  Charles
  Reply
  - Will
    
    April 12, 2016 at 5:58 pm
    
    Thanks. After obtaining result is there some other step to progress to based on result? Statistics is not my strong point
    Reply
    - Charles
      
      April 13, 2016 at 8:17 am
      
      Will,
      The next step depends on why you decided to use regression in the first place. E.g., you can use the regression for forecasting.
      Charles
      Reply
      - Will
        
        April 14, 2016 at 1:29 am
        
        I’m merely trying to determine if the two variables are correlated in some way to some statistical significance p=.05. Does this make sense or no?
      - Charles
        
        April 14, 2016 at 9:26 am
        
        Will,
        To test whether there is a significant correlation between two variables, please look at the following webpage
        Correlation Testing
        There are a number of different tests that can be used, as explained on the website.
        Charles
Alexander

March 12, 2016 at 4:15 pm

Hello Charles.

If a model has R-square 55% is it doing a good job?

Some people say it is good others say it is weak

i would appreciate it if you helped me
Reply
- Charles
  
  March 12, 2016 at 10:51 pm
  
  Alexander,
  There is no set answer to this question. It really depends on a umber of factors. I certainly wouldn’t say that 55% necessarily means that the model is a bad fit. See the following webpage for more details
  http://people.duke.edu/~rnau/rsquared.htm
  Charles
  Reply
Connor L

January 20, 2016 at 3:21 am

Hello!

I need to run regressions using this software for my BUS252 class. However, when I click “Linear Regression” and it prompts me to highlight the X and Y values, I’m unable to do so. Im working from a Macbook, but can’t seem to get it to work. Any help would be much appreciated!
Reply
- Charles
  
  January 20, 2016 at 11:01 am
  
  Hello Connor,
  For the Mac version of the software, you can’t highlight the range, but need to enter the ranges manually. E.g. for range D5:E24, you need to type in the value D5:E24 in the Input Range field. Good luck with your course.
  Charles
  Reply
hani

January 14, 2016 at 6:59 pm

hi Charles,

i have a result from one of the independent variables showing a highest standardized beta but the p-value are not significant.
so it is still accepted or rejected to explain which is the most influential variables?
how do i interpret this result?

help me.
tq!
Reply
- Charles
  
  January 16, 2016 at 3:45 pm
  
  Hi Hani,
  
  Standardizing the variables doesn’t necessarily lead to identifying the most important/influential variable in linear regression. For example, see the webpage
  http://www.jerrydallal.com/lhsp/importnt.htm.
  
  Depending on what you mean by most influential, you might want to compare partial correlation coefficients or compare the R-square values when each variable is dropped from the model.
  
  Charles
  Reply
Michael Broderick

November 25, 2015 at 7:46 pm

You have created a wonderful resource in this website and the Excel add-in. I’m wondering if you’re planning to add standardized coefficients to the linear regression routine. Is it already there and I missed it?

Keep up the great work!
Reply
- Charles
  
  November 25, 2015 at 8:17 pm
  
  Michael,
  I have not added standard coefficients yet, but it will eventually be added.
  Charles
  Reply
Gustavo

July 29, 2015 at 6:52 am

When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected? What parts of the regression output are used in the regression equation?
Reply
- Charles
  
  July 30, 2015 at 1:20 pm
  
  Gustavo,
  
  Q. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected?
  A. Significance F
  
  Q. What parts of the regression output are used in the regression equation?
  A. The coefficients
  
  Charles
  Reply
Ryan

June 16, 2015 at 2:28 pm

Charles,

Can you help me understand the difference between errors and residuals?

Thanks
Reply
- Charles
  
  June 16, 2015 at 4:21 pm
  
  Ryan,
  In this context, error values and residuals are the same thing.
  Charles
  Reply
Daniel

March 13, 2015 at 4:59 am

Ive loaded Real Stats as an add in but I cant see it up in the data analysis icon area? How do I use it, where do I access it it in the excel page?
Dan
Reply
kalina

March 4, 2015 at 12:51 am

Thank you so much for this site. It is very helpful. I have to make a project for bachelor’s degree but I don’t know how to find some data so I can make the regression analysis. I want you to tell me what variable to take, to make the regression analysis?I have the theme of agriculture and environmental. Thank you

Regards
Kalina
Reply
- Charles
  
  March 4, 2015 at 8:31 am
  
  Kalina,
  You haven’t provided enough information for me to tell you what variables to use, but think about what hypothesis in the area of agriculture and environment you want to make. From this will naturally follow which variables you need to study.
  Charles
  Reply
paula

October 1, 2014 at 4:33 am

I’m so glad I found this site…..thanks so much Professor. You explain all the concepts so well!!!
Reply
William Agurto

September 4, 2014 at 5:03 pm

Dear Charles:

Thank you for the explanation. This information will be very useful for my future calculations.

Regards.

William Agurto.
Reply
William Agurto

August 14, 2014 at 3:39 pm

Dear Charles:

I found a little bug in the residuals and Cook’s D sections (when that options are selected in linear regression analysis):
1. There´s a section at the right of the Cook’s D table that calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k.
2. Because of that, there’s a difference between Data Analysis Excel Add-in results and Real Statistics 2.16.1 results in the calculation of Standard Residuals (SResidual in Cook’s D table).
3. Also, for that reason, there are some mistakes in the calculation of: ModMSE, RStudent, T-Test, Cook’s D, and DFFITS columns in Cook’s D table.

I hope these comments will be helpful.

Regards.

William agurto.
Reply
- William Agurto
  
  September 2, 2014 at 6:07 pm
  
  Dear Charles:
  
  The bug in the SResidual calculation is still unfixed in Real Statistics 2.17.
  I present an example for making the explanation simpler:
  
  1. Data for analysis:
  X Y
  9 0.98
  14 0.96
  25 0.94
  30 0.92
  59 0.9
  122 0.88
  128 0.86
  138 0.84
  144 0.82
  
  2. Results obtained in Excel 2010 (using Data Analysis Add-in) for RESIDUAL OUTPUT:
  
  Observation Predicted Y Residuals Standard Residuals
  1 0.959434128 0.020565872 1.305567744
  2 0.954885598 0.005114402 0.324673749
  3 0.944878831 -0.004878831 -0.309719156
  4 0.940330301 -0.020330301 -1.290613152
  5 0.913948826 -0.013948826 -0.885502786
  6 0.856637346 0.023362654 1.483113757
  7 0.851179109 0.008820891 0.559969944
  8 0.842082049 -0.002082049 -0.132173143
  9 0.836623813 -0.016623813 -1.055316957
  
  3. Results obtained in Real Statistics 2.17 for Cook’s D Table (only columns 1, and 4 to 7 are shown):
  
  Obs Pred Y Residual Leverage SResidual
  1 0.959434128 0.020565872 0.271566883 1.221246799
  2 0.954885598 0.005114402 0.247947105 0.303704483
  3 0.944878831 -0.004878831 0.202599637 -0.289715742
  4 0.940330301 -0.020330301 0.184994445 -1.207258058
  5 0.913948826 -0.013948826 0.11994921 -0.82831201
  6 0.856637346 0.023362654 0.1965224 1.387325886
  7 0.851179109 0.008820891 0.219377825 0.523803919
  8 0.842082049 -0.002082049 0.263484784 -0.123636654
  9 0.836623813 -0.016623813 0.293557711 -0.987158622
  
  4. As you see, the Standard residuals obtained by Data Analysis Add-in is different from those obtained in Real Statistics 2.17. As I commented in the prior message, this is because Real Statistics 2.17 calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k (in the example, 7 instead of 8). When I manually correct this difference (I write 8 instead of 7), all the standard residuals are OK.
  
  I hope this comment will be helpfull.
  
  Regards.
  
  William Agurto.
  Reply

Topics

73 thoughts on “Linear Regression”

Leave a Comment Cancel reply