# Linear Regression

The goal of linear regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable.

Topics:

### 41 Responses to Linear Regression

1. Abhishek Saurabh says:

Excellent work Professor Charles. The explanations you have provided for here has helped me learn not only Excel better but also allowed me to implement models in R. I would appreciate if you can please provide a downloadable .pdf version of all the topics here.

• Charles says:

Abhishek,
I plan to publish an ebook covering the various topics shortly. Stay tuned.
Charles

2. William Agurto says:

Dear Charles:

I found a little bug in the residuals and Cook’s D sections (when that options are selected in linear regression analysis):
1. There´s a section at the right of the Cook’s D table that calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k.
2. Because of that, there’s a difference between Data Analysis Excel Add-in results and Real Statistics 2.16.1 results in the calculation of Standard Residuals (SResidual in Cook’s D table).
3. Also, for that reason, there are some mistakes in the calculation of: ModMSE, RStudent, T-Test, Cook’s D, and DFFITS columns in Cook’s D table.

Regards.

William agurto.

• William Agurto says:

Dear Charles:

The bug in the SResidual calculation is still unfixed in Real Statistics 2.17.
I present an example for making the explanation simpler:

1. Data for analysis:
X Y
9 0.98
14 0.96
25 0.94
30 0.92
59 0.9
122 0.88
128 0.86
138 0.84
144 0.82

2. Results obtained in Excel 2010 (using Data Analysis Add-in) for RESIDUAL OUTPUT:

Observation Predicted Y Residuals Standard Residuals
1 0.959434128 0.020565872 1.305567744
2 0.954885598 0.005114402 0.324673749
3 0.944878831 -0.004878831 -0.309719156
4 0.940330301 -0.020330301 -1.290613152
5 0.913948826 -0.013948826 -0.885502786
6 0.856637346 0.023362654 1.483113757
7 0.851179109 0.008820891 0.559969944
8 0.842082049 -0.002082049 -0.132173143
9 0.836623813 -0.016623813 -1.055316957

3. Results obtained in Real Statistics 2.17 for Cook’s D Table (only columns 1, and 4 to 7 are shown):

Obs Pred Y Residual Leverage SResidual
1 0.959434128 0.020565872 0.271566883 1.221246799
2 0.954885598 0.005114402 0.247947105 0.303704483
3 0.944878831 -0.004878831 0.202599637 -0.289715742
4 0.940330301 -0.020330301 0.184994445 -1.207258058
5 0.913948826 -0.013948826 0.11994921 -0.82831201
6 0.856637346 0.023362654 0.1965224 1.387325886
7 0.851179109 0.008820891 0.219377825 0.523803919
8 0.842082049 -0.002082049 0.263484784 -0.123636654
9 0.836623813 -0.016623813 0.293557711 -0.987158622

4. As you see, the Standard residuals obtained by Data Analysis Add-in is different from those obtained in Real Statistics 2.17. As I commented in the prior message, this is because Real Statistics 2.17 calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k (in the example, 7 instead of 8). When I manually correct this difference (I write 8 instead of 7), all the standard residuals are OK.

I hope this comment will be helpfull.

Regards.

William Agurto.

• Charles says:

William,
I have looked at the issue that you have raised and indeed there is a difference between the results for the standard residuals calculated by Excel and the ones in the Cook’s D calculated by the Real Statistics Resource Pack. Excel calculates the standard residuals by dividing each residual by the standard deviation of all 9 residuals. This is not the usually accepted approach. Instead the usually accepted approach is to divide each residual by the standard error, which is the square root of MSE, as is done by the Real Statistics software. The relationship between the two approaches is as you describe. I will add a description of the discrepancy to the website, but would refer to use the commonly accepted approach rather than the one employed by Excel.
Charles

3. William Agurto says:

Dear Charles:

Thank you for the explanation. This information will be very useful for my future calculations.

Regards.

William Agurto.

4. paula says:

I’m so glad I found this site…..thanks so much Professor. You explain all the concepts so well!!!

5. kalina says:

Thank you so much for this site. It is very helpful. I have to make a project for bachelor’s degree but I don’t know how to find some data so I can make the regression analysis. I want you to tell me what variable to take, to make the regression analysis?I have the theme of agriculture and environmental. Thank you

Regards
Kalina

• Charles says:

Kalina,
You haven’t provided enough information for me to tell you what variables to use, but think about what hypothesis in the area of agriculture and environment you want to make. From this will naturally follow which variables you need to study.
Charles

6. Daniel says:

Ive loaded Real Stats as an add in but I cant see it up in the data analysis icon area? How do I use it, where do I access it it in the excel page?
Dan

• Charles says:

Daniel,

You can use the Real Statistics functions just like any other Excel function. You can access the Real Statistics data analysis tools by pressing Ctrl-m. More detailed descriptions of this can be found on the following webpages:

Charles

7. Ryan says:

Charles,

Can you help me understand the difference between errors and residuals?

Thanks

• Charles says:

Ryan,
In this context, error values and residuals are the same thing.
Charles

8. Gustavo says:

When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected? What parts of the regression output are used in the regression equation?

• Charles says:

Gustavo,

Q. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected?
A. Significance F

Q. What parts of the regression output are used in the regression equation?
A. The coefficients

Charles

9. Michael Broderick says:

You have created a wonderful resource in this website and the Excel add-in. I’m wondering if you’re planning to add standardized coefficients to the linear regression routine. Is it already there and I missed it?

Keep up the great work!

• Charles says:

Michael,
I have not added standard coefficients yet, but it will eventually be added.
Charles

10. hani says:

hi Charles,

i have a result from one of the independent variables showing a highest standardized beta but the p-value are not significant.
so it is still accepted or rejected to explain which is the most influential variables?
how do i interpret this result?

help me.
tq!

• Charles says:

Hi Hani,

Standardizing the variables doesn’t necessarily lead to identifying the most important/influential variable in linear regression. For example, see the webpage
http://www.jerrydallal.com/lhsp/importnt.htm.

Depending on what you mean by most influential, you might want to compare partial correlation coefficients or compare the R-square values when each variable is dropped from the model.

Charles

11. Connor L says:

Hello!

I need to run regressions using this software for my BUS252 class. However, when I click “Linear Regression” and it prompts me to highlight the X and Y values, I’m unable to do so. Im working from a Macbook, but can’t seem to get it to work. Any help would be much appreciated!

• Charles says:

Hello Connor,
For the Mac version of the software, you can’t highlight the range, but need to enter the ranges manually. E.g. for range D5:E24, you need to type in the value D5:E24 in the Input Range field. Good luck with your course.
Charles

12. Alexander says:

Hello Charles.

If a model has R-square 55% is it doing a good job?

Some people say it is good others say it is weak

i would appreciate it if you helped me

• Charles says:

Alexander,
There is no set answer to this question. It really depends on a umber of factors. I certainly wouldn’t say that 55% necessarily means that the model is a bad fit. See the following webpage for more details
http://people.duke.edu/~rnau/rsquared.htm
Charles

13. Will says:

Hi Charles,
Thanks so much for helping on here first off. My question is that I am attempting to compare “OB GYN per capita population” for each state to the “percent of total deliveries that are via cesarean section” to see if there is a coorelation between the two. So basically less OB GYNs is perhaps associated with higher rates of cesarean section for instance. Would this be a simple linear regression? Would it be proper to enter data for all 50 states to arrive at conclusion? Thanks again

• Charles says:

Hi Will,
Yes, this could be a way to test this hypothesis. You don’t know a priori whether the relationship is linear (or quadratic or something else), but it makes sense to start with simple linear regression.
Charles

• Will says:

Thanks. After obtaining result is there some other step to progress to based on result? Statistics is not my strong point

• Charles says:

Will,
The next step depends on why you decided to use regression in the first place. E.g., you can use the regression for forecasting.
Charles

• Will says:

I’m merely trying to determine if the two variables are correlated in some way to some statistical significance p=.05. Does this make sense or no?

• Charles says:

Will,
To test whether there is a significant correlation between two variables, please look at the following webpage
Correlation Testing
There are a number of different tests that can be used, as explained on the website.
Charles

14. Sandy says:

Hi Charles

For linear regression, if R square is .209 (20%) and ANOVA p= 0.01, what does this imply? Isn’t R square value should be greater than 30% ?

• Charles says:

Sandy,
p = .01 means that the slope is significantly different from zero. I don’t know of any rule that R square needs to be greater than 30%.
Charles

• Sandy says:

Thank you Charles

15. Marcos says:

Charles,
Thanks for all the interesting information you have available here.
I’m relatively new to regressions and I’m hoping you can give me your thoughts on the following:
I’m trying to roughly estimate/predict what the hourly energy prices (\$/MWh), at a certain grid point, will be going forward, out 12 months.
The issue I’m having is that the price I’m trying to estimate, lets call it “Price A”, is relatively new, with only 6 months of hourly historical prices available. Knowing that this price is highly correlated to a different price (r = 0.98 and r-squared = 95%), let’s call it “Price B”, and that Price B does have available historical data going back multiple years, here’s what I’ve done: calculated in Excel, using the equation y=m*x+a (where y = price A and x = price B) and parameters calculated in Excel (“m” and “a”), what the prices would had been at point A, let’s say for the last 12 months.
Essentially, I would appreciate if you could tell me whether or not this is valid approach and also what would I should be doing next to estimate the prices for the next 12 months.
I hope this isn’t too confusing, please let me know otherwise.

• Charles says:

Marcos,
It seems like a reasonable approach. You should be aware of issues such as seasonality and autocorrelation.
Charles

16. mary ann r. aquino says:

HELLO, the computed r is 0.201566193 which means that tere is a weak correlatio between the variables, while the pvalue is 6.63411E-10 which tells me that i have to reject a null hypothesis. is this result really possible?

• Charles says:

Mary Ann,
This is certainly possible. If you are using simple linear regression, then the p-value being so low only means that there is a significant difference between the population correlation and zero. It doesn’t mean that the population value of r is high; it just means that it is not likely to be zero.
Charles

17. Diego says:

Buenas tardes,
Tengo la siguiente consulta:
Realizo un trabajo entre variables continuas,
el numero del coeficiente de correlación es -0,496630749
significa que la relación lineal entre las variables es negativa y muy débil , sin embargo cuando aumenta x disminuye y.
como obtengo el numero de p= probabilidad ?
Muchas gracias !!