The goal of linear regression analysis is to describe the relationship between two variables based on observed data and to predict the value of the dependent variable based on the value of the independent variable.

Topics:

- Method of Least Squares
- Regression Analysis
- Testing if the regression line is a good fit
- Testing the significance of the slope of the regression line
- Confidence and prediction intervals for regression
- Exponential Regression Model
- Power Regression Model
- Linear regression models for comparing means

Charles,

Thanks for all the interesting information you have available here.

I’m relatively new to regressions and I’m hoping you can give me your thoughts on the following:

I’m trying to roughly estimate/predict what the hourly energy prices ($/MWh), at a certain grid point, will be going forward, out 12 months.

The issue I’m having is that the price I’m trying to estimate, lets call it “Price A”, is relatively new, with only 6 months of hourly historical prices available. Knowing that this price is highly correlated to a different price (r = 0.98 and r-squared = 95%), let’s call it “Price B”, and that Price B does have available historical data going back multiple years, here’s what I’ve done: calculated in Excel, using the equation y=m*x+a (where y = price A and x = price B) and parameters calculated in Excel (“m” and “a”), what the prices would had been at point A, let’s say for the last 12 months.

Essentially, I would appreciate if you could tell me whether or not this is valid approach and also what would I should be doing next to estimate the prices for the next 12 months.

I hope this isn’t too confusing, please let me know otherwise.

Thank you in advance.

Marcos,

It seems like a reasonable approach. You should be aware of issues such as seasonality and autocorrelation.

Charles

Hi Charles

For linear regression, if R square is .209 (20%) and ANOVA p= 0.01, what does this imply? Isn’t R square value should be greater than 30% ?

Sandy,

p = .01 means that the slope is significantly different from zero. I don’t know of any rule that R square needs to be greater than 30%.

Charles

Thank you Charles

Hi Charles,

Thanks so much for helping on here first off. My question is that I am attempting to compare “OB GYN per capita population” for each state to the “percent of total deliveries that are via cesarean section” to see if there is a coorelation between the two. So basically less OB GYNs is perhaps associated with higher rates of cesarean section for instance. Would this be a simple linear regression? Would it be proper to enter data for all 50 states to arrive at conclusion? Thanks again

Hi Will,

Yes, this could be a way to test this hypothesis. You don’t know a priori whether the relationship is linear (or quadratic or something else), but it makes sense to start with simple linear regression.

Charles

Thanks. After obtaining result is there some other step to progress to based on result? Statistics is not my strong point

Will,

The next step depends on why you decided to use regression in the first place. E.g., you can use the regression for forecasting.

Charles

I’m merely trying to determine if the two variables are correlated in some way to some statistical significance p=.05. Does this make sense or no?

Will,

To test whether there is a significant correlation between two variables, please look at the following webpage

Correlation Testing

There are a number of different tests that can be used, as explained on the website.

Charles

Hello Charles.

If a model has R-square 55% is it doing a good job?

Some people say it is good others say it is weak

i would appreciate it if you helped me

Alexander,

There is no set answer to this question. It really depends on a umber of factors. I certainly wouldn’t say that 55% necessarily means that the model is a bad fit. See the following webpage for more details

http://people.duke.edu/~rnau/rsquared.htm

Charles

Hello!

I need to run regressions using this software for my BUS252 class. However, when I click “Linear Regression” and it prompts me to highlight the X and Y values, I’m unable to do so. Im working from a Macbook, but can’t seem to get it to work. Any help would be much appreciated!

Hello Connor,

For the Mac version of the software, you can’t highlight the range, but need to enter the ranges manually. E.g. for range D5:E24, you need to type in the value D5:E24 in the Input Range field. Good luck with your course.

Charles

hi Charles,

i have a result from one of the independent variables showing a highest standardized beta but the p-value are not significant.

so it is still accepted or rejected to explain which is the most influential variables?

how do i interpret this result?

help me.

tq!

Hi Hani,

Standardizing the variables doesn’t necessarily lead to identifying the most important/influential variable in linear regression. For example, see the webpage

http://www.jerrydallal.com/lhsp/importnt.htm.

Depending on what you mean by most influential, you might want to compare partial correlation coefficients or compare the R-square values when each variable is dropped from the model.

Charles

You have created a wonderful resource in this website and the Excel add-in. I’m wondering if you’re planning to add standardized coefficients to the linear regression routine. Is it already there and I missed it?

Keep up the great work!

Michael,

I have not added standard coefficients yet, but it will eventually be added.

Charles

When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected? What parts of the regression output are used in the regression equation?

Gustavo,

Q. When running a regression in Excel what do you use to determine if the null hypothesis is accepted or rejected?

A. Significance F

Q. What parts of the regression output are used in the regression equation?

A. The coefficients

Charles

Charles,

Can you help me understand the difference between errors and residuals?

Thanks

Ryan,

In this context, error values and residuals are the same thing.

Charles

Ive loaded Real Stats as an add in but I cant see it up in the data analysis icon area? How do I use it, where do I access it it in the excel page?

Dan

Daniel,

You can use the Real Statistics functions just like any other Excel function. You can access the Real Statistics data analysis tools by pressing Ctrl-m. More detailed descriptions of this can be found on the following webpages:

http://www.real-statistics.com/free-download/real-statistics-resource-pack/

http://www.real-statistics.com/excel-capabilities/supplemental-data-analysis-tools/accessing-supplemental-data-analysis-tools/

Charles

Thank you so much for this site. It is very helpful. I have to make a project for bachelor’s degree but I don’t know how to find some data so I can make the regression analysis. I want you to tell me what variable to take, to make the regression analysis?I have the theme of agriculture and environmental. Thank you

Regards

Kalina

Kalina,

You haven’t provided enough information for me to tell you what variables to use, but think about what hypothesis in the area of agriculture and environment you want to make. From this will naturally follow which variables you need to study.

Charles

I’m so glad I found this site…..thanks so much Professor. You explain all the concepts so well!!!

Dear Charles:

Thank you for the explanation. This information will be very useful for my future calculations.

Regards.

William Agurto.

Dear Charles:

I found a little bug in the residuals and Cook’s D sections (when that options are selected in linear regression analysis):

1. There´s a section at the right of the Cook’s D table that calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k.

2. Because of that, there’s a difference between Data Analysis Excel Add-in results and Real Statistics 2.16.1 results in the calculation of Standard Residuals (SResidual in Cook’s D table).

3. Also, for that reason, there are some mistakes in the calculation of: ModMSE, RStudent, T-Test, Cook’s D, and DFFITS columns in Cook’s D table.

I hope these comments will be helpful.

Regards.

William agurto.

Dear Charles:

The bug in the SResidual calculation is still unfixed in Real Statistics 2.17.

I present an example for making the explanation simpler:

1. Data for analysis:

X Y

9 0.98

14 0.96

25 0.94

30 0.92

59 0.9

122 0.88

128 0.86

138 0.84

144 0.82

2. Results obtained in Excel 2010 (using Data Analysis Add-in) for RESIDUAL OUTPUT:

Observation Predicted Y Residuals Standard Residuals

1 0.959434128 0.020565872 1.305567744

2 0.954885598 0.005114402 0.324673749

3 0.944878831 -0.004878831 -0.309719156

4 0.940330301 -0.020330301 -1.290613152

5 0.913948826 -0.013948826 -0.885502786

6 0.856637346 0.023362654 1.483113757

7 0.851179109 0.008820891 0.559969944

8 0.842082049 -0.002082049 -0.132173143

9 0.836623813 -0.016623813 -1.055316957

3. Results obtained in Real Statistics 2.17 for Cook’s D Table (only columns 1, and 4 to 7 are shown):

Obs Pred Y Residual Leverage SResidual

1 0.959434128 0.020565872 0.271566883 1.221246799

2 0.954885598 0.005114402 0.247947105 0.303704483

3 0.944878831 -0.004878831 0.202599637 -0.289715742

4 0.940330301 -0.020330301 0.184994445 -1.207258058

5 0.913948826 -0.013948826 0.11994921 -0.82831201

6 0.856637346 0.023362654 0.1965224 1.387325886

7 0.851179109 0.008820891 0.219377825 0.523803919

8 0.842082049 -0.002082049 0.263484784 -0.123636654

9 0.836623813 -0.016623813 0.293557711 -0.987158622

4. As you see, the Standard residuals obtained by Data Analysis Add-in is different from those obtained in Real Statistics 2.17. As I commented in the prior message, this is because Real Statistics 2.17 calculates dfE (degrees of freedom of errors) substracting (k+1) instead of substracting k (in the example, 7 instead of 8). When I manually correct this difference (I write 8 instead of 7), all the standard residuals are OK.

I hope this comment will be helpfull.

Regards.

William Agurto.

William,

I have looked at the issue that you have raised and indeed there is a difference between the results for the standard residuals calculated by Excel and the ones in the Cook’s D calculated by the Real Statistics Resource Pack. Excel calculates the standard residuals by dividing each residual by the standard deviation of all 9 residuals. This is not the usually accepted approach. Instead the usually accepted approach is to divide each residual by the standard error, which is the square root of MSE, as is done by the Real Statistics software. The relationship between the two approaches is as you describe. I will add a description of the discrepancy to the website, but would refer to use the commonly accepted approach rather than the one employed by Excel.

Charles

Excellent work Professor Charles. The explanations you have provided for here has helped me learn not only Excel better but also allowed me to implement models in R. I would appreciate if you can please provide a downloadable .pdf version of all the topics here.

Abhishek,

I plan to publish an ebook covering the various topics shortly. Stay tuned.

Charles