The 95% confidence interval for the forecasted values ŷ of *x* is

This means that there is a 95% probability that the true linear regression line of the population will lie within the confidence interval of the regression line calculated from the sample data.

**Figure 1 – Confidence vs. prediction intervals**

In the graph on the left of Figure 1, a linear regression line is calculated to fit the sample data points. The confidence interval consists of the space between the two curves (dotted lines). Thus there is a 95% probability that the true best-fit line for the population lies within the confidence interval (e.g. any of the lines in the figure on the right above).

There is also a concept called **prediction interval**. Here we look at any specific value of *x*, *x*_{0}, and find an interval around the predicted value ŷ_{0} for *x*_{0} such that there is a 95% probability that the real value of y (in the population) corresponding to *x*_{0} is within this interval (see the graph on the right side of Figure 1).

The 95% prediction interval of the forecasted value ŷ_{0} for *x*_{0} is

where the **standard error of the prediction** is

For any specific value *x*_{0} the prediction interval is more meaningful than the confidence interval.

**Example 1**: Find the 95% confidence and prediction intervals for the forecasted life expectancy for men who smoke 20 cigarettes in Example 1 of Method of Least Squares.

**Figure 2 – Confidence and prediction intervals**

Referring to Figure 2, we see that the forecasted value for 20 cigarettes is given by FORECAST(20,B4:B18,A4:A18) = 73.16. The confidence interval, calculated using the standard error 2.06 (found in cell E12), is (68.70, 77.61).

The prediction interval is calculated in a similar way using the prediction standard error of 8.24 (found in cell J12). Thus life expectancy of men who smoke 20 cigarettes is in the interval (55.36, 90.95) with 95% probability.

**Example 2**: Test whether the y-intercept is 0.

We use the same approach as that used in Example 1 to find the confidence interval of ŷ when *x* = 0 (this is the y-intercept). The result is given in column M of Figure 2. Here the standard error is

and so the confidence interval is

Since 0 is not in this interval, the null hypothesis that the y-intercept is zero is rejected.

Thank you, as with all the other commenters this operation procedure helped me very much.

In your example we assume data to be t-distributed. Can I follow the same steps only replacing T.INV.2T(0.05;df) with NORM.INV(0.05;meanX;stdevX), if I assume my data is normally distributed? Also if this is possible, do I have to calculate NORM.INV(0.025;meanX;stdevX) for two tailed data? I did not find a description of this function to be sure on this point.

Thank you and best regards, Christian

Christian,

Is there any reason why you want to replace the t distribution by the normal distribution? You would usually use the t distribution when the population mean and standard deviation are unknown (and so are estimated from the sample).

Charles

Hello Charles,

there are a few reasons why I wanted to check and see if choosing either distribution would produce different results. First, my data-set is a population and log-transformed normally distributed so I can calculate the geometric- population mean (or mean of log data) and geometric standard deviation. Second, the population size is around 2000 and I wanted to see if with so many degrees of freedom, choosing normal of t-distribution still makes a significant difference. In the end I am simply not sure which would be more correct however when I compared the results of T.INV(0.05,df) and NORM.INV(0.025, mean,stdev) for my data the values are quite similar, same with confidence levels 0.1 vs. 0.05. The only difference are that the second calculations gives me negative numbers.

T.INV(0.05,df)=1.9613

T.INV(0.1,df)=1.646

NORM.INV(0.025, mean,stdev)=-1.916

NORM.INV(0.05, mean,stdev)=-1.604

Christian,

For df high enough the values should be almost the same. Also the correct comparison is

T.INV(0.05,df)=1.9613

T.INV(0.1,df)=1.646

NORM.INV(1-0.025, mean,stdev)=1.916

NORM.INV(1-0.05, mean,stdev)=1.604

Charles

Perfect, that works for me. Thanks again, much appreciated.

CAn you extend your Fig 1 to a pooled within-group regression slope as obtained from ANOVA

Roger,

Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANOVA?

Charles

Can you extend this to a pooled within-group slope as obtained from ANCOVA

Roger,

Sorry, but what do you consider to be the pooled within-group regression slope as obtained from ANCOVA?

Charles

Hi Charles

I am not a statistician – please bear this in mind.

When you have two variables (x,y) and several groups, one can use an analysis of covariance to test the difference between groups for y when regressed against x. This does not use the regression line with the data pooled irrespective of group, but the pooled within group regression line. For example Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed) . I want the prediction intervals around this pooled-within-group regression line. I hope this is clear.

Roger

Sorry Roger, but I don’t have access to Statistical Methods, Snedechor and Cochran Chapter 14 fig 14:6:1 (6th ed).

Charles

Thank you so much! This example was extremely helpful in finding prediction intervals in Excel and exactly what I was looking for!

Hi Charles

Thanks so much for this, I hadn’t found a description of prediction intervals in various textbooks. The example is really clear.

Are the confidence and prediction intervals still valid outside of your range of known x’s? eg a colleague has produced a linear regression model for some data of payments against time to predict future payments. They have created confidence intervals for the predicted future payments.

I was going to suggest an exponential smoothing method to predict the next values with confidence intervals but they have already used their regression model and wanted to know whether it was suitable.

Will,

As to whether or not prediction are valid outside the data range, the answer (as is often the case) is that it depends. The following webpage explores this issue: http://stats.stackexchange.com/questions/86258/using-regression-equation-to-estimate-values-outside-of-the-range-of-data

I don’t have enough information to comment on whether exponential smoothing can be used. I have just added a description of exponential smoothing to the website. See the webpage Exponential Smoothing

Charles

Thanks for such a fast response, Charles,

The data represents payments over time. They decay exponentially and are expected to reach zero eventually. A log transformation was performed before producing the linear regression model. The data is remarkably close to being a straight line after the log transformation. R squared is close to 1. Time is deemed to be a highly significant explanatory variable.

I believe I could use exponential smoothing with a multiplicative trend and no seasonal component. If I wanted to predict payments at a certain point in future, do you think we would be better to use regression with prediction intervals or exponential smoothing? (I have 32 monthly observations but I’m particularly interested in the predicted value in 16 months time).

Thanks again!

Will,

I can’t tell which is better a priori. If there are theoretical reasons for using one over the other, then that should guide your choice. Otherwise I would try each and see which looks like a better fit. Of course, the limitations with this approach to the decision are (1) are you able to tell which provides a better fit and (2) does the choice overfit the data, i.e. it matches the known data better, but it is less of a fit for future values — this is why it is often better to base the decision on your domain knowledge (what I called theoretical considerations previously.

Charles

Charles,

I am trying to determine a confidence interval around a natural exponential function (y = ae^bx or ln(y) = ln(a)+bx) using Excel. I have reviewed this website but am unsure about the confidence interval calculation.

http://www.tushar-mehta.com/publish_train/data_analysis/16.htm

Thank you!

Steve,

Since you are treating ln(y) = ln(a) + bx as a linear regression z = bx + c where z = ln(y) and c = ln(a), one approach to creating a confidence interval is to use the confidence interval for z = bx + c, as described on the webpage

http://www.real-statistics.com/regression/confidence-and-prediction-intervals/

This will give you a confidence interval for z of form [h, k] you then need to convert this into a confidence interval for y. The simplest approach for this is to use the confidence interval [e^h, e^k].

Charles

Everything seems to follow in my software (i wanted to transfer confidence intervals into a bespoke dashboard from my software) for a single X variable

If i have two x variables how do i calculate the SE of the overall confidence interval?

I.e what does the Equation in cell G12 look like?

Alex,

For multiple regression you use a slightly different approach. See the following webpage for details:

http://www.real-statistics.com/multiple-regression/confidence-and-prediction-intervals/

Charles

Thank you, ive been after this for ages! My statistics software doesnt make it clear what is going on in its calculations for the confidence intervals, but seeing this, i can reverse engineer it AND understand it.

I wasnt even aware Excel had these capabilities, i had twigged at MMULT.

Very pleased to find a site that isnt filled with mathematics requiring a mathematics degree to comprehend!

Hello charles,

Thank you for the explanation. I have a question though:

what if I have x and I would like to predict y instead but with a confidence interval? is it the same principle and the same formula?

Hello Sma,

In this case you use the prediction interval.

Charles

Hello Charles,

Could you tell me what the first part in cell G12 is (it is kind of blurry on my monitor), for the calculation of s.e. 2.06?

Also, is there a shortcut (function) in Excel for the s.e. calculation?

Thank you,

John,

Cell G12 contains the following formula: s_Res * SQRT(1/n + x_0 – k) – 2/SS_x)

There is no Excel formula for calculating this. The Real Statistics form REGPRED will automatically calculate the s.e. in the prediction interval case, but not the confidence interval case. I will add a similar formula for the confidence interval shortly. See the following webpage for more details:

Prediction and Confidence Intervals

Charles

Thank you, Dr. Zaiontz. Pardon my ignorance, but could you tell me what s_Res and how I obtain it?

John

Looks like its the standard error of y and x, the 7.97 in cell E10. Is this correct? The underscore in sRes threw me off a little bit.

Yes. The underscore was supposed to indicate a subscript.

Charles

Hi Charles,

If I understand you correctly, you describe the confidence interval as the range of possible values for model parameters, e.g. the range of values slope and intercept may be for a linear regression for a given set of data and specified confidence interval. You describe prediction interval as the interval around a predicted Y for a specific X0.

However, in your example you calculate confidence interval for a specific X0 and do the same for a prediction interval. I am confused as the example does not appear to match the discussion.

Obviously I don’t understand you correctly!

On a related matter, when one does a linear regression with Excel, Excel reports the Lower and Upper confidence intervals for “intercept” and “X Variable”, i.e. values for the slope and intercept. How does this Excel output relate to your discussion above?

Thank you

Gary,

I will respond to your first question shortly.

Regarding your second question: the confidence intervals for the intercept and x variable are really for the intercept and x coefficient (not for the prediction or confidence interval of data elements).

Charles

Gary,

Here is my response to your first question:

The confidence interval focuses on the population mean. If you create many random samples that are normally distributed and for each sample you calculate a confidence interval for the mean, then about 95% of those intervals will contain the true value of the population mean.

The prediction interval focuses on the true y value for any set of x values. If you create many random samples that are normally distributed and for each sample you calculate a prediction interval for the y value corresponding to some set of x values, then about 95% of those intervals will contain the true y value.

Charles

Hi, Charles. Can you make a video on plotting a 95% confidence interval. Above instruction was really confusing. Will be grateful.

Ang,

Thanks for your input. I will try to improve the explanation, but first I need to finish up the work I am doing on time series analysis.

Charles

Hi Charles,

I need like to plot the 95% Confidence Interval curves just like they are shown within Figure 1 (e.g. the dotted lines). How would you recommend doing this in Excel? I was informed that other programs may provide this feature, but I prefer to continue working in Excel if at all possible.

Thank you in advance,

Andy

Yes, you can do this in Excel. E.g. to create the dotted line for the upper confidence interval curve, fill in a range of x and y values (say A1:B100) as follows: (1) in column A insert x values of the appropriate scale. If for example you are looking at values between 0 and 10, insert 0 in cell A1 and the formula =A1+.1 in cell A2 and then highlight the range A2:A100 and press Ctrl-D, in (2) in column B insert the y values for the upper confidence interval. E.g. in cell B1 insert the Excel formula for the upper confidence interval value corresponding to the x value in cell A1 (this is as described in cell E15 of Figure 2 of the referenced webpage), then highlight range B1:B100 and press Ctrl-D. (Make sure that you use absolute addressing for all the parts of the formula in B1 that don’t depend on the x value in cell A1.) (3) Finally, highlight range A1:B100 and select Insert > Charts|Scatter.

Charles

Hi Charles,

I wanted to point out a common misunderstanding about confidence intervals: CI’s say nothing about the probability that the true value of the population parameter lies within them – it either does or it doesn’t. A 95% CI just tells you that, if you were to repeat your experiment (sampling) an infinite number of times and run the statistics on each sample, the true parameter will lie within 95% of those confidence intervals. What you described as a CI in the first section of your post is actually Bayesian credible interval, which is a bit more complicated to calculate, but it does tell you the probability that your population parameter lies within the interval.

Cheers,

Phil

Phil,

Thanks for correcting my somewhat appealing, but, in the end, inaccurate statement. Shortly I will update the website with a more accurate characterization of the confidence interval.

I appreciate your help in making the site more accurate.

Charles

Hello! Thank you very much this incredibly helpful guide! Do you have a tutorial on how to find the C.I and P.I using multiple regression?

Mari,

Not yet. I plan to include this shortly. I am working on adding information about Survival Analysis now. After that the next thing I will do will include CI/PI for multiple regression.

Charles

Hi Charles. Was wondering if you’ve added the section covering CIs/PIs for Multi Linear Regression yet.

Really love the site and it has helped out tremendously. Keep up the good work!

Hi Thomas,

Thanks for your continued support.

The section covering CIs/PIs for Multi Linear Regression is located on the following webpage:

http://www.real-statistics.com/multiple-regression/confidence-and-prediction-intervals/

Charles

Hi Charles,

I notice that you use n-2 for degrees of freedom, whilst other publications, for example UKAS M3003, use n-1. Although a small difference is seen in the calculation of the degrees of freedom, it can greatly affect the interval magnitude. Could you give some advise as to which calculation should be used for the df value? Thank you in advance for your reply.

By the way, your website has been an extremely useful aid, Thanks

Paul,

I am not familiar with UKAS M3003, but I just checked a few other publications and they all show df = n-2 when calculating the confidence interval for regression.

Charles

Charles,

What’s the rationale for the added 1 under the square root of the standard error of the prediction? Theoretically, why 1?

Thank you

Ryan,

The standard error for a prediction interval adds the variability of the points around the predicted mean. mathematically this is where the 1 comes from.

I plan to eventually show the proofs of the formulas for the confidence and prediction intervals. You will then be able to see from the proofs more precisely where the 1 comes from.

Charles

Hi Charles,

This post has helped me so much already, really very insightful and easy to follow!

I am trying to do a prediction interval for some metal fatigue test data but I am trying to find 90% confidence in the 98% reliability data. In order to do this is just using 0.1 as the probability in TINV sufficient for the method to change it to the 90% probability or do I need to make more changes please?

Thanks for your help!

Tristan

Tristan,

TINV(.1,df) (i.e. alpha = .1) is part of the formula for a 90% prediction interval. But I don’t understand what the 98% reliability data means.

Charles

Hi Charles,

I am using a exponential/hyperbolic function to fit my data. The model is (a*X+b)-SQRT[(a*X+b)^2-((4*a*X*b*c)/(2*c))].

I wonder if I can calculate prediction intervals in the way you show, or if there is any parameter that is different for this type of models. If so, can you tell me how I should calculate it?

Thank you very much

Elisa

Elisa,

I can’t think of a way of doing this. Perhaps someone else has suggestion.

Charles

Hi Charles,

Is Syx = Sres = STEYX(Y,X)? Is it same as Syx = SQRT((SUM(yi – Yi)^2)/(degrees of freedom)), where (xi,yi) are given data and Y is any nonlinear model (not a straight line, say a sigmoidal or logistic curve), which fits the data.

Rizwan,

Sorry for the long delay in responding to you. I just realized that I overlooked responding to you.

It is true that Syx = Sres = STEYX(Y,X) = SQRT((SUM(yi – Yi)^2)/df) for linear models. I am not sure what these terms would even mean for non linear models such as logistic regression models.

Charles

Hi Charles,

Thanks for your contributions on this site. I’m a bit confused by your base formula, though. Where you use the sum of squared deviations of x (SSx, calculated as DEVSQ(x) or DEVSQ(A4:A:18), I’ve learned to use the standard deviation of x times (n-1), or STDEV.S(A4:A:18)*(n-1) in Excel speak. This would yield a value of CI SE of 2.090695467 and a PI SE of 8.244184143. The difference is small in your dataset, but where deviations are larger the use of sums of squared deviations instead of the method I’ve heard will yield very different results. That said, I’m not at all certain which method is correct–can you point to some references for your formula, please?

Thanks in advance for taking the time to clarify this issue for me.

David

Hi David,

Note that for any range R1, the square root of DEVSQ(R1) is STDEV.S(R1)*(n-1). In fact, in the formula for cell E12 in Figure 2 of the referenced page I do take the square root of SSx, and so it seems that we should get the same answer. Can you send me an example of your calculation so that I can see why the results are not the same?

Charles

Thanks for your contribution. I would like to know how to calculate CI and PI if there are two independent variables. Thanks.

Zhang,

Sorry but I haven’t had enough time to figure out or find an answer to your question.

Charles

Please help how u got value of SSx which I suppose to be:-271.6

Anu,

SSx (cell E11) is calculated by the formula =DEVSQ(A4:A18). It has the value 2171.6.

Charles

Hi Charles,

Great. Thank u.

/Kristian

Hi,

Whats the formula in J12? Cannot get the same results…

Thanks

/Kristian

Hi Kristian,

J12 contains the same value as cell E9. The formula in E9 is =FORECAST(E8,B4:B18,A4:A18).

Charles

Hi Charles,

I’m refering to J12, not J11 J12 contains the formula for se (prediction standard error) and formula result i 8.236857, which I cannot get by using the exact same numbers you do.

What formula is in cell J12??

I think it is in the (x – x_)^2 that something is wrong!

Thanks

/ristian

Hi Kristian,

The formula in cell J12 is =E10*SQRT(1+1/E5+(E8-E7)^2/E11).

Charles

Dr. Zaiontz,

Very neat and concise example. I’m particularly interested in a one sided C.I. (lower bound)

Would you agree to use

\hat{y} – t_{crit} s.e.

where t_{crit} should be calculated in Excel using =TINV(2*\alpha,df),

where \alpha = 1-p?

Regards,

Joaquin

Joaquin,

I believe that what you wrote is correct.

Charles

Hi all,

It would be good to point out that the function TINV gives a two-sided confidence interval. The new function T.INV (with a dot) gives a one-sided confidence interval. In case of using the new function, you should take \alpha/2; furthermore, it uses the 1-\alpha/2 value, thus, T.INV(0.975,df).

Emiel

Emiel,

Actually, more simply you should use the T.INV.2T function for the two-sided critical value. It is equivalent to TINV.

This is explained on the webpage http://www.real-statistics.com/excel-capabilities/built-in-statistical-functions/

I will shortly update this information to better explain the various usages of the functions.

Charles

Hi Charles,

Your post has been of tremendous help to me.

Although it took me some time I have applied most of the solutions and they have worked just fine.

I need clarification on example 2. Why did you have to eliminate the # 1 before 1/15?

Would really appreciate your prompt response.

Thanks.

The test uses the confidence interval and not the prediction interval. This is why a 1 is not inserted before 1/15. Shortly I will update the webpage to explain better when the prediction interval is used and when the confidence interval is used.

Charles