Confidence and Prediction Intervals

Basic Concepts

Let’s assume that we have a regression line

where the regression coefficients b_j are based on the n × (k+1) data array X (with ones in the first column).

For any specific values of the x_j, say x₀₁, x₀₂, …, x_0k, we have the predicted value

If X₀ is the column array with values 1, x₀₁, x₀₂, …, x_0k, then an unbiased estimate of the standard error of ŷ₀, called the standard error of the fit, is given by the formula

The 1 – α confidence interval for the true value of ŷ₀ is therefore

Here t_crit is the critical value of the t distribution with df_Res = n − k − 1 degrees of freedom with significance level α/2, i.e. t_crit = T.INV.2T(α, df_Res).

The prediction interval is calculated in a similar way, except that now the variance is the variance of the residual y − ŷ, which is

The 1 – α prediction interval of ŷ₀ is therefore

Here, the square root term is called the standard error of the prediction.

Observation: Click here for proofs of the above formulas.

Example

Example 1: Find the 95% confidence and prediction intervals for Poverty where Infant Mortality is 7.0, White = 80, and Crime = 400 based on the data in Example 2 of Multiple Regression Analysis using Excel, which is reproduced in Figure 1 (in two blocks to fit better on the page).

Figure 1 – US State Data

We next perform a regression analysis using columns C, D, and E as the independent variables and column B as the dependent variable. Alternatively, we can use Excel’s Regression data analysis tool or, as we have done on the left side of Figure 2, by using the Real Statistics Linear Regression data analysis tool.

Figure 2 – Calculation of Confidence and Prediction Intervals

We have added the required data for which we want to calculate the confidence/prediction intervals in range O18:O22. We have also inserted the matrix (X^TX)^-1 in range J6:M9, which we calculate using the Real Statistics formula =CORE(C4:E52), referencing the data in Figure 1. Now we calculate the confidence and prediction intervals, as shown in range O3:Q13. The formulas used for the confidence interval are shown in column S of Figure 3.

Figure 3 – Key formulas for the confidence interval

The calculations for the prediction interval are identical except that the standard error (cell Q11) is calculated by the formula

=SQRT(P8*(1+MMULT(TRANSPOSE(O19:O22),MMULT(J6:M9,O19:O22))))

Note that this formula and the one in cell P11 are array formulas, and so you need to press Ctrl-Shft-Enter, even though they produce a single value.

The model predicts that 12.867% (cell P7) of the population will be below the poverty level when infant mortality is 7.0 (per 1,000 births), 70% of the population is white and crime is 400 (per 100,000 people). The 95% confidence level for this prediction is (12.14%, 13.59%) and the 95% prediction interval is (7.84%, 17.89%).

Worksheet Functions

Real Statistics Functions: The Real Statistics Resource Pack contains the following array function. Here R0 is a k × 1 array containing X data for which we want to predict a y value, R1 is an n × k array containing the X sample data and R2 is an n × 1 array containing the Y sample data.

RegPred(R0, R1, R2, lab, alpha) = 7 × 1 column range containing the predicted y value for the data in R0, the standard error for the confidence interval, the lower and upper ends of the 1 – alpha confidence interval, the standard error for the prediction interval, the lower and upper ends of the 1 – alpha prediction interval. If lab = TRUE (default is FALSE) then a column of labels is appended to the output.

There is also the following function where the regression coefficients are known. This version will also handle regression without an intercept (simply by not including an intercept coefficient).

RegPredCC(R0, Rc) = predicted y values for x values in range R0 based on the regression coefficients in the column range Rc. Here R0 is an m × k array containing m X data vectors for which we want to predict a y value and Rc is an k × 1 (w/o intercept) or k+1 × 1 (with intercept) range containing the regression coefficients.

There is also the following non-array function:

RegPredC(R0, Rc) = predicted y values for x values in range R0 based on the regression coefficients in range Rc. R0 and Rc can either be a column or row range.

Examples Workbook

Click here to download the Excel workbook with the examples described on this webpage.

63 thoughts on “Confidence and Prediction Intervals”

Declan

January 2, 2024 at 9:24 pm

Hi

Really sounds confusing. So we 95% confident that the prediction falls between 12.14 and 13.59 but the prediction interval could be way below that and we 95% confident of that.

“The 95% confidence level for this prediction is (12.14%, 13.59%) and the 95% prediction interval is (7.84%, 17.89%)”

kind regards
Declan
Reply
- Charles
  
  January 9, 2024 at 10:15 am
  
  Declan,
  1. The prediction interval is wider than the confidence interval since it needs to take into account the variability of the prediction of the single point. I use the word “wider” since it is higher at the right end of the interval and lower than the left end of the interval
  2. Why are are you adding the percentage symbol to the intervals? Unless your data consists of percentages, you should use (12.14, 13.59) and (7.84, 17.89).
  Charles
  Reply
Brian Clark

January 10, 2022 at 8:42 pm

Where do the values from O19:O22 come from? I cannot seem to get this out of the Multiple Linear Regression tool in Real Statistics or in the excel function.
Reply
- Charles
  
  January 11, 2022 at 9:04 am
  
  Hi Brian,
  It is input data. You can use whatever values that you like (except that the intercept must be 1).
  Charles
  Reply
Bien

May 23, 2020 at 12:42 am

Help, the CORE function, which is supposed to output an array (k+1 x k+1), only outputs one value, which is the first value shown in the supposed array. How do I do this correctly?
Reply
- Charles
  
  May 23, 2020 at 3:47 pm
  
  Bien,
  This is an array function and so to get all the values see
  Array Formulas and Functions Array Formulas and Functions
  Charles
  Reply
Mark

March 19, 2020 at 4:10 am

I understand inside your package you programed the calculation of inverse of X’X. If I have 14 variables the inverse of X’X will be a matrix of 15 by 15, right? Is it possible I program inverse of X’X, using basic Excel or other basic spreadsheet?
Thank you!
Reply
- Charles
  
  March 19, 2020 at 9:29 am
  
  Hello Mark,
  Yes, the inverse of X’X is a 15 x 15 matrix
  The Real Statistics COR(R1) function can be used to calculate the inverse of X’X where R1 contains X without the extra column of ones.
  See https://real-statistics.com/multiple-regression/multiple-regression-analysis/real-statistics-capabilities-for-multiple-regression/
  Charles
  Reply
David A Vaccari

February 27, 2020 at 2:59 pm

Slight typo: Where you state: “The 1 – α prediction interval of ŷ0 is therefore … [equation]…Here the last term is called the standard error of the prediction.” The last term in the referenced equation includes the t-crit value. Shouldn’t the standard error be the square-root part of the term without the t-crit? In other words, the last term is t-crit times the s.e.
Reply
- Charles
  
  March 2, 2020 at 2:01 pm
  
  Hi David,
  Yes, that is what I meant by the last term. In any case, I can see how this would cause some confusion. I have now changed the wording. Please let me know whether this is now clear enough. Also, thanks for bringing this issue to my attention.
  Charles
  Reply
Sun Kim

July 30, 2019 at 6:36 am

Charles,
There is a typo in Figure 3 for the T critical value formula. It should be said =T.INV.2T instead of TINV to be a two-sided value.

Thanks,
-Sun
Reply
- Charles
  
  July 30, 2019 at 10:45 am
  
  Hello Sun,
  TINV yields the two-tailed critical value and is equivalent to T.INV.2T. It is the older version of the function and is the only one available for Excel 2007 users.
  Charles
  Reply
Jill

June 30, 2019 at 5:14 am

how did you get the CORE MATRIX????
Is there a link to this spreadsheet with the formulas for the core, please share!
Reply
- Charles
  
  June 30, 2019 at 7:30 am
  
  Hi Jill,
  The CORE array function is defined at
  https://real-statistics.com/multiple-regression/multiple-regression-analysis/real-statistics-capabilities-for-multiple-regression/
  You can get the definitions of all the Real Statistics functions on the Tools menu of the Real Statistics website.
  Charles
  Reply
  - Jill
    
    June 30, 2019 at 11:45 pm
    
    you have to calculate those or are they available in the excel summary results? https://www.youtube.com/watch?v=_ZgWScL3F-A. While I like your layout better this calc made finding the confidence interval much easier. Thanks for your site. I think I might survive to the end of the semester. df = degrees of freedom! SS all this stat jargon!
    Reply
    - Charles
      
      July 9, 2019 at 7:22 pm
      
      Hello Jill,
      This webpage describes how to calculate the confidence and prediction intervals no matter the number of independent variables. If you only have one independent variable you can use the approach described at
      https://real-statistics.com/regression/confidence-and-prediction-intervals/
      In any case, you can use the Real Statistics function RegPred to calculate these values for you.
      Charles
      Reply
Nick R.

June 19, 2019 at 4:24 pm

Hello,

When I attempt to use the RegPred function, the only value that is returned/displayed is the linear regression prediction. The other parts of the output, such as confidence interval and prediction interval, are not displayed. Do you have any suggestions?

Many thanks,

Nick
Reply
- Charles
  
  June 22, 2019 at 11:47 am
  
  Hello Nick,
  This function is an array function and so you can’t simply press the Enter key to get all the values. The approach to use is pretty simple and is described at
  Array Formulas and Functions
  Charles
  Reply
Susana Pasciullo

May 10, 2019 at 12:50 pm

Buen día Charles, muy util todas sus presentaciones de Estadística en Excel. En este ejemplo de la construcción de los intervalos de confianza para la predicción nos habla de información para 50 estados y en efecto en la salida de Excel 50 es la cantidad de observaciones, pero la información original presentada contiene 49 estados. ¿Podré contar con la información faltante para resolver el ejercicio y que mis resultados coinsidan con los suyos? Muy agradecida, soy profesora de Estasdística II en la Universidad CAECE (Centro de Altos Estudios en Ciencias Exactas) en Buenos Aires, Argentina y enfáticamente recomiendo sus publicaciones a mis alumnos.
Reply
- Charles
  
  May 11, 2019 at 8:56 am
  
  Hello Susana,
  The sample has all 50 states. Wyoming was inadvertently left off the list on the webpage. I have now corrected this and shown Wyoming as well.
  Charles
  Reply
Carolyn

May 7, 2019 at 8:40 am

Hi Charles, I am struggling to figure out where X0 comes from. In my example, we are given some matrices and asked to find some of the statistics from them and then give the confidence interval. The given matrices are (X’X), (X’X)-1, (X’y), y’y, and b’. Every other example given to us has come from a textbook word problem or given data. Thank you for your help.
Reply
- Charles
  
  May 7, 2019 at 9:11 am
  
  Hi Carolyn,
  X0 is a column of data. It can be any data that you want to analyse.
  Charles
  Reply
Takayoshi

April 6, 2019 at 12:59 pm

Hello Mr Zaiontz,

In the first sentence of the third paragraph of this page, you wrote “Here X is the (k+1) × 1 column vector”. However, in a textbook called 《Introduction to Linear Regression Analysis》 by Douglas C.Montgomery, it is indicated that X is the same old (n) × (k+1) matrix which you have shown in “Multiple Regression using Matrices” as the “design matrix”.

I am confused by the two seemingly contradictory statements. I wonder if it is convenient for you to teach me about which one is true?

Thank you
Reply
Sree

November 18, 2018 at 1:04 pm

Hi Charles,

Thanks for this post! Could you please help me with my query below:

I have developed a predictive model based on a dataset that contains 100 samples. The response variable of the model (Y_orig) is dependent on three input parameters, X1, X2 and X3. When I use the regression model to predict the response variable (Y_pred) for an X1,X2,X3 combination from within the sample dataset, I get a response Y_pred that is, say 30% different to the Y_orig. I do this for all the samples within the dataset. Some samples get predicted more accurately than the others. So, I can find out the residue (Y_pred-Y_orig) for all the samples in the dataset. How do I obtain a prediction interval for the model with 95% confidence.. i.e., an interval that conveys to the reader that if I forecast a value of Y_pred for a different combination of X1,X2,X3 that is not within the sample dataset, what is the interval within which this model can predict the Y_pred value. Or, should I use a confidence interval based on the Mean and S.D of the sample to define this? Please advise ASAP. Thanks!
Reply
- Charles
  
  November 30, 2018 at 11:46 am
  
  You can calculate the prediction interval even for a combination of x1,x2,x3 not in the sample data set. You use the approach described on this webpage.
  E.g. the x value = (7, 80, 400) in Example 1 is not part of the sample, yet the 95% prediction interval is calculated.
  Charles
  Reply
Festus

March 1, 2018 at 2:51 pm

Hi Charles
my only problem is the CORE function how can i make this table with the basic excel functions, i am doing this for academic purpose, i fear i might get in trouble by using a different package
thanks in advanced
Reply
- Charles
  
  March 1, 2018 at 3:38 pm
  
  Festus,
  Sorry, but I don’t understand your question. Which table are you referring to? What is your issue with the CORE function?
  Charles
  Reply
Gautam Dugar

November 22, 2017 at 2:01 pm

It’s a wonderful example and tool! Some people may have issues in building the Core matrix. The best way to do so is to free download the example excel files and the statistical add-in for Excel. Use the example to build a template for your own question.
Reply
Fred

November 6, 2017 at 10:21 am

Hello again Charles,
What is the relationship between the parameters confidence interval (for b0, b1, b2…) and the confidence intervals for the ys?
Does the uncertainty add up?
Fred
Reply
- Charles
  
  November 6, 2017 at 10:28 am
  
  Fred,
  There is a relationship, but it is not so direct. The formula for the s.e. for the y’s involves the inverse of X’X. The square roots of the diagonal elements on the inverse of X’X are the standard errors of the coefficients b0, b1, …ù
  Charles
  Reply
Issah Kadir

March 28, 2017 at 1:14 pm

Please how should I approach this questions?
A) Discuss the coefficient of gender, group size, household size.
Given the following figures as coefficient (-0.08137, -0.01013 and -0.01033) respectively.
B)the variable group size and gender are significant at 1% and 5% respectively, what policy implications do they portray.
Please I need your help, thank you in advance.
Reply
- Charles
  
  March 29, 2017 at 10:04 am
  
  Issah,
  A) It really depends on the dependent variable. E.g. if this variable is “expense (in dollars)”, then the coefficient -.010033 for household size, means that for every increase of 1 in household size, the expense decreases by $.01.
  B) The term “policy implications” doesn’t seem like a statistical concept, and so I can’t really comment.
  Charles
  Reply
  - Issah Kadir
    
    April 2, 2017 at 9:27 am
    
    Thank you Mr Charles
    Reply
Faiza Hafeez

December 26, 2016 at 7:19 pm

Hi,
Does MS Excel or SPSS allows for polynomial (cubic curvilinear) regression with 4-6 independent variables? If yes, kindly tell method.
Reply
- Charles
  
  December 26, 2016 at 8:42 pm
  
  Faiza,
  Excel does support this as explained on the following webpage:
  Polynomial Regression
  I don’t know what sort of support SPSS provides.
  Charles
  Reply
Dan P.

September 3, 2016 at 5:50 pm

Thanks. I’d been struggling with the computation of the s.e. term because my go-to statistics reference book (I’m old-school, I admit) punted when it came to how to do the calculation and instead deferred to a MINITAB printout. I was able to use your recommendations and that MINITAB printout to fill in the blanks !
Reply
Peter

July 7, 2016 at 6:14 pm

If one were to calculate a confidence and/or prediction interval for each predicted-y (ŷi) calculated from observation (Xi), would it have the same kind of arcing shape that the confidence and prediction intervals that result from doing this type of calculation on a single variable linear regression?
Reply
- Charles
  
  July 8, 2016 at 6:16 am
  
  Peter,
  Probably so, but I have not tried to do this. It is easy enough to create a chart to show whether this is true or not. I suggest that you try to do this.
  Charles
  Reply
Fern

July 3, 2016 at 9:48 pm

Sorry, I forgot to ask this; I think this is clearly true, but I have been wrong before:

Say we get a regression output :

y= b_0 + a_1x1+a_2x2+…..+a_kbk . (b_0 is the intercept)

Does it follow that the regression coefficient of y with respect to any x_i ; i=1,..,k. is a_i? And, is there a way of recovering the intercept b_i from b_0?
Thanks.

by
Reply
- Charles
  
  July 4, 2016 at 9:14 pm
  
  Fern,
  The regression model takes the form y = b0 + b1*x1 + b2*x2 + … + bk*xk
  There is only one intercept, namely b0. There is no bi intercept, just a bi coefficient corresponding to xi.
  Charles
  Reply
Fern

July 3, 2016 at 9:39 pm

Hi, great presentations.
A small question: I am not getting the right prdicted value for infant mortality=7, percent white is 80% and crime rate of 400 per 100,000. I am using the coefficients
1.279, 0.0364, 0.001421 from the output , and I am getting:

Predicted(7,80%, 400)= (1.28)(7)+(0.0364)(80)+(0.001421)(400)=(Approx.) 12.47

Am I doing something wrong?
Reply
- Charles
  
  July 4, 2016 at 9:15 pm
  
  Fern,
  Does my answer to your other question help?
  Charles
  Reply
Bahaa Siha

March 17, 2016 at 6:29 pm

Hi Charles,
Thank you for the great information. I just saw your blog while I’m trying to understand how to build confidence intervals for simple and multiple variables regressions.
I saw a simple method using the information provided from ANOVA within excel:
80% confidence = Se * t_stat/SQT(n) where:
Se = Standard Error of the regression
t_stat = TINV(0.2, df), df = deg of freedom of the risidual
n = number of observation.
Is that correct or it’s approximation?
The sight I got it from is:
https://www.youtube.com/watch?v=_ZgWScL3F-A

Thank you very much
Bahaa
Reply
- Charles
  
  March 18, 2016 at 7:08 am
  
  Hi Bahaa,
  
  Yes, this looks correct, except that you shouldn’t divide by SQRT(n) since s.e. already includes this. In the case where there is only one independent variable, the following webpage describes the correct calculations. These are along the lines that you describe.
  
  https://real-statistics.com/regression/confidence-and-prediction-intervals/
  
  Charles
  Reply
Rob Shaw

February 19, 2016 at 3:55 pm

Charles

Thanks for this article. I have a couple of questions:

For the core function, could you tell if I could replicate this using inbuilt Excel functions please? The link on your page https://real-statistics.com/real-statistics-environment/supplemental-functions/ points to your LS MR page, on which I can’t find a reference to it

Could you say how the method you’ve set out is different when using weighted data please?
Reply
- Charles
  
  February 19, 2016 at 7:40 pm
  
  Rob,
  
  If you are referring to Example 1 on the referenced webpage, then the link listed was incorrect. The link should be to the following webpage:
  Multiple Regression Analysis using Excel
  This has now been corrected.
  
  You reference the “core function”. Are you referring to CORE. This is described on the webpage
  Real Statistics Capabilities for Multiple Regression
  
  If you are referring to some other function, please let me know which one you are referring to.
  
  Charles
  Reply
Haijin

January 8, 2016 at 9:12 pm

Hi Charles,

How to use use the CORE function? I cannot figure it out.

Thanks,

Haijin
Reply
- Haijin
  
  January 8, 2016 at 9:59 pm
  
  Charles,
  
  Please forget about my question. I figured it out anyway.
  
  Thank you for the great stat package.
  
  Haijin
  Reply
Chris

January 4, 2016 at 6:42 pm

Charles, this is a very helpful site, thank you for putting all this time into it.

Can you advise on applying this method to time series forecasting where the model is built on multivariate regression as you have here? Acknowledging all the usual caveats about out-of-sample forecasting (model uncertainty, parameter uncertainty over time etc.), I would like to apply the prediction intervals as you give them to a time series iterating, say, three steps into the future.

I imagine the first PI for one step into the future can be calculated very similarly to the in-sample PI you give on this page. I know the values of the independent variables into the future and just plug them into the MV regression equation estimate for n+1 and apply the PI based on the sample of n.

But for the PI for the second step into the future, it should capture some of the uncertainty of the first step, such that predictions intervals expand further into the future (like a fan chart to capture the increasing uncertainty of future predictions). Can you advise on this?
Many thanks, Chris
Reply
- Charles
  
  January 4, 2016 at 9:10 pm
  
  Chris,
  I plan to provide this sort of capability for univariate time series forecasting in the next release of the software. Multivariate capabilities of this type will take some time.
  Charles
  Reply
Dean

December 2, 2015 at 3:42 pm

Charles,

Thanks so much for your work. I am trying to duplicate your work for the confidence and prediction intervals for multiple regression, but the site only shows the data for 49 states – Wyoming is not shown in your posted example. Your calculations indicate 50 observations, but I have only been able to input 49. Is it possible to send/post the Poverty, Infant Mort, White, and Crime data for Wyoming?

Thank you.
Reply
- Charles
  
  December 3, 2015 at 9:08 am
  
  You can download the full example from the webpage Real Statistics Examples Workbooks.
  Charles
  Reply
Philip

July 27, 2015 at 1:02 pm

Hi Charles,

I have a model for remaining tread on a car tyre:

Tread = b0 + b1* (Tread when new) + b2*(age years)

Can the confidence or prediction intervals answer these questions:

1. What is the remaining tread on a tyre which gives a 95% chance that it won’t fall below a legal limit in 1 more year?

2. If say N tyres are re-used with a certain remaining tread. What fraction of them will fall below the legal limit in the next year?

Any pointers on how to solve these questions would be great!

Thanks
Reply
- Charles
  
  August 13, 2015 at 10:29 am
  
  Philip,
  Sorry, but I still haven’t had a chance to think about the question you have raised.
  Charles
  Reply
ZPan

December 24, 2014 at 7:53 pm

Wouldn’t the core matrix be a 3×3 one as there are only 3 variables?
Reply
- ZPan
  
  December 24, 2014 at 8:45 pm
  
  Got it, there is a X0. But CORE(C4:E52) doesn’t match in dimension though.
  Reply
  - Charles
    
    December 25, 2014 at 6:32 pm
    
    CORE(R1) is (XTX)-1 where X is the matrix in range R1 with a first column of all ones added as the first column. If R1 is m x n then X is m x (n+1) and CORE(R1) is (n+1) x (n+1).
    Charles
    Reply
- Charles
  
  December 25, 2014 at 6:28 pm
  
  You need to add a column containing ones (for the constant term)
  Charles
  Reply

Basic Concepts

Example

Worksheet Functions

Examples Workbook

63 thoughts on “Confidence and Prediction Intervals”

Leave a Reply to Festus Cancel reply