In Method of Least Squares for Multiple Regression we review how to fit data to a straight line. Sometimes data fits better with a polynomial curve.

On this webpage we explore how to construct polynomial regression models using standard Excel capabilities. Click here to learn more about Real Statistics capabilities that support polynomial regression.

**Excel Capabilities**

We look at a quadratic model, although it is straightforward to extend this to any higher order polynomial.

This is equivalent to the usual multiple regression model

studied in Multiple Regression Analysis where .

**Example 1**: A group of senior citizens who have never used the Internet before are given training. A sample of 5 people is chosen at random and the number of hours of Internet use is recorded for 6 months, as shown in the table on the upper left side of Figure 1. Determine whether a quadratic regression line is a good fit for the data.

**Figure 1 – Data for polynomial regression in Example 1**

We next create the table on the right in Figure 1 from this data, adding a second independent variable (MonSq) which is equal to the square of the month. We now run the Regression data analysis tool using the table on the right (quadratic model) in columns I, J and K as the input. The results are displayed in Figure 2.

**Figure 2 – Quadratic regression output**

The Adjusted R Square value of 95% and p-value (Significance F) close to 0 shows that the model is a good fit for the data. The fact that the p-value for the MonSq variable is near 0 also confirms that the quadratic coefficient is significant. This is further confirmed by looking at the scatter diagram in Figure 1, which shows that the quadratic trend line is a better bit for the data than the linear trend line. (To display the quadratic trend line select **Layout > Analysis|Trendline** and then More Trendline Options… On the display box which appears choose Polynomial trendline of Order 2.)

Figure 2 also shows that the regression quadratic that best fits the data is

Hours of Use = 21.92 – 24.55 * Month + 8.06 * Month^{2}

Thus to predict the number of hours that a particular senior will use the Internet after 3 months, we plug 3 into the model (or use the TREND function) to get 20.8 hours of use.

We can also run the Regression data analysis tool on the original data to compare the above results with the linear model studied in Regression Analysis. The linear model is generated by using only columns I and K from Figure 1. The output is shown in Figure 3.

**Figure 3 – Linear regression output**

That the quadratic model is a better fit for the data is apparent from the fact that the adjusted R-square value is higher (95.2% vs. 83.5%) and the standard error is lower (13.2 vs. 24.5).

**Real Statistics Capabilities**

Click here to learn more about Real Statistics capabilities that support polynomial regression.

Hi Charles,

The example above shows using a quadratic equation with one independent variable. Is it possible to use a quadratic or cubic equation with 2 or 3 independent variables? In that case how will the equation look?

Bhushan,

It can have many forms. E.g. y = b0 + b1*x1 + b2*x1^2 + b3*x1^3 + b4*x2^2 + b5*v1*x2.

Charles

Charles,

I beleive Bhushan is asking how to carry out this multivariate polynomial regression using your code. Currently the polynomial regression tab only allows for one dependent variable.

Chris,

I understood from his comment that he has multiple independent variables (not dependent variables). E.g. y = b0 + b1*x1 + b2*x1^2 + b3*x1^3 + b4*x2^2 + b5*v1*x2. In these cases you can use multiple linear regression where you treat terms such as x1^2 as a new independent variable y1 (whose value is x1^2).

Charles

Hi Charles,

Thank you for making this easier to understand – with the learnings from my statistics classes already blurred this was an excellent brush up!

I am trying to show if there can be talk of herding behaviour in stock markets. For this I have obtained market return data (r_m,t) to calculate the cross-sectional absolute deviation value. Now, in order for me to identify herding behaviour I have to detect a negative correlation between CSAD and r_m,t, from below formula (with D^event being a dummy for certain days):

CSAD_(m,t)=y_0+γ_1 D^Event |R_(m,t) |+γ_2 (1-D^Event )|R_(m,t) |+γ_3 D^Event R_(m,t)^2+γ_4 (1-D^Event )R_(m,t)^2+e_t

My question is now if you have any advise as to how I estimate these coefficients (y_3 and y_4 in particular) in excel.

Many thanks,

Maja

Sorry Maja, but I don’t understand the formula that you are using.

Charles

Hello Sir,

I want to ask about the application of polynomial regression.

I want to find a correlation between brain activities and enzyme activities during emotional state. So at first, I perform linear correlation/regression but almost all the results gave no significant in correlations (even though some are with large r) and I believe my variables are not correlated.

Then, if I use this polynomial regression to aim for that correlation, is it relevant.? Or what is the polynomial regression are actually aiming if it is not correlation?

Thank you,

Hayati

Hayati,

It is not clear from your description what sort of polynomial regression you would use. It is possible that the (linear) correlation between x and y is say .2, while the linear correlation between x^2 and y is .9. Thus, the polynomial regression y = b*x^2+a might yield a better model (e.g. for predictions) then the linear regression model y = b*x+a.

Charles

Thank you for the reply, Mr Charles.

The polynomial regression that I meant is as in this chapter. I am not really familiar with statistics so I do not know if there are any types besides this.

So as in your reply, I can still use polynomial regression (or multiple regression, like explained in this chapter) to find correlation?

Then, I want to add one more question:

Should we go further with the order (quadratic, than cubic) to find better results? (results might be regression values, R^2 or p-values. I am not sure but in my case, I aim for correlation)

Hayati

Hayati,

You can define the correlation coefficient for nonlinear relationships (i.e. based on a nonlinear regression) as the square root of 1 – SSE/SST, where

SSE = the sum of the squared residuals (i.e. where for each data value the residual is the difference between the observed y value and the y value predicted by the regression model)

SST = the sum of the squared differences between the observed y values and the mean of the observed y values

For linear regression this definition is equivalent to the usual definition of the linear correlation coefficient.

You can use polynomial regression to find the polynomial correlation coefficient. You can do this for quadratic, cubic, etc. regression/correlation.

Charles

Hello

What is the p value for the polynomial line?

If I have a data series and I determine that the polynomial line is better fit than linear one and Rsquare is higher, how do I determine the p value for the polynomial line? do I have to change all my values to the square of the original values from the data series, run regression with excel and present the p I get?

Beck,

The referenced webpage describes how to calculate the p-value for the linear and quadratic coefficients of the polynomial regression model. There is one p-value for each coefficient (corresponding to the degree of the polynomial). There is one R-square value for the entire regression model.

Charles

Hey,

I want to do a polynomial model with four independent variables in software R. How can I go on about that?

Thanks

Sorry Varada, but this website is about statistics in Excel, not R. In fact, I don’t use R.

Charles

Thank you for this academic materials. Would you please give a guideline for analysis of third order polynomial regression model?

Regards

Shahadat

It is exactly as in Example 1 of the referenced webpage, except that now you must add another column with the cubes of the x values of the input data.

Charles

Dear Charles

Thank you for your response. Would you please illustrate the meaning of a fitted third order polynomial regression curve/model i.e. how I can explain it.

Regards

Shahadat

A fitted third order curve is one of the form y = ax^3 + bx^2 + cx + d. You are generally looking for the curve of this type that best fits the data. There are various versions of what best fit means. If you want to use linear regression then you are essentially viewing y = ax^3 + bx^2 + cx + d as a multiple linear regression model, where x^3, x^2 and x are the three independent variables. This is the approach used on the referenced webpage to find the best values of a, b, c and d. Here “best” means the smallest value of the sum of squared differences between the observed values of y_i and the values of y_i calculated when x_i is substituted for x in the equation y = ax^3 + bx^2 + cx + d.

You can also use a non-linear model to find the best values of a, b, c and d. This approach is illustrated on the following webpage (using Excel’s Solver):

http://www.real-statistics.com/regression/exponential-regression-models/exponential-regression-using-solver/

Charles

hie there i jus want to know if a transcendental model function be done in excel

What transcendental model functions are you referring to?

Charles

Hi Charles

I have a set of data (lets call em X and Y). I fit to them a quadratic regression and i get an R^2 = 99.29%. Now my problem is to estimate the error my new values produced by the fitted polynomial. To be more exact I am intersted to the point that the fitted curves crosses the x axis or in other words a*x^2 + b*x + c = 0. What is going to be the variance of this point ?

Stamatis,

A polynomial regression is just a special case of multiple linear regression. Therefore you can use the approach shown on the following webpage

Confidence and Prediction Interval

to calculate the standard error (i.e. the square root of the variance) at any point. In particular, you can use the Real Statistics REGPRED array function to do this.

You may need to actually calculate the two roots of the quadratic polynomial a*x^2 + b*x + c = 0. This can be done using the quadratic formula. Alternatively, you can use the new Real Statistics ROOTS function. I will add a description of the ROOTS function to the website shortly.

Charles

Hi

I am doing multiple regression and getting compile error in hidden module.

Please help.

Rishav Garg

Rishav,

To try to figure out what is happening, please answer the following questions:

1. What do you see when you enter the formula =VER() in any spreadsheet cell?

2. What do you see when you press Ctrl-m?

3. Which release of Excel and Windows are you using?

Charles

Hi Charles, would you be able to give guidance on a method within excel of applying ± 95% confidence limits to a 3rd order polynomial. The limits would then be used to control a process. Thank you in advance for your reply,

Paul,

For which variable are looking for a 95% confidence interval?

Charles

Hours of use per month, as in your example above.

Paul, I’m not sure that I understand what you mean by

applyinga ± 95% confidence limit. Generally, you should have more confidence in the accuracy of a statistic when its confidence interval is narrow. In particular if the confidence interval contains zero then the coefficient for that variable is not significantly different from zero, which means that that variable (at least the cube of that variable in this case) is not making a significant contribution to the regression model.Charles

Hi Charles, Thanks for your response. I was hoping to plot a ±95% confidence interval about the polynomial trend. I understand the function when applying to linear regression, not so easy for polynomial I guess (=t*SYX*SQRT(1/n+(A18-XAVG)^2/SSX).

Paul,

Polynomial Regression is identical to multiple linear regression except that instead of independent variables like x1, x2, …, xn, you use the variables x, x^2, …, x^n. Thus, the formulas for confidence intervals for multiple linear regression also hold for polynomial regression. See the webpage Confidence Intervals for Multiple Regression.

Charles

Is the high collinearity (or correlation) between Month and Month^2 a concern?

Ryan,

The correlation between Month and Month^2 is .9789, which is quite high, but it is also not necessarily at the level of collinearity. If you perform regression with Month and Month^2, the result won’t be very different from the result with just Month^2.

Charles

So in that case, you would probably remove Month from the model and fit a new model using only Month^2 as your explanatory variable? My question is about the worrisome correlation between two independent variables in the model.

Thank you for your work.t

Linus