In regression analysis, curve fitting is the process of specifying the model that provides the best fit to the specific curves in your dataset. Curved relationships between variables are not as straightforward to fit and interpret as linear relationships.

For linear relationships, as you increase the independent variable by one unit, the mean of the dependent variable always changes by a specific amount. This relationship holds true regardless of where you are in the observation space.

Unfortunately, the real world isn’t always nice and neat like this. Sometimes your data have curved relationships between variables. In a curved relationship, the change in the dependent variable associated with a one unit shift in the independent variable varies based on the location in the observation space. In other words, the effect of the independent variable is not a constant value.

Read my post where I discuss how to interpret regression coefficients for both linear and curvilinear relationships to see this in action.

In this post, I cover various curve fitting methods using both linear regression and nonlinear regression. I’ll also show you how to determine which model provides the best fit.

## Why You Need to Fit Curves in a Regression Model

The fitted line plot below illustrates the problem of using a linear relationship to fit a curved relationship. The R-squared is high, but the model is clearly inadequate. You need to do curve fitting!

When you have one independent variable, it’s easy to see the curvature using a fitted line plot. However, with multiple regression, curved relationships are not always so apparent. For these cases, residual plots are a key indicator for whether your model adequately captures curved relationships.

If you see a pattern in the residual plots, your model doesn’t provide an adequate fit for the data. A common reason is that your model incorrectly models the curvature. Plotting the residuals by each of your independent variables can help you locate the curved relationship.

**Related post**: Check Your Residual Plots to Ensure Trustworthy Results!

In others cases, you might need to depend on subject-area knowledge to do curve fitting. Previous experience or research can tell you that the effect of one variable on another varies based on the value of the independent variable. Perhaps there’s a limit, threshold, or point of diminishing returns where the relationship changes?

To compare curve fitting methods, I’ll fit models to the curve in the fitted line plot above because it is not an easy fit. Let’s assume that these data are from a physical process with very precise measurements. We need to produce accurate predictions of the output for any specified input. You can download the CSV dataset for these examples: CurveFittingExample.

## Curve Fitting using Polynomial Terms in Linear Regression

Despite its name, you can fit curves using linear regression. The most common method is to include polynomial terms in the linear model. Polynomial terms are independent variables that you raise to a power, such as squared or cubed terms.

To determine the correct polynomial term to include, simply count the number of bends in the line. Take the number of bends in your curve and add one for the model order that you need. For example, quadratic terms model one bend while cubic terms model two. In practice, cubic terms are very rare, and I’ve never seen quartic terms or higher. When you use polynomial terms, consider standardizing your continuous independent variables.

**Linear**

**Quadratic**

**Cubic**

Our data has one bend. Let’s fit a linear model with a quadratic term.

The R-squared has increased, but the regression line doesn’t quite fit correctly. The fitted line over- and under-predict the data at different points along the curve. The high R-squared reinforces the point I make in my post about how to interpret R-squared. High R-squared values don’t always represent good models and that you need to check the residual plots!

Let’s try other models.

## Curve Fitting using Reciprocal Terms in Linear Regression

When your dependent variable descends to a floor or ascends to a ceiling (i.e., approaches an asymptote), you can try curve fitting using a reciprocal of an independent variable (1/X). Use a reciprocal term when the effect of an independent variable decreases as its value increases.

The value of this term decreases as the independent variable (X) increases because it is in the denominator. In other words, as X increases, the effect of this term decreases and the slope flattens. X cannot equal zero for this type of model because you can’t divide by zero.

For our data, the increases in Output flatten out as the Input increases. There appears to be an asymptote near 20. Let’s try curve fitting with a reciprocal term. In the data set, I created a column for 1/Input (InvInput). I fit a model with a linear reciprocal term (top) and another with a quadratic reciprocal term (bottom).

For our example dataset, the quadratic reciprocal model provides a much better fit to the curvature. The plots change the x-axis scale to 1/Input, which makes it difficult to see the natural curve in the data.

To show the natural scale of the data, I created the scatterplot below using the regression equations. Clearly, the green data points are closer to the quadratic line.

On the fitted line plots, the quadratic reciprocal model has a higher R-squared value (good) and a lower S-value (good) than the quadratic model. It also doesn’t display biased fitted values. This model provides the best fit to the data so far!

## Curve Fitting with Log Functions in Linear Regression

A log transformation allows linear models to fit curves that are otherwise possible only with nonlinear regression.

For instance, you can express the nonlinear function:

Y=e^{B0}X_{1}^{B1}X_{2}^{B2}

In the linear form:

Ln Y = B_{0} + B_{1}lnX_{1} + B_{2}lnX_{2}

Your model can take logs on both sides of the equation, which is the double-log form shown above. Or, you can use a semi-log form which is where you take the log of only one side. If you take logs on the independent variable side of the model, it can be for all or a subset of the variables.

Using log transformations is a powerful method to fit curves. There are too many possibilities to cover them all. Choosing between a double-log and a semi-log model depends on your data and subject area. If you use this approach, you’ll need to do some investigation.

Let’s apply this to our example curve. A semi-log model can fit curves that flatten as the independent variable increases. Let’s see how a semi-log model fits our data!

In the fitted line plot below, I transformed the independent variable.

Like the first quadratic model we fit, the semi-log model provides a biased fit to the data points. Additionally, the S and R-squared values are very similar to that model. The model with the quadratic reciprocal term continues to provide the best fit.

So far, we’ve performed curve fitting using only linear models. Let’s switch gears and try a nonlinear regression model.

**Related post**: Using Log-Log Plots to Determine Whether Size Matters

## Curve Fitting with Nonlinear Regression

Nonlinear regression is a very powerful alternative to linear regression. It provides more flexibility in fitting curves because you can choose from a broad range of nonlinear functions. In fact, there are so many possible functions that the trick becomes finding the function that best fits the particular curve in your data.

Most statistical software packages that perform nonlinear regression have a catalog of nonlinear functions. You can use that to help pick the function. Further, because nonlinear regression uses an iterative algorithm to find the best solution, you might need to provide the starting values for all of the parameters in the function.

Our data approaches an asymptote, which helps use choose the nonlinear function from the catalog below.

The diagram in the catalog helps us determine the starting values. Theta1 is the asymptote. For our data, that’s near 20. Based on the shape of our curve, Theta2 and Theta3 must be both greater than 0.

Consequently, I’ll use the following starting values for the parameters:

- Theta1: 20
- Theta2: 1
- Theta3: 1

The fitted line plot below displays the nonlinear regression model.

The nonlinear model provides an excellent, unbiased fit to the data. Let’s compare models and determine which one fits our curve the best.

## Comparing the Curve-Fitting Effectiveness of the Different Models

R-squared is not valid for nonlinear regression. So, you can’t use that statistic to assess the goodness-of-fit for this model. However, the standard error of the regression (S) is valid for both linear and nonlinear models and serves as great way to compare fits between these types of models. A small standard error of the regression indicates that the data points are closer to the fitted values.

Model | R-squared | S | Unbiased |

Reciprocal – Quadratic | 99.9 | 0.134828 | Yes |

Nonlinear | N/A | 0.179746 | Yes |

Quadratic | 99.0 | 0.518387 | No |

Semi-Log | 98.6 | 0.565293 | No |

Reciprocal – Linear | 90.4 | 1.49655 | No |

Linear | 84.0 | 1.93253 | No |

We have two models at the top that are equally good at producing accurate and unbiased predictions. These two models are the linear model that uses the quadratic reciprocal term and the nonlinear model.

The standard error of the regression for the nonlinear model (0.179746) is almost as low the S for the reciprocal model (0.134828). The difference between them is so small that you can use either. However, with the linear model, you also obtain p-values for the independent variables (not shown) and R-squared.

For reporting purposes, these extra statistics can be handy. However, if the nonlinear model had provided a much better fit, we’d want to go with it even without those statistics. Learn why you can’t obtain P values for the variables in a nonlinear model.

**Related posts**: The Difference between Linear and Nonlinear Regression Models and How to Choose Between Linear and Nonlinear Regression.

## Closing Thoughts

Curve fitting isn’t that difficult. There are various methods you can use that provide great flexibility to fit most any type of curve. Further, identifying the best model involves assessing only a few statistics and the residual plots.

Setting up your study and collecting the data is a time intensive process. It’s definitely worth the effort to find the model that provides the best fit.

Any time you are specifying a model, you need to let subject-area knowledge and theory guide you. Additionally, some study areas might have standard practices and functions for modeling the data.

Here’s one final caution. You’d like a great fit, but you don’t want to overfit your regression model. An overfit model is too complex, it begins to model the random error, and it falsely inflates the R-squared. Adjusted R-squared and predicted R-squared are tools that can help you avoid this problem.

Learn how to choose the correct regression model!

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Josey says

I’m curious. I use non-linear regression to model the progression of prostate cancer. The independent variable is PSA (Prostate-specific antigen), a product of healthy and cancerous prostate cells. A half-life regression model works well in predicting whether the cancer is growing or, conversely, whether the cancer treatment is working. In the former case, cells numbers are doubling. In the latter case, cells numbers are halving.

Is it possible to model both at the same time with the same data? In other words, is it possible to estimate how many cells are doubling in number because they are resistant to treatment and how many are halving i number because the treatment is effective?

Ahmed says

HI

I have 5 variables with 3 levels and 1 variable for 2 levels. based on that, had designed 18 mixes and i have tested one response for different ages (4 periods).

Now, i have 8 columns: 6 for variables (x1,x2,…x6) and the age, finally the response

I have selected the option of regression- fit regression model and have found the all anova table and regression model

the problem is the relation between these variables and this response by using main effect plot was straight line and this i dont know how can change it to curve

someone told me need to divide the response on the square root of the age

please help me, i appreciate that

Jim Frost says

Hi Ahmed, you need to fit a model that can handle the curvature, such as by including polynomial terms (e.g., X^2). Based on the analysis names, it sounds like you’re using Minitab. If so, include your variables on the main dialog box, then click

Model, and there you can include the higher-order terms (polynomials and interactions). Then, when you create a main effect plot, it should display any curvature that is present.It sounds like either you’re not fitting those polynomial terms or, if you did, maybe curvature isn’t present? How do the residual plots look?

Patrik Silva says

Hi Jim, this post is definitely wonderful, because it provide the foundation of regression analysis…I was always thinking that linear regression is the one where the correlation seems to look like linear (line), like you were saying that we maybe think.

Very, very clear!!! However i have some questions:

1) How do we convert back to the original unit of the data, and how can we interpret the coefficient, after using the transformation and polynomial terms.

2) By transforming the data for example reciprocal transformation the curve looks inverse, that OK, because that is what we want. But how can we interpret the graph? I think we lose all the power to explain the graph since is not in a readable unit. For example if the line has a positive slope we say as the X increase the Y variable tend to increase also. but in the reciprocal everything is inverse, I got lost in this point.

I hope you understand my question and clarify it to me.

Thank you!

Patrik Silva

Jim Frost says

Hi Patrik,

You’re right, the names of the analyses (linear and nonlinear regression) really gives the wrong impression about when you should use each one!

On to your questions.

For converting the transformed data back to the original units, you can do the calculations yourself. The precise calculations depend on the nature of the transformations that you’ve used. However, most statistical software should be able to back convert the values for you. So, I’d check that out first. One thing to note, if you use polynomials to fit curvature, you don’t need to back transform anything. All the units use their original scale. For example, suppose your model is: y = 2 + 2X + X^2. If your X = 2, then your y = (2 + 4 + 4 = 10). No transformation is necessary! However, it does make understanding the relationship between the X and the Y more complex because it changes.

In general, most statistical software can produce main effects plots that incorporate all the transformations. These plots display the relationship between an independent variable and the dependent variable while incorporating transformations and polynomials. If the relationship is curved, you’ll see it in these graphs. Looking at the graph helps you characterize the nature of the relationship, which brings me to your second question.

For the model that uses the reciprocal, I had to actually create the Linear vs Quadratic Reciprocal Model comparison graph by hand because the software couldn’t do that for reciprocal variables. However, once I created the graph, I can use it to describe the relationship because it’s all in natural units at that point.

The way that I’d characterize the quadratic reciprocal relationship is that as input increases, the output also increases. Initially the output increases at a very high rate but as input increases, the rate by which output increases slows down as is asymptotically approaches ~20. For example, looking at the quadratic curve in that graph, you can see that increasing X by 1 unit corresponds to different changes in Y. If your input is at 1 and you increase it to 2, the output increases by quite a bit–from ~6 to ~10. However, if your input is at 10 and you increase it to 11, the output barely increases at all–stays right around for both settings ~19. Our model incorporates all of that mathematically!

You raise an important point. While these transformations help you fit curves that are present in your data, they can obscure the reality behind the relationships, You need to transform the numbers back to their natural units and use graphs to understand the relationships. Fortunately, statistical software can automate that process. One advantage of nonlinear models is that you don’t transform the data but rather you specify the model that fits the data without transformation. However, the relationships, coefficients, etc can be just as hard to understand! As an example of that, just look at the nonlinear model in this post and you’ll see that equation is cryptic! The data aren’t transformed but the equation is not easier to understand. Again, use the graph to better understand the nature of the relationship.

I hope this helps!

Patrik Silva says

It helped for sure!

Thank you Jim, for your prompt answer. I understood very well.

I see all the affection that you are giving us here.

Thank you for sharing your valuable (and I imagine scarce) time with me. I thank you very much.

Wisley Wan says

Hi Jim,

Thank you very much for the blog. It is very clear and helps me understand the issue better.

I tried the polynomial linear regressions using excel (standardized the IV), but it is weird that the interception is 0 but the other coefficients are both correct. the R-squared dropped to only 7%. I have checked but couldn’t find where went wrong….Could you please give me some tips?

Thanks!

Wisley Wan says

Hi Jim, Please ignore my previous message – I’ve found it!

Jim Frost says

Hi Wisley,

I’m glad that you find the blog to be helpful. That means a lot to me! I’m also glad that you were able to try example out yourself!

bob says

Hi, thanks for your helpful webpage! I’m running some statistic analysis on spss to check for both linear and non-linear effects ( about 10 predictor variable and one outcome variable, al are of continiues level) in a multiple linear regression . My goal is to check If I can come to a better model for predicting the outcome variable if I check for posible non-linear effect. I took the folowing steps, Is this a good approach?

-made a linear model with only the significant predictors(function, analyse, regression, linear, “backward”, “forward”)

-made an extra variable for the ones the literature suggest possible quadratic effect, so I made new variables by the square of them ( so I did a transformation)

-I putted the squared variables in the total model, and checked I they are significant

thanks

Jim Frost says

Hi Bob,

A quick terminology issue before we get to your question. Linear and nonlinear have very specific meanings in statistics that refer to the form of the model and not whether the line is curved. I know that’s confusing! That’s why I wrote an entire post about that issue–The Difference Between Linear and Nonlinear Regression. In statistical terms, your model with squared variables is a linear model even though it will fit a curve!

Your general process sounds correct. Although, I have a few suggestions. For one thing, be sure to assess the residual plots for the model without the squared variables. If there is curvature that you need to fit, you’ll often see it in the residual plots. And, those plots are a great way to verify that you’re fitting any curvature adequately.

When you include the squared terms, check their p-values to see if they’re significant. That can help you determine whether those terms are good additions to the model.

Finally, it looks like you’re using a stepwise procedure to select your model. Just be aware that research shows that stepwise procedures generally only get you close to the best model but not exactly to it. Read my post about Stepwise Regression for more information. Stepwise chooses the final model based strictly on statistical significance. To specify the correct model, you typically need to use subject-area knowledge and theory to guide you along with the statistical measures. Read my post about Model Specification for more about this!

I hope this helps!

Adriano says

Hi Jim.

Thank for all the strait to the point information.

I have a rather not so simple question, and hoping for a as simple as possible explanation.

I have 10 predictors which affect a specific beer consumption, like: price, trade penetration, advertising, temperature etc.

What is a procedure of fitting a nonlinear regression with more predictors?

Thanks

Jim Frost says

Hi Adriano,

First off, we need to clarify whether you mean a true nonlinear model or a linear model that uses polynomials to fit curvature. There are huge differences between the two types. In fact, I’ve never heard of a true nonlinear model that has 10 predictors. One seems to be the most common case. So, I’m going to assume that you actually mean a linear model that uses polynomials and/or data transformation. To be sure about this, you should read my post, The Differences between Linear and Nonlinear Models. You’ll be able to tell the difference and know what type of model you’re using.

As for fitting a model with 10 predictors and potential curvature. Choosing a model to fit your data is known as model specification. You should read my post about it: Model Specification: Choosing the Correct Regression Model. This post goes over all the different statistical and non-statistical methods for choosing the best model. In addition to that information, given that you are particularly interested in modeling curvature, you should graph the individual relationships between each predictor and the response. This process will help you visually assess curvature and help you include the correct polynomial terms–or possibly use other methods to fit the curve. You should also think about the potential curvature from a theoretical basis. These are always important tasks to perform, but more so because you’re specifically concerned about curvature.

One final warning. Because you have 10 predictors and possible polynomials, you need to worry about overfitting your model. You need a certain number of observations per term in your model or you risk obtaining invalid, misleading results. Read my post about overfitting for more information.

I hope this helps!

Xie Chang says

Hello Jim,

I have one question regarding multiple regression. Actually I’m trying to find the energy (E) of an object using the mass (M) and the shape factor (s) multiplied by the velocity (V) as independent variables:

E= β+ β1(M)+ β2 (sV)^2

In this case, I’m using Excel (data analysis: regression option) to find (β, β1 and β2). The best fit (highest R^2) is obtained if the term (sV) is squared. in this case, it is still a Multiple linear regression or Multiple nonlinear regression because one of the terms is squared??

Thanks,

Xie

Jim Frost says

Hi Xie,

It is still linear regression analysis. To learn why, read my post about the difference between linear and nonlinear regression.

Have you tried including the sV term (not squared) as well?

Best of luck with your analysis!

Albert says

Hi Jim, Thank you for this thorough explanation!

Al says

Hi Jim

Al says

Hi jim,

Why does a linear regression model with an x and an x-square term not have high multicollinearity automatically? The correlation between x and x-squared should be very high.

Jim Frost says

Hi Al,

Yes, you’re completely correct–and squared terms do cause very high multicollinearity. If you check the VIFs (that measure multicollinearity), you’ll find very high values. Fortunately, there is an easy solution to fix multicollinearity caused by these types of terms. Read my post about multicollinearity for more information!