In regression analysis, curve fitting is the process of specifying the model that provides the best fit to the specific curves in your dataset. Curved relationships between variables are not as straightforward to fit and interpret as linear relationships.

For linear relationships, as you increase the independent variable by one unit, the mean of the dependent variable always changes by a specific amount. This relationship holds true regardless of where you are in the observation space.

Unfortunately, the real world isn’t always nice and neat like this. Sometimes your data have curved relationships between variables. In a curved relationship, the change in the dependent variable associated with a one unit shift in the independent variable varies based on the location in the observation space. In other words, the effect of the independent variable is not a constant value.

Read my post where I discuss how to interpret regression coefficients for both linear and curvilinear relationships to see this in action.

In this post, I cover various curve fitting methods using both linear regression and nonlinear regression. I’ll also show you how to determine which model provides the best fit.

## Why You Need to Fit Curves in a Regression Model

The fitted line plot below illustrates the problem of using a linear relationship to fit a curved relationship. The R-squared is high, but the model is clearly inadequate. You need to do curve fitting!

When you have one independent variable, it’s easy to see the curvature using a fitted line plot. However, with multiple regression, curved relationships are not always so apparent. For these cases, residual plots are a key indicator for whether your model adequately captures curved relationships.

If you see a pattern in the residual plots, your model doesn’t provide an adequate fit for the data. A common reason is that your model incorrectly models the curvature. Plotting the residuals by each of your independent variables can help you locate the curved relationship.

**Related post**: Check Your Residual Plots to Ensure Trustworthy Results!

In others cases, you might need to depend on subject-area knowledge to do curve fitting. Previous experience or research can tell you that the effect of one variable on another varies based on the value of the independent variable. Perhaps there’s a limit, threshold, or point of diminishing returns where the relationship changes?

To compare curve fitting methods, I’ll fit models to the curve in the fitted line plot above because it is not an easy fit. Let’s assume that these data are from a physical process with very precise measurements. We need to produce accurate predictions of the output for any specified input. You can download the CSV dataset for these examples: CurveFittingExample.

## Curve Fitting using Polynomial Terms in Linear Regression

Despite its name, you can fit curves using linear regression. The most common method is to include polynomial terms in the linear model. Polynomial terms are independent variables that you raise to a power, such as squared or cubed terms.

To determine the correct polynomial term to include, simply count the number of bends in the line. Take the number of bends in your curve and add one for the model order that you need. For example, quadratic terms model one bend while cubic terms model two. In practice, cubic terms are very rare, and I’ve never seen quartic terms or higher. When you use polynomial terms, consider standardizing your continuous independent variables.

**Linear**

**Quadratic**

**Cubic**

Our data has one bend. Let’s fit a linear model with a quadratic term.

The R-squared has increased, but the regression line doesn’t quite fit correctly. The fitted line over- and under-predict the data at different points along the curve. The high R-squared reinforces the point I make in my post about how to interpret R-squared. High R-squared values don’t always represent good models and that you need to check the residual plots!

Let’s try other models.

## Curve Fitting using Reciprocal Terms in Linear Regression

When your dependent variable descends to a floor or ascends to a ceiling (i.e., approaches an asymptote), you can try curve fitting using a reciprocal of an independent variable (1/X). Use a reciprocal term when the effect of an independent variable decreases as its value increases.

The value of this term decreases as the independent variable (X) increases because it is in the denominator. In other words, as X increases, the effect of this term decreases and the slope flattens. X cannot equal zero for this type of model because you can’t divide by zero.

For our data, the increases in Output flatten out as the Input increases. There appears to be an asymptote near 20. Let’s try curve fitting with a reciprocal term. In the data set, I created a column for 1/Input (InvInput). I fit a model with a linear reciprocal term (top) and another with a quadratic reciprocal term (bottom).

For our example dataset, the quadratic reciprocal model provides a much better fit to the curvature. The plots change the x-axis scale to 1/Input, which makes it difficult to see the natural curve in the data.

To show the natural scale of the data, I created the scatterplot below using the regression equations. Clearly, the green data points are closer to the quadratic line.

On the fitted line plots, the quadratic reciprocal model has a higher R-squared value (good) and a lower S-value (good) than the quadratic model. It also doesn’t display biased fitted values. This model provides the best fit to the data so far!

## Curve Fitting with Log Functions in Linear Regression

A log transformation allows linear models to fit curves that are otherwise possible only with nonlinear regression.

For instance, you can express the nonlinear function:

Y=e^{B0}X_{1}^{B1}X_{2}^{B2}

In the linear form:

Ln Y = B_{0} + B_{1}lnX_{1} + B_{2}lnX_{2}

Your model can take logs on both sides of the equation, which is the double-log form shown above. Or, you can use a semi-log form which is where you take the log of only one side. If you take logs on the independent variable side of the model, it can be for all or a subset of the variables.

Using log transformations is a powerful method to fit curves. There are too many possibilities to cover them all. Choosing between a double-log and a semi-log model depends on your data and subject area. If you use this approach, you’ll need to do some investigation.

Let’s apply this to our example curve. A semi-log model can fit curves that flatten as the independent variable increases. Let’s see how a semi-log model fits our data!

In the fitted line plot below, I transformed the independent variable.

Like the first quadratic model we fit, the semi-log model provides a biased fit to the data points. Additionally, the S and R-squared values are very similar to that model. The model with the quadratic reciprocal term continues to provide the best fit.

So far, we’ve performed curve fitting using only linear models. Let’s switch gears and try a nonlinear regression model.

**Related post**: Using Log-Log Plots to Determine Whether Size Matters

## Curve Fitting with Nonlinear Regression

Nonlinear regression is a very powerful alternative to linear regression. It provides more flexibility in fitting curves because you can choose from a broad range of nonlinear functions. In fact, there are so many possible functions that the trick becomes finding the function that best fits the particular curve in your data.

Most statistical software packages that perform nonlinear regression have a catalog of nonlinear functions. You can use that to help pick the function. Further, because nonlinear regression uses an iterative algorithm to find the best solution, you might need to provide the starting values for all of the parameters in the function.

Our data approaches an asymptote, which helps use choose the nonlinear function from the catalog below.

The diagram in the catalog helps us determine the starting values. Theta1 is the asymptote. For our data, that’s near 20. Based on the shape of our curve, Theta2 and Theta3 must be both greater than 0.

Consequently, I’ll use the following starting values for the parameters:

- Theta1: 20
- Theta2: 1
- Theta3: 1

The fitted line plot below displays the nonlinear regression model.

The nonlinear model provides an excellent, unbiased fit to the data. Let’s compare models and determine which one fits our curve the best.

## Comparing the Curve-Fitting Effectiveness of the Different Models

R-squared is not valid for nonlinear regression. So, you can’t use that statistic to assess the goodness-of-fit for this model. However, the standard error of the regression (S) is valid for both linear and nonlinear models and serves as great way to compare fits between these types of models. A small standard error of the regression indicates that the data points are closer to the fitted values.

Model | R-squared | S | Unbiased |

Reciprocal – Quadratic | 99.9 | 0.134828 | Yes |

Nonlinear | N/A | 0.179746 | Yes |

Quadratic | 99.0 | 0.518387 | No |

Semi-Log | 98.6 | 0.565293 | No |

Reciprocal – Linear | 90.4 | 1.49655 | No |

Linear | 84.0 | 1.93253 | No |

We have two models at the top that are equally good at producing accurate and unbiased predictions. These two models are the linear model that uses the quadratic reciprocal term and the nonlinear model.

The standard error of the regression for the nonlinear model (0.179746) is almost as low the S for the reciprocal model (0.134828). The difference between them is so small that you can use either. However, with the linear model, you also obtain p-values for the independent variables (not shown) and R-squared.

For reporting purposes, these extra statistics can be handy. However, if the nonlinear model had provided a much better fit, we’d want to go with it even without those statistics. Learn why you can’t obtain P values for the variables in a nonlinear model.

**Related posts**: The Difference between Linear and Nonlinear Regression Models and How to Choose Between Linear and Nonlinear Regression.

## Closing Thoughts

Curve fitting isn’t that difficult. There are various methods you can use that provide great flexibility to fit most any type of curve. Further, identifying the best model involves assessing only a few statistics and the residual plots.

Setting up your study and collecting the data is a time intensive process. It’s definitely worth the effort to find the model that provides the best fit.

Any time you are specifying a model, you need to let subject-area knowledge and theory guide you. Additionally, some study areas might have standard practices and functions for modeling the data.

Here’s one final caution. You’d like a great fit, but you don’t want to overfit your regression model. An overfit model is too complex, it begins to model the random error, and it falsely inflates the R-squared. Adjusted R-squared and predicted R-squared are tools that can help you avoid this problem.

Learn how to choose the correct regression model!

If you’re learning regression, check out my Regression Tutorial!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Soroush says

Hello Jim,

Does using a linear relationship to fit a curved relationship always cause errors to be heteroscedastic, nonnormal and correlated simultaneously? If so, in case of encountering such problems (having heteroscedastic, nonnormal or correlated errors), how to realize that they are are a sign of using wrong functional form (using a linear relationship to fit a curved relationship) or otherwise?

Jim Frost says

Hi Soroush,

No, using linear models to fit curvature doesn’t necessarily cause those problems. In fact, I use a polynomial in a model that uses body mass index (BMI) to predict body fat percentage and, because the model fits the data well, it doesn’t have those problems. It all depends on how well your model fits the data. Sometimes linear models can adequately fit the curvature and there are no problems. Other times it can’t.

There are various ways to assess the functional form of your model. I’ve written about a bunch of them and rather than retyping them here, please go read my post about model specification for an overview with links to more details.

I hope this helps!

Myriam says

Hello!

Thanks for your help with this blog.

I would like to know if it is possible to compare two curves from two datasets instead of a curve with a non-linear regression.

For example if you have two sub-datasets A and B and you want to know if A and B are from the same data or not. Do you have a test that will let you know if the curves of A and B are fitting?

Jim Frost says

Hi Myriam,

I don’t know of a test for nonlinear regression. That’s assuming you’re using the statistically correct definition for nonlinear (not just fitting a curve but the form of the model itself is nonlinear). Given that you can’t obtain p-values out of the box for nonlinear parameter estimates, I doubt there is such a test “out of the box.” A statistician might be able to devise a custom test for particular functions. That’s my hunch, but I haven’t investigated that question specifically.

However, if you’re using linear regression to model curves, such as polynomial terms, you’re in luck. You just need to combine the two datasets into one and create a categorical variable that identifies the original dataset. Then include the appropriate interaction terms. I discuss the process in my post about statistically comparing regression lines. I don’t discuss curvature in it, but you’d just need to include interaction terms for the terms that model the curvature in addition the other interactions I mention.

Kiran S says

hi jim

am kiran, i want to fit an regression curve for the data which contains four independent variables and on dependent variable out of the four independent two are nonlinear and two are linear i want to fit a curve to this data please help me

Shaz says

Hi Jim,

I am getting my head around on understanding one thing. My dependent variable has lots of zeros. The residuals will potentially be non-normal in this case. But how can the zeros pose a challenge to non-linearity of the relationship? I am confusing the concept of non-linear relationship and non-normal errors in the context of a highly skewed dependent variable with lots of zeros.

Moreover, how can log transformation correct for non-linearity and non-normality here?

Jim Frost says

Hi Shaz,

This is a fairly complicated problem that affects some subject areas more than others. Unfortunately, I don’t have any first-hand knowledge of dealing it, which limits how much I can help.

Typically, this type of problem goes beyond using transformation to resolve it.

If you are dealing with count data, you might look into zero inflated models. I discuss those a bit in my post about choosing the correct type of regression analysis. You’ll find that in the count data section at the end.

Another method I’ve heard a bit about is separate your dataset into two datasets. One is dataset indicates the presence of whatever you’re measuring. The other is the amount. You create separate models for each. Model the presence dataset using logistic regression and the other with ordinary regression. Then, you merge the models That might or might not work for your data.

This issue is something that will probably take a bit of research on your part. What I write above is really the extent of my knowledge. I’m sure there are also a variety of subject specific variations on this issue as well.

I hope this helps to at least point you in the right direction!

cristina says

how could I fit a nonlinear data set to a linear function?

Jim Frost says

Hi Cristina,

In the first portion of this post, I show you a variety of ways that you can fit curves using a linear model.

Al says

Hi jim,

Why does a linear regression model with an x and an x-square term not have high multicollinearity automatically? The correlation between x and x-squared should be very high.

Jim Frost says

Hi Al,

Yes, you’re completely correct–and squared terms do cause very high multicollinearity. If you check the VIFs (that measure multicollinearity), you’ll find very high values. Fortunately, there is an easy solution to fix multicollinearity caused by these types of terms. Read my post about multicollinearity for more information!

Al says

Hi Jim

Albert says

Hi Jim, Thank you for this thorough explanation!

Xie Chang says

Hello Jim,

I have one question regarding multiple regression. Actually I’m trying to find the energy (E) of an object using the mass (M) and the shape factor (s) multiplied by the velocity (V) as independent variables:

E= β+ β1(M)+ β2 (sV)^2

In this case, I’m using Excel (data analysis: regression option) to find (β, β1 and β2). The best fit (highest R^2) is obtained if the term (sV) is squared. in this case, it is still a Multiple linear regression or Multiple nonlinear regression because one of the terms is squared??

Thanks,

Xie

Jim Frost says

Hi Xie,

It is still linear regression analysis. To learn why, read my post about the difference between linear and nonlinear regression.

Have you tried including the sV term (not squared) as well?

Best of luck with your analysis!

Adriano says

Hi Jim.

Thank for all the strait to the point information.

I have a rather not so simple question, and hoping for a as simple as possible explanation.

I have 10 predictors which affect a specific beer consumption, like: price, trade penetration, advertising, temperature etc.

What is a procedure of fitting a nonlinear regression with more predictors?

Thanks

Jim Frost says

Hi Adriano,

First off, we need to clarify whether you mean a true nonlinear model or a linear model that uses polynomials to fit curvature. There are huge differences between the two types. In fact, I’ve never heard of a true nonlinear model that has 10 predictors. One seems to be the most common case. So, I’m going to assume that you actually mean a linear model that uses polynomials and/or data transformation. To be sure about this, you should read my post, The Differences between Linear and Nonlinear Models. You’ll be able to tell the difference and know what type of model you’re using.

As for fitting a model with 10 predictors and potential curvature. Choosing a model to fit your data is known as model specification. You should read my post about it: Model Specification: Choosing the Correct Regression Model. This post goes over all the different statistical and non-statistical methods for choosing the best model. In addition to that information, given that you are particularly interested in modeling curvature, you should graph the individual relationships between each predictor and the response. This process will help you visually assess curvature and help you include the correct polynomial terms–or possibly use other methods to fit the curve. You should also think about the potential curvature from a theoretical basis. These are always important tasks to perform, but more so because you’re specifically concerned about curvature.

One final warning. Because you have 10 predictors and possible polynomials, you need to worry about overfitting your model. You need a certain number of observations per term in your model or you risk obtaining invalid, misleading results. Read my post about overfitting for more information.

I hope this helps!

bob says

Hi, thanks for your helpful webpage! I’m running some statistic analysis on spss to check for both linear and non-linear effects ( about 10 predictor variable and one outcome variable, al are of continiues level) in a multiple linear regression . My goal is to check If I can come to a better model for predicting the outcome variable if I check for posible non-linear effect. I took the folowing steps, Is this a good approach?

-made a linear model with only the significant predictors(function, analyse, regression, linear, “backward”, “forward”)

-made an extra variable for the ones the literature suggest possible quadratic effect, so I made new variables by the square of them ( so I did a transformation)

-I putted the squared variables in the total model, and checked I they are significant

thanks

Jim Frost says

Hi Bob,

A quick terminology issue before we get to your question. Linear and nonlinear have very specific meanings in statistics that refer to the form of the model and not whether the line is curved. I know that’s confusing! That’s why I wrote an entire post about that issue–The Difference Between Linear and Nonlinear Regression. In statistical terms, your model with squared variables is a linear model even though it will fit a curve!

Your general process sounds correct. Although, I have a few suggestions. For one thing, be sure to assess the residual plots for the model without the squared variables. If there is curvature that you need to fit, you’ll often see it in the residual plots. And, those plots are a great way to verify that you’re fitting any curvature adequately.

When you include the squared terms, check their p-values to see if they’re significant. That can help you determine whether those terms are good additions to the model.

Finally, it looks like you’re using a stepwise procedure to select your model. Just be aware that research shows that stepwise procedures generally only get you close to the best model but not exactly to it. Read my post about Stepwise Regression for more information. Stepwise chooses the final model based strictly on statistical significance. To specify the correct model, you typically need to use subject-area knowledge and theory to guide you along with the statistical measures. Read my post about Model Specification for more about this!

I hope this helps!

Wisley Wan says

Hi Jim, Please ignore my previous message – I’ve found it!

Jim Frost says

Hi Wisley,

I’m glad that you find the blog to be helpful. That means a lot to me! I’m also glad that you were able to try example out yourself!

Wisley Wan says

Hi Jim,

Thank you very much for the blog. It is very clear and helps me understand the issue better.

I tried the polynomial linear regressions using excel (standardized the IV), but it is weird that the interception is 0 but the other coefficients are both correct. the R-squared dropped to only 7%. I have checked but couldn’t find where went wrong….Could you please give me some tips?

Thanks!

Patrik Silva says

It helped for sure!

Thank you Jim, for your prompt answer. I understood very well.

I see all the affection that you are giving us here.

Thank you for sharing your valuable (and I imagine scarce) time with me. I thank you very much.

Patrik Silva says

Hi Jim, this post is definitely wonderful, because it provide the foundation of regression analysis…I was always thinking that linear regression is the one where the correlation seems to look like linear (line), like you were saying that we maybe think.

Very, very clear!!! However i have some questions:

1) How do we convert back to the original unit of the data, and how can we interpret the coefficient, after using the transformation and polynomial terms.

2) By transforming the data for example reciprocal transformation the curve looks inverse, that OK, because that is what we want. But how can we interpret the graph? I think we lose all the power to explain the graph since is not in a readable unit. For example if the line has a positive slope we say as the X increase the Y variable tend to increase also. but in the reciprocal everything is inverse, I got lost in this point.

I hope you understand my question and clarify it to me.

Thank you!

Patrik Silva

Jim Frost says

Hi Patrik,

You’re right, the names of the analyses (linear and nonlinear regression) really gives the wrong impression about when you should use each one!

On to your questions.

For converting the transformed data back to the original units, you can do the calculations yourself. The precise calculations depend on the nature of the transformations that you’ve used. However, most statistical software should be able to back convert the values for you. So, I’d check that out first. One thing to note, if you use polynomials to fit curvature, you don’t need to back transform anything. All the units use their original scale. For example, suppose your model is: y = 2 + 2X + X^2. If your X = 2, then your y = (2 + 4 + 4 = 10). No transformation is necessary! However, it does make understanding the relationship between the X and the Y more complex because it changes.

In general, most statistical software can produce main effects plots that incorporate all the transformations. These plots display the relationship between an independent variable and the dependent variable while incorporating transformations and polynomials. If the relationship is curved, you’ll see it in these graphs. Looking at the graph helps you characterize the nature of the relationship, which brings me to your second question.

For the model that uses the reciprocal, I had to actually create the Linear vs Quadratic Reciprocal Model comparison graph by hand because the software couldn’t do that for reciprocal variables. However, once I created the graph, I can use it to describe the relationship because it’s all in natural units at that point.

The way that I’d characterize the quadratic reciprocal relationship is that as input increases, the output also increases. Initially the output increases at a very high rate but as input increases, the rate by which output increases slows down as is asymptotically approaches ~20. For example, looking at the quadratic curve in that graph, you can see that increasing X by 1 unit corresponds to different changes in Y. If your input is at 1 and you increase it to 2, the output increases by quite a bit–from ~6 to ~10. However, if your input is at 10 and you increase it to 11, the output barely increases at all–stays right around for both settings ~19. Our model incorporates all of that mathematically!

You raise an important point. While these transformations help you fit curves that are present in your data, they can obscure the reality behind the relationships, You need to transform the numbers back to their natural units and use graphs to understand the relationships. Fortunately, statistical software can automate that process. One advantage of nonlinear models is that you don’t transform the data but rather you specify the model that fits the data without transformation. However, the relationships, coefficients, etc can be just as hard to understand! As an example of that, just look at the nonlinear model in this post and you’ll see that equation is cryptic! The data aren’t transformed but the equation is not easier to understand. Again, use the graph to better understand the nature of the relationship.

I hope this helps!

Ahmed says

HI

I have 5 variables with 3 levels and 1 variable for 2 levels. based on that, had designed 18 mixes and i have tested one response for different ages (4 periods).

Now, i have 8 columns: 6 for variables (x1,x2,…x6) and the age, finally the response

I have selected the option of regression- fit regression model and have found the all anova table and regression model

the problem is the relation between these variables and this response by using main effect plot was straight line and this i dont know how can change it to curve

someone told me need to divide the response on the square root of the age

please help me, i appreciate that

Jim Frost says

Hi Ahmed, you need to fit a model that can handle the curvature, such as by including polynomial terms (e.g., X^2). Based on the analysis names, it sounds like you’re using Minitab. If so, include your variables on the main dialog box, then click

Model, and there you can include the higher-order terms (polynomials and interactions). Then, when you create a main effect plot, it should display any curvature that is present.It sounds like either you’re not fitting those polynomial terms or, if you did, maybe curvature isn’t present? How do the residual plots look?

Josey says

I’m curious. I use non-linear regression to model the progression of prostate cancer. The independent variable is PSA (Prostate-specific antigen), a product of healthy and cancerous prostate cells. A half-life regression model works well in predicting whether the cancer is growing or, conversely, whether the cancer treatment is working. In the former case, cells numbers are doubling. In the latter case, cells numbers are halving.

Is it possible to model both at the same time with the same data? In other words, is it possible to estimate how many cells are doubling in number because they are resistant to treatment and how many are halving i number because the treatment is effective?