In regression analysis, curve fitting is the process of specifying the model that provides the best fit to the specific curves in your dataset. Curved relationships between variables are not as straightforward to fit and interpret as linear relationships.
For linear relationships, as you increase the independent variable by one unit, the mean of the dependent variable always changes by a specific amount. This relationship holds true regardless of where you are in the observation space.
Unfortunately, the real world isn’t always nice and neat like this. Sometimes your data have curved relationships between variables. In a curved relationship, the change in the dependent variable associated with a one unit shift in the independent variable varies based on the location in the observation space. In other words, the effect of the independent variable is not a constant value.
In this post, I cover various curve fitting methods using both linear regression and nonlinear regression. I’ll also show you how to determine which model provides the best fit.
Related post: What are Independent and Dependent Variables?
Why You Need to Fit Curves in a Regression Model
The fitted line plot below illustrates the problem of using a linear relationship to fit a curved relationship. The R-squared is high, but the model is clearly inadequate. You need to do curve fitting!
When you have one independent variable, it’s easy to see the curvature using a fitted line plot. However, with multiple regression, curved relationships are not always so apparent. For these cases, residual plots are a key indicator for whether your model adequately captures curved relationships.
If you see a pattern in the residual plots, your model doesn’t provide an adequate fit for the data. A common reason is that your model incorrectly models the curvature. Plotting the residuals by each of your independent variables can help you locate the curved relationship.
In others cases, you might need to depend on subject-area knowledge to do curve fitting. Previous experience or research can tell you that the effect of one variable on another varies based on the value of the independent variable. Perhaps there’s a limit, threshold, or point of diminishing returns where the relationship changes?
To compare curve fitting methods, I’ll fit models to the curve in the fitted line plot above because it is not an easy fit. Let’s assume that these data are from a physical process with very precise measurements. We need to produce accurate predictions of the output for any specified input. You can download the CSV dataset for these examples: CurveFittingExample.
Curve Fitting using Polynomial Terms in Linear Regression
Despite its name, you can fit curves using linear regression. The most common method is to include polynomial terms in the linear model. Polynomial terms are independent variables that you raise to a power, such as squared or cubed terms.
To determine the correct polynomial term to include, simply count the number of bends in the line. Take the number of bends in your curve and add one for the model order that you need. For example, quadratic terms model one bend while cubic terms model two. In practice, cubic terms are very rare, and I’ve never seen quartic terms or higher. When you use polynomial terms, consider standardizing your continuous independent variables.
Our data has one bend. Let’s fit a linear model with a quadratic term.
The R-squared has increased, but the regression line doesn’t quite fit correctly. The fitted line over- and under-predict the data at different points along the curve. The high R-squared reinforces the point I make in my post about how to interpret R-squared. High R-squared values don’t always represent good models and that you need to check the residual plots!
Let’s try other models.
Curve Fitting using Reciprocal Terms in Linear Regression
When your dependent variable descends to a floor or ascends to a ceiling (i.e., approaches an asymptote), you can try curve fitting using a reciprocal of an independent variable (1/X). Use a reciprocal term when the effect of an independent variable decreases as its value increases.
The value of this term decreases as the independent variable (X) increases because it is in the denominator. In other words, as X increases, the effect of this term decreases and the slope flattens. X cannot equal zero for this type of model because you can’t divide by zero.
For our data, the increases in Output flatten out as the Input increases. There appears to be an asymptote near 20. Let’s try curve fitting with a reciprocal term. In the data set, I created a column for 1/Input (InvInput). I fit a model with a linear reciprocal term (top) and another with a quadratic reciprocal term (bottom).
For our example dataset, the quadratic reciprocal model provides a much better fit to the curvature. The plots change the x-axis scale to 1/Input, which makes it difficult to see the natural curve in the data.
To show the natural scale of the data, I created the scatterplot below using the regression equations. Clearly, the green data points are closer to the quadratic line.
On the fitted line plots, the quadratic reciprocal model has a higher R-squared value (good) and a lower S-value (good) than the quadratic model. It also doesn’t display biased fitted values. This model provides the best fit to the data so far!
Curve Fitting with Log Functions in Linear Regression
A log transformation allows linear models to fit curves that are otherwise possible only with nonlinear regression.
For instance, you can express the nonlinear function:
In the linear form:
Ln Y = B0 + B1lnX1 + B2lnX2
Your model can take logs on both sides of the equation, which is the double-log form shown above. Or, you can use a semi-log form which is where you take the log of only one side. If you take logs on the independent variable side of the model, it can be for all or a subset of the variables.
Using log transformations is a powerful method to fit curves. There are too many possibilities to cover them all. Choosing between a double-log and a semi-log model depends on your data and subject area. If you use this approach, you’ll need to do some investigation.
Let’s apply this to our example curve. A semi-log model can fit curves that flatten as the independent variable increases. Let’s see how a semi-log model fits our data!
In the fitted line plot below, I transformed the independent variable.
Like the first quadratic model we fit, the semi-log model provides a biased fit to the data points. Additionally, the S and R-squared values are very similar to that model. The model with the quadratic reciprocal term continues to provide the best fit.
So far, we’ve performed curve fitting using only linear models. Let’s switch gears and try a nonlinear regression model.
Related post: Using Log-Log Plots to Determine Whether Size Matters
Curve Fitting with Nonlinear Regression
Nonlinear regression is a very powerful alternative to linear regression. It provides more flexibility in fitting curves because you can choose from a broad range of nonlinear functions. In fact, there are so many possible functions that the trick becomes finding the function that best fits the particular curve in your data.
Most statistical software packages that perform nonlinear regression have a catalog of nonlinear functions. You can use that to help pick the function. Further, because nonlinear regression uses an iterative algorithm to find the best solution, you might need to provide the starting values for all of the parameters in the function.
Our data approach an asymptote, which helps us choose a nonlinear function from the catalog below.
The diagram in the catalog helps us determine the starting values. Theta1 is the asymptote. For our data, that’s near 20. Based on the shape of our curve, Theta2 and Theta3 must be both greater than 0.
Consequently, I’ll use the following starting values for the parameters:
- Theta1: 20
- Theta2: 1
- Theta3: 1
The fitted line plot below displays the nonlinear regression model.
The nonlinear model provides an excellent, unbiased fit to the data. Let’s compare models and determine which one fits our curve the best.
Comparing the Curve-Fitting Effectiveness of the Different Models
R-squared is not valid for nonlinear regression. So, you can’t use that statistic to assess the goodness-of-fit for this model. However, the standard error of the regression (S) is valid for both linear and nonlinear models and serves as great way to compare fits between these types of models. A small standard error of the regression indicates that the data points are closer to the fitted values.
|Reciprocal – Quadratic||99.9||0.134828||Yes|
|Reciprocal – Linear||90.4||1.49655||No|
We have two models at the top that are equally good at producing accurate and unbiased predictions. These two models are the linear model that uses the quadratic reciprocal term and the nonlinear model.
The standard error of the regression for the nonlinear model (0.179746) is almost as low the S for the reciprocal model (0.134828). The difference between them is so small that you can use either. However, with the linear model, you also obtain p-values for the independent variables (not shown) and R-squared.
For reporting purposes, these extra statistics can be handy. However, if the nonlinear model had provided a much better fit, we’d want to go with it even without those statistics. Learn why you can’t obtain P values for the variables in a nonlinear model.
Curve fitting isn’t that difficult. There are various methods you can use that provide great flexibility to fit most any type of curve. Further, identifying the best model involves assessing only a few statistics and the residual plots.
Setting up your study and collecting the data is a time intensive process. It’s definitely worth the effort to find the model that provides the best fit.
Any time you are specifying a model, you need to let subject-area knowledge and theory guide you. Additionally, some study areas might have standard practices and functions for modeling the data.
Here’s one final caution. You’d like a great fit, but you don’t want to overfit your regression model. An overfit model is too complex, it begins to model the random error, and it falsely inflates the R-squared. Adjusted R-squared and predicted R-squared are tools that can help you avoid this problem.
If you’re learning regression, check out my Regression Tutorial!
Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.