Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. In this post, I explain how overfitting models is a problem and how you can identify and avoid it.

Overfit regression models have too many terms for the number of observations. When this occurs, the regression coefficients represent the noise rather than the genuine relationships in the population.

That’s problematic by itself. However, there is another problem. Each sample has its own unique quirks. Consequently, a regression model that becomes tailor-made to fit the random quirks of one sample is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.

Taking the above in combination, an overfit regression model describes the noise, and it’s not applicable outside the sample. That’s not very helpful, right? I’d really like these problems to sink in because overfitting often occurs when analysts chase a high R-squared. In fact, inflated R-squared values are a *symptom* of overfit models! Despite the misleading results, it can be difficult for analysts to give up that nice high R-squared value.

When choosing a regression model, our goal is to approximate the true model for the whole population. If we accomplish this goal, our model should fit most random samples drawn from that population. In other words, our results are more generalizable—we can expect that the model will fit other samples.

**Related post**: Model Specification: Choosing the Correct Regression Model

## Graphical Illustration of Overfitting Regression Models

The image below illustrates an overfit model. The green line represents the true relationship between the variables. The random error inherent in the data causes the data points to fall randomly around the green fit line. The red line represents an overfit model. This model is too complex, and it attempts to explain the random error present in the data.

The example above is very clear. However, it’s not always that obvious. Below, the fitted line plot shows an overfit model. In the graph, it appears that the model explains a good proportion of the dependent variable variance. Unfortunately, this is an overfit model, and I’ll show you how to detect it shortly.

If you have more than two independent variables, it’s not possible to graph them in this manner, which makes it harder to detect.

## How Overfitting a Model Causes these Problems

Let’s go back to the basics of inferential statistics to understand how overfitting models causes problems. You use inferential statistics to draw conclusions about a population from a random sample. An important consideration is that the sample size limits the quantity and quality of the conclusions you can draw about a population. The more you need to learn, the larger the sample must be.

This concept is fairly intuitive. Suppose we have a total sample size of 20 and we need to estimate one population mean using a 1-sample t-test. We’ll probably obtain a good estimate. However, if we want to use a 2-sample t-test to estimate the means of two populations, it’s not as good because we have only ten observations to estimate each mean. If we want to estimate three or more means using one-way ANOVA, it becomes pretty bad.

As the number of observations per estimate decreases (20, 10, 6.7, etc.), the estimates become more erratic. Furthermore, a new sample is unlikely to replicate the inconsistent estimates produced by the smaller sample sizes.

In short, the quality of the estimates deteriorates as you draw more conclusions from a sample. This idea is directly related to the degrees of freedom in the analysis. To learn more about this concept, read my post: Degrees of Freedom in Statistics.

## Applying These Concepts to Overfitting Regression Models

Overfitting a regression model is similar to the example above. The problems occur when you try to estimate too many parameters from the sample. Each term in the model forces the regression analysis to estimate a parameter using a fixed sample size. Therefore, the size of your sample restricts the number of terms that you can safely add to the model before you obtain erratic estimates.

Similar to the example with the means, you need a sufficient number of observations for each term in the regression model to help ensure trustworthy results. Statisticians have conducted simulation studies* which indicate you should have at least 10-15 observations for each term in a linear model. The number of terms in a model is the sum of all the independent variables, their interactions, and polynomial terms to model curvature.

For instance, if the regression model has two independent variables and their interaction term, you have three terms and need 30-45 observations. Although, if the model has multicollinearity or if the effect size is small, you might need more observations.

To obtain reliable results, you need a sample size that is large enough to handle the model complexity that your study requires. If your study calls for a complex model, you must collect a relatively large sample size. If the sample is too small, you can’t dependably fit a model that approaches the true model for your independent variable. In that case, the results can be misleading.

## How to Detect Overfit Models

As I discussed earlier, generalizability suffers in an overfit model. Consequently, you can detect overfitting by determining whether your model fits new data as well as it fits the data used to estimate the model. In statistics, we call this cross-validation, and it often involves partitioning your data.

However, for linear regression, there is an excellent accelerated cross-validation method called predicted R-squared. This method doesn’t require you to collect a separate sample or partition your data, and you can obtain the cross-validated results as you fit the model. Statistical software calculates predicted R-squared using the following automated procedure:

- It removes a data point from the dataset.
- Calculates the regression equation.
- Evaluates how well the model predicts the missing observation.
- And, repeats this for all data points in the dataset.

Predicted R-squared has several cool features. First, you can just include it in the output as you fit the model without any extra steps on your part. Second, it’s easy to interpret. You simply compare predicted R-squared to the regular R-squared and see if there is a big difference.

If there is a large discrepancy between the two values, your model doesn’t predict new observations as well as it fits the original dataset. The results are not generalizable, and there’s a good chance you’re overfitting the model.

For the fitted line plot above, the model produces a predicted R-squared (not shown) of 0%, which reveals the overfitting. For more information, read my post about how to interpret predicted R-squared, which also covers the model in the fitted line plot in more detail.

## How to Avoid Overfitting Models

To avoid overfitting a regression model, you should draw a random sample that is large enough to handle all of the terms that you expect to include in your model. This process requires that you investigate similar studies before you collect data. The goal is to identify relevant variables and terms that you are likely to include in your own model. After you get a sense of the typical complexity of models in your study area, you’ll be able to estimate a good sample size.

For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

### Reference

Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, *Psychosomatic Medicine* 66:411-421 (2004).

Mark says

Hi Jim, It’s Mark again. I wondered if you might be able to clear up an uncertainty I have about Polynomial Squares Regression. I have tried to fit a polynomial with increasing order to some y data (there are 2 regressors). I’ve used JMP and it generates model coefficients. I guess these the resulting model minimizes the sum of squares (the sum of the squared “distance” between the predicted model value and the actual value. My question is this – Will the residuals for a model obtained by Least squares always sum to zero. I thought that the answer would be they would sum to zero but I’m finding that they do for low order models n=0, n=1, n=2 but not for order n=3 for example (so the n=3 order has the form:

y = bo + (b1.x1 +b2.x1^2 + b3.x1^3 ) + (b4.x2) + (b5.x^2) + (b6.x^3)

Thanks, Mark

Jim Frost says

Hi Mark,

Yes, they should always sum to zero as long as you include the constant in the model. Including the constant forces them to sum to zero. Including polynomials should not affect that even with higher-order terms. So, I’m not sure what is happening in JMP!

Mark says

Hi Jim,

Thanks for your reply. I guess that you need to have data or have some idea of what the relationship is likely to be before you propose a model of any order or decide to apply transformations. So taking a stab at order=3 just because it can fit somewhat complex curves isn’t a sensible approach : )

Mark

Jim Frost says

Hi Mark,

Right, it’s not an approach I’d recommend. Graph your data, consider theory, fit the model that makes sense, check residual plots, and make adjustments as needed. The thing with higher-order polynomials is that they’re very good at fitting noise!

Mark says

Hi Jim, Without knowing the true relationship between y and x, is there a minimum polynomial order that is a go to? For example, if there is curvature then a model of order 1 e.g. y = ao + a1x wouldn’t be a good fit. A model y = a0 + a1x + a2x^2 would be better but wouldn’t be a good fit if there was both a minimum and maximum present. Then a model of order 3 e.g. y = a0 + a1x + a2x^2 + a3x^3 would be better. My feeling is that order = 3 is the minimum order required to fit “wiggly data” and so to be safe order = 4 would be a safe bet. If the relationship between y and x has many hills and valleys (not in a regular sinusoidal way) then maybe an order higher than 4 would be required but usually, relationships are smooth and continuous (but sometimes sharp discontinuities e.g if x = Twater = 0degC occur where maybe even a 4th order polynomial would not approximate the data well). In the case I’m considering, I don’t think there are any sharp discontinuities so I’m thinking that 4th order is a good “go to” choice for polynomial order. If I were to use a 10th order polynomial (to maximize) R^2 I suspect I’d just be fitting data to the noise rather than the underlying true relationship. Just wondered if you’d agree with my assessment of 4th degree polynomial being a good “go to” choice? Thanks, Mark

Jim Frost says

Hi Mark,

As a general rule of thumb for most analysts in most situations? No, I’d say that a 4th order polynomial is far too high. For most situations, that would be too many. Depending on the nature of your data and sample size, you’d either be overfitting the data or just have a number of terms that are not significant–which can reduce the precision of your model.

I’d think about it from the opposite direction. Start with graphing your data and subject-area knowledge to get an idea of what curvature you need to fit. That will hopefully make it clear right there. If unsure, I’d start with a lower-order polynomial, and then check the residual plots. If necessary, you can increase the model order based on the residual plots.

In practice, I’ve never seen a 4th order polynomial, or even a 3rd order. What I have seen was by the time a 3rd order would’ve been called for based on the number of bends, it was actually a nonlinear model that fit better. That’s not necessarily is a general rule for how it works, just what I’ve seen. I’m sure that varies by subject-area. However, you’ll need to use subject-knowledge and theory to guide you. If there is a specific reason why a 4th order polynomial or higher makes theoretical sense, it could well be justified.

As a counter example, in the fitted line plot in this post with the cubic model, there’s just no theoretical reason for why the rankings would first increase, then decline, and then increase again as approval increases. It looks like it provides a good fit in the plot but it doesn’t make theoretical sense. The predicted R-squared also makes it clear that it’s not a good model. The cubic model just forces the model to play connect the dots.

In order to start with a 4th order polynomial, you’d need a good reason for why the model calls for that. In other words, an explanation for why there should be three bends in the data. That’s not going to be the normal case for most analysts. However, again, use your subject-area knowledge to make this call. Maybe it’s appropriate for what you’re studying?

If you go that route, be sure to check predicted R-squared. Also note that for each term in your model you should have 10-15 observations minimum. With a 4th order model, you should have 40 – 60. With 10, you’d need 100 – 150! That’s assuming there are no other terms in the model other than the individual predictor and its higher-order terms.

I hope this helps!

Vansh says

Thanks..

Great explanation..

Amir says

Hi Jim,

Excellent explanation! You nailed it! I have a question, though. Could overfitting affect the size of the coefficients? Such as, making them larger than usual? I have a probit model with a fairly big number of observations, i.e 4000, and couple of interaction terms. When I estimate the model I get 0.67 R-squared but only two interactions are significant with the coefficient size greater than 10 while the size of the coefficients of the both main effects is less than 0.5. This has made me really concerned about the model.

Jim Frost says

Hi Amir,

Yes, overfitting can do all sorts of strange things including affecting the size of the coefficients. However, having interaction coefficients that are larger than the main effect coefficients isn’t necessarily a problem. In fact, it can be OK to have main effects = 0 and large interaction effects. It all depends on the subject-area. By itself, that’s no reason to be concerned. With 4000 observations, you’d have to have a very complex model to be overfitting your model. I doubt that’s happening. Check your residual plots to make sure they look good and graph the interaction effects to determine whether they make sense using your subject-area knowledge.

Best of luck of with your analysis!

Dave says

Jim, Awesome stuff! I’ve been an algorithm developer for 20 years using mostly neural networks. I really appreciate your posts on over-fitting. I just wanted to make sure I understand your rule of thumb for observations. So if I have independent variables k and j and k*j in my model then that would count as 3 terms and I should have at least 30 (3×10) observations to develop the model. Is that correct?

Jim Frost says

Hi Dave,

Yes, that’s absolutely correct for OLS. You should have at least 30 observations. Other forms of regression analysis can have different requirements. If you have weak effects, you might need even more to detect them.

Boris Droz says

Hi Jim,

Thank for your web site very helpful for non statistician such as me. You provide good “cooking receipt” if I can use this term.

Have a question: I did exactly what you did to detect overfitting (comparing model R2 and cross-validate R2) and I saw this procedure in a couple of time in different papers. But I am strangling to find out the threshold value between the best scenario case (difference = 0), acceptable scenario (maybe until 0.2), small overfitting and overfitting scenario.

Do some thresholds exist?

Thank you in advance for your answer

Boris

Jim Frost says

Hi Boris,

It’s difficult to come up with a specific value. I’m sure if you ask 5 different statisticians, they’d give you 5 different answers. However, I’d agree that once you get to a difference of 0.2, you should definitely start wondering and looking into it as a potential problem.

Rohit says

Tahnks Jim. Your work is really increasing understandining of statisticss

Jim Frost says

Thank you, Rohit! I’m glad you’ve found my blog to be helpful!

reet khatri says

this is so easy to understand ,thank you

Jim Frost says

Hi Reet, you’re very welcome! I’m happy to hear that you found it to be helpful!

Ed says

I’ve been asked to right a proof for why the number of regressors K cannot exceed N. I understand the intuition need some help proving it mathematically.

Jim Frost says

Hi Ed, here’s a pointer in the right direction. When the number of parameters = N, there are zero error degrees of freedom. Note that the parameters include the constant. So, if you have five observations, you can estimate the parameters for the constant and four predictors.

Md Rabiul Kabir says

Very helpful site

Jim Frost says

Thank you! I’m glad you found it to be helpful!

Ramskrishna says

Wonderful job thank you

Jim Frost says

Thanks so much for your kind comment. It made my day!