Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. In this post, I explain how overfitting models is a problem and how you can identify and avoid it.

Overfit regression models have too many terms for the number of observations. When this occurs, the regression coefficients represent the noise rather than the genuine relationships in the population.

That’s problematic by itself. However, there is another problem. Each sample has its own unique quirks. Consequently, a regression model that becomes tailor-made to fit the random quirks of one sample is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.

Taking the above in combination, an overfit regression model describes the noise, and it’s not applicable outside the sample. That’s not very helpful, right? I’d really like these problems to sink in because overfitting often occurs when analysts chase a high R-squared. In fact, inflated R-squared values are a *symptom* of overfit models! Despite the misleading results, it can be difficult for analysts to give up that nice high R-squared value.

When choosing a regression model, our goal is to approximate the true model for the whole population. If we accomplish this goal, our model should fit most random samples drawn from that population. In other words, our results are more generalizable—we can expect that the model will fit other samples.

**Related post**: Model Specification: Choosing the Correct Regression Model

## Graphical Illustration of Overfitting Regression Models

The image below illustrates an overfit model. The green line represents the true relationship between the variables. The random error inherent in the data causes the data points to fall randomly around the green fit line. The red line represents an overfit model. This model is too complex, and it attempts to explain the random error present in the data.

The example above is very clear. However, it’s not always that obvious. Below, the fitted line plot shows an overfit model. In the graph, it appears that the model explains a good proportion of the dependent variable variance. Unfortunately, this is an overfit model, and I’ll show you how to detect it shortly.

If you have more than two independent variables, it’s not possible to graph them in this manner, which makes it harder to detect.

## How Overfitting a Model Causes these Problems

Let’s go back to the basics of inferential statistics to understand how overfitting models causes problems. You use inferential statistics to draw conclusions about a population from a random sample. An important consideration is that the sample size limits the quantity and quality of the conclusions you can draw about a population. The more you need to learn, the larger the sample must be.

This concept is fairly intuitive. Suppose we have a total sample size of 20 and we need to estimate one population mean using a 1-sample t-test. We’ll probably obtain a good estimate. However, if we want to use a 2-sample t-test to estimate the means of two populations, it’s not as good because we have only ten observations to estimate each mean. If we want to estimate three or more means using one-way ANOVA, it becomes pretty bad.

As the number of observations per estimate decreases (20, 10, 6.7, etc.), the estimates become more erratic. Furthermore, a new sample is unlikely to replicate the inconsistent estimates produced by the smaller sample sizes.

In short, the quality of the estimates deteriorates as you draw more conclusions from a sample. This idea is directly related to the degrees of freedom in the analysis. To learn more about this concept, read my post: Degrees of Freedom in Statistics.

## Applying These Concepts to Overfitting Regression Models

Overfitting a regression model is similar to the example above. The problems occur when you try to estimate too many parameters from the sample. Each term in the model forces the regression analysis to estimate a parameter using a fixed sample size. Therefore, the size of your sample restricts the number of terms that you can safely add to the model before you obtain erratic estimates.

Similar to the example with the means, you need a sufficient number of observations for each term in the regression model to help ensure trustworthy results. Statisticians have conducted simulation studies* which indicate you should have at least 10-15 observations for each term in a linear model. The number of terms in a model is the sum of all the independent variables, their interactions, and polynomial terms to model curvature.

For instance, if the regression model has two independent variables and their interaction term, you have three terms and need 30-45 observations. Although, if the model has multicollinearity or if the effect size is small, you might need more observations.

To obtain reliable results, you need a sample size that is large enough to handle the model complexity that your study requires. If your study calls for a complex model, you must collect a relatively large sample size. If the sample is too small, you can’t dependably fit a model that approaches the true model for your independent variable. In that case, the results can be misleading.

## How to Detect Overfit Models

As I discussed earlier, generalizability suffers in an overfit model. Consequently, you can detect overfitting by determining whether your model fits new data as well as it fits the data used to estimate the model. In statistics, we call this cross-validation, and it often involves partitioning your data.

However, for linear regression, there is an excellent accelerated cross-validation method called predicted R-squared. This method doesn’t require you to collect a separate sample or partition your data, and you can obtain the cross-validated results as you fit the model. Statistical software calculates predicted R-squared using the following automated procedure:

- It removes a data point from the dataset.
- Calculates the regression equation.
- Evaluates how well the model predicts the missing observation.
- And, repeats this for all data points in the dataset.

Predicted R-squared has several cool features. First, you can just include it in the output as you fit the model without any extra steps on your part. Second, it’s easy to interpret. You simply compare predicted R-squared to the regular R-squared and see if there is a big difference.

If there is a large discrepancy between the two values, your model doesn’t predict new observations as well as it fits the original dataset. The results are not generalizable, and there’s a good chance you’re overfitting the model.

For the fitted line plot above, the model produces a predicted R-squared (not shown) of 0%, which reveals the overfitting. For more information, read my post about how to interpret predicted R-squared, which also covers the model in the fitted line plot in more detail.

## How to Avoid Overfitting Models

To avoid overfitting a regression model, you should draw a random sample that is large enough to handle all of the terms that you expect to include in your model. This process requires that you investigate similar studies before you collect data. The goal is to identify relevant variables and terms that you are likely to include in your own model. After you get a sense of the typical complexity of models in your study area, you’ll be able to estimate a good sample size.

For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

### Reference

Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, *Psychosomatic Medicine* 66:411-421 (2004).

EISHA AKANKSHA says

I AM NEW TO DATA SCIENCE BUT THIS LINK IS REALLY HELPFUL

Sarah Napier says

Hi Jim,

This is so useful. Thanks so much for taking the time to explain this, I really appreciate it.

Jim Frost says

You’re very welcome, Sarah! Best of luck with your research!

Sarah Napier says

Apologies, I realise I wasn’t quite clear – my sample size is 1,860 but in regards to my dependent variable I am modelling an outcome experienced by 314 (who answered yes) out of 1,860 people

Jim Frost says

Hi Sarah,

Thanks for the additional information. There are several common guidelines related to binary logistic regression and sample size and they don’t always give the same answer! Some use an event per variable (EPV) calculation. I’ve seen different guidelines say that you need 50 EPV and others say that EPV ≥ 10 is sufficient. So, with 314 events, you can have 6 variables using the more stringent 50 EPV but as many as 31 variables if you go with EPV ≥ 10.

And, yet another guideline is N = 10 k / p where N is sample size, k is the number of variables and p is the proportion of events or non-events, whichever is smaller. P in your case is events 314/1860 = 0.169. So, if we solve for k (IVs): 1860 = 10*k/0.169, we get k = 31 IVs.

Given that you have 18 IVs, you’re well under 31, which we get using two of the guidelines. Using the 50 EPV guideline, you wouldn’t include all of those–only 6. However, you have 17.4 EPV. In my experience that should be fine. My opinion is that it is OK to include all of them.

Sarah Napier says

Hi Jim I found your article extremely useful, thank you. I am conducting analysis of an online survey I administered. I have a sample of 1,860 respondents, and wish to use a logistic regression to test the effect of 18 predictor variables on the dependent variable, which is binary (yes/no) (N=314). Can I include all 18 predictor variables in the same logistic regression model, or will this cause overfitting? I note the rule of 10-15 observations per predictor, and I believe my sample size would allow this, but I wasn’t sure if there was a maximum number of variables you can use? Also, I guess I need to run the model before I know how many observations I have per predictor variable? Thanks in advance!

Alana says

For a sample size of 1648 would you caution using 33 variables?

Kathryn says

as I understand it, for binary logistic regression, its not the number of cases but the smaller of the number of events on the dichotomous outcome that is used with the rule of thumb for capping the number of independent variables (see https://www.cs.vu.nl/~eliens/sg/local/theory/overfitting.pdf).

For instance, when modeling an outcome experienced by 80 out of 200 cases, it would be a basis of 80 to which the rule of thumb would be applied.

so you were correct in your suspicion that it wouldn’t be more lax.

Robert Parker says

Thanks for response. Thank you for your insight regarding over-fitting. Also, it looks like Lasso regression and PLS will not address our problems as we are testing a hypothesized variable. I will consult with my co-authors.

Jim Frost says

Hi Robert,

You’re very welcome! That was my concern after you mentioned the control variables. I figured you were testing a specific variable. Best of luck going forward!

Robert Parker says

Jim,

I like the way you explain things.

I am doing an academic study. I ran a logistic regression (binary dependent variable, yes/no) with many predictors variables — maybe too many. I suspect over-fitting problems.

The sample size is very low, about 80. I know that most statisticians will argue that my sample size is way too small for what I am attempting to do. However, I am stuck with this as collecting additional data is not feasible.

Model 1.

23 predictor variables — 22 are control variables from prior studies. Two predictor variables have significant p-values (p<.05) which includes my hypothesized variable.

A reviewer at a journal is arguing that I should add 12 additional control variables. This seems ill advised to me as my degrees of freedom would be really low. When I run the regression with 35 predictors (23+12), no predictor variable has a significant p-value. Does this reflect over-fitting? Is the standard error of the regression coefficients over-inflated which leads to insignificant p-values? If this is true, how can I prove this to satisfaction of reviewer? I have thought about step-wise procedure to delete some of control variables. Many journals do not like this approach. Also I have read that Lasso regression might be an option

Please advise.

Bob

Jim Frost says

Hi Bob,

I agree with the idea that your sample size is way too small for what you want to do. It’s already too small for the 23 variables you have in it–much less the additional 12. This article discusses a good rule of thumb for OLS regression. I don’t know offhand what a good rule is for binary logistic regression, but I doubt it will be more lax.

The problem with overfitting is that it can create completely untrustworthy results that appear to be statistically significant. You’re fitting the noise in the data. I would not say that the lack of significance with the 35 predictors is necessarily overfitting. Overfitting can produce misleading but statistically significant results. You could try reducing the number of predictors by removing the ones that are not significant. The problem with that approach is that you’ll be trying various combinations of predictors and making decisions about what to leave in the model based on many different combinations of variables. That’s called data mining and can cause problems similar to overfitting. Read my post about data mining for more information.

Overfitting can cause biased coefficients. Inflated standard errors is more typically associated with multicollinearity. I don’t know if your model has multicollinearity or not. If you do, that’s an additional problem above and beyond overfitting.

You’re in a tough spot and, unfortunately, I don’t have an easy answer. It sounds like you want to do too much with too little data. The best solution would be to obtain more data, but you indicate that’s not possible. As I mention in this article, I recommend reviewing the literature to determine the likely complexity of your model and then using that information to determine the necessary sample size.

I recommend consulting with a statistician who can devote the time to your project that it deserves. With a more in depth look, they might be able to find a solution for you. I suspect it will involve comprises because you have far too few observations for the complexity of model you want to fit. For example, Lasso regression is a possibility when you have overfitting. However, it’s purpose is more for prediction than drawing inferences about the nature of the relationships between variables. Partial least squares (PLS) can also work when you have too many predictors for a given dataset. That procedure reduces the number of variables down to a smaller set of components and then performs least squares regression on those components. Both of those focus on prediction rather than the relationships between variables, and I don’t know if that suits your purposes or not. Given that you have control variables, I’m guessing not.

Best of luck with your analysis!

Jacri says

Awesome Jim. Can this approach conduct in the Cox regression?

Victor says

Hi Jim,

You wrote that the statistics software to calculate the predictive R-squared do the following steps:

It removes a data point from the dataset.

Calculates the regression equation.

Evaluates how well the model predicts the missing observation.

And, repeats this for all data points in the dataset.

Can you provide more details in how to you get then the predictive R-squared after these steps, i.e. giving a formula?

Andrea Berdondini says

You must always choose the polynomial whose results are less likely to be obtained randomly. Remember that increasing the degree of the polynomial drastically increases the probability of obtaining the same randomly.

So you have to develop a monte carlo simulator in order to calculate this probability for the various polynomials.

Jim Frost says

Hi Andrea,

That’s certainly one way to do it. However, there are other methods. You can assess Predicted R-squared to see if you’re overfitting your model. And, you can use your subject area knowledge to determine what the relationship should be like. Usually you’ll know if theory suggests you should have multiple bends in the line or not. Using a cubic term is very rare. Anything higher and you’re almost definitely overfitting the model unless you have strong theoretical reasons to support it.

I suppose you could use a Monte Carlo simulation, but it’s not a required method. Also, using a simulation like that suggests you already know the correct form, which might not be the case. Predicted R-squared does not make that assumption.

Andrea Berdondini says

The overfitting is simply the direct consequence of considering the statistical parameters, and therefore the results obtained, as a useful information without checking that them was not obtained in a random way. Therefore, in order to estimate the presence of overfitting we have to use the algorithm on a database equivalent to the real one but with randomly generated values, repeating this operation many times we can estimate the probability of obtaining equal or better results in a random way. If this probability is high, we are most likely in an overfitting situation. For example, the probability that a fourth-degree polynomial has a correlation of 1 with 5 random points on a plane is 100%, so this correlation is useless and we are in an overfitting situation.

Sarah says

You’ve put an asterisk in the body text (Statisticians have conducted simulation studies*) which I have presumed is there to provide a reference for the following conclusion (which indicate you should have at least 10-15 observations for each term in a linear model). However, I can’t seem to locate this link. Can you please provide the reference to this analysis? Thanks.

Jim Frost says

Hi Sarah, it’s the Babyak reference at the bottom of this blog post. Just above the comments section.

Jae says

Hi Jim,

Thank you for your intuitive website.

I said at a interview “I developed a multiple regression model”. Then the interview asked me how did you know the model is good or not? Then I said R-square, ah! adj.R-squared was high.

Then the interview asked me what is the difference between R-squared and adj. R-squared?

I couldn’t answer. she explained fast but I didn’t understood.

Then the interviewer asked me about overfitting issue. Of course, I didn’t explain about the issue of regression model.

is there relationship between overfitting vs r-squared?

how I don’t understand anything at a interview even if I have a masters in statistics in 10 years ago. too dumb. I am going to study again.

best,

Jae

Jim Frost says

Hi Jae,

To quickly learn many things about regression analysis, I highly recommend that you read my ebook about regression analysis!

In terms of how do you tell whether a model is good, there are various things to look for. Do you residuals appear to be random or are there patterns in them? A high R-squared can be nice, but by itself doesn’t mean you have a good model. And, you can have a low R-squared but as long as you have significant independent variables, it might still be a good model. One model might be good at explaining the relationships in the data but bad at making precise predictions. Another model might be opposite, good at making predictions but bad at explaining the relationships between the variables. So, much depends on the purpose of your model and how you define good. And, the subject-area also affects what is considered good. In some study areas, high R-squared values are not possible.

Back to overfitting. Typically, if you’re overfitting a model, your R-squared is higher than it should be. However, you might not know what it should be, so you might not know that it is too high. Yes, it’s possible that R-squared is too high! One of the best ways to detect overfitting is, as I explain in this post, by using predicted R-squared.

Best of luck with your studies!

Mark says

Hi Jim, It’s Mark again. I wondered if you might be able to clear up an uncertainty I have about Polynomial Squares Regression. I have tried to fit a polynomial with increasing order to some y data (there are 2 regressors). I’ve used JMP and it generates model coefficients. I guess these the resulting model minimizes the sum of squares (the sum of the squared “distance” between the predicted model value and the actual value. My question is this – Will the residuals for a model obtained by Least squares always sum to zero. I thought that the answer would be they would sum to zero but I’m finding that they do for low order models n=0, n=1, n=2 but not for order n=3 for example (so the n=3 order has the form:

y = bo + (b1.x1 +b2.x1^2 + b3.x1^3 ) + (b4.x2) + (b5.x^2) + (b6.x^3)

Thanks, Mark

Jim Frost says

Hi Mark,

Yes, they should always sum to zero as long as you include the constant in the model. Including the constant forces them to sum to zero. Including polynomials should not affect that even with higher-order terms. So, I’m not sure what is happening in JMP!

Mark says

Hi Jim,

Thanks for your reply. I guess that you need to have data or have some idea of what the relationship is likely to be before you propose a model of any order or decide to apply transformations. So taking a stab at order=3 just because it can fit somewhat complex curves isn’t a sensible approach : )

Mark

Jim Frost says

Hi Mark,

Right, it’s not an approach I’d recommend. Graph your data, consider theory, fit the model that makes sense, check residual plots, and make adjustments as needed. The thing with higher-order polynomials is that they’re very good at fitting noise!

Mark says

Hi Jim, Without knowing the true relationship between y and x, is there a minimum polynomial order that is a go to? For example, if there is curvature then a model of order 1 e.g. y = ao + a1x wouldn’t be a good fit. A model y = a0 + a1x + a2x^2 would be better but wouldn’t be a good fit if there was both a minimum and maximum present. Then a model of order 3 e.g. y = a0 + a1x + a2x^2 + a3x^3 would be better. My feeling is that order = 3 is the minimum order required to fit “wiggly data” and so to be safe order = 4 would be a safe bet. If the relationship between y and x has many hills and valleys (not in a regular sinusoidal way) then maybe an order higher than 4 would be required but usually, relationships are smooth and continuous (but sometimes sharp discontinuities e.g if x = Twater = 0degC occur where maybe even a 4th order polynomial would not approximate the data well). In the case I’m considering, I don’t think there are any sharp discontinuities so I’m thinking that 4th order is a good “go to” choice for polynomial order. If I were to use a 10th order polynomial (to maximize) R^2 I suspect I’d just be fitting data to the noise rather than the underlying true relationship. Just wondered if you’d agree with my assessment of 4th degree polynomial being a good “go to” choice? Thanks, Mark

Jim Frost says

Hi Mark,

As a general rule of thumb for most analysts in most situations? No, I’d say that a 4th order polynomial is far too high. For most situations, that would be too many. Depending on the nature of your data and sample size, you’d either be overfitting the data or just have a number of terms that are not significant–which can reduce the precision of your model.

I’d think about it from the opposite direction. Start with graphing your data and subject-area knowledge to get an idea of what curvature you need to fit. That will hopefully make it clear right there. If unsure, I’d start with a lower-order polynomial, and then check the residual plots. If necessary, you can increase the model order based on the residual plots.

In practice, I’ve never seen a 4th order polynomial, or even a 3rd order. What I have seen was by the time a 3rd order would’ve been called for based on the number of bends, it was actually a nonlinear model that fit better. That’s not necessarily is a general rule for how it works, just what I’ve seen. I’m sure that varies by subject-area. However, you’ll need to use subject-knowledge and theory to guide you. If there is a specific reason why a 4th order polynomial or higher makes theoretical sense, it could well be justified.

As a counter example, in the fitted line plot in this post with the cubic model, there’s just no theoretical reason for why the rankings would first increase, then decline, and then increase again as approval increases. It looks like it provides a good fit in the plot but it doesn’t make theoretical sense. The predicted R-squared also makes it clear that it’s not a good model. The cubic model just forces the model to play connect the dots.

In order to start with a 4th order polynomial, you’d need a good reason for why the model calls for that. In other words, an explanation for why there should be three bends in the data. That’s not going to be the normal case for most analysts. However, again, use your subject-area knowledge to make this call. Maybe it’s appropriate for what you’re studying?

If you go that route, be sure to check predicted R-squared. Also note that for each term in your model you should have 10-15 observations minimum. With a 4th order model, you should have 40 – 60. With 10, you’d need 100 – 150! That’s assuming there are no other terms in the model other than the individual predictor and its higher-order terms.

I hope this helps!

Vansh says

Thanks..

Great explanation..

Amir says

Hi Jim,

Excellent explanation! You nailed it! I have a question, though. Could overfitting affect the size of the coefficients? Such as, making them larger than usual? I have a probit model with a fairly big number of observations, i.e 4000, and couple of interaction terms. When I estimate the model I get 0.67 R-squared but only two interactions are significant with the coefficient size greater than 10 while the size of the coefficients of the both main effects is less than 0.5. This has made me really concerned about the model.

Jim Frost says

Hi Amir,

Yes, overfitting can do all sorts of strange things including affecting the size of the coefficients. However, having interaction coefficients that are larger than the main effect coefficients isn’t necessarily a problem. In fact, it can be OK to have main effects = 0 and large interaction effects. It all depends on the subject-area. By itself, that’s no reason to be concerned. With 4000 observations, you’d have to have a very complex model to be overfitting your model. I doubt that’s happening. Check your residual plots to make sure they look good and graph the interaction effects to determine whether they make sense using your subject-area knowledge.

Best of luck of with your analysis!

Dave says

Jim, Awesome stuff! I’ve been an algorithm developer for 20 years using mostly neural networks. I really appreciate your posts on over-fitting. I just wanted to make sure I understand your rule of thumb for observations. So if I have independent variables k and j and k*j in my model then that would count as 3 terms and I should have at least 30 (3×10) observations to develop the model. Is that correct?

Jim Frost says

Hi Dave,

Yes, that’s absolutely correct for OLS. You should have at least 30 observations. Other forms of regression analysis can have different requirements. If you have weak effects, you might need even more to detect them.

Boris Droz says

Hi Jim,

Thank for your web site very helpful for non statistician such as me. You provide good “cooking receipt” if I can use this term.

Have a question: I did exactly what you did to detect overfitting (comparing model R2 and cross-validate R2) and I saw this procedure in a couple of time in different papers. But I am strangling to find out the threshold value between the best scenario case (difference = 0), acceptable scenario (maybe until 0.2), small overfitting and overfitting scenario.

Do some thresholds exist?

Thank you in advance for your answer

Boris

Jim Frost says

Hi Boris,

It’s difficult to come up with a specific value. I’m sure if you ask 5 different statisticians, they’d give you 5 different answers. However, I’d agree that once you get to a difference of 0.2, you should definitely start wondering and looking into it as a potential problem.

Rohit says

Tahnks Jim. Your work is really increasing understandining of statisticss

Jim Frost says

Thank you, Rohit! I’m glad you’ve found my blog to be helpful!

reet khatri says

this is so easy to understand ,thank you

Jim Frost says

Hi Reet, you’re very welcome! I’m happy to hear that you found it to be helpful!

Ed says

I’ve been asked to right a proof for why the number of regressors K cannot exceed N. I understand the intuition need some help proving it mathematically.

Jim Frost says

Hi Ed, here’s a pointer in the right direction. When the number of parameters = N, there are zero error degrees of freedom. Note that the parameters include the constant. So, if you have five observations, you can estimate the parameters for the constant and four predictors.

Md Rabiul Kabir says

Very helpful site

Jim Frost says

Thank you! I’m glad you found it to be helpful!

Ramskrishna says

Wonderful job thank you

Jim Frost says

Thanks so much for your kind comment. It made my day!