Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. In this post, I explain how overfitting models is a problem and how you can identify and avoid it.
Overfit regression models have too many terms for the number of observations. When this occurs, the regression coefficients represent the noise rather than the genuine relationships in the population.
That’s problematic by itself. However, there is another problem. Each sample has its own unique quirks. Consequently, a regression model that becomes tailor-made to fit the random quirks of one sample is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.
Taking the above in combination, an overfit regression model describes the noise, and it’s not applicable outside the sample. That’s not very helpful, right? I’d really like these problems to sink in because overfitting often occurs when analysts chase a high R-squared. In fact, inflated R-squared values are a symptom of overfit models! Despite the misleading results, it can be difficult for analysts to give up that nice high R-squared value.
When choosing a regression model, our goal is to approximate the true model for the whole population. If we accomplish this goal, our model should fit most random samples drawn from that population. In other words, our results are more generalizable—we can expect that the model will fit other samples.
Related post: Model Specification: Choosing the Correct Regression Model
Graphical Illustration of Overfitting Regression Models
The image below illustrates an overfit model. The green line represents the true relationship between the variables. The random error inherent in the data causes the data points to fall randomly around the green fit line. The red line represents an overfit model. This model is too complex, and it attempts to explain the random error present in the data.
The example above is very clear. However, it’s not always that obvious. Below, the fitted line plot shows an overfit model. In the graph, it appears that the model explains a good proportion of the dependent variable variance. Unfortunately, this is an overfit model, and I’ll show you how to detect it shortly.
If you have more than two independent variables, it’s not possible to graph them in this manner, which makes it harder to detect.
How Overfitting a Model Causes these Problems
Let’s go back to the basics of inferential statistics to understand how overfitting models causes problems. You use inferential statistics to draw conclusions about a population from a random sample. An important consideration is that the sample size limits the quantity and quality of the conclusions you can draw about a population. The more you need to learn, the larger the sample must be.
This concept is fairly intuitive. Suppose we have a total sample size of 20 and we need to estimate one population mean using a 1-sample t-test. We’ll probably obtain a good estimate. However, if we want to use a 2-sample t-test to estimate the means of two populations, it’s not as good because we have only ten observations to estimate each mean. If we want to estimate three or more means using one-way ANOVA, it becomes pretty bad.
As the number of observations per estimate decreases (20, 10, 6.7, etc.), the estimates become more erratic. Furthermore, a new sample is unlikely to replicate the inconsistent estimates produced by the smaller sample sizes.
In short, the quality of the estimates deteriorates as you draw more conclusions from a sample. This idea is directly related to the degrees of freedom in the analysis. To learn more about this concept, read my post: Degrees of Freedom in Statistics.
Applying These Concepts to Overfitting Regression Models
Overfitting a regression model is similar to the example above. The problems occur when you try to estimate too many parameters from the sample. Each term in the model forces the regression analysis to estimate a parameter using a fixed sample size. Therefore, the size of your sample restricts the number of terms that you can safely add to the model before you obtain erratic estimates.
Similar to the example with the means, you need a sufficient number of observations for each term in the regression model to help ensure trustworthy results. Statisticians have conducted simulation studies* which indicate you should have at least 10-15 observations for each term in a linear model. The number of terms in a model is the sum of all the independent variables, their interactions, and polynomial terms to model curvature.
For instance, if the regression model has two independent variables and their interaction term, you have three terms and need 30-45 observations. Although, if the model has multicollinearity or if the effect size is small, you might need more observations.
To obtain reliable results, you need a sample size that is large enough to handle the model complexity that your study requires. If your study calls for a complex model, you must collect a relatively large sample size. If the sample is too small, you can’t dependably fit a model that approaches the true model for your independent variable. In that case, the results can be misleading.
How to Detect Overfit Models
As I discussed earlier, generalizability suffers in an overfit model. Consequently, you can detect overfitting by determining whether your model fits new data as well as it fits the data used to estimate the model. In statistics, we call this cross-validation, and it often involves partitioning your data.
However, for linear regression, there is an excellent accelerated cross-validation method called predicted R-squared. This method doesn’t require you to collect a separate sample or partition your data, and you can obtain the cross-validated results as you fit the model. Statistical software calculates predicted R-squared using the following automated procedure:
- It removes a data point from the dataset.
- Calculates the regression equation.
- Evaluates how well the model predicts the missing observation.
- And, repeats this for all data points in the dataset.
Predicted R-squared has several cool features. First, you can just include it in the output as you fit the model without any extra steps on your part. Second, it’s easy to interpret. You simply compare predicted R-squared to the regular R-squared and see if there is a big difference.
If there is a large discrepancy between the two values, your model doesn’t predict new observations as well as it fits the original dataset. The results are not generalizable, and there’s a good chance you’re overfitting the model.
For the fitted line plot above, the model produces a predicted R-squared (not shown) of 0%, which reveals the overfitting. For more information, read my post about how to interpret predicted R-squared, which also covers the model in the fitted line plot in more detail.
How to Avoid Overfitting Models
To avoid overfitting a regression model, you should draw a random sample that is large enough to handle all of the terms that you expect to include in your model. This process requires that you investigate similar studies before you collect data. The goal is to identify relevant variables and terms that you are likely to include in your own model. After you get a sense of the typical complexity of models in your study area, you’ll be able to estimate a good sample size.
If you’re really stuck in a situation where you have too many variables and too few observations, consider using principal component analysis to create a smaller set of indices you can model. Learn more in, Principal Component Analysis Guide and Example.
To read about an analysis I performed where I had to be extremely careful to avoid overfit models, read Understanding Historians’ Rankings of U.S. Presidents using Regression Models.
For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.
If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.
Reference
Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, Psychosomatic Medicine 66:411-421 (2004).
John says
Hi Jim,
I have several data sets from different projects I am working on. Each set has several hundred to several thousand individual X,Y measures in it. I want to perform regression analysis on them, but I am concerned about over fitting the models. How do I select the right number variables to obtain an accurate R-squared values?
Sjoerd says
Hi Jim, Your website contains a lot of usefull information for people like me, with limited statistical training. In my case this training goes back for more than 40 years so I must have missed a lot of more recent developements. Nevertheless, I decided to try my hand at mortality statistics with two aims. One is to unravel the effect of the mean weekly ambient temperature and the second one is to come to an estimate of the excess mortality during the first wave of the COVID-19 pandemic that is more accurate than those published by our national statistics bureau and national health authority, that are both based on moving average models. I am relying on gretl OLS regression models but keep getting into trouble. My dataset covers the period 2006-2020 and has gaps corresponding to the yearly influenza epidemics. These periods were left out since they obviously distort the regression line. Apart from the requirements of a reasonable fit and no obvious autoregression, simple logic dictates that the aggregate mortality during the periods between epidemics ought to be close to zero. In the model for females 80 years of age and over (2006-2019) the fit is fairly good but there is a high probability af autocorrelation and there are several intervals with aggregate mortalities that are outside the +/- 2SD range. The latter problem can be remedied by including a correction factor for each of the intervals but this does not completely solve the autocorrelation issue. Adding factors to model the time course in each of the intervals brings the D-value into the inconclusive range but probably leads to overfitting. Moreover, both additions preclude extrapolation to the 2020 period. Some advice would be more than welcome.
Satish Kumar says
Hi Jim,
Thanks for such a nice explaination.
I have applied Decision tree and Random forest regression model on a time series dataset.
R-squared of DT on Train data is 65.55% and on test data is 65.24%
R-squared of RF on Train data is 99.71% and on test data is 99.76%.
Even though DT is showing R2 score a bit moderate, whereas RF is showing very high R2 score, i storngly believe that there is overfitting in both the models.
any thoughts on this will be highly appreciable.
Regards
Satish
Jim Frost says
Hi Satish,
Those are interesting results. The fact that for both cases the test data have nearly as high of an R-squared as the training data such that they aren’t overfit. But it seems likely, particularly the RF model!
These are time series data. Is there a consistent upward or downward trend over time? If so, have you detrended the data so you’re only trying to predict the changes? If not, that might explain the overfitting in both the training and test data.
Melissa Costagliola-Ray says
Hi Jim,
Thank you, I found this extremely helpful!
I just wanted to ask when you talk about requiring a certain number of observations, would this relate to the number of surveys undertaken or the number of animals recorded within each survey?
Many thanks,
Melissa
Jim Frost says
Hi Melissa,
The number of observations relate to the number of animals in your survey.
In your datasheet, the number of observations is typically equivalent to the number of rows (minus any rows with headings), where each row can have multiple variables (DV and IV) associated with it. In the regression context, each animal is one observation with multiple variables. The values for one animal are typically stored in one row. For example, you might measure the height, weight, other characteristics, and the outcome you’re assessing (the dependent variable) and record all the values for one animal in one row.
Retno Maruti says
Hi Jim, excellent work! thank you so much for sharing. I have a question, I use high-dimensional data and use OLS to investigate the predicatability relationship between depvar and the indepvars. I divide the data into two large group: testing and training. And then I use OLS and have a quite high R-squared for the testing sample data. I assume that there must be an overfitting issue. Then I use Lasso (cross-validated regression) and have lower R-squred (i divide the sample to two large group as well). How do I interpret the R-squared in Lasso? Because from the papers I’ve read previously, usually R-squared in Lasso is higher compared to OLS as they handle the overfitting issue and drop the irrelevant variables.
Jim Frost says
Hi Retno,
I’m not an expert in LASSO regression. However, I believe the interpretation for R-squared is the same in LASSO regression as it is for OLS because both are linear models, it’s just the estimation methods that differ.
I don’t know what the typical results are for R-square in OLS vs. LASSO models. However, I’m not surprised that R-squared values can be lower. Remember that LASSO shrinks coefficients down and, unlike Ridge regression, can shrink them down to zero, which effectively removes the predictor from the model. By removing predictors, you’d expect R-squared to decrease. That’s particularly true if you have an inflated R-squared due to overfitting and LASSO is rectifying the overfitting.
Krishnan says
How many observations do I need if I have 4 independent variables
Minimum of 40 observations ?
Jim Frost says
Hi Krishnan,
Assuming you’re talking about linear least squares and your IVs are all continuous, yes, an absolute minimum would be 40. However, if you need to fit curvature or interaction effect, you’ll need more observations for those terms.
berihun nega says
I learn a lot form the explanation. tank you very much. i have one question. is there any minimum and maximum standard to say the regression madel fit or over fit?
Halie Kang says
Hi Jim.
Thank you for the postings. I am a graduate student in social science major and your postings are very helpful, some that hard to find in other blogs or websites.
I have a question on something and I am not sure if this questions is related to this posting, but I am in desperate need of help with this issue.
I am aware that having too many categorical variables in the regression model might affect degree of freedom (which would affect we can’t trust the results), especially when the size of the sample is small.
1) I want to know the reason more precisely. Why is it that?
2) Does it apply to the logistic regression analysis as well?
3) I have a data with 260 samples and I’m conducting binary logistic regression. I have so far 11 independent variables and 9 of them are categorical variables. With this very small sample size, I am concerned if this makes a problem.
4) What would be the way to check my model is not a good model in stata in this case?
Jim Frost says
Hi Halie,
Yes, what you describe is overfitting. I describe why that happens in this post so I won’t retype it in the comments.
Yes, it applies to logistic regression, although the guidelines for sample sizes is a bit different than for least squares regression. I think there is more emphasis on the number of observations in the smaller group of the two for your dependent variable. Although, I don’t recall offhand. I believe the article I reference in the blog post provides some information about that. So, you might want to check that out. I don’t have the article on hand unfortunately.
For your third question, it’s starting to sound problematic. However, it depend on how many levels your categorical variables have. Each categorical variable uses N-1 DF. Where N equals the number of levels for each categorical variable. If each one has three or more, you’re running into problems! Although, again, I don’t recall offhand the guidelines for logistic regression but you’d be running into with least squares regression and logistic regression has more stringent guidelines. Beware!
I haven’t used Stata. So I can’t help you there. But you need to check the residual like other models. And do check into the possibility of overfitting because it looks like you might be running into it.
I hope that helps!
Gunalan says
Hi Jim
Thankyou for your recon. I will check out the book.
Gunalan says
Dear Jim
Thankyou for your quick reply and I would like know if you would recon me a reading material about Measurement Uncertainty.
Jim Frost says
Hi Gunalan,
I’d recommend EMP III: Evaluating the Measurement Process & Using Imperfect Data by Donald J. Wheeler. That’s a pretty thorough book on the topic. If you’re looking for an overview rather than an entire book, I don’t have a good reference at hand.
Gunalan says
Hi Jim,
Im a R&D chemist and I would like to gain more knowledge in Regression analysis and method validation. Kindly advice what kind of books would you recon.
Jim Frost says
Hi Gunalan,
I’ve written a book on regression analysis that I recommend you read. To learn about click this link: Regression Analysis: An Intuitive Guide. You can also get a free sample of it that has the first two chapters. Just go to My Web Store and click on the free sample version of my Regression book. No credit card is required for the free sample. I think you’ll find this book to be very helpful!
EISHA AKANKSHA says
I AM NEW TO DATA SCIENCE BUT THIS LINK IS REALLY HELPFUL
Sarah Napier says
Hi Jim,
This is so useful. Thanks so much for taking the time to explain this, I really appreciate it.
Jim Frost says
You’re very welcome, Sarah! Best of luck with your research!
Sarah Napier says
Apologies, I realise I wasn’t quite clear – my sample size is 1,860 but in regards to my dependent variable I am modelling an outcome experienced by 314 (who answered yes) out of 1,860 people
Jim Frost says
Hi Sarah,
Thanks for the additional information. There are several common guidelines related to binary logistic regression and sample size and they don’t always give the same answer! Some use an event per variable (EPV) calculation. I’ve seen different guidelines say that you need 50 EPV and others say that EPV โฅ 10 is sufficient. So, with 314 events, you can have 6 variables using the more stringent 50 EPV but as many as 31 variables if you go with EPV โฅ 10.
And, yet another guideline is N = 10 k / p where N is sample size, k is the number of variables and p is the proportion of events or non-events, whichever is smaller. P in your case is events 314/1860 = 0.169. So, if we solve for k (IVs): 1860 = 10*k/0.169, we get k = 31 IVs.
Given that you have 18 IVs, you’re well under 31, which we get using two of the guidelines. Using the 50 EPV guideline, you wouldn’t include all of those–only 6. However, you have 17.4 EPV. In my experience that should be fine. My opinion is that it is OK to include all of them.
Sarah Napier says
Hi Jim I found your article extremely useful, thank you. I am conducting analysis of an online survey I administered. I have a sample of 1,860 respondents, and wish to use a logistic regression to test the effect of 18 predictor variables on the dependent variable, which is binary (yes/no) (N=314). Can I include all 18 predictor variables in the same logistic regression model, or will this cause overfitting? I note the rule of 10-15 observations per predictor, and I believe my sample size would allow this, but I wasn’t sure if there was a maximum number of variables you can use? Also, I guess I need to run the model before I know how many observations I have per predictor variable? Thanks in advance!
Alana says
For a sample size of 1648 would you caution using 33 variables?
Kathryn says
as I understand it, for binary logistic regression, its not the number of cases but the smaller of the number of events on the dichotomous outcome that is used with the rule of thumb for capping the number of independent variables (see https://www.cs.vu.nl/~eliens/sg/local/theory/overfitting.pdf).
For instance, when modeling an outcome experienced by 80 out of 200 cases, it would be a basis of 80 to which the rule of thumb would be applied.
so you were correct in your suspicion that it wouldn’t be more lax.
Robert Parker says
Thanks for response. Thank you for your insight regarding over-fitting. Also, it looks like Lasso regression and PLS will not address our problems as we are testing a hypothesized variable. I will consult with my co-authors.
Jim Frost says
Hi Robert,
You’re very welcome! That was my concern after you mentioned the control variables. I figured you were testing a specific variable. Best of luck going forward!
Robert Parker says
Jim,
I like the way you explain things.
I am doing an academic study. I ran a logistic regression (binary dependent variable, yes/no) with many predictors variables — maybe too many. I suspect over-fitting problems.
The sample size is very low, about 80. I know that most statisticians will argue that my sample size is way too small for what I am attempting to do. However, I am stuck with this as collecting additional data is not feasible.
Model 1.
23 predictor variables — 22 are control variables from prior studies. Two predictor variables have significant p-values (p<.05) which includes my hypothesized variable.
A reviewer at a journal is arguing that I should add 12 additional control variables. This seems ill advised to me as my degrees of freedom would be really low. When I run the regression with 35 predictors (23+12), no predictor variable has a significant p-value. Does this reflect over-fitting? Is the standard error of the regression coefficients over-inflated which leads to insignificant p-values? If this is true, how can I prove this to satisfaction of reviewer? I have thought about step-wise procedure to delete some of control variables. Many journals do not like this approach. Also I have read that Lasso regression might be an option
Please advise.
Bob
Jim Frost says
Hi Bob,
I agree with the idea that your sample size is way too small for what you want to do. It’s already too small for the 23 variables you have in it–much less the additional 12. This article discusses a good rule of thumb for OLS regression. I don’t know offhand what a good rule is for binary logistic regression, but I doubt it will be more lax.
The problem with overfitting is that it can create completely untrustworthy results that appear to be statistically significant. You’re fitting the noise in the data. I would not say that the lack of significance with the 35 predictors is necessarily overfitting. Overfitting can produce misleading but statistically significant results. You could try reducing the number of predictors by removing the ones that are not significant. The problem with that approach is that you’ll be trying various combinations of predictors and making decisions about what to leave in the model based on many different combinations of variables. That’s called data mining and can cause problems similar to overfitting. Read my post about data mining for more information.
Overfitting can cause biased coefficients. Inflated standard errors is more typically associated with multicollinearity. I don’t know if your model has multicollinearity or not. If you do, that’s an additional problem above and beyond overfitting.
You’re in a tough spot and, unfortunately, I don’t have an easy answer. It sounds like you want to do too much with too little data. The best solution would be to obtain more data, but you indicate that’s not possible. As I mention in this article, I recommend reviewing the literature to determine the likely complexity of your model and then using that information to determine the necessary sample size.
I recommend consulting with a statistician who can devote the time to your project that it deserves. With a more in depth look, they might be able to find a solution for you. I suspect it will involve comprises because you have far too few observations for the complexity of model you want to fit. For example, Lasso regression is a possibility when you have overfitting. However, it’s purpose is more for prediction than drawing inferences about the nature of the relationships between variables. Partial least squares (PLS) can also work when you have too many predictors for a given dataset. That procedure reduces the number of variables down to a smaller set of components and then performs least squares regression on those components. Both of those focus on prediction rather than the relationships between variables, and I don’t know if that suits your purposes or not. Given that you have control variables, I’m guessing not.
Best of luck with your analysis!
Jacri says
Awesome Jim. Can this approach conduct in the Cox regression?
Victor says
Hi Jim,
You wrote that the statistics software to calculate the predictive R-squared do the following steps:
It removes a data point from the dataset.
Calculates the regression equation.
Evaluates how well the model predicts the missing observation.
And, repeats this for all data points in the dataset.
Can you provide more details in how to you get then the predictive R-squared after these steps, i.e. giving a formula?
Andrea Berdondini says
You must always choose the polynomial whose results are less likely to be obtained randomly. Remember that increasing the degree of the polynomial drastically increases the probability of obtaining the same randomly.
So you have to develop a monte carlo simulator in order to calculate this probability for the various polynomials.
Jim Frost says
Hi Andrea,
That’s certainly one way to do it. However, there are other methods. You can assess Predicted R-squared to see if you’re overfitting your model. And, you can use your subject area knowledge to determine what the relationship should be like. Usually you’ll know if theory suggests you should have multiple bends in the line or not. Using a cubic term is very rare. Anything higher and you’re almost definitely overfitting the model unless you have strong theoretical reasons to support it.
I suppose you could use a Monte Carlo simulation, but it’s not a required method. Also, using a simulation like that suggests you already know the correct form, which might not be the case. Predicted R-squared does not make that assumption.
Andrea Berdondini says
The overfitting is simply the direct consequence of considering the statistical parameters, and therefore the results obtained, as a useful information without checking that them was not obtained in a random way. Therefore, in order to estimate the presence of overfitting we have to use the algorithm on a database equivalent to the real one but with randomly generated values, repeating this operation many times we can estimate the probability of obtaining equal or better results in a random way. If this probability is high, we are most likely in an overfitting situation. For example, the probability that a fourth-degree polynomial has a correlation of 1 with 5 random points on a plane is 100%, so this correlation is useless and we are in an overfitting situation.
Sarah says
You’ve put an asterisk in the body text (Statisticians have conducted simulation studies*) which I have presumed is there to provide a reference for the following conclusion (which indicate you should have at least 10-15 observations for each term in a linear model). However, I can’t seem to locate this link. Can you please provide the reference to this analysis? Thanks.
Jim Frost says
Hi Sarah, it’s the Babyak reference at the bottom of this blog post. Just above the comments section.
Jae says
Hi Jim,
Thank you for your intuitive website.
I said at a interview “I developed a multiple regression model”. Then the interview asked me how did you know the model is good or not? Then I said R-square, ah! adj.R-squared was high.
Then the interview asked me what is the difference between R-squared and adj. R-squared?
I couldn’t answer. she explained fast but I didn’t understood.
Then the interviewer asked me about overfitting issue. Of course, I didn’t explain about the issue of regression model.
is there relationship between overfitting vs r-squared?
how I don’t understand anything at a interview even if I have a masters in statistics in 10 years ago. too dumb. I am going to study again.
best,
Jae
Jim Frost says
Hi Jae,
To quickly learn many things about regression analysis, I highly recommend that you read my ebook about regression analysis!
In terms of how do you tell whether a model is good, there are various things to look for. Do you residuals appear to be random or are there patterns in them? A high R-squared can be nice, but by itself doesn’t mean you have a good model. And, you can have a low R-squared but as long as you have significant independent variables, it might still be a good model. One model might be good at explaining the relationships in the data but bad at making precise predictions. Another model might be opposite, good at making predictions but bad at explaining the relationships between the variables. So, much depends on the purpose of your model and how you define good. And, the subject-area also affects what is considered good. In some study areas, high R-squared values are not possible.
Back to overfitting. Typically, if you’re overfitting a model, your R-squared is higher than it should be. However, you might not know what it should be, so you might not know that it is too high. Yes, it’s possible that R-squared is too high! One of the best ways to detect overfitting is, as I explain in this post, by using predicted R-squared.
Best of luck with your studies!
Mark says
Hi Jim, It’s Mark again. I wondered if you might be able to clear up an uncertainty I have about Polynomial Squares Regression. I have tried to fit a polynomial with increasing order to some y data (there are 2 regressors). I’ve used JMP and it generates model coefficients. I guess these the resulting model minimizes the sum of squares (the sum of the squared “distance” between the predicted model value and the actual value. My question is this – Will the residuals for a model obtained by Least squares always sum to zero. I thought that the answer would be they would sum to zero but I’m finding that they do for low order models n=0, n=1, n=2 but not for order n=3 for example (so the n=3 order has the form:
y = bo + (b1.x1 +b2.x1^2 + b3.x1^3 ) + (b4.x2) + (b5.x^2) + (b6.x^3)
Thanks, Mark
Jim Frost says
Hi Mark,
Yes, they should always sum to zero as long as you include the constant in the model. Including the constant forces them to sum to zero. Including polynomials should not affect that even with higher-order terms. So, I’m not sure what is happening in JMP!
Mark says
Hi Jim,
Thanks for your reply. I guess that you need to have data or have some idea of what the relationship is likely to be before you propose a model of any order or decide to apply transformations. So taking a stab at order=3 just because it can fit somewhat complex curves isn’t a sensible approach : )
Mark
Jim Frost says
Hi Mark,
Right, it’s not an approach I’d recommend. Graph your data, consider theory, fit the model that makes sense, check residual plots, and make adjustments as needed. The thing with higher-order polynomials is that they’re very good at fitting noise!
Mark says
Hi Jim, Without knowing the true relationship between y and x, is there a minimum polynomial order that is a go to? For example, if there is curvature then a model of order 1 e.g. y = ao + a1x wouldn’t be a good fit. A model y = a0 + a1x + a2x^2 would be better but wouldn’t be a good fit if there was both a minimum and maximum present. Then a model of order 3 e.g. y = a0 + a1x + a2x^2 + a3x^3 would be better. My feeling is that order = 3 is the minimum order required to fit “wiggly data” and so to be safe order = 4 would be a safe bet. If the relationship between y and x has many hills and valleys (not in a regular sinusoidal way) then maybe an order higher than 4 would be required but usually, relationships are smooth and continuous (but sometimes sharp discontinuities e.g if x = Twater = 0degC occur where maybe even a 4th order polynomial would not approximate the data well). In the case I’m considering, I don’t think there are any sharp discontinuities so I’m thinking that 4th order is a good “go to” choice for polynomial order. If I were to use a 10th order polynomial (to maximize) R^2 I suspect I’d just be fitting data to the noise rather than the underlying true relationship. Just wondered if you’d agree with my assessment of 4th degree polynomial being a good “go to” choice? Thanks, Mark
Jim Frost says
Hi Mark,
As a general rule of thumb for most analysts in most situations? No, I’d say that a 4th order polynomial is far too high. For most situations, that would be too many. Depending on the nature of your data and sample size, you’d either be overfitting the data or just have a number of terms that are not significant–which can reduce the precision of your model.
I’d think about it from the opposite direction. Start with graphing your data and subject-area knowledge to get an idea of what curvature you need to fit. That will hopefully make it clear right there. If unsure, I’d start with a lower-order polynomial, and then check the residual plots. If necessary, you can increase the model order based on the residual plots.
In practice, I’ve never seen a 4th order polynomial, or even a 3rd order. What I have seen was by the time a 3rd order would’ve been called for based on the number of bends, it was actually a nonlinear model that fit better. That’s not necessarily is a general rule for how it works, just what I’ve seen. I’m sure that varies by subject-area. However, you’ll need to use subject-knowledge and theory to guide you. If there is a specific reason why a 4th order polynomial or higher makes theoretical sense, it could well be justified.
As a counter example, in the fitted line plot in this post with the cubic model, there’s just no theoretical reason for why the rankings would first increase, then decline, and then increase again as approval increases. It looks like it provides a good fit in the plot but it doesn’t make theoretical sense. The predicted R-squared also makes it clear that it’s not a good model. The cubic model just forces the model to play connect the dots.
In order to start with a 4th order polynomial, you’d need a good reason for why the model calls for that. In other words, an explanation for why there should be three bends in the data. That’s not going to be the normal case for most analysts. However, again, use your subject-area knowledge to make this call. Maybe it’s appropriate for what you’re studying?
If you go that route, be sure to check predicted R-squared. Also note that for each term in your model you should have 10-15 observations minimum. With a 4th order model, you should have 40 – 60. With 10, you’d need 100 – 150! That’s assuming there are no other terms in the model other than the individual predictor and its higher-order terms.
I hope this helps!
Vansh says
Thanks..
Great explanation..
Amir says
Hi Jim,
Excellent explanation! You nailed it! I have a question, though. Could overfitting affect the size of the coefficients? Such as, making them larger than usual? I have a probit model with a fairly big number of observations, i.e 4000, and couple of interaction terms. When I estimate the model I get 0.67 R-squared but only two interactions are significant with the coefficient size greater than 10 while the size of the coefficients of the both main effects is less than 0.5. This has made me really concerned about the model.
Jim Frost says
Hi Amir,
Yes, overfitting can do all sorts of strange things including affecting the size of the coefficients. However, having interaction coefficients that are larger than the main effect coefficients isn’t necessarily a problem. In fact, it can be OK to have main effects = 0 and large interaction effects. It all depends on the subject-area. By itself, that’s no reason to be concerned. With 4000 observations, you’d have to have a very complex model to be overfitting your model. I doubt that’s happening. Check your residual plots to make sure they look good and graph the interaction effects to determine whether they make sense using your subject-area knowledge.
Best of luck of with your analysis!
Dave says
Jim, Awesome stuff! Iโve been an algorithm developer for 20 years using mostly neural networks. I really appreciate your posts on over-fitting. I just wanted to make sure I understand your rule of thumb for observations. So if I have independent variables k and j and k*j in my model then that would count as 3 terms and I should have at least 30 (3×10) observations to develop the model. Is that correct?
Jim Frost says
Hi Dave,
Yes, that’s absolutely correct for OLS. You should have at least 30 observations. Other forms of regression analysis can have different requirements. If you have weak effects, you might need even more to detect them.
Boris Droz says
Hi Jim,
Thank for your web site very helpful for non statistician such as me. You provide good “cooking receipt” if I can use this term.
Have a question: I did exactly what you did to detect overfitting (comparing model R2 and cross-validate R2) and I saw this procedure in a couple of time in different papers. But I am strangling to find out the threshold value between the best scenario case (difference = 0), acceptable scenario (maybe until 0.2), small overfitting and overfitting scenario.
Do some thresholds exist?
Thank you in advance for your answer
Boris
Jim Frost says
Hi Boris,
It’s difficult to come up with a specific value. I’m sure if you ask 5 different statisticians, they’d give you 5 different answers. However, I’d agree that once you get to a difference of 0.2, you should definitely start wondering and looking into it as a potential problem.
Rohit says
Tahnks Jim. Your work is really increasing understandining of statisticss
Jim Frost says
Thank you, Rohit! I’m glad you’ve found my blog to be helpful!
reet khatri says
this is so easy to understand ,thank you
Jim Frost says
Hi Reet, you’re very welcome! I’m happy to hear that you found it to be helpful!
Ed says
I’ve been asked to right a proof for why the number of regressors K cannot exceed N. I understand the intuition need some help proving it mathematically.
Jim Frost says
Hi Ed, here’s a pointer in the right direction. When the number of parameters = N, there are zero error degrees of freedom. Note that the parameters include the constant. So, if you have five observations, you can estimate the parameters for the constant and four predictors.
Md Rabiul Kabir says
Very helpful site
Jim Frost says
Thank you! I’m glad you found it to be helpful!
Ramskrishna says
Wonderful job thank you
Jim Frost says
Thanks so much for your kind comment. It made my day!