• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Overfitting Regression Models: Problems, Detection, and Avoidance

By Jim Frost 60 Comments

Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. In this post, I explain how overfitting models is a problem and how you can identify and avoid it.

Overfit regression models have too many terms for the number of observations. When this occurs, the regression coefficients represent the noise rather than the genuine relationships in the population.

That’s problematic by itself. However, there is another problem. Each sample has its own unique quirks. Consequently, a regression model that becomes tailor-made to fit the random quirks of one sample is unlikely to fit the random quirks of another sample. Thus, overfitting a regression model reduces its generalizability outside the original dataset.

Taking the above in combination, an overfit regression model describes the noise, and it’s not applicable outside the sample. That’s not very helpful, right? I’d really like these problems to sink in because overfitting often occurs when analysts chase a high R-squared. In fact, inflated R-squared values are a symptom of overfit models! Despite the misleading results, it can be difficult for analysts to give up that nice high R-squared value.

When choosing a regression model, our goal is to approximate the true model for the whole population. If we accomplish this goal, our model should fit most random samples drawn from that population. In other words, our results are more generalizable—we can expect that the model will fit other samples.

Related post: Model Specification: Choosing the Correct Regression Model

Graphical Illustration of Overfitting Regression Models

The image below illustrates an overfit model. The green line represents the true relationship between the variables. The random error inherent in the data causes the data points to fall randomly around the green fit line. The red line represents an overfit model. This model is too complex, and it attempts to explain the random error present in the data.

Graphical illustration of overfitting a regression model.

The example above is very clear. However, it’s not always that obvious. Below, the fitted line plot shows an overfit model. In the graph, it appears that the model explains a good proportion of the dependent variable variance. Unfortunately, this is an overfit model, and I’ll show you how to detect it shortly.

Fitted line plot that displays another example of overfitting a regression model.

If you have more than two independent variables, it’s not possible to graph them in this manner, which makes it harder to detect.

How Overfitting a Model Causes these Problems

Let’s go back to the basics of inferential statistics to understand how overfitting models causes problems. You use inferential statistics to draw conclusions about a population from a random sample. An important consideration is that the sample size limits the quantity and quality of the conclusions you can draw about a population. The more you need to learn, the larger the sample must be.

This concept is fairly intuitive. Suppose we have a total sample size of 20 and we need to estimate one population mean using a 1-sample t-test. We’ll probably obtain a good estimate. However, if we want to use a 2-sample t-test to estimate the means of two populations, it’s not as good because we have only ten observations to estimate each mean. If we want to estimate three or more means using one-way ANOVA, it becomes pretty bad.

As the number of observations per estimate decreases (20, 10, 6.7, etc.), the estimates become more erratic. Furthermore, a new sample is unlikely to replicate the inconsistent estimates produced by the smaller sample sizes.

In short, the quality of the estimates deteriorates as you draw more conclusions from a sample. This idea is directly related to the degrees of freedom in the analysis. To learn more about this concept, read my post: Degrees of Freedom in Statistics.

Applying These Concepts to Overfitting Regression Models

Overfitting a regression model is similar to the example above. The problems occur when you try to estimate too many parameters from the sample. Each term in the model forces the regression analysis to estimate a parameter using a fixed sample size. Therefore, the size of your sample restricts the number of terms that you can safely add to the model before you obtain erratic estimates.

Similar to the example with the means, you need a sufficient number of observations for each term in the regression model to help ensure trustworthy results. Statisticians have conducted simulation studies* which indicate you should have at least 10-15 observations for each term in a linear model. The number of terms in a model is the sum of all the independent variables, their interactions, and polynomial terms to model curvature.

For instance, if the regression model has two independent variables and their interaction term, you have three terms and need 30-45 observations. Although, if the model has multicollinearity or if the effect size is small, you might need more observations.

To obtain reliable results, you need a sample size that is large enough to handle the model complexity that your study requires. If your study calls for a complex model, you must collect a relatively large sample size. If the sample is too small, you can’t dependably fit a model that approaches the true model for your independent variable. In that case, the results can be misleading.

How to Detect Overfit Models

As I discussed earlier, generalizability suffers in an overfit model. Consequently, you can detect overfitting by determining whether your model fits new data as well as it fits the data used to estimate the model. In statistics, we call this cross-validation, and it often involves partitioning your data.

However, for linear regression, there is an excellent accelerated cross-validation method called predicted R-squared. This method doesn’t require you to collect a separate sample or partition your data, and you can obtain the cross-validated results as you fit the model. Statistical software calculates predicted R-squared using the following automated procedure:

  • It removes a data point from the dataset.
  • Calculates the regression equation.
  • Evaluates how well the model predicts the missing observation.
  • And, repeats this for all data points in the dataset.

Predicted R-squared has several cool features. First, you can just include it in the output as you fit the model without any extra steps on your part. Second, it’s easy to interpret. You simply compare predicted R-squared to the regular R-squared and see if there is a big difference.

If there is a large discrepancy between the two values, your model doesn’t predict new observations as well as it fits the original dataset. The results are not generalizable, and there’s a good chance you’re overfitting the model.

For the fitted line plot above, the model produces a predicted R-squared (not shown) of 0%, which reveals the overfitting. For more information, read my post about how to interpret predicted R-squared, which also covers the model in the fitted line plot in more detail.

How to Avoid Overfitting Models

To avoid overfitting a regression model, you should draw a random sample that is large enough to handle all of the terms that you expect to include in your model. This process requires that you investigate similar studies before you collect data. The goal is to identify relevant variables and terms that you are likely to include in your own model. After you get a sense of the typical complexity of models in your study area, you’ll be able to estimate a good sample size.

If you’re really stuck in a situation where you have too many variables and too few observations, consider using principal component analysis to create a smaller set of indices you can model. Learn more in, Principal Component Analysis Guide and Example.

To read about an analysis I performed where I had to be extremely careful to avoid overfit models, read Understanding Historians’ Rankings of U.S. Presidents using Regression Models.

For more information about successful regression modeling, read my post: Five Regression Analysis Tips to Avoid Common Mistakes.

If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Reference

Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, Psychosomatic Medicine 66:411-421 (2004).

Share this:

  • Tweet

Related

Filed Under: Regression Tagged With: conceptual

Reader Interactions

Comments

  1. Satish Kumar says

    September 8, 2022 at 12:19 pm

    Hi Jim,
    Thanks for such a nice explaination.

    I have applied Decision tree and Random forest regression model on a time series dataset.
    R-squared of DT on Train data is 65.55% and on test data is 65.24%
    R-squared of RF on Train data is 99.71% and on test data is 99.76%.

    Even though DT is showing R2 score a bit moderate, whereas RF is showing very high R2 score, i storngly believe that there is overfitting in both the models.
    any thoughts on this will be highly appreciable.

    Regards
    Satish

    Reply
    • Jim Frost says

      September 12, 2022 at 11:40 pm

      Hi Satish,

      Those are interesting results. The fact that for both cases the test data have nearly as high of an R-squared as the training data such that they aren’t overfit. But it seems likely, particularly the RF model!

      These are time series data. Is there a consistent upward or downward trend over time? If so, have you detrended the data so you’re only trying to predict the changes? If not, that might explain the overfitting in both the training and test data.

      Reply
  2. Melissa Costagliola-Ray says

    July 26, 2022 at 6:23 am

    Hi Jim,

    Thank you, I found this extremely helpful!

    I just wanted to ask when you talk about requiring a certain number of observations, would this relate to the number of surveys undertaken or the number of animals recorded within each survey?

    Many thanks,

    Melissa

    Reply
    • Jim Frost says

      July 26, 2022 at 1:47 pm

      Hi Melissa,

      The number of observations relate to the number of animals in your survey.

      In your datasheet, the number of observations is typically equivalent to the number of rows (minus any rows with headings), where each row can have multiple variables (DV and IV) associated with it. In the regression context, each animal is one observation with multiple variables. The values for one animal are typically stored in one row. For example, you might measure the height, weight, other characteristics, and the outcome you’re assessing (the dependent variable) and record all the values for one animal in one row.

      Reply
  3. Retno Maruti says

    April 15, 2022 at 12:55 pm

    Hi Jim, excellent work! thank you so much for sharing. I have a question, I use high-dimensional data and use OLS to investigate the predicatability relationship between depvar and the indepvars. I divide the data into two large group: testing and training. And then I use OLS and have a quite high R-squared for the testing sample data. I assume that there must be an overfitting issue. Then I use Lasso (cross-validated regression) and have lower R-squred (i divide the sample to two large group as well). How do I interpret the R-squared in Lasso? Because from the papers I’ve read previously, usually R-squared in Lasso is higher compared to OLS as they handle the overfitting issue and drop the irrelevant variables.

    Reply
    • Jim Frost says

      April 15, 2022 at 4:24 pm

      Hi Retno,

      I’m not an expert in LASSO regression. However, I believe the interpretation for R-squared is the same in LASSO regression as it is for OLS because both are linear models, it’s just the estimation methods that differ.

      I don’t know what the typical results are for R-square in OLS vs. LASSO models. However, I’m not surprised that R-squared values can be lower. Remember that LASSO shrinks coefficients down and, unlike Ridge regression, can shrink them down to zero, which effectively removes the predictor from the model. By removing predictors, you’d expect R-squared to decrease. That’s particularly true if you have an inflated R-squared due to overfitting and LASSO is rectifying the overfitting.

      Reply
  4. Krishnan says

    November 14, 2021 at 11:32 pm

    How many observations do I need if I have 4 independent variables

    Minimum of 40 observations ?

    Reply
    • Jim Frost says

      November 14, 2021 at 11:52 pm

      Hi Krishnan,

      Assuming you’re talking about linear least squares and your IVs are all continuous, yes, an absolute minimum would be 40. However, if you need to fit curvature or interaction effect, you’ll need more observations for those terms.

      Reply
  5. berihun nega says

    July 2, 2021 at 3:26 pm

    I learn a lot form the explanation. tank you very much. i have one question. is there any minimum and maximum standard to say the regression madel fit or over fit?

    Reply
  6. Halie Kang says

    May 29, 2021 at 3:19 pm

    Hi Jim.
    Thank you for the postings. I am a graduate student in social science major and your postings are very helpful, some that hard to find in other blogs or websites.

    I have a question on something and I am not sure if this questions is related to this posting, but I am in desperate need of help with this issue.
    I am aware that having too many categorical variables in the regression model might affect degree of freedom (which would affect we can’t trust the results), especially when the size of the sample is small.

    1) I want to know the reason more precisely. Why is it that?
    2) Does it apply to the logistic regression analysis as well?
    3) I have a data with 260 samples and I’m conducting binary logistic regression. I have so far 11 independent variables and 9 of them are categorical variables. With this very small sample size, I am concerned if this makes a problem.
    4) What would be the way to check my model is not a good model in stata in this case?

    Reply
    • Jim Frost says

      May 29, 2021 at 8:07 pm

      Hi Halie,

      Yes, what you describe is overfitting. I describe why that happens in this post so I won’t retype it in the comments.

      Yes, it applies to logistic regression, although the guidelines for sample sizes is a bit different than for least squares regression. I think there is more emphasis on the number of observations in the smaller group of the two for your dependent variable. Although, I don’t recall offhand. I believe the article I reference in the blog post provides some information about that. So, you might want to check that out. I don’t have the article on hand unfortunately.

      For your third question, it’s starting to sound problematic. However, it depend on how many levels your categorical variables have. Each categorical variable uses N-1 DF. Where N equals the number of levels for each categorical variable. If each one has three or more, you’re running into problems! Although, again, I don’t recall offhand the guidelines for logistic regression but you’d be running into with least squares regression and logistic regression has more stringent guidelines. Beware!

      I haven’t used Stata. So I can’t help you there. But you need to check the residual like other models. And do check into the possibility of overfitting because it looks like you might be running into it.

      I hope that helps!

      Reply
  7. Gunalan says

    October 20, 2020 at 10:54 pm

    Hi Jim
    Thankyou for your recon. I will check out the book.

    Reply
  8. Gunalan says

    October 20, 2020 at 9:20 pm

    Dear Jim
    Thankyou for your quick reply and I would like know if you would recon me a reading material about Measurement Uncertainty.

    Reply
    • Jim Frost says

      October 20, 2020 at 9:28 pm

      Hi Gunalan,

      I’d recommend EMP III: Evaluating the Measurement Process & Using Imperfect Data by Donald J. Wheeler. That’s a pretty thorough book on the topic. If you’re looking for an overview rather than an entire book, I don’t have a good reference at hand.

      Reply
  9. Gunalan says

    October 20, 2020 at 12:06 am

    Hi Jim,
    Im a R&D chemist and I would like to gain more knowledge in Regression analysis and method validation. Kindly advice what kind of books would you recon.

    Reply
    • Jim Frost says

      October 20, 2020 at 12:37 am

      Hi Gunalan,

      I’ve written a book on regression analysis that I recommend you read. To learn about click this link: Regression Analysis: An Intuitive Guide. You can also get a free sample of it that has the first two chapters. Just go to My Web Store and click on the free sample version of my Regression book. No credit card is required for the free sample. I think you’ll find this book to be very helpful!

      Reply
  10. EISHA AKANKSHA says

    June 2, 2020 at 10:58 am

    I AM NEW TO DATA SCIENCE BUT THIS LINK IS REALLY HELPFUL

    Reply
  11. Sarah Napier says

    May 3, 2020 at 2:24 am

    Hi Jim,

    This is so useful. Thanks so much for taking the time to explain this, I really appreciate it.

    Reply
    • Jim Frost says

      May 3, 2020 at 3:06 am

      You’re very welcome, Sarah! Best of luck with your research!

      Reply
  12. Sarah Napier says

    May 2, 2020 at 9:17 am

    Apologies, I realise I wasn’t quite clear – my sample size is 1,860 but in regards to my dependent variable I am modelling an outcome experienced by 314 (who answered yes) out of 1,860 people

    Reply
    • Jim Frost says

      May 2, 2020 at 10:55 pm

      Hi Sarah,

      Thanks for the additional information. There are several common guidelines related to binary logistic regression and sample size and they don’t always give the same answer! Some use an event per variable (EPV) calculation. I’ve seen different guidelines say that you need 50 EPV and others say that EPV ≥ 10 is sufficient. So, with 314 events, you can have 6 variables using the more stringent 50 EPV but as many as 31 variables if you go with EPV ≥ 10.

      And, yet another guideline is N = 10 k / p where N is sample size, k is the number of variables and p is the proportion of events or non-events, whichever is smaller. P in your case is events 314/1860 = 0.169. So, if we solve for k (IVs): 1860 = 10*k/0.169, we get k = 31 IVs.

      Given that you have 18 IVs, you’re well under 31, which we get using two of the guidelines. Using the 50 EPV guideline, you wouldn’t include all of those–only 6. However, you have 17.4 EPV. In my experience that should be fine. My opinion is that it is OK to include all of them.

      Reply
  13. Sarah Napier says

    April 30, 2020 at 11:19 pm

    Hi Jim I found your article extremely useful, thank you. I am conducting analysis of an online survey I administered. I have a sample of 1,860 respondents, and wish to use a logistic regression to test the effect of 18 predictor variables on the dependent variable, which is binary (yes/no) (N=314). Can I include all 18 predictor variables in the same logistic regression model, or will this cause overfitting? I note the rule of 10-15 observations per predictor, and I believe my sample size would allow this, but I wasn’t sure if there was a maximum number of variables you can use? Also, I guess I need to run the model before I know how many observations I have per predictor variable? Thanks in advance!

    Reply
  14. Alana says

    April 23, 2020 at 1:19 pm

    For a sample size of 1648 would you caution using 33 variables?

    Reply
  15. Kathryn says

    April 10, 2020 at 2:07 am

    as I understand it, for binary logistic regression, its not the number of cases but the smaller of the number of events on the dichotomous outcome that is used with the rule of thumb for capping the number of independent variables (see https://www.cs.vu.nl/~eliens/sg/local/theory/overfitting.pdf).

    For instance, when modeling an outcome experienced by 80 out of 200 cases, it would be a basis of 80 to which the rule of thumb would be applied.

    so you were correct in your suspicion that it wouldn’t be more lax.

    Reply
  16. Robert Parker says

    February 22, 2020 at 10:51 pm

    Thanks for response. Thank you for your insight regarding over-fitting. Also, it looks like Lasso regression and PLS will not address our problems as we are testing a hypothesized variable. I will consult with my co-authors.

    Reply
    • Jim Frost says

      February 22, 2020 at 11:17 pm

      Hi Robert,

      You’re very welcome! That was my concern after you mentioned the control variables. I figured you were testing a specific variable. Best of luck going forward!

      Reply
  17. Robert Parker says

    February 17, 2020 at 12:01 am

    Jim,

    I like the way you explain things.

    I am doing an academic study. I ran a logistic regression (binary dependent variable, yes/no) with many predictors variables — maybe too many. I suspect over-fitting problems.

    The sample size is very low, about 80. I know that most statisticians will argue that my sample size is way too small for what I am attempting to do. However, I am stuck with this as collecting additional data is not feasible.

    Model 1.
    23 predictor variables — 22 are control variables from prior studies. Two predictor variables have significant p-values (p<.05) which includes my hypothesized variable.

    A reviewer at a journal is arguing that I should add 12 additional control variables. This seems ill advised to me as my degrees of freedom would be really low. When I run the regression with 35 predictors (23+12), no predictor variable has a significant p-value. Does this reflect over-fitting? Is the standard error of the regression coefficients over-inflated which leads to insignificant p-values? If this is true, how can I prove this to satisfaction of reviewer? I have thought about step-wise procedure to delete some of control variables. Many journals do not like this approach. Also I have read that Lasso regression might be an option

    Please advise.

    Bob

    Reply
    • Jim Frost says

      February 20, 2020 at 3:41 pm

      Hi Bob,

      I agree with the idea that your sample size is way too small for what you want to do. It’s already too small for the 23 variables you have in it–much less the additional 12. This article discusses a good rule of thumb for OLS regression. I don’t know offhand what a good rule is for binary logistic regression, but I doubt it will be more lax.

      The problem with overfitting is that it can create completely untrustworthy results that appear to be statistically significant. You’re fitting the noise in the data. I would not say that the lack of significance with the 35 predictors is necessarily overfitting. Overfitting can produce misleading but statistically significant results. You could try reducing the number of predictors by removing the ones that are not significant. The problem with that approach is that you’ll be trying various combinations of predictors and making decisions about what to leave in the model based on many different combinations of variables. That’s called data mining and can cause problems similar to overfitting. Read my post about data mining for more information.

      Overfitting can cause biased coefficients. Inflated standard errors is more typically associated with multicollinearity. I don’t know if your model has multicollinearity or not. If you do, that’s an additional problem above and beyond overfitting.

      You’re in a tough spot and, unfortunately, I don’t have an easy answer. It sounds like you want to do too much with too little data. The best solution would be to obtain more data, but you indicate that’s not possible. As I mention in this article, I recommend reviewing the literature to determine the likely complexity of your model and then using that information to determine the necessary sample size.

      I recommend consulting with a statistician who can devote the time to your project that it deserves. With a more in depth look, they might be able to find a solution for you. I suspect it will involve comprises because you have far too few observations for the complexity of model you want to fit. For example, Lasso regression is a possibility when you have overfitting. However, it’s purpose is more for prediction than drawing inferences about the nature of the relationships between variables. Partial least squares (PLS) can also work when you have too many predictors for a given dataset. That procedure reduces the number of variables down to a smaller set of components and then performs least squares regression on those components. Both of those focus on prediction rather than the relationships between variables, and I don’t know if that suits your purposes or not. Given that you have control variables, I’m guessing not.

      Best of luck with your analysis!

      Reply
  18. Jacri says

    December 3, 2019 at 7:55 pm

    Awesome Jim. Can this approach conduct in the Cox regression?

    Reply
  19. Victor says

    September 19, 2019 at 3:20 am

    Hi Jim,

    You wrote that the statistics software to calculate the predictive R-squared do the following steps:
    It removes a data point from the dataset.
    Calculates the regression equation.
    Evaluates how well the model predicts the missing observation.
    And, repeats this for all data points in the dataset.

    Can you provide more details in how to you get then the predictive R-squared after these steps, i.e. giving a formula?

    Reply
  20. Andrea Berdondini says

    August 21, 2019 at 10:20 am

    You must always choose the polynomial whose results are less likely to be obtained randomly. Remember that increasing the degree of the polynomial drastically increases the probability of obtaining the same randomly.
    So you have to develop a monte carlo simulator in order to calculate this probability for the various polynomials.

    Reply
    • Jim Frost says

      August 21, 2019 at 11:49 am

      Hi Andrea,

      That’s certainly one way to do it. However, there are other methods. You can assess Predicted R-squared to see if you’re overfitting your model. And, you can use your subject area knowledge to determine what the relationship should be like. Usually you’ll know if theory suggests you should have multiple bends in the line or not. Using a cubic term is very rare. Anything higher and you’re almost definitely overfitting the model unless you have strong theoretical reasons to support it.

      I suppose you could use a Monte Carlo simulation, but it’s not a required method. Also, using a simulation like that suggests you already know the correct form, which might not be the case. Predicted R-squared does not make that assumption.

      Reply
  21. Andrea Berdondini says

    August 15, 2019 at 6:41 am

    The overfitting is simply the direct consequence of considering the statistical parameters, and therefore the results obtained, as a useful information without checking that them was not obtained in a random way. Therefore, in order to estimate the presence of overfitting we have to use the algorithm on a database equivalent to the real one but with randomly generated values, repeating this operation many times we can estimate the probability of obtaining equal or better results in a random way. If this probability is high, we are most likely in an overfitting situation. For example, the probability that a fourth-degree polynomial has a correlation of 1 with 5 random points on a plane is 100%, so this correlation is useless and we are in an overfitting situation.

    Reply
  22. Sarah says

    April 2, 2019 at 1:49 pm

    You’ve put an asterisk in the body text (Statisticians have conducted simulation studies*) which I have presumed is there to provide a reference for the following conclusion (which indicate you should have at least 10-15 observations for each term in a linear model). However, I can’t seem to locate this link. Can you please provide the reference to this analysis? Thanks.

    Reply
    • Jim Frost says

      April 2, 2019 at 1:54 pm

      Hi Sarah, it’s the Babyak reference at the bottom of this blog post. Just above the comments section.

      Reply
  23. Jae says

    March 30, 2019 at 12:30 pm

    Hi Jim,

    Thank you for your intuitive website.

    I said at a interview “I developed a multiple regression model”. Then the interview asked me how did you know the model is good or not? Then I said R-square, ah! adj.R-squared was high.
    Then the interview asked me what is the difference between R-squared and adj. R-squared?
    I couldn’t answer. she explained fast but I didn’t understood.
    Then the interviewer asked me about overfitting issue. Of course, I didn’t explain about the issue of regression model.

    is there relationship between overfitting vs r-squared?

    how I don’t understand anything at a interview even if I have a masters in statistics in 10 years ago. too dumb. I am going to study again.

    best,
    Jae

    Reply
    • Jim Frost says

      March 30, 2019 at 7:28 pm

      Hi Jae,

      To quickly learn many things about regression analysis, I highly recommend that you read my ebook about regression analysis!

      In terms of how do you tell whether a model is good, there are various things to look for. Do you residuals appear to be random or are there patterns in them? A high R-squared can be nice, but by itself doesn’t mean you have a good model. And, you can have a low R-squared but as long as you have significant independent variables, it might still be a good model. One model might be good at explaining the relationships in the data but bad at making precise predictions. Another model might be opposite, good at making predictions but bad at explaining the relationships between the variables. So, much depends on the purpose of your model and how you define good. And, the subject-area also affects what is considered good. In some study areas, high R-squared values are not possible.

      Back to overfitting. Typically, if you’re overfitting a model, your R-squared is higher than it should be. However, you might not know what it should be, so you might not know that it is too high. Yes, it’s possible that R-squared is too high! One of the best ways to detect overfitting is, as I explain in this post, by using predicted R-squared.

      Best of luck with your studies!

      Reply
  24. Mark says

    February 12, 2019 at 1:23 pm

    Hi Jim, It’s Mark again. I wondered if you might be able to clear up an uncertainty I have about Polynomial Squares Regression. I have tried to fit a polynomial with increasing order to some y data (there are 2 regressors). I’ve used JMP and it generates model coefficients. I guess these the resulting model minimizes the sum of squares (the sum of the squared “distance” between the predicted model value and the actual value. My question is this – Will the residuals for a model obtained by Least squares always sum to zero. I thought that the answer would be they would sum to zero but I’m finding that they do for low order models n=0, n=1, n=2 but not for order n=3 for example (so the n=3 order has the form:

    y = bo + (b1.x1 +b2.x1^2 + b3.x1^3 ) + (b4.x2) + (b5.x^2) + (b6.x^3)

    Thanks, Mark

    Reply
    • Jim Frost says

      February 12, 2019 at 7:54 pm

      Hi Mark,

      Yes, they should always sum to zero as long as you include the constant in the model. Including the constant forces them to sum to zero. Including polynomials should not affect that even with higher-order terms. So, I’m not sure what is happening in JMP!

      Reply
  25. Mark says

    January 30, 2019 at 10:59 am

    Hi Jim,
    Thanks for your reply. I guess that you need to have data or have some idea of what the relationship is likely to be before you propose a model of any order or decide to apply transformations. So taking a stab at order=3 just because it can fit somewhat complex curves isn’t a sensible approach : )

    Mark

    Reply
    • Jim Frost says

      January 30, 2019 at 11:16 am

      Hi Mark,

      Right, it’s not an approach I’d recommend. Graph your data, consider theory, fit the model that makes sense, check residual plots, and make adjustments as needed. The thing with higher-order polynomials is that they’re very good at fitting noise!

      Reply
  26. Mark says

    January 30, 2019 at 6:18 am

    Hi Jim, Without knowing the true relationship between y and x, is there a minimum polynomial order that is a go to? For example, if there is curvature then a model of order 1 e.g. y = ao + a1x wouldn’t be a good fit. A model y = a0 + a1x + a2x^2 would be better but wouldn’t be a good fit if there was both a minimum and maximum present. Then a model of order 3 e.g. y = a0 + a1x + a2x^2 + a3x^3 would be better. My feeling is that order = 3 is the minimum order required to fit “wiggly data” and so to be safe order = 4 would be a safe bet. If the relationship between y and x has many hills and valleys (not in a regular sinusoidal way) then maybe an order higher than 4 would be required but usually, relationships are smooth and continuous (but sometimes sharp discontinuities e.g if x = Twater = 0degC occur where maybe even a 4th order polynomial would not approximate the data well). In the case I’m considering, I don’t think there are any sharp discontinuities so I’m thinking that 4th order is a good “go to” choice for polynomial order. If I were to use a 10th order polynomial (to maximize) R^2 I suspect I’d just be fitting data to the noise rather than the underlying true relationship. Just wondered if you’d agree with my assessment of 4th degree polynomial being a good “go to” choice? Thanks, Mark

    Reply
    • Jim Frost says

      January 30, 2019 at 10:07 am

      Hi Mark,

      As a general rule of thumb for most analysts in most situations? No, I’d say that a 4th order polynomial is far too high. For most situations, that would be too many. Depending on the nature of your data and sample size, you’d either be overfitting the data or just have a number of terms that are not significant–which can reduce the precision of your model.

      I’d think about it from the opposite direction. Start with graphing your data and subject-area knowledge to get an idea of what curvature you need to fit. That will hopefully make it clear right there. If unsure, I’d start with a lower-order polynomial, and then check the residual plots. If necessary, you can increase the model order based on the residual plots.

      In practice, I’ve never seen a 4th order polynomial, or even a 3rd order. What I have seen was by the time a 3rd order would’ve been called for based on the number of bends, it was actually a nonlinear model that fit better. That’s not necessarily is a general rule for how it works, just what I’ve seen. I’m sure that varies by subject-area. However, you’ll need to use subject-knowledge and theory to guide you. If there is a specific reason why a 4th order polynomial or higher makes theoretical sense, it could well be justified.

      As a counter example, in the fitted line plot in this post with the cubic model, there’s just no theoretical reason for why the rankings would first increase, then decline, and then increase again as approval increases. It looks like it provides a good fit in the plot but it doesn’t make theoretical sense. The predicted R-squared also makes it clear that it’s not a good model. The cubic model just forces the model to play connect the dots.

      In order to start with a 4th order polynomial, you’d need a good reason for why the model calls for that. In other words, an explanation for why there should be three bends in the data. That’s not going to be the normal case for most analysts. However, again, use your subject-area knowledge to make this call. Maybe it’s appropriate for what you’re studying?

      If you go that route, be sure to check predicted R-squared. Also note that for each term in your model you should have 10-15 observations minimum. With a 4th order model, you should have 40 – 60. With 10, you’d need 100 – 150! That’s assuming there are no other terms in the model other than the individual predictor and its higher-order terms.

      I hope this helps!

      Reply
  27. Vansh says

    November 15, 2018 at 9:59 am

    Thanks..
    Great explanation..

    Reply
  28. Amir says

    November 11, 2018 at 3:45 am

    Hi Jim,

    Excellent explanation! You nailed it! I have a question, though. Could overfitting affect the size of the coefficients? Such as, making them larger than usual? I have a probit model with a fairly big number of observations, i.e 4000, and couple of interaction terms. When I estimate the model I get 0.67 R-squared but only two interactions are significant with the coefficient size greater than 10 while the size of the coefficients of the both main effects is less than 0.5. This has made me really concerned about the model.

    Reply
    • Jim Frost says

      November 12, 2018 at 12:26 am

      Hi Amir,

      Yes, overfitting can do all sorts of strange things including affecting the size of the coefficients. However, having interaction coefficients that are larger than the main effect coefficients isn’t necessarily a problem. In fact, it can be OK to have main effects = 0 and large interaction effects. It all depends on the subject-area. By itself, that’s no reason to be concerned. With 4000 observations, you’d have to have a very complex model to be overfitting your model. I doubt that’s happening. Check your residual plots to make sure they look good and graph the interaction effects to determine whether they make sense using your subject-area knowledge.

      Best of luck of with your analysis!

      Reply
  29. Dave says

    August 22, 2018 at 8:13 am

    Jim, Awesome stuff! I’ve been an algorithm developer for 20 years using mostly neural networks. I really appreciate your posts on over-fitting. I just wanted to make sure I understand your rule of thumb for observations. So if I have independent variables k and j and k*j in my model then that would count as 3 terms and I should have at least 30 (3×10) observations to develop the model. Is that correct?

    Reply
    • Jim Frost says

      August 23, 2018 at 2:22 am

      Hi Dave,

      Yes, that’s absolutely correct for OLS. You should have at least 30 observations. Other forms of regression analysis can have different requirements. If you have weak effects, you might need even more to detect them.

      Reply
  30. Boris Droz says

    August 19, 2018 at 11:37 am

    Hi Jim,

    Thank for your web site very helpful for non statistician such as me. You provide good “cooking receipt” if I can use this term.
    Have a question: I did exactly what you did to detect overfitting (comparing model R2 and cross-validate R2) and I saw this procedure in a couple of time in different papers. But I am strangling to find out the threshold value between the best scenario case (difference = 0), acceptable scenario (maybe until 0.2), small overfitting and overfitting scenario.
    Do some thresholds exist?

    Thank you in advance for your answer

    Boris

    Reply
    • Jim Frost says

      August 23, 2018 at 1:48 am

      Hi Boris,

      It’s difficult to come up with a specific value. I’m sure if you ask 5 different statisticians, they’d give you 5 different answers. However, I’d agree that once you get to a difference of 0.2, you should definitely start wondering and looking into it as a potential problem.

      Reply
  31. Rohit says

    May 14, 2018 at 11:52 am

    Tahnks Jim. Your work is really increasing understandining of statisticss

    Reply
    • Jim Frost says

      May 14, 2018 at 11:54 am

      Thank you, Rohit! I’m glad you’ve found my blog to be helpful!

      Reply
  32. reet khatri says

    February 26, 2018 at 3:50 am

    this is so easy to understand ,thank you

    Reply
    • Jim Frost says

      February 26, 2018 at 10:01 am

      Hi Reet, you’re very welcome! I’m happy to hear that you found it to be helpful!

      Reply
  33. Ed says

    January 25, 2018 at 9:57 am

    I’ve been asked to right a proof for why the number of regressors K cannot exceed N. I understand the intuition need some help proving it mathematically.

    Reply
    • Jim Frost says

      January 25, 2018 at 11:24 am

      Hi Ed, here’s a pointer in the right direction. When the number of parameters = N, there are zero error degrees of freedom. Note that the parameters include the constant. So, if you have five observations, you can estimate the parameters for the constant and four predictors.

      Reply
  34. Md Rabiul Kabir says

    October 6, 2017 at 10:38 pm

    Very helpful site

    Reply
    • Jim Frost says

      October 6, 2017 at 10:51 pm

      Thank you! I’m glad you found it to be helpful!

      Reply
      • Ramskrishna says

        October 9, 2017 at 11:43 am

        Wonderful job thank you

        Reply
        • Jim Frost says

          October 9, 2017 at 12:08 pm

          Thanks so much for your kind comment. It made my day!

          Reply

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Mean, Median, and Mode: Measures of Central Tendency
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • How to Interpret the F-test of Overall Significance in Regression Analysis
    • Choosing the Correct Type of Regression Analysis
    • How to Find the P value: Process and Calculations
    • Interpreting Correlation Coefficients
    • How to do t-Tests in Excel
    • Z-table

    Recent Posts

    • Principal Component Analysis Guide & Example
    • Fishers Exact Test: Using & Interpreting
    • Percent Change: Formula and Calculation Steps
    • X and Y Axis in Graphs
    • Simpsons Paradox Explained
    • Covariates: Definition & Uses

    Recent Comments

    • Dave on Control Variables: Definition, Uses & Examples
    • Jim Frost on How High Does R-squared Need to Be?
    • Mark Solomons on How High Does R-squared Need to Be?
    • John Grenci on Normal Distribution in Statistics
    • Jim Frost on Normal Distribution in Statistics

    Copyright © 2023 · Jim Frost · Privacy Policy