Data mining and regression seem to go together naturally. I’ve described regression as a seductive analysis because it is so tempting and so easy to add more variables in the pursuit of a larger R-squared. In this post, I’ll begin by illustrating the problems that data mining creates. To do this, I’ll show how data mining with regression analysis can take randomly generated data and produce a misleading model that appears to have significant variables and a good R-squared. Then, I’ll explain how data mining creates these deceptive results and how to avoid them.
When you think of data mining, you tend to think of big data. However, it can occur on the scale of a smaller research study. In this context, it’s often referred to as data dredging or a fishing expedition. However, data mining problems can be more pronounced when you’re using smaller data sets. That’s the context that I’m writing about.
Data mining is the process of exploring a data set and allowing the patterns in the sample to suggest the correct model rather than being guided by theory. This process is easy because you can quickly test numerous combinations of independent variables to uncover statistically significant relationships. In fact, automated model building procedures, such as stepwise and best subsets regression, can fit thousands of models quickly. You can continue adding statistically significant variables as you find them, and R-squared always increases.
Over the years, I’ve heard numerous comments about how it makes sense to look at many different variables, their interactions, and polynomials in all sorts of combinations. After all, if you end up with a model that is full of statistically significant variables, a high R-squared, and good looking residual plots, what can possibly be wrong? That’s exactly what I’m going to show you! Data mining or dredging is a form of p-hacking.
Learn more about What is P-Hacking: Methods & Best Practices.
Regression Example that Illustrates the Problems of Data Mining
The first thing I want to show is the severity of the problems. That way, if you use this approach, you understand the potential problems. Luckily, it’s easy to demonstrate because data mining can find statistically significant correlations in data that are randomly generated. Data mining can take a set of randomly generated independent variables and use them to explain the majority of the variation in a randomly generated dependent variable.
For this demonstration, I’ve created 101 columns of data, and each one contains 30 rows of entirely random data. The first column (C1) will be the dependent variable, and the other 100 columns are potential independent variables. I’ll use stepwise regression to pick the model. Here is the CSV data file: Random_data.
This scenario forces the procedure to dredge through 100 models just to pick the first variable, and then repeat that for the next variables. That’s a lot of models to fit! We’ll talk more about that later because it’s a defining characteristic of data mining.
Using Stepwise Regression on Random Data
Initially, the stepwise procedure adds 28 independent variables to the model, which explains 100% of the variance! Because we have a sample size of only 30, we’re obviously overfitting the model. Overfitting a model is a different issue that also inflates R-squared.
Related post: Five Reasons Why Your R-squared can be Too High
In this post, I want to address only the problems related to data mining, so I’ll reduce the number of independent variables to avoid an overfit model. A good rule of thumb is to include a maximum of one variable for every 10 observations. With 30 observations, I’ll include only the first three variables that stepwise regression picks: C35, C28, and C87. The stepwise regression output for the first three variables is below.
In step three, the coefficient P values are all statistically significant. The R-squared of 61.38% can be considered either strong or moderate depending on the field of study. However, for random data, it’s unbelievable—literally! In actual research, you’re likely to have some real effects mixed in, which can produce an even higher R-squared.
Neither the adjusted R-squared nor the predicted R-squared indicate any problems. In fact, all three R-squared values increase with each additional term. That’s what you want to see. The residual plots look good (not shown).
Just to be sure, let’s graph the relationship between an independent variable (C35) and the dependent variable (C1). We’ll see if it looks like a real relationship. Seeing is believing!
This plot looks good. The graph shows that as C35 increases, the dependent variable (C1) tends to decrease, which is consistent with the negative coefficient in the output. The data sure look like they follow a real relationship. If we didn’t know that the data are random, we’d think it’s a great model!
Lessons Learned from the Data Mining Example
The example above shows how data mining symptoms can be hard to detect. There are no visible signs of problems even though all of the results are deceptive. The statistical output and chart look great. Unfortunately, these results don’t reflect actual relationships but instead represent chance correlations that are guaranteed to occur with enough opportunities.
In the introduction, I asked, “What can possibly be wrong?” Now you know—everything can be wrong! The regression model suggests that random data can explain other random data, which is impossible. If you didn’t already know that there are no actual relationships between these variables, these results would lead you to completely inaccurate conclusions. Additionally, the capability of this model to predict new observations is zero despite the predicted R-squared.
The problems are real. Now, let’s move on to explaining how they happen and how to avoid them.
How Data Mining Causes these Problems
For all hypothesis tests, including tests for regression coefficients, there is always the chance of rejecting a null hypothesis that is actually true (Type I error). This error rate equals your significance level, which is often 5%. In other words, in cases where the null hypothesis is correct, you’ll have false positives 5% of the time.
A false positive in this context indicates that you have a statistically significant P value, but no effect or relationship exists in the population. These false positives occur due to chance patterns in the sample data that are not present in the population. The more hypothesis tests you perform, the greater your probability of encountering false positives.
Related post: How Hypothesis Tests Work: Significance Levels and P values
Let’s apply these concepts to data mining with regression analysis. When you fit many models with different combinations of variables, you are performing many hypothesis tests. In fact, if you use an automated procedure like stepwise or best subsets regression, you are performing hundreds if not thousands of hypothesis tests on the same data.
With this many tests, you will inevitably find variables that appear to be significant but are actually false positives. If you are guided mainly by statistical significance, you’ll keep these variables in the model, and it will fill up with false positives.
That’s precisely what occurred in our example. We had 100 candidate independent variables and stepwise regression scoured through hundreds of potential models to find the chance correlations.
Next, I’ll explain how you can specify your model without using data mining and avoid these problems.
Let Theory Guide You and Avoid Data Mining
Don’t get me wrong. Data mining can help build a regression model in the exploratory stage, particularly when there isn’t much theory to guide you. However, if you use data mining as the primary way to specify your model, you are likely to experience some problems. You should perform a confirmation study using a new dataset to verify data mining results. There can be costly consequences if you don’t. Imagine if we made decisions based on the example model!
Instead of data mining, use theory to guide you while fitting models and evaluating results. This approach reduces the number of models that you need to fit. Additionally, you can evaluate the model’s properties using subject-area considerations.
The best practice is to develop an understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. This method requires that you review the subject-area literature and similar studies.
The advance research allows you to:
- Collect the correct data in the first place.
- Specify a good model without data mining.
- Compare your results to theory.
Using statistics in a scientific study requires a lot of planning. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.
Never make a decision about including a variable in the model based on statistical significance alone. If there are discrepancies between the results and theory, be sure to investigate. Either explain the discrepancy or alter your model. For instance, compare the coefficient signs in your results to those that theory predicts. And, compare your R-squared to those from similar studies.
Related posts: Model Specification: Choosing the Correct Regression Model and Five Regression Analysis Tips to Avoid Common Problems
In conclusion, you want to develop knowledge that can guide you rather than relying on automated procedures to build your model. After all, it’s unreasonable to expect simple algorithms based on statistical significant to model the complex world better than a subject-area expert. Use your smarts before brute force!
Hello Jim.
Dr. Karemera of South Carolina State University wants me to run a “System GMM” on data that I have already run several other types of regressions.
I have told him that the System GMM is unnecessary. I told him this because the correlation of the independent variables with the residuals are all very close to zero.
However, he insists that I run the system GMM.
I have found some web sits and a PDF that explains in the laguage of regression how to do a GMM regression.
However, I don’t fully understand the notations.
Is it possible that you can describe, in simple algebraic language, the step by step process of running a “system GMM” regression?
Great post sir
Hi Jim,
I have one question regarding multiple regression. I have done coding in python where I have three independent variables (X,Y,Z) and one dependent variable. Two of my independent variables are categorical so I have used dummy variables. I am confused about p-values. For example if one of the categorical variable X has three values a, b and c then the regression model shows three p-values (one for a, one for b and one for c). Then how I get the idea that what is the p-value of my categorical independent variable X since I am looking the association of all of my independent variables with the dependent so in this case how I interpret the p-values of my three independent variables (X,Y, Z). Please reply.
Thank you.
Hi Rabia,
In my regression analysis ebook, I cover both the coding and testing in more detail. You might want to check that out.
Here’s a summary. Yes, you need a series of binary/indicator variables for categorical variables as you describe. However, you have to leave one level out for a baseline. For your example with three levels for categorical variable X, you’d include two of them. The third level is the baseline. You can use any level for your baseline and get consistent results, but there’s often a value that makes the most sense for your analysis. If Python doesn’t give you an error, you still shouldn’t include all three levels because it can mess up the results.
You can get p-values for each level, which tests the difference between that level and the baseline. These are based on t-tests.
If you want to determine the significance for the entire categorical variable (e.g., Variable X in your example), you’ll need to use an F-test. T-tests can only test the significance of one model term. However, F-tests can test the significance between models when more than one term is different. For your case, the F-test would assess the model with and without Variable X. And, of course, X is actually two binary variables using your example. That test assesses the joint significance of those two binary variables together, which allows you to say that X as a whole is significant or not.
I hope that helps!
Hello Jim you say: ”Specify a good model without data mining” … but how?
Hi Lucas,
Thanks for the great question!
I talk about how in the last section of this post, “Let Theory Guide You . . .” I also explain more in my post about Choosing the Correct Regression Model. Focus on those to learn more.
The point is to avoid going purely by statistical measures whenever possible. In exploratory studies where there isn’t theory or other studies to help guide your efforts, you might need to rely on purely statistical consideration. In those cases, just be aware of the potential problems and understand that performing additional studies to confirm/deny the original results are needed.
Hi all
Thought to share the following with you. I just came across a reference that mentions the rule-of-thumb for 10 observations per predictor. In fact, the source suggests to have at least 10 observations per predictor *degree of freedom*. Not too surprisingly, the model complexity affects the required number of observations. The second reference is a good and often cited tutorial for biostatistics.
Harrell et al. 1984. “Regression modelling strategies for improved prognostic prediction”
Harrell et al. 1996. “Multivariable prognostic models: Issues in developing models, evaluating, assumptions and adequacy, and measuring and reducing errors.”
Hi Norman,
Thanks so much for sharing the references! That recommendation is consistent with what I present in my post about how to avoid overfitting your model. There is an additional reference in that post as well.
Hi Jim. Your response was very helpful, and quick. Thank you very much!
Regarding my first question (Bonferroni correction), true, I never saw this in context of model selection, me neither. The notion of phantom DFs is new to me. Will give it a brief literature review to see what is out there and how to deal with this. And for questions 2) and 3), you seem confirming my view. Thanks for the hints and clarifications.
Best wishes
Norman
Hi Norman,
You’re very welcome! As for phantom degrees of freedom, the best answer is to go in armed with knowledge so you can limit the number of models you need to fit!
Hi Jim. Thanks for these very useful and intuitive posts on regression and related topics. I am consulting your articles in preparation for a statistical analysis I need to perform for my studies.
In the context of model selection and the possible pitfalls, I came across alternative approaches to avoid the type of problems you illustrated in this post. I am not entirely sure if they apply, though.
1) Wouldn’t it be useful to simply compensate for multiple testing, for instance via Bonferroni correction? In this approach, the significance level is scaled with the number of separate tests in order to reduce the risk of making type I errors.
2) How about cross-validation? Several methods exist I guess. A common one is to split the data into training and validation sets. The model selection procedure is executed using the training data, whereas the predictive performance is measured with the validation data.The complete procedure is repeated multiple times, before finally, models are selected only if they have been observed consistently in a majority of trials.
3) An obvious way to avoid the problem is to collect further data, or to at least assess the critical number of data required prior to performing a statistical analysis. From courses on statistics and machine learning, I remember that such approaches exist (though that they are sometimes difficult to perform). Is this the case for regression too?
And a last question: The rule-of-thumb that you mentioned (1 predictor per 10 observations) is new to me. Has this been examined analytically in a study that I can cite?
Many thanks for your response
Hi Norman, I’m so happy to hear that my blog posts are helpful! You ask some great questions, so let’s dive into them one-by-one!
1) I’ve heard of using Bonferroni corrections for post hoc analyses. I haven’t heard of using it for model selection. That’s not to say that it hasn’t been done, I’m just not familiar with that approach. What I have heard of in the data mining context is the notion of “phantom degrees of freedom.” These are degrees of freedom that you use when you try out different models, but they are phantoms because they don’t show up in the DF for your final model. What you need is software that keeps track of all the different versions of models that you’ve tried, adding, subtracting predictors, and trying different combinations and then reflects the final degrees of freedom that you have used. You’re performing all of these various tests but the cumulative DF is not usually tracked. Typically, the software displays the DF for only the final model. I don’t know if there is software out there that does or not!
2) I think cross-validation is very important. And, more generally, replication is crucial. If you can afford to set aside a portion of your data to cross-validate, that is fantastic. If you’re in a scenario where you are making a decision based on the results, I think cross-validation and/or separate validation runs at the settings you choose are essential. If it’s more scientific research with no immediate applied application, cross-validation or just the knowledge that replication by other researchers is still required. I never believe that any individual study is sufficient to prove anything. It’s the accumulation of replicated results that is important–and I’d include cross-validation as a form of replication. In this post, I mention performing a confirmation study for these reasons! At some point, I need to write a post about cross-validation studies!
3) There are power and sample size procedures for various hypothesis tests. However, with regression analysis, that type of analysis much more complicated because you can have different numbers of predictors, continuous and/or categorical variables, interaction terms, and polynomials among other model variants! The best approach I’ve seen is to do your background research first and build up a list of potential predictors and develop an idea of how complex the model will be. Then, you can let that guide you to the minimum sample size that is required. This approach also helps you avoid trying many different models because already have some ideas before starting. I write about this approach in my post about overfitting models. That post also contains the reference about the rule-of-thumb that you’re asking about. I highly recommend that article!
That all said, sometimes you’re in the exploratory stages for a new research area, and there’s not much existing information to guide you. In that case, you may have to wing it and go by educated guesses. But, if information exists that you can guide your study, it’s always worthwhile finding it.
Best of luck with your analysis!
You say data mining works better with large data sets. Do you have any guidance on how many observations you should have to avoid this problem in data mining?
For starters see the work of Ewout Steyerberg and Karel Moons.