Data mining and regression seem to go together naturally. I’ve described regression as a seductive analysis because it is so tempting and so easy to add more variables in the pursuit of a larger R-squared. In this post, I’ll begin by illustrating the problems that data mining creates. To do this, I’ll show how data mining with regression analysis can take randomly generated data and produce a misleading model that appears to have significant variables and a good R-squared. Then, I’ll explain how data mining creates these deceptive results and how to avoid them.
When you think of data mining, you tend to think of big data. However, it can occur on the scale of a smaller research study. In this context, it’s often referred to as data dredging or a fishing expedition. However, data mining problems can be more pronounced when you’re using smaller data sets. That’s the context that I’m writing about.
Data mining is the process of exploring a data set and allowing the patterns in the sample to suggest the correct model rather than being guided by theory. This process is easy because you can quickly test numerous combinations of independent variables to uncover statistically significant relationships. In fact, automated model building procedures, such as stepwise and best subsets regression, can fit thousands of models quickly. You can continue adding statistically significant variables as you find them, and R-squared always increases.
Over the years, I’ve heard numerous comments about how it makes sense to look at many different variables, their interactions, and polynomials in all sorts of combinations. After all, if you end up with a model that is full of statistically significant variables, a high R-squared, and good looking residual plots, what can possibly be wrong? That’s exactly what I’m going to show you! Data mining or dredging is a form of p-hacking.
Learn more about What is P-Hacking: Methods & Best Practices.
Regression Example that Illustrates the Problems of Data Mining
The first thing I want to show is the severity of the problems. That way, if you use this approach, you understand the potential problems. Luckily, it’s easy to demonstrate because data mining can find statistically significant correlations in data that are randomly generated. Data mining can take a set of randomly generated independent variables and use them to explain the majority of the variation in a randomly generated dependent variable.
For this demonstration, I’ve created 101 columns of data, and each one contains 30 rows of entirely random data. The first column (C1) will be the dependent variable, and the other 100 columns are potential independent variables. I’ll use stepwise regression to pick the model. Here is the CSV data file: Random_data.
This scenario forces the procedure to dredge through 100 models just to pick the first variable, and then repeat that for the next variables. That’s a lot of models to fit! We’ll talk more about that later because it’s a defining characteristic of data mining.
Using Stepwise Regression on Random Data
Initially, the stepwise procedure adds 28 independent variables to the model, which explains 100% of the variance! Because we have a sample size of only 30, we’re obviously overfitting the model. Overfitting a model is a different issue that also inflates R-squared.
Related post: Five Reasons Why Your R-squared can be Too High
In this post, I want to address only the problems related to data mining, so I’ll reduce the number of independent variables to avoid an overfit model. A good rule of thumb is to include a maximum of one variable for every 10 observations. With 30 observations, I’ll include only the first three variables that stepwise regression picks: C35, C28, and C87. The stepwise regression output for the first three variables is below.
In step three, the coefficient P values are all statistically significant. The R-squared of 61.38% can be considered either strong or moderate depending on the field of study. However, for random data, it’s unbelievable—literally! In actual research, you’re likely to have some real effects mixed in, which can produce an even higher R-squared.
Neither the adjusted R-squared nor the predicted R-squared indicate any problems. In fact, all three R-squared values increase with each additional term. That’s what you want to see. The residual plots look good (not shown).
Just to be sure, let’s graph the relationship between an independent variable (C35) and the dependent variable (C1). We’ll see if it looks like a real relationship. Seeing is believing!
This plot looks good. The graph shows that as C35 increases, the dependent variable (C1) tends to decrease, which is consistent with the negative coefficient in the output. The data sure look like they follow a real relationship. If we didn’t know that the data are random, we’d think it’s a great model!
Lessons Learned from the Data Mining Example
The example above shows how data mining symptoms can be hard to detect. There are no visible signs of problems even though all of the results are deceptive. The statistical output and chart look great. Unfortunately, these results don’t reflect actual relationships but instead represent chance correlations that are guaranteed to occur with enough opportunities.
In the introduction, I asked, “What can possibly be wrong?” Now you know—everything can be wrong! The regression model suggests that random data can explain other random data, which is impossible. If you didn’t already know that there are no actual relationships between these variables, these results would lead you to completely inaccurate conclusions. Additionally, the capability of this model to predict new observations is zero despite the predicted R-squared.
The problems are real. Now, let’s move on to explaining how they happen and how to avoid them.
How Data Mining Causes these Problems
For all hypothesis tests, including tests for regression coefficients, there is always the chance of rejecting a null hypothesis that is actually true (Type I error). This error rate equals your significance level, which is often 5%. In other words, in cases where the null hypothesis is correct, you’ll have false positives 5% of the time.
A false positive in this context indicates that you have a statistically significant P value, but no effect or relationship exists in the population. These false positives occur due to chance patterns in the sample data that are not present in the population. The more hypothesis tests you perform, the greater your probability of encountering false positives.
Let’s apply these concepts to data mining with regression analysis. When you fit many models with different combinations of variables, you are performing many hypothesis tests. In fact, if you use an automated procedure like stepwise or best subsets regression, you are performing hundreds if not thousands of hypothesis tests on the same data.
With this many tests, you will inevitably find variables that appear to be significant but are actually false positives. If you are guided mainly by statistical significance, you’ll keep these variables in the model, and it will fill up with false positives.
That’s precisely what occurred in our example. We had 100 candidate independent variables and stepwise regression scoured through hundreds of potential models to find the chance correlations.
Next, I’ll explain how you can specify your model without using data mining and avoid these problems.
Let Theory Guide You and Avoid Data Mining
Don’t get me wrong. Data mining can help build a regression model in the exploratory stage, particularly when there isn’t much theory to guide you. However, if you use data mining as the primary way to specify your model, you are likely to experience some problems. You should perform a confirmation study using a new dataset to verify data mining results. There can be costly consequences if you don’t. Imagine if we made decisions based on the example model!
Instead of data mining, use theory to guide you while fitting models and evaluating results. This approach reduces the number of models that you need to fit. Additionally, you can evaluate the model’s properties using subject-area considerations.
The best practice is to develop an understanding of the relevant independent variables, their relationships with the dependent variable, and the expected coefficient signs and effect magnitudes before you begin collecting data. This method requires that you review the subject-area literature and similar studies.
The advance research allows you to:
- Collect the correct data in the first place.
- Specify a good model without data mining.
- Compare your results to theory.
Using statistics in a scientific study requires a lot of planning. For more information about this process, read 5 Steps for Conducting Scientific Studies with Statistical Analyses.
Never make a decision about including a variable in the model based on statistical significance alone. If there are discrepancies between the results and theory, be sure to investigate. Either explain the discrepancy or alter your model. For instance, compare the coefficient signs in your results to those that theory predicts. And, compare your R-squared to those from similar studies.
In conclusion, you want to develop knowledge that can guide you rather than relying on automated procedures to build your model. After all, it’s unreasonable to expect simple algorithms based on statistical significant to model the complex world better than a subject-area expert. Use your smarts before brute force!