Precision in predictive analytics refers to how close the model’s predictions are to the observed values. The more precise the model, the closer the data points are to the predictions. When you have an imprecise model, the observations tend to be further away from the predictions, thereby reducing the usefulness of the predictions. If you have a model that is not sufficiently precise, you risk making costly mistakes!

Regression models are a critical part of predictive analytics. These models can help you make predictions in applied situations. By entering values into the regression equation, you can predict the average outcome. However, predictions are not quite this simple because you need to understand the precision.

In this blog post, I present research that shows how surprisingly easy it is for even statistical experts to make mistakes related to misunderstanding the precision of the predictions. The research shows that how you present regression results influences the probability of making a wrong decision. I’ll show you a variety of potential solutions so you can avoid these traps!

## The Illusion of Predictability

Emre Soyer and Robin M. Hogarth study behavioral decision-making. They found that experts in applied regression analysis frequently make incorrect decisions based on applied regression models because they misinterpret the prediction precision.*

Decision-makers can use regression equations for predictive analytics. However, predictions are not as straightforward as entering numbers into an equation and making a decision based on the particular value of the prediction. Instead, decisions based on regression predictions need to incorporate the margin of error around the predicted value.

Regression predictions are for the *mean* of the dependent variable. If you think of any mean, you know that there is variation around that mean. The same concept applies to the predicted mean of the dependent variable. There is a spread of data points around regression lines. We need to quantify that scatter to know how close the predictions are to the observed values. If the range is too large, the predictions won’t provide useful information.

Soyer and Hogarth conclude that analysts frequently perceive the outcomes to be more predictable than the model justifies. The apparent simplicity of inputting numbers into a regression equation and obtaining a particular prediction frequently deceives the analysts into believing that the value is an exact estimate. It *seems* like the regression equation is giving you the correct answer exactly, but it’s not. Soyer and Hogarth call this phenomenon the illusion of predictability.

I’ll show you this illusion in action, and then present some ways to mitigate its effect.

## Studying How Experts Perceive Prediction Uncertainty

Soyer and Hogarth recruited 257 economists and asked them to assess regression results and use them to make a decision. Many empirical economic studies use regression models, so this is familiar territory for economists.

The researchers displayed the regression output using the most common tabular format that appears in the top economic journals: descriptive statistics, regression coefficients, constant, standard errors, R-squared, and the number of observations. Then, they asked the participants to make a decision using the model. The participants are mainly professors in applied economics and econometrics. Here’s an example.

### Use a regression model to make a decision

To be sure that you have a 95% probability of obtaining a positive outcome (Y > 0), what is the minimum value of X that you need?

The regression coefficient is statistically significant at the 95% level, and standard errors are in parentheses.

Variable | Mean | Std. Dev |

X | 50.72 | 28.12 |

Y | 51.11 | 40.78 |

X Coefficient | 1.001 (0.033) |

Constant | 0.32 (1.92) |

R-squared | 0.50 |

N | 1000 |

### The difference between perception and reality

76% of the participants indicated that a very small X (X < 10) is sufficient to ensure a 95% probability of a positive Y.

Let’s work through their logic using the regression equation that you can construct from the information in the table: Y = 0.32 + 1.001X.

If you enter a value of 10 in the equation for X, you obtain a predicted Y of 10.33. This prediction seems sufficiently above zero to virtually assure a positive outcome, right? The predicted value is the average outcome, but it doesn’t factor in the precision of the predictions around the mean.

When you factor in the variability around the average outcome, you find that the correct answer is 47! Unfortunately, only 20% of the experts gave an answer that was near the correct value even though it is possible to solve it mathematically using the information in the table. (These are experts, after all, and I wouldn’t expect most people to be able to solve it mathematically. I’ll cover easier methods below.)

Imagine if an important decision depended on this answer? That’s how costly mistakes can be made!

### Low R-squared values should have warned of low precision

The researchers asked the same question for a model with an R-squared of 25%, and the results were essentially the same. No changes were made in their answers to address the greater uncertainty!

The participants severely overestimated the precision of the regression predictions. Again, this is the illusion of predictability. It’s a psychological phenomenon where the apparent exactness of the regression equation gives the impression that the predictions are more precise than they are in reality. The end result is that a majority of experts severely underestimated the variability, which can lead to expensive mistakes. If the numeric results deceive most applied regression *experts*, imagine how common this mistake must be among less experienced analysts!

I’ve written that a high R-squared value isn’t always critical *except* for when you require precise predictions. In the first model, the R-squared of 50% should have set off alarm bells about imprecise predictions. Even more so for the model with an R-squared of 25%! Later in this post, I’ll show you a different goodness-of-fit statistic that is better than R-squared at evaluating precision.

## Graph the Model to Highlight the Variability

In the next phase of the experiment, the researchers ask two new groups of experts the same questions about the same models, but they present the regression results differently. One group saw the results tables with fitted line plots, and the other group saw only the fitted line plots. Fitted line plots display both the data points and the fitted regression line. Surprisingly, the group that saw only the fitted line plots had the largest percentage of correct answers.

The fitted line plot below is for the same R-squared = 50% model that produced the regression results in the tables above.

By assessing the fitted line plot, only 10% answered with an X < 10 while 66% were close to 47. Look at the graph, and it’s easy to see that at around 47 most of the data points are greater than zero. You can also understand why answers of X < 10 are way off!

The graph brings the imprecision of the predictions to life. You see the variability of the data points around the fitted line.

## Graphs Are Only One Way to Pierce the Illusion of Predictability

I completely agree with Soyer and Hogarth’s call to change how analysts present the results for predictive analytics. I use fitted line plots in my blog posts as often as possible. It’s a fantastic tool that makes regression results more intuitive. Seeing is believing!

However, the scenario that the researchers present is especially favorable to a visual analysis. For a start, there is only one independent variable, which allows you to use a fitted line graph. Furthermore, there are many data points (N = 1000) that are evenly distributed throughout the full range of both variables. Collectively, this situation produces a clearly visible location on the graph where you are unlikely to obtain negative values.

What do you do when you have multiple independent variables and can’t use a fitted line plot? What about models that have interaction and polynomial terms? How about cases where you don’t have such a large amount of nicely arranged data? For these less tidy cases, we must still factor in the real-world variability to understand the precision in predictive analytics. Read on!

## Prediction Intervals Show the Precision to Improve Your Decision-Making

In predictive analytics, a prediction interval is the range where a single new observation is likely to fall given specific values of the independent variables. Narrower prediction intervals represent more precise predictions. Prediction intervals factor in the variability around the mean outcome. Use prediction intervals to determine whether the predictions are sufficiently precise to satisfy your requirements.

Prediction intervals have a confidence level and can be a two-sided range, or be an upper or lower bound. Let’s see how prediction intervals can help us!

**Related posts**: Making Predictions with Regression Analysis and Confidence Intervals vs Prediction Intervals vs Tolerance Intervals

## Display Prediction Intervals on Fitted Line Plots to Assess Precision

I’ve created a dataset that is very similar to the data that Soyer and Hogarth use for their study. You can download the CSV data file to try this yourself: SimpleRegressionPrecision.

Let’s start out with a simple case by using prediction intervals to answer the same question they asked in their study. Then, we’ll look at several more complex cases.

What is the minimum value of X that ensures a positive result (Y > 0) with 95% probability?

To choose the correct value, we need a 95% lower bound for the prediction, which is a one-sided prediction interval with a 95% confidence level. Unfortunately, the software I’m using can’t display a one-sided prediction interval on a fitted line plot, but the lower limit of a two-sided 90% prediction interval is equivalent to a 95% lower bound. Consequently, on the fitted line plot below, we’ll use only the lower green line.

In the plot, I placed the crosshairs over the point where the 95% lower bound crosses zero on the y-axis. The software displays the values for this point in the upper-left corner of the graph. The results tell us that we need an X of 47.1836 to obtain a Y greater than zero with 95% confidence.

As I noted earlier, this dataset is particularly conducive to visual analysis. What if we have fewer data points that aren’t so consistently arranged?

I randomly sampled 50 observations from the complete data set and created the fitted line plot below.

With this dataset, it’s hard to determine the answer visually. Prediction intervals really shine here. Even though the sample is only 1/20^{th} the size of the full dataset, the results are very close. Using the crosshairs again, we see that the answer is 41.7445.

## Example of Using Prediction Intervals with Predictive Analytics

The previous models have only one independent variable, which allowed us to graph the model and the prediction intervals. If you have more than one independent variable, you can’t graph prediction intervals, but you can still use them.

We’ll use a regression model as an example of predictive analytics in order to decide how to set the pressure and fuel flow in our process. These variables predict the heat that the process generates. Download the CSV data file to try it yourself: MultipleRegressionPrecision. The regression output is below.

To prevent equipment damage, we must avoid excessive heat. We need to set the pressure and fuel flow so that we can be 95% confident that the heat will be less than 250. However, we don’t want to go too low because it reduces the efficiency of the system.

We could plug numbers into the regression equation to find values that produce an average heat of 250. However, we know that there will be variation around this average. Consequently, we’ll need to set the pressure and fuel flow to produce an average that is somewhat less than 250. How much lower is sufficient? We’ll use prediction intervals to find out!

## Creating Prediction Intervals to Assess Precision in Predictive Analytics

Finding the correct settings to use for pressure and fuel flow requires subject-area knowledge to determine settings that are both feasible and will produce temperatures in the right ballpark. Using a combination of experience and trial and error, you want to produce results where the 95% upper bound is near 250.

Most statistical software allows you to create prediction intervals based on a regression model. While the process varies by statistical software package, I’m using Minitab, and below I show how I enter the settings and the results that it calculates. It’s convenient because the software calculates the mean outcome and the prediction interval using the regression model that you fit. I’m entering process settings of 36 for pressure and 17.5 for fuel flow. I’ve also set it so that the software will calculate a 95% upper bound.

The output shows that if we set the pressure and fuel flow at 36 and 17.5 respectively, the average temperature is 232.574 and the upper bound is 248.274. We can be 95% confident that the next temperature measurement at these settings will be below 248. That’s just what we need! We’re using the prediction interval to show us the precision of the predictions to incorporate the process’s inherent variability into our decision-making.

We can use this same procedure even when our regression model includes more independent variables, curvature, and interaction terms.

## Other Tips to Avoid Costly Mistakes in Predictive Analytics

**Assess predicted R-squared:**Even when a regression model has a high R-squared value, it might not be able to predict new observations as well. Use predicted R-squared to evaluate how well your model predicts for new observations. Read my post about predicted R-squared.**Assess the Standard Error of the Regression (S)**: As I mentioned earlier, R-squared doesn’t directly assess the precision of your regression model. However, the standard error of the regression (S) is a different goodness-of-fit statistic that directly assesses precision using the units of the dependent variable. The predicted value plus/minus 2*S is a quick estimate of a 95% prediction interval. To learn more, read my post about the standard error of the regression.**Perform validation runs**: After using regression analysis and the prediction intervals to identify candidate settings, perform some validation runs at these settings to be sure that the real world behaves as your model predicts it should!

**Have you used predictive analytics to make a decision?**

### Reference

Emre Soyer, Robin M. Hogarth, The illusion of predictability: How regression statistics mislead experts, International Journal of Forecasting, Volume 28, Issue 3, July–September 2012, Pages 695-711.

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

abdul says

Hi Jim,

First and foremost thank you alot for this extremely useful post!

How could you calculate the prediction interval with a series of measurements for one constant true (integer) value, instead of a regression line?

I want to define the precision of an assay in determining integer values, but i feel like the confidence intervals are deceptively narrow.

Guilherme says

Hi Jim,

Thanks for this very interesting post. Could you please explain how to solve your example mathematically (the case in which factoring in the variability leads to an answer of 47 instead of 10)?

So, if I understood it correctly, we can’t establish the precision of a coefficient per se, but instead should look at the precision of its prediction given a certain value of the variable, correct?

Thanks,

Guil

Jim Frost says

Hi Guilherme,

The solution is something I wouldn’t expect most people to calculate. And, I do think using prediction intervals is easier. But, here you go!

As an interesting point, when I first read the article, I wrote to the authors and they were impressed by the usefulness of prediction intervals. I don’t think they really considered using them before.

It is possible to establish the precision of the coefficient estimates. Just have your software calculate the confidence intervals for the coefficients. They work like other CIs. In this case, each CI incorporates a margin of error around the point estimate (coefficient) to provide a range of values that likely contains the true population parameter. Narrower CIs represent more precise estimates.

So, really, the choice comes down to whether you need to know the precision of your model’s predictions or the precision of the coefficient estimates. Of course, you can do both! Just know that they’re different things.

I hope that helps!

Jicheng Xia says

Thanks Jim for the information. Looking forward to your new e-books. I will definitely purchase them, not only to support you, but to save these valuable knowledge in a safer place :-). I believe these books will be worth every penny! I am planning to recommend them to my colleagues too, and I believe they are going to love it!

Jicheng Xia says

Hi Jim,

I have been reading your posts in the last several months, and I have benefited a lot from it.

One request here: is it possible to make a content directory for all these posts so that it is easier to search?

Lastly, please keep posting and working on this blog. Really appreciate all the knowledge sharing. This is one of the best resource online regarding applied statistics!!

Jim Frost says

Hi Jicheng,

I’m happy to hear that my website has been helpful for you! Thank you for the kind words and encouragement! I will keep working at it! ðŸ™‚

On the right hand navigation pane, you’ll see a search box on all pages. You can use that to search for posts about specific topics. Also, across the top you will see a menu bar for broader categories (Basics, Hypothesis Tests, Regression, etc.). Click on those to go to a page or pages that group all posts in those broad categories.

Additionally, I’m in the process of writing ebooks about all the topics. Currently, my ebook on regression analysis is complete. My Introduction to Statistics ebook will be ready this September. And, next year, I’ll complete one on hypothesis testing. These books not only organize everything nicely but they also go into the subject areas much deeper than the blog posts.

Happy reading!

Ben Coltman says

Thanks indeed for your response. I’ll let you know how I get on…

Ben Coltman says

Thanks for this engaging blog post. Remarkable that so many economists failed to properly take R-squared into account when assessing the precision of the regression’s predictions.

I’m interested in predictive models. In particular, I’m trying to build a model that predicts the value of residential properties in the UK. My first attempt is to build a hedonic regression model. I bought your book, Regression Analysis: An Intuitive Guide, but I’m aware this is more of a how-to guide on building explanatory regression models rather than predictive ones.

I wonder whether you could point me to some further guidance on how to build predictive regression models (books/textbooks/other blogs).

In addition, are they any other model forms that you recommend I try for my house price model?

Thanks,

Ben

Jim Frost says

Hi Ben,

First, thank you so much for buying my ebook! And, in that ebook, there’s a chapter about using regression analysis to make predictions. You should definitely read that. And, my advice would be throughout the book, pay extra attention to places where I’m talking about assessing the standard error of the regression (S) and prediction interval. These tools are particularly helpful when you’re using regression to make predictions. The mistakes in this blog post would be entirely avoided by assessing those two things–and they’re simple to use! People just get too hung up on R-squared and it doesn’t help you when it comes to prediction.

I actually had an econometrics professor way back who published a journal article about housing prices. He used OLS regression, which worked fine. I took his class over 20 years ago so the details are foggy. But, I know he included a ton of predictors. These were both continuous variables like square footage, distance from schools and parks, but also categorical variables to captures things like architectural style. One of the strongest predictors was whether a bathroom was remodeled. During the class, he complained that after he sold his own house he had large negative residual!!

I think you’re on the right track. The trick will be to identify the predictors and gather the data. There’s published research on this exact type of model too.

Best of luck with your analysis!