If you were able to make predictions about something important to you, you’d probably love that, right? It’s even better if you know that your predictions are sound. In this post, I show how to use regression analysis to make predictions and determine whether they are both unbiased and precise.
You can use regression equations to make predictions. Regression equations are a crucial part of the statistical output after you fit a model. The coefficients in the equation define the relationship between each independent variable and the dependent variable. However, you can also enter values for the independent variables into the equation to predict the mean value of the dependent variable.
The Regression Approach for Predictions
Using regression to make predictions doesn’t necessarily involve predicting the future. Instead, you predict the mean of the dependent variable given specific values of the independent variable(s). For our example, we’ll use one independent variable to predict the dependent variable. I measured both of these variables at the same point in time.
Psychic predictions are things that just pop into mind and are not often verified against reality. Unsurprisingly, predictions in the regression context are more rigorous. We need to collect data for relevant variables, formulate a model, and evaluate how well the model fits the data.
The general procedure for using regression to make good predictions is the following:
- Research the subject-area so you can build on the work of others. This research helps with the subsequent steps.
- Collect data for the relevant variables.
- Specify and assess your regression model.
- If you have a model that adequately fits the data, use it to make predictions.
While this process involves more work than the psychic approach, it provides valuable benefits. With regression, we can evaluate the bias and precision of our predictions:
- Bias in a statistical model indicates that the predictions are systematically too high or too low.
- Precision represents how close the predictions are to the observed values.
When we use regression to make predictions, our goal is to produce predictions that are both correct on average and close to the real values. In other words, we need predictions that are both unbiased and precise.
Example Scenario for Regression Predictions
We’ll use a regression model to predict body fat percentage based on body mass index (BMI). I collected these data for a study with 92 middle school girls. The variables we measured include height, weight, and body fat measured by a Hologic DXA whole-body system. I’ve calculated the BMI using the height and weight measurements. DXA measurements of body fat percentage are considered to be among the best.
You can download the CSV data file: Predict_BMI.
Why might we want to use BMI to predict body fat percentage? It’s more expensive to obtain your body fat percentage through a direct measure like DXA. If you can use your BMI to predict your body fat percentage, that provides valuable information more easily and cheaply. Let’s see if BMI can produce good predictions!
Finding a Good Regression Model for Predictions
We have the data. Now, we need to determine whether there is a statistically significant relationship between the variables. Relationships, or correlations between variables, are crucial if we want to use the value of one variable to predict the value of another. We also need to evaluate the suitability of the regression model for making predictions.
We have only one independent variable (BMI), so we can use a fitted line plot to display its relationship with body fat percentage. The relationship between the variables is curvilinear. I’ll use a polynomial term to fit the curvature. In this case, I’ll include a quadratic (squared) term. The fitted line plot below suggests that this model fits the data.
Related post: Curve Fitting using Linear and Nonlinear Regression
This curvature is readily apparent because we have only one independent variable and we can graph the relationship. If your model has more than one independent variable, use separate scatterplots to display the association between each independent variable and the dependent variable so you can evaluate the nature of each relationship.
Assess the residual plots
You should also assess the residual plots. If you see patterns in the residual plots, you know that your model is incorrect and that you need to reevaluate it. Non-random residuals indicate that the predicted values are biased. You need to fix the model to produce unbiased predictions.
The residual plots below also confirm the unbiased fit because the data points fall randomly around zero and follow a normal distribution.
Interpret the regression output
In the statistical output below, the p-values indicate that both the linear and squared terms are statistically significant. Based on all of this information, we have a model that provides a statistically significant and unbiased fit to these data. We have a valid regression model. However, there are additional issues we must consider before we can use this model to make predictions.
As an aside, the curved relationship is interesting. The flattening curve indicates that higher BMI values are associated with smaller increases in body fat percentage.
Other Considerations for Valid Predictions
Precision of the Predictions
Previously, we established that our regression model provides unbiased predictions of the observed values. That’s good. However, it doesn’t address the precision of those predictions. Precision measures how close the predictions are to the observed values. We want the predictions to be both unbiased and close to the actual values. Predictions are precise when the observed values cluster close to the predicted values.
Regression predictions are for the mean of the dependent variable. If you think of any mean, you know that there is variation around that mean. The same applies to the predicted mean of the dependent variable. In the fitted line plot, the regression line is nicely in the center of the data points. However, there is a spread of data points around the line. We need to quantify that spread to know how close the predictions are to the observed values. If the spread is too large, the predictions won’t provide useful information.
Later, I’ll generate predictions and show you how to assess the precision.
Goodness-of-fit measures, like R-squared, assess the scatter of the data points around the fitted value. The R-squared for our model is 76.1%, which is good but not great. For a given dataset, higher R-squared values represent predictions that are more precise. However, R-squared doesn’t tell us directly how precise the predictions are in the units of the dependent variable. We can use the standard error of the regression (S) to assess the precision in this manner. However, for this post, I’ll use prediction intervals to evaluate precision.
Related post: Standard Error of the Regression vs. R-squared
New Observations versus Data Used to Fit the Model
R-squared and S indicate how well the model fits the observed data. We need predictions for new observations that the analysis did not use during the model estimation process. Assessing that type of fit requires a different goodness-of-fit measure, the predicted R-squared.
Predicted R-squared measures how well the model predicts the value of new observations. Statistical software packages calculate it by sequentially removing each observation, fitting the model, and determining how well the model predicts the removed observations.
If the predicted R-squared is much lower than the regular R-squared, you know that your regression model doesn’t predict new observations as well as it fits the current dataset. This situation should make you wary of the predictions.
The statistical output below shows that the predicted R-squared (74.14%) is nearly equal to the regular R-squared (76.06%) for our model. We have reason to believe that the model predicts new observations nearly as well as it fits the dataset.
Make Predictions Only Within the Range of the Data
Regression predictions are valid only for the range of data used to estimate the model. The relationship between the independent variables and the dependent variable can change outside of that range. In other words, we don’t know whether the shape of the curve changes. If it does, our predictions will be invalid.
The graph shows that the observed BMI values range from 15-35. We should not make predictions outside of this range.
Make Predictions Only for the Population You Sampled
The relationships that a regression model estimates might be valid for only the specific population that you sampled. Our data were collected from middle school girls that are 12-14 years old. The relationship between BMI and body fat percentage might be different for males and different age groups.
Using our Regression Model to Make Predictions
We have a valid regression model that appears to produce unbiased predictions and can predict new observations nearly as well as it predicts the data used to fit the model. Let’s go ahead and use our model to make a prediction and assess the precision.
It is possible to use the regression equation and calculate the predicted values ourselves. However, I’ll use statistical software to do this for us. Not only is this approach easier and more accurate, but I’ll also have it calculate the prediction intervals so we can assess the precision.
I’ll use the software to predict the body fat percentage for a BMI of 18. The prediction output is below.
Interpreting the Regression Prediction Results
The output indicates that the mean value associated with a BMI of 18 is estimated to be ~23% body fat. Again, this mean applies to the population of middle school girls. Let’s assess the precision using the confidence interval (CI) and the prediction interval (PI).
The confidence interval is the range where the mean value for girls with a BMI of 18 is likely to fall. We can be 95% confident that this mean is between 22.1% and 23.9%. However, this confidence interval does not help us evaluate the precision of individual predictions.
A prediction interval is the range where a single new observation is likely to fall. Narrower prediction intervals represent more precise predictions. For an individual middle school girl with a BMI of 18, we can be 95% confident that her body fat percentage is between 16% and 30%.
The range of the prediction interval is always wider than the confidence interval due to the greater uncertainty of predicting an individual value rather than the mean.
Is this prediction sufficiently precise? To make this determination, we’ll need to use our subject-area knowledge in conjunction with any specific requirements we have. I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too imprecise to provide meaningful information. If this is true, our regression model is too imprecise to be useful.
Don’t Focus On Only the Fitted Values
As we saw in this post, using regression analysis to make predictions is a multi-step process. After collecting the data, you need to specify a valid model. The model must satisfy several conditions before you make predictions. Finally, be sure to assess the precision of the predictions. It’s all too easy to get lulled into a false sense of security by focusing on only the fitted value and not consider the prediction interval.
If you’re learning regression and like the approach I use in my blog, check out my eBook!