If you were able to make predictions about something important to you, you’d probably love that, right? It’s even better if you know that your predictions are sound. In this post, I show how to use regression analysis to make predictions and determine whether they are both unbiased and precise.

You can use regression equations to make predictions. Regression equations are a crucial part of the statistical output after you fit a model. The coefficients in the equation define the relationship between each independent variable and the dependent variable. However, you can also enter values for the independent variables into the equation to predict the mean value of the dependent variable.

**Related post**: When Should I Use Regression Analysis?

## The Regression Approach for Predictions

Using regression to make predictions doesn’t necessarily involve predicting the future. Instead, you predict the mean of the dependent variable given specific values of the dependent variable(s). For our example, we’ll use one independent variable to predict the dependent variable. I measured both of these variables at the same point in time.

Psychic predictions are things that just pop into mind and are not often verified against reality. Unsurprisingly, predictions in the regression context are more rigorous. We need to collect data for relevant variables, formulate a model, and evaluate how well the model fits the data.

The general procedure for using regression to make good predictions is the following:

- Research the subject-area so you can build on the work of others. This research helps with the subsequent steps.
- Collect data for the relevant variables.
- Specify and assess your regression model.
- If you have a model that adequately fits the data, use it to make predictions.

While this process involves more work than the psychic approach, it provides valuable benefits. With regression, we can evaluate the bias and precision of our predictions:

- Bias in a statistical model indicates that the predictions are systematically too high or too low.
- Precision represents how close the predictions are to the observed values.

When we use regression to make predictions, our goal is to produce predictions that are both correct on average and close to the real values. In other words, we need predictions that are both unbiased and precise.

## Example Scenario for Regression Predictions

We’ll use a regression model to predict body fat percentage based on body mass index (BMI). I collected these data for a study with 92 middle school girls. The variables we measured include height, weight, and body fat measured by a Hologic DXA whole-body system. I’ve calculated the BMI using the height and weight measurements. DXA measurements of body fat percentage are considered to be among the best.

You can download the CSV data file: Predict_BMI.

Why might we want to use BMI to predict body fat percentage? It’s more expensive to obtain your body fat percentage through a direct measure like DXA. If you can use your BMI to predict your body fat percentage, that provides valuable information more easily and cheaply. Let’s see if BMI can produce good predictions!

## Finding a Good Regression Model for Predictions

We have the data. Now, we need to determine whether there is a statistically significant relationship between the variables. Relationships, or correlations between variables, are crucial if we want to use the value of one variable to predict the value of another. We also need to evaluate the suitability of the regression model for making predictions.

We have only one independent variable (BMI), so we can use a fitted line plot to display its relationship with body fat percentage. The relationship between the variables is curvilinear. I’ll use a polynomial term to fit the curvature. In this case, I’ll include a quadratic (squared) term. The fitted line plot below suggests that this model fits the data.

**Related post**: Curve Fitting using Linear and Nonlinear Regression

This curvature is readily apparent because we have only one independent variable and we can graph the relationship. If your model has more than one independent variable, use separate scatterplots to display the association between each independent variable and the dependent variable so you can evaluate the nature of each relationship.

### Assess the residual plots

You should also assess the residual plots. If you see patterns in the residual plots, you know that your model is incorrect and that you need to reevaluate it. Non-random residuals indicate that the predicted values are biased. You need to fix the model to produce unbiased predictions.

Learn how to choose the correct regression model.

The residual plots below also confirm the unbiased fit because the data points fall randomly around zero and follow a normal distribution.

### Interpret the regression output

In the statistical output below, the p-values indicate that both the linear and squared terms are statistically significant. Based on all of this information, we have a model that provides a statistically significant and unbiased fit to these data. We have a valid regression model. However, there are additional issues we must consider before we can use this model to make predictions.

As an aside, the curved relationship is interesting. The flattening curve indicates that higher BMI values are associated with smaller increases in body fat percentage.

## Other Considerations for Valid Predictions

### Precision of the Predictions

Previously, we established that our regression model provides unbiased predictions of the observed values. That’s good. However, it doesn’t address the precision of those predictions. Precision measures how close the predictions are to the observed values. We want the predictions to be both unbiased *and* close to the actual values. Predictions are precise when the observed values cluster close to the predicted values.

Regression predictions are for the *mean* of the dependent variable. If you think of any mean, you know that there is variation around that mean. The same applies to the predicted mean of the dependent variable. In the fitted line plot, the regression line is nicely in the center of the data points. However, there is a spread of data points around the line. We need to quantify that spread to know how close the predictions are to the observed values. If the spread is too large, the predictions won’t provide useful information.

Later, I’ll generate predictions and show you how to assess the precision.

**Related post**: Understand Precision in Applied Regression to Avoid Costly Mistakes

### Goodness-of-Fit Measures

Goodness-of-fit measures, like R-squared, assess the scatter of the data points around the fitted value. The R-squared for our model is 76.1%, which is good but not great. For a given dataset, higher R-squared values represent predictions that are more precise. However, R-squared doesn’t tell us directly how precise the predictions are in the units of the dependent variable. We can use the standard error of the regression (S) to assess the precision in this manner. However, for this post, I’ll use prediction intervals to evaluate precision.

**Related post**: Standard Error of the Regression vs. R-squared

### New Observations versus Data Used to Fit the Model

R-squared and S indicate how well the model fits the observed data. We need predictions for new observations that the analysis did not use during the model estimation process. Assessing that type of fit requires a different goodness-of-fit measure, the predicted R-squared.

Predicted R-squared measures how well the model predicts the value of new observations. Statistical software packages calculate it by sequentially removing each observation, fitting the model, and determining how well the model predicts the removed observations.

If the predicted R-squared is much lower than the regular R-squared, you know that your regression model doesn’t predict new observations as well as it fits the current dataset. This situation should make you wary of the predictions.

The statistical output below shows that the predicted R-squared (74.14%) is nearly equal to the regular R-squared (76.06%) for our model. We have reason to believe that the model predicts new observations nearly as well as it fits the dataset.

Related post: How to Interpret Adjusted R-squared and Predicted R-squared

### Make Predictions Only Within the Range of the Data

Regression predictions are valid only for the range of data used to estimate the model. The relationship between the independent variables and the dependent variable can change outside of that range. In other words, we don’t know whether the shape of the curve changes. If it does, our predictions will be invalid.

The graph shows that the observed BMI values range from 15-35. We should not make predictions outside of this range.

### Make Predictions Only for the Population You Sampled

The relationships that a regression model estimates might be valid for only the specific population that you sampled. Our data were collected from middle school girls that are 12-14 years old. The relationship between BMI and body fat percentage might be different for males and different age groups.

## Using our Regression Model to Make Predictions

We have a valid regression model that appears to produce unbiased predictions and can predict new observations nearly as well as it predicts the data used to fit the model. Let’s go ahead and use our model to make a prediction and assess the precision.

It is possible to use the regression equation and calculate the predicted values ourselves. However, I’ll use statistical software to do this for us. Not only is this approach easier and more accurate, but I’ll also have it calculate the prediction intervals so we can assess the precision.

I’ll use the software to predict the body fat percentage for a BMI of 18. The prediction output is below.

## Interpreting the Regression Prediction Results

The output indicates that the mean value associated with a BMI of 18 is estimated to be ~23% body fat. Again, this mean applies to the population of middle school girls. Let’s assess the precision using the confidence interval (CI) and the prediction interval (PI).

The confidence interval is the range where the mean value for girls with a BMI of 18 is likely to fall. We can be 95% confident that this mean is between 22.1% and 23.9%. However, this confidence interval does not help us evaluate the precision of individual predictions.

A prediction interval is the range where a single new observation is likely to fall. Narrower prediction intervals represent more precise predictions. For an individual middle school girl with a BMI of 18, we can be 95% confident that her body fat percentage is between 16% and 30%.

The range of the prediction interval is always wider than the confidence interval due to the greater uncertainty of predicting an individual value rather than the mean.

Is this prediction sufficiently precise? To make this determination, we’ll need to use our subject-area knowledge in conjunction with any specific requirements we have. I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too imprecise to provide meaningful information. If this is true, our regression model is too imprecise to be useful.

## Don’t Focus On Only the Fitted Values

As we saw in this post, using regression analysis to make predictions is a multi-step process. After collecting the data, you need to specify a valid model. The model must satisfy several conditions before you make predictions. Finally, be sure to assess the precision of the predictions. It’s all too easy to get lulled into a false sense of security by focusing on only the fitted value and not consider the prediction interval.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

Saran Karthick says

Hi Jim,

I’m starting out in Predictive Analytics and found your article very useful and informative.

I’m currently working on a use case where the quality of a product is directly affected by a temperature parameter (which was found by root cause analysis). So our objective is to maintain the temperature at the nominal value and provide predictions on when the tempertaure may vary. But unfortunately quality data is not available. Hence we need to work with the temperature and additonal process parameters data available to us.

My queries are as follows:

Can I predict the temperature variance and assume that the quality of the product will be in sync to a certain extent ?

Is regression analysis the best methodology for my use case ?

Are there any open source tools available for doing this predictive analytics ?

N'Dah Kolani says

Hello dear,

Thank you for all your interesting posts.

I’m beginner in regression and I would like to use logistic Model to predict surrenders in life insurance.

I would like to well understand the prediction probabilities.

In my model I us the age (in months) of the contract in the portefollio, the gender of Policy holder, …

when making prediction, for age 57, gender M for example, what’s does the predicted probability mean?

Does it mean that it’s the probability of the contract to be surrended at age 57 given the gender of the Policy holder?

Jim Frost says

Hi N’Dah,

Yes, the prediction the probability of that a 57 year old male will surrender the policy. That assumes the model provides a good fit and satisfies the necessary assumptions. I write more about binary logistic regression. It’s a post that uses binary logistic regression to analyze a political group in the U.S. But, I do talk about interpreting the output, which might be helpful.

I hope that helps!

RG says

Why is the standard error of estimate or prediction higher when the predictive quality of variables is lower?

Aanchal Iyer says

Good Read. Easy to understand keep it up.

Musarrat Abbas Khan says

I really appreciate your support in regression analysis. Actually i have data on milk yield of buffaloes. Different buffaloes yield milk in different number of days. in order to rank buffaloes i need to put milk to a standard milk period 305 days. Some buffaloes have lactation length higher than 305 days, other less than 305 days. How to develop factors for correction/prediction of milk of all buffaloes on one standard

Jim Frost says

Hi Musarrat,

The process of identifying the correct variables to include in your model is a mix between subject area knowledge and statistics. To develop an initial list of potential factors, you’ll need to research the subject area and use your expertise to identify candidates. I don’t know the dairy industry so, unfortunately, I can’t help you there.

I suggest you read my post about choosing the correct regression model for some tips. Additionally, consider buying my ebook on regression analysis which can help you with the process.

I hope this helps!

philoinme says

Unlike Standard error of regression (https://statisticsbyjim.com/regression/standard-error-regression-vs-r-squared/), the assessment by calculating prediction intervals in this article doesn’t seem to be comprehensive because with SE of regression, it is clear by the rule-of-thumb that a certain number of points must fall within the bounds based on the confidence level (95%, 99%) – this of course depends on how precise we want.

In the case of prediction intervals, the usage of subject matter expertise was mentioned and the calculations were based on every point (where the conditions of independent variables are given). Now, I wonder how to quantify and assess the precision of model based on a one-off calculation?

Considering such scenario, is SE of regression followed/ used typically unless one has a lot of subject expertise and ways to calculate PI for all the data points and subsequently assess the precision of the prediction precision?

Thanks Jim!

Keryn says

I am one week before my thesis submission and wish I had found your site much earlier. Your explanations are so clear and concise. You are a great teacher Jim!

Jim Frost says

Thank you so much, Keryn! I strive to make these explanations as straightforward as possible, so I really appreciate your kind words!

Shudak Marty says

Using the body mass index data set as an example. Suppose these results were gained from several different groups. For example one group worked out regularly, one group didn’t work out but maintained a healthy diet, one group didn’t work out and maintained a poor diet, etc. Can we use the group average differences between estimated results (based on the regression equation) and the actual results to determine of one group was significantly different from the others in terms of that group being consistently above or below the regression line?

Jay Jay says

Hello, how can I predict the dependent variable for a new case in spss?

James says

Nice article. Very clear and easy to understand. Bravo.

Jim Frost says

Thank you, James!

Ginalyn says

Oh my! I came across you during the final week of my stat class. You just enlightened me in this regression area. I wish I came upon you during my first week of class. It is easier to grasp stats when it is explained plainly and their correlation with whatever in life you will be doing. Safe to say I passed (barely) my stat basically with following step by step without understanding why I am doing it in such a way and why.

Jim Frost says

Hi Ginalyn, thanks for taking the time to write such a nice comment! It made my day! I’m glad you found my blog to be helpful. I always try to explain statistics in the most straightforward, simple manner possible. I’m glad you passed!

Consolatha J Ngonyani Mhaiki says

Thanks for the deep insight; indeed your idea brings me back in trying to seek as much closer to reality predictions on our daily life phenomenal. As this universe in as much as the orderly chaotic manner, some predictions becomes erroneously to the extent that they are rendered uncertain for the decision making. In validation of a model in question, the uncertainty would be clarified by using a set of conditions for prediction and suitable intervals (limits).

BIRUK AYALEW Wondem says

like it