If you were able to make predictions about something important to you, you’d probably love that, right? It’s even better if you know that your predictions are sound. In this post, I show how to use regression analysis to make predictions and determine whether they are both unbiased and precise.
You can use regression equations to make predictions. Regression equations are a crucial part of the statistical output after you fit a model. The coefficients in the equation define the relationship between each independent variable and the dependent variable. However, you can also enter values for the independent variables into the equation to predict the mean value of the dependent variable.
Related post: When Should I Use Regression Analysis?
The Regression Approach for Predictions
Using regression to make predictions doesn’t necessarily involve predicting the future. Instead, you predict the mean of the dependent variable given specific values of the independent variable(s). For our example, we’ll use one independent variable to predict the dependent variable. I measured both of these variables at the same point in time.
Psychic predictions are things that just pop into mind and are not often verified against reality. Unsurprisingly, predictions in the regression context are more rigorous. We need to collect data for relevant variables, formulate a model, and evaluate how well the model fits the data.
The general procedure for using regression to make good predictions is the following:
- Research the subject-area so you can build on the work of others. This research helps with the subsequent steps.
- Collect data for the relevant variables.
- Specify and assess your regression model.
- If you have a model that adequately fits the data, use it to make predictions.
While this process involves more work than the psychic approach, it provides valuable benefits. With regression, we can evaluate the bias and precision of our predictions:
- Bias in a statistical model indicates that the predictions are systematically too high or too low.
- Precision represents how close the predictions are to the observed values.
When we use regression to make predictions, our goal is to produce predictions that are both correct on average and close to the real values. In other words, we need predictions that are both unbiased and precise.
Example Scenario for Regression Predictions
We’ll use a regression model to predict body fat percentage based on body mass index (BMI). I collected these data for a study with 92 middle school girls. The variables we measured include height, weight, and body fat measured by a Hologic DXA whole-body system. I’ve calculated the BMI using the height and weight measurements. DXA measurements of body fat percentage are considered to be among the best.
You can download the CSV data file: Predict_BMI.
Why might we want to use BMI to predict body fat percentage? It’s more expensive to obtain your body fat percentage through a direct measure like DXA. If you can use your BMI to predict your body fat percentage, that provides valuable information more easily and cheaply. Let’s see if BMI can produce good predictions!
Finding a Good Regression Model for Predictions
We have the data. Now, we need to determine whether there is a statistically significant relationship between the variables. Relationships, or correlations between variables, are crucial if we want to use the value of one variable to predict the value of another. We also need to evaluate the suitability of the regression model for making predictions.
We have only one independent variable (BMI), so we can use a fitted line plot to display its relationship with body fat percentage. The relationship between the variables is curvilinear. I’ll use a polynomial term to fit the curvature. In this case, I’ll include a quadratic (squared) term. The fitted line plot below suggests that this model fits the data.
Related post: Curve Fitting using Linear and Nonlinear Regression
This curvature is readily apparent because we have only one independent variable and we can graph the relationship. If your model has more than one independent variable, use separate scatterplots to display the association between each independent variable and the dependent variable so you can evaluate the nature of each relationship.
Assess the residual plots
You should also assess the residual plots. If you see patterns in the residual plots, you know that your model is incorrect and that you need to reevaluate it. Non-random residuals indicate that the predicted values are biased. You need to fix the model to produce unbiased predictions.
Learn how to choose the correct regression model.
The residual plots below also confirm the unbiased fit because the data points fall randomly around zero and follow a normal distribution.
Interpret the regression output
In the statistical output below, the p-values indicate that both the linear and squared terms are statistically significant. Based on all of this information, we have a model that provides a statistically significant and unbiased fit to these data. We have a valid regression model. However, there are additional issues we must consider before we can use this model to make predictions.
As an aside, the curved relationship is interesting. The flattening curve indicates that higher BMI values are associated with smaller increases in body fat percentage.
Other Considerations for Valid Predictions
Precision of the Predictions
Previously, we established that our regression model provides unbiased predictions of the observed values. That’s good. However, it doesn’t address the precision of those predictions. Precision measures how close the predictions are to the observed values. We want the predictions to be both unbiased and close to the actual values. Predictions are precise when the observed values cluster close to the predicted values.
Regression predictions are for the mean of the dependent variable. If you think of any mean, you know that there is variation around that mean. The same applies to the predicted mean of the dependent variable. In the fitted line plot, the regression line is nicely in the center of the data points. However, there is a spread of data points around the line. We need to quantify that spread to know how close the predictions are to the observed values. If the spread is too large, the predictions won’t provide useful information.
Later, I’ll generate predictions and show you how to assess the precision.
Related post: Understand Precision in Applied Regression to Avoid Costly Mistakes
Goodness-of-Fit Measures
Goodness-of-fit measures, like R-squared, assess the scatter of the data points around the fitted value. The R-squared for our model is 76.1%, which is good but not great. For a given dataset, higher R-squared values represent predictions that are more precise. However, R-squared doesn’t tell us directly how precise the predictions are in the units of the dependent variable. We can use the standard error of the regression (S) to assess the precision in this manner. However, for this post, I’ll use prediction intervals to evaluate precision.
Related post: Standard Error of the Regression vs. R-squared
New Observations versus Data Used to Fit the Model
R-squared and S indicate how well the model fits the observed data. We need predictions for new observations that the analysis did not use during the model estimation process. Assessing that type of fit requires a different goodness-of-fit measure, the predicted R-squared.
Predicted R-squared measures how well the model predicts the value of new observations. Statistical software packages calculate it by sequentially removing each observation, fitting the model, and determining how well the model predicts the removed observations.
If the predicted R-squared is much lower than the regular R-squared, you know that your regression model doesn’t predict new observations as well as it fits the current dataset. This situation should make you wary of the predictions.
The statistical output below shows that the predicted R-squared (74.14%) is nearly equal to the regular R-squared (76.06%) for our model. We have reason to believe that the model predicts new observations nearly as well as it fits the dataset.
Related post: How to Interpret Adjusted R-squared and Predicted R-squared
Make Predictions Only Within the Range of the Data
Regression predictions are valid only for the range of data used to estimate the model. The relationship between the independent variables and the dependent variable can change outside of that range. In other words, we don’t know whether the shape of the curve changes. If it does, our predictions will be invalid.
The graph shows that the observed BMI values range from 15-35. We should not make predictions outside of this range.
Make Predictions Only for the Population You Sampled
The relationships that a regression model estimates might be valid for only the specific population that you sampled. Our data were collected from middle school girls that are 12-14 years old. The relationship between BMI and body fat percentage might be different for males and different age groups.
Using our Regression Model to Make Predictions
We have a valid regression model that appears to produce unbiased predictions and can predict new observations nearly as well as it predicts the data used to fit the model. Let’s go ahead and use our model to make a prediction and assess the precision.
It is possible to use the regression equation and calculate the predicted values ourselves. However, I’ll use statistical software to do this for us. Not only is this approach easier and more accurate, but I’ll also have it calculate the prediction intervals so we can assess the precision.
I’ll use the software to predict the body fat percentage for a BMI of 18. The prediction output is below.
Interpreting the Regression Prediction Results
The output indicates that the mean value associated with a BMI of 18 is estimated to be ~23% body fat. Again, this mean applies to the population of middle school girls. Let’s assess the precision using the confidence interval (CI) and the prediction interval (PI).
The confidence interval is the range where the mean value for girls with a BMI of 18 is likely to fall. We can be 95% confident that this mean is between 22.1% and 23.9%. However, this confidence interval does not help us evaluate the precision of individual predictions.
A prediction interval is the range where a single new observation is likely to fall. Narrower prediction intervals represent more precise predictions. For an individual middle school girl with a BMI of 18, we can be 95% confident that her body fat percentage is between 16% and 30%.
The range of the prediction interval is always wider than the confidence interval due to the greater uncertainty of predicting an individual value rather than the mean.
Is this prediction sufficiently precise? To make this determination, we’ll need to use our subject-area knowledge in conjunction with any specific requirements we have. I’m not a medical expert, but I’d guess that the 14 point range of 16-30% is too imprecise to provide meaningful information. If this is true, our regression model is too imprecise to be useful.
Don’t Focus On Only the Fitted Values
As we saw in this post, using regression analysis to make predictions is a multi-step process. After collecting the data, you need to specify a valid model. The model must satisfy several conditions before you make predictions. Finally, be sure to assess the precision of the predictions. It’s all too easy to get lulled into a false sense of security by focusing on only the fitted value and not consider the prediction interval.
If you’re learning regression and like the approach I use in my blog, check out my eBook!
God bless you. This would be perfect, if made in excel, so laymen could have more insight on what is happening. Thank you.
Hello Jim,
Thank you for the nice text. I have found in my job that academic research using observational datasets has surprisingly little focus on prediction accuracy of a model. Furthermore, the model selection process is often blurry and the final model might have been chosen quite haphazardly. Some model assumptions might be checked along with some goodness of fit -test, but usually nothing is mentioned about prediction accuracy.
Even the absolute correct model can have large (parameter/function) variance. For prediction, there is also the irreducible error. And even if one uses an unbiased model (unbiased parameter estimates), research shows (Harrell, Zhang, Chatfield, Faraway, Breiman etc.) that model bias will be present. Thus, we don’t even have unbiasedness. And on top of that, we have variance.
I am quite keen on machine learning, where the focus is on prediction accuracy. The approach is kind of like “the proof is in the pudding”. I find it not to be the case for “traditional statistics”, where the aim is more on interpretation (inference). Obviously, a machine learning model is not readily interpretable and it could even be impossible.
If a statistical model focused on inference (interpretation of the parameters) does not predict well, what is its use? If it’s a poor model, it most likely will predict poorly. So you should test that. Even if it’s the correct model, the predictions can be poor because of the variance. Even with an absolute correct unbiased model with large variance, your sample is probably way off from the truth. This leads to poor predictions. How happy can you really be if and when even with a correct model you predict poorly?
Having said all this, I’m leaning towards the opinion that every statistical model should incorporate prediction. Preferably to a new dataset (from the same phenomenon). I think this could help the reproducebility problem disrupting the academic research world.
Any thoughts on this?
Hello, I enjoy reading through your post. following from South Eastern Kenya University
Hi Seku! Welcome to my website! I’m so glad that you’ve found it to be helpful! Happy reading!
Hello Jim,
Thanks a lot for this great post and all sub-links which was really useful for me to understand all the aspects I need to build a regression model and to do forecast.
My question is related to multiple regression, what if one important variable is categorical but has many values inside which are difficult to group them. How can I encode it without distorting my model with many numeric category.
Thanks a lot
Hi Lily,
Coding your categorical variable is a very subject-area specific process. Consequently, I can’t give you a specific answer. However, you’ll need find a system of sorting all your observations into a set of categories. You must find a method so that all observations fall unambiguously into one, and only one, category. These categories must be mutually exclusive. All observations in your study must fall within one category.
Hope you are doing well. If a researcher has constructed a new test and would like to investigate to what extent the new test is able to predict the subjects’ performance on an already established test, which test should be taken as a predictor and which one as the outcome measure in the regression analysis?
My intuition is that if the results of the new test can predict subjects’ scores on the old test, we have to consider the new test as the predictor as we are interested in finding out to what extent it can predict the unique variance of the old test.
Thanks in advance and
hi sir, i have a hypothesis where : amount customers have spent at a store in the last 12 months predicts likelihood they recommend the brand to others. which type of regression would this be and what are the measurements of scale for each IV and DV? thanks!
Hi Jim, very interesting read. I was wondering, I’ve read a little on Cox for prediction modelling (though not much I’ve found compared to logistic regression models). In prediction time is always important I suppose. Is there any benefit to using Cox over LR? I am looking at risk of developing a condition within 3 years based on certain subject characteristics. Many thanks for your help.
Hi Jim, an excellent and helpful read thanks. I was hoping you could help me confirm how I would apply the logistic regression equation to generate a risk score for participants to calculate a ROC curve? Thanks!
Nice explanation. It helped in my project.
Hi professor,
I followed up your subjects, really they are valuable and appreciated. However, i have a question, if i have a dependent variable and 4 or 5 independent variables, what is the best method to develop a correct statistical equation which correlate all of them??
Thanks
Hello Sir. How can we predict final exam results from class assignments marks
Hi Donald,
You’re in the right post for the answers you seek! I spell out the process here. If you have more specific questions, please post them after reading thoroughly.
Hello professor,
Your posts helped me a lot in reshaping my knowledge in regression models.
I want to ask you how can we use time as a predictor along side other predictors to perform prediction.
What I can’t undrestand is when plotting time against my dependant variable, I find no correlation.
So how can I design my study using time?
I hope that I made myself clear.
Thank you again.
Hi Ines,
Using regression to analyze time series data is possible. However, it raises a number of other considerations. It’s far too complex to go into in the comments section. However, you should first determine whether time is related to your dependent variable. Instead of a correlation, try graphing it using a time series plot. You can then see if there’s any sort of relationship between time and your DV. Cyclical patterns might not show up as a correlation put would be visible on a time series plot. There’s a bunch of time series analysis methods that you can incorporate into regression analysis. At some point, I might write posts about that. However, it involves many details that I can’t quickly summarize. But, you can factor in the effect of time along with other factors that related to your DV.
I wish I could be more helpful. And perhaps down the road I’ll have something just perfect for you. But alas I don’t right now. I’m sure you could do a search and find more information though.
Hello Professor Jim I am a profound admirer of your work and your posts has helped me very much.
When I read this post I thought you were going to mention and talk also about forecasts. But you were talking about regular regressions predictions.
So I would like to ask you something important to the scientific investigation I am working on.
Do you think that, if besides predict the impact of a IV on a DV, I decide to use the model that I will buld to forecast future values of my dependent variable. Do you think it would add a considerable amount of work?
in terms of modelling and code building for the calculations?
Thank you very much.
Hi Daiane,
I’m so happy to hear that my posts have been helpful! 🙂
Forecasting in regression uses the same methodology as predictions. You’re still using the IVs to predict the DV. The difference, of course, is that you’re using past values of the IVs to predict future values of the DV. If you’re familiar with fitting regression models, fitting a forecasting model isn’t necessarily going to be more work than a regular regression model. You’ll still need to go through the process of determining which variables to include in your model. Given the forecast nature, you’ll need to think about the variables, the timing of the variables, and how they influence the DV. In addition to the more typical IVs, you’ll need to consider things such as seasonal patterns and other trends over time. Given that the model incorporates time, you will need to pay more attention to the potential problem of autocorrelation in the residuals, which I describe in my post about least squares assumptions. So, there are definitely some different considerations for a forecast model, but, again I wouldn’t say that it’s necessarily harder than a non-forecast model. As usual, it comes down to research, getting the right data, including the correct variables, and checking the assumptions.
I hope this helps!
Hi Jim,
I’m starting out in Predictive Analytics and found your article very useful and informative.
I’m currently working on a use case where the quality of a product is directly affected by a temperature parameter (which was found by root cause analysis). So our objective is to maintain the temperature at the nominal value and provide predictions on when the tempertaure may vary. But unfortunately quality data is not available. Hence we need to work with the temperature and additonal process parameters data available to us.
My queries are as follows:
Can I predict the temperature variance and assume that the quality of the product will be in sync to a certain extent ?
Is regression analysis the best methodology for my use case ?
Are there any open source tools available for doing this predictive analytics ?
Hello dear,
Thank you for all your interesting posts.
I’m beginner in regression and I would like to use logistic Model to predict surrenders in life insurance.
I would like to well understand the prediction probabilities.
In my model I us the age (in months) of the contract in the portefollio, the gender of Policy holder, …
when making prediction, for age 57, gender M for example, what’s does the predicted probability mean?
Does it mean that it’s the probability of the contract to be surrended at age 57 given the gender of the Policy holder?
Hi N’Dah,
Yes, the prediction the probability of that a 57 year old male will surrender the policy. That assumes the model provides a good fit and satisfies the necessary assumptions. I write more about binary logistic regression. It’s a post that uses binary logistic regression to analyze a political group in the U.S. But, I do talk about interpreting the output, which might be helpful.
I hope that helps!
Why is the standard error of estimate or prediction higher when the predictive quality of variables is lower?
Good Read. Easy to understand keep it up.
I really appreciate your support in regression analysis. Actually i have data on milk yield of buffaloes. Different buffaloes yield milk in different number of days. in order to rank buffaloes i need to put milk to a standard milk period 305 days. Some buffaloes have lactation length higher than 305 days, other less than 305 days. How to develop factors for correction/prediction of milk of all buffaloes on one standard
Hi Musarrat,
The process of identifying the correct variables to include in your model is a mix between subject area knowledge and statistics. To develop an initial list of potential factors, you’ll need to research the subject area and use your expertise to identify candidates. I don’t know the dairy industry so, unfortunately, I can’t help you there.
I suggest you read my post about choosing the correct regression model for some tips. Additionally, consider buying my ebook on regression analysis which can help you with the process.
I hope this helps!
Unlike Standard error of regression (https://statisticsbyjim.com/regression/standard-error-regression-vs-r-squared/), the assessment by calculating prediction intervals in this article doesn’t seem to be comprehensive because with SE of regression, it is clear by the rule-of-thumb that a certain number of points must fall within the bounds based on the confidence level (95%, 99%) – this of course depends on how precise we want.
In the case of prediction intervals, the usage of subject matter expertise was mentioned and the calculations were based on every point (where the conditions of independent variables are given). Now, I wonder how to quantify and assess the precision of model based on a one-off calculation?
Considering such scenario, is SE of regression followed/ used typically unless one has a lot of subject expertise and ways to calculate PI for all the data points and subsequently assess the precision of the prediction precision?
Thanks Jim!
I am one week before my thesis submission and wish I had found your site much earlier. Your explanations are so clear and concise. You are a great teacher Jim!
Thank you so much, Keryn! I strive to make these explanations as straightforward as possible, so I really appreciate your kind words!
Using the body mass index data set as an example. Suppose these results were gained from several different groups. For example one group worked out regularly, one group didn’t work out but maintained a healthy diet, one group didn’t work out and maintained a poor diet, etc. Can we use the group average differences between estimated results (based on the regression equation) and the actual results to determine of one group was significantly different from the others in terms of that group being consistently above or below the regression line?
Hello, how can I predict the dependent variable for a new case in spss?
Nice article. Very clear and easy to understand. Bravo.
Thank you, James!
Oh my! I came across you during the final week of my stat class. You just enlightened me in this regression area. I wish I came upon you during my first week of class. It is easier to grasp stats when it is explained plainly and their correlation with whatever in life you will be doing. Safe to say I passed (barely) my stat basically with following step by step without understanding why I am doing it in such a way and why.
Hi Ginalyn, thanks for taking the time to write such a nice comment! It made my day! I’m glad you found my blog to be helpful. I always try to explain statistics in the most straightforward, simple manner possible. I’m glad you passed!
Thanks for the deep insight; indeed your idea brings me back in trying to seek as much closer to reality predictions on our daily life phenomenal. As this universe in as much as the orderly chaotic manner, some predictions becomes erroneously to the extent that they are rendered uncertain for the decision making. In validation of a model in question, the uncertainty would be clarified by using a set of conditions for prediction and suitable intervals (limits).
like it