Excel can perform various statistical analyses, including regression analysis. It is a great option because nearly everyone can access Excel. This post is an excellent introduction to performing and interpreting regression analysis, even if Excel isn’t your primary statistical software package.
In this post, I provide step-by-step instructions for using Excel to perform multiple regression analysis. Importantly, I also show you how to specify the model, choose the right options, assess the model, check the assumptions, and interpret the results.
I include links to additional resources I’ve written, which present clear explanations of relevant regression analysis concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together!
Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On the Data tab, look for Data Analysis, as shown below.
If you don’t see Data Analysis, install that ToolPak. Learn how to install it in my post about using Excel to perform t-tests. It’s free!
Multiple Regression Analysis in Excel
Regression analysis describes the relationships between a set of independent variables and the dependent variable. It produces an equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions. Excel performs ordinary least squares regression.
For more information, read my post about when to use regression analysis.
To perform regression analysis in Excel, arrange your data so that each variable is in a column, as shown below. The independent variables must be next to each other.
For our regression example, we’ll use a model to determine whether pressure and fuel flow are related to the temperature of a manufacturing process. These two variables predict the heat that the process generates. The variables in this analysis are the following:
- Temperature (C): Dependent variable
- Pressure: Independent variable
- Fuel Rate: Independent variable
Related post: What are Independent and Dependent Variables?
Download the Excel file that contains the data for this example: MultipleRegression.
In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup, choose Regression, and then follow the steps below.
Specifying the correct model is an iterative process where you fit a model, check the results, and possibly modify it. For more details about this process, read my post about Specifying the Correct Regression Model.
Step-by-Step Instructions for Filling In Excel’s Regression Box
- Under Input Y Range, select the range for your dependent variable. The dependent variable is a variable that you want to explain or predict using the model. The values of this variable depend on other variables. It’s also known as the response variable, outcome variable, and it is commonly denoted using a Y. Traditionally, analysts graph dependent variables on the vertical Y-axis.
- Under Input X Range, select the range for your independent variable(s). In Excel, these variables must be next to each other so you can choose them all in one range. Independent variables are the variables you include in the model to explain or predict changes in the dependent variable. In randomized controlled experiments, researchers systematically set and change the values of the independent variables. However, in observational studies, the values of the independent variables are not set by researchers but observed instead. These variables are also known as predictor variables, input variables, and are commonly denoted using Xs. On graphs, analysts place independent variables on the horizontal X-axis.
- Check the Labels checkbox if you have meaningful variable names in row 1. This option makes the output easier to interpret.
- Check the Constant is Zero if you want to force the regression line through the origin. Do not check this box unless you’re absolutely sure you know what you’re doing! For more information, read my post about the regression constant.
- Check Confidence Level if you want to display confidence intervals for the coefficient estimates.
- Check Residual Plots to display the values of the residuals and graph them.
- Click OK.
For this example, your popup should look like the following:
Interpreting Excel’s Regression Analysis Results
After Excel creates the statistical output, I autofit some of the columns for clarity.
We’ll work our way down from the top of Excel’s regression analysis output. If you want to learn more about the statistics, be sure to click the links for more detailed information!
Regression Statistics Table
The Regression Statistics table provides statistical measures of how well the model fits the data.
Multiple R is not a standard measure for regression and it is difficult to interpret. So, we’ll skip it and go to the two R-squared values.
The R-squared value of ~0.858 indicates that our model accounts for about 85.8% of the dependent variable’s variance. Usually, higher R-squared values are better. However, there are important caveats about that!
The adjusted R-squared value helps us compare regression models with differing numbers of independent variables. For example, if you compare a model with one independent variable to a model with two, you often favor the model with the higher adjusted R-squared.
The standard error of the regression indicates the typical size of the residuals. This statistic shows how wrong the regression model is on average. You want lower values because it signifies that the distances between the data points and the fitted values are smaller. Conveniently, this value uses the measurement units of the dependent variable.
From the output, we know that the standard distance between the predicted and observed values is 8.93 degrees Celsius.
For more information, read my posts about:
In Excel’s ANOVA table, the most important statistic is Significance F. This is the p-value for the F-test of overall significance. This test determines whether your model with all of its independent variables does a better job explaining the dependent variable’s variability than a model with no independent variables. If this test result is statistically significant, it suggests you have a good model.
Our p-value for the overall F-test is 8.93783E-12. It’s written in scientific notation because it is a tiny value. The E-12 indicates that we need to move the decimal point 12 places to the left. This value is smaller than any reasonable significance level. Consequently, we can conclude that our regression model as a whole is statistically significant. Learn about How to Use Scientific Notation.
For more information, read my post about the overall F-test of significance.
The coefficients table displays the parameter estimates for the independent variables in our model, along with the intercept value (constant). I won’t interpret the intercept because it is usually meaningless. For more information about it, read my post about the Y-intercept (constant).
We included two independent variables in our model: Pressure and Fuel Rate.
The coefficient for Pressure is approximately 4.79. The positive sign indicates that as pressure increases, temperature also tends to increase. There is a positive association between these two variables. For every one-unit increase in pressure, temperature increases by an average of 4.79 degrees.
The coefficient for Fuel Rate is -24.21. The negative sign indicates that as the fuel rate increases, temperature tends to decrease. There is a negative association between these two variables. For every one-unit increase in fuel rate, temperature decreases by an average of 24.21 degrees.
The p-values for the coefficients indicate whether the dependent variable is statistically significant. When the p-value is less than your significance level, you can reject the null hypothesis that the coefficient equals zero. Zero indicates no relationship.
For our two variables, Excel again displays the p-values using scientific notation because they’re both miniscule. Pressure and Fuel Rate are both statistically significant!
The confidence interval for a coefficient indicates the range of values that the actual population parameter is likely to fall. Keep in mind that the coefficient values in the output are sample estimates and are unlikely to equal the population value exactly.
For example, the confidence interval for Pressure is [2.84, 6.75]. We can be 95% confident that the actual population parameter for Pressure falls within this range.
To learn how least squares regression calculates the coefficients and y-intercept with a worked example, read my post Least Squares Regression: Definition, Formulas & Example.
That covers the numeric output. Now we’ll get to the residual plots!
Excel’s Residual Plots for Regression Analysis
It’s crucial to examine the residual plots. If the residual plots don’t look good, you can’t trust any of the previous numerical results! While I covered the numeric output first, you shouldn’t get too invested in them before checking the residual plots. These plots tell you whether the model fits the data, and they even provide ideas for improving your model.
In general, you want your residuals to be randomly scattered around zero. If you see patterns, then you have a problem and need to change your model. For more information about residual plots, why they’re essential, and how they can help you improve your model, read my post Check Your Residual Plots to Ensure Trustworthy Regression Results!
When you check the Residual Plots checkbox, Excel includes both a table of residuals and a residual plot for each independent variable in your model.
On these graphs, the X-axis (horizontal) displays the value of an independent variable. Excel has a strange tendency of extending the X-axis to zero on these charts even when the independent variable’s values aren’t near zero. That looks weird. So, I’ve changed the scaling.
The Y-axis displays the residuals.
In our two residual plots below, the residuals are randomly scattered around zero without any visible patterns. No problems are apparent.
Those graphs are great. However, I usually recommend looking at the residuals versus fitted values plot. If there is a problem there, you can look at the variable specific graphs for clues about where the problem lies more specifically. However, Excel doesn’t create the residuals versus fits plot, just the variable specific graphs shown above.
Using Excel’s Residual Output table (shown below), you can create your own plot. Just select those two columns and create a scatterplot. The one I made is below.
What Excel’s Regression Analysis Does Not Include
It’s great that you can use Excel to perform multiple regression. However, it doesn’t include several vital features. Fortunately, there are workarounds for most of them.
Least squares regression can fit curvature using polynomials. Unlike most statistical software, you’ll have to create additional columns with the squared or cubed terms. That is easy to do using Excel’s formulas.
Similarly, in many statistical software packages, you tell it that you want to include interaction terms and it does the rest. In Excel, you’ll need to create columns for the interaction terms in your worksheet. An interaction term is just the product of two or more variables. For the interaction of A*B, simply create a column with a formula that contains the product of those two variables!
Interaction plots are a handy way to interpret interaction terms. While Excel doesn’t create those, you can make them in Excel. I might write a post on how to do that if there is sufficient interest.
Multicollinearity is a problem that occurs when the independent variables are highly correlated. VIFs are a crucial statistic for determining if your model has problematic levels of multicollinearity, but Excel doesn’t include this statistic in its output. However, you can have Excel calculate VIFs for you with a bit of effort.
If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.