Excel can perform various statistical analyses, including regression analysis. It is a great option because nearly everyone can access Excel. This post is an excellent introduction to performing and interpreting regression analysis, even if Excel isn’t your primary statistical software package.
In this post, I provide step-by-step instructions for using Excel to perform multiple regression analysis. Importantly, I also show you how to specify the model, choose the right options, assess the model, check the assumptions, and interpret the results.
I include links to additional resources I’ve written, which present clear explanations of relevant regression analysis concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together!
Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On the Data tab, look for Data Analysis, as shown below.
If you don’t see Data Analysis, install that ToolPak. Learn how to install it in my post about using Excel to perform t-tests. It’s free!
Multiple Regression Analysis in Excel
Regression analysis describes the relationships between a set of independent variables and the dependent variable. It produces an equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions. Excel performs ordinary least squares regression.
For more information, read my post about when to use regression analysis.
To perform regression analysis in Excel, arrange your data so that each variable is in a column, as shown below. The independent variables must be next to each other.
For our regression example, we’ll use a model to determine whether pressure and fuel flow are related to the temperature of a manufacturing process. These two variables predict the heat that the process generates. The variables in this analysis are the following:
- Temperature (C): Dependent variable
- Pressure: Independent variable
- Fuel Rate: Independent variable
Related post: What are Independent and Dependent Variables?
Download the Excel file that contains the data for this example: MultipleRegression.
In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup, choose Regression, and then follow the steps below.
Specifying the correct model is an iterative process where you fit a model, check the results, and possibly modify it. For more details about this process, read my post about Specifying the Correct Regression Model.
Step-by-Step Instructions for Filling In Excel’s Regression Box
- Under Input Y Range, select the range for your dependent variable. The dependent variable is a variable that you want to explain or predict using the model. The values of this variable depend on other variables. It’s also known as the response variable, outcome variable, and it is commonly denoted using a Y. Traditionally, analysts graph dependent variables on the vertical Y-axis.
- Under Input X Range, select the range for your independent variable(s). In Excel, these variables must be next to each other so you can choose them all in one range. Independent variables are the variables you include in the model to explain or predict changes in the dependent variable. In randomized controlled experiments, researchers systematically set and change the values of the independent variables. However, in observational studies, the values of the independent variables are not set by researchers but observed instead. These variables are also known as predictor variables, input variables, and are commonly denoted using Xs. On graphs, analysts place independent variables on the horizontal X-axis.
- Check the Labels checkbox if you have meaningful variable names in row 1. This option makes the output easier to interpret.
- Check the Constant is Zero if you want to force the regression line through the origin. Do not check this box unless you’re absolutely sure you know what you’re doing! For more information, read my post about the regression constant.
- Check Confidence Level if you want to display confidence intervals for the coefficient estimates.
- Check Residual Plots to display the values of the residuals and graph them.
- Click OK.
For this example, your popup should look like the following:
Interpreting Excel’s Regression Analysis Results
After Excel creates the statistical output, I autofit some of the columns for clarity.
We’ll work our way down from the top of Excel’s regression analysis output. If you want to learn more about the statistics, be sure to click the links for more detailed information!
Regression Statistics Table
The Regression Statistics table provides statistical measures of how well the model fits the data.
Multiple R is not a standard measure for regression and it is difficult to interpret. So, we’ll skip it and go to the two R-squared values.
The R-squared value of ~0.858 indicates that our model accounts for about 85.8% of the dependent variable’s variance. Usually, higher R-squared values are better. However, there are important caveats about that!
The adjusted R-squared value helps us compare regression models with differing numbers of independent variables. For example, if you compare a model with one independent variable to a model with two, you often favor the model with the higher adjusted R-squared.
The standard error of the regression indicates the typical size of the residuals. This statistic shows how wrong the regression model is on average. You want lower values because it signifies that the distances between the data points and the fitted values are smaller. Conveniently, this value uses the measurement units of the dependent variable.
From the output, we know that the standard distance between the predicted and observed values is 8.93 degrees Celsius.
For more information, read my posts about:
ANOVA Table
In Excel’s ANOVA table, the most important statistic is Significance F. This is the p-value for the F-test of overall significance. This test determines whether your model with all of its independent variables does a better job explaining the dependent variable’s variability than a model with no independent variables. If this test result is statistically significant, it suggests you have a good model.
Our p-value for the overall F-test is 8.93783E-12. It’s written in scientific notation because it is a tiny value. The E-12 indicates that we need to move the decimal point 12 places to the left. This value is smaller than any reasonable significance level. Consequently, we can conclude that our regression model as a whole is statistically significant. Learn about How to Use Scientific Notation.
For more information, read my post about the overall F-test of significance.
Coefficients Table
The coefficients table displays the parameter estimates for the independent variables in our model, along with the intercept value (constant). I won’t interpret the intercept because it is usually meaningless. For more information about it, read my post about the Y-intercept (constant).
We included two independent variables in our model: Pressure and Fuel Rate.
The coefficient for Pressure is approximately 4.79. The positive sign indicates that as pressure increases, temperature also tends to increase. There is a positive association between these two variables. For every one-unit increase in pressure, temperature increases by an average of 4.79 degrees.
The coefficient for Fuel Rate is -24.21. The negative sign indicates that as the fuel rate increases, temperature tends to decrease. There is a negative association between these two variables. For every one-unit increase in fuel rate, temperature decreases by an average of 24.21 degrees.
The p-values for the coefficients indicate whether the dependent variable is statistically significant. When the p-value is less than your significance level, you can reject the null hypothesis that the coefficient equals zero. Zero indicates no relationship.
For our two variables, Excel again displays the p-values using scientific notation because they’re both miniscule. Pressure and Fuel Rate are both statistically significant!
The confidence interval for a coefficient indicates the range of values that the actual population parameter is likely to fall. Keep in mind that the coefficient values in the output are sample estimates and are unlikely to equal the population value exactly.
For example, the confidence interval for Pressure is [2.84, 6.75]. We can be 95% confident that the actual population parameter for Pressure falls within this range.
To learn how least squares regression calculates the coefficients and y-intercept with a worked example, read my post Least Squares Regression: Definition, Formulas & Example.
For more detailed information about interpreting regression results, read my posts about Regression Coefficients and P-values and Linear Regression Equations Explained.
That covers the numeric output. Now we’ll get to the residual plots!
Excel’s Residual Plots for Regression Analysis
It’s crucial to examine the residual plots. If the residual plots don’t look good, you can’t trust any of the previous numerical results! While I covered the numeric output first, you shouldn’t get too invested in them before checking the residual plots. These plots tell you whether the model fits the data, and they even provide ideas for improving your model.
In general, you want your residuals to be randomly scattered around zero. If you see patterns, then you have a problem and need to change your model. For more information about residual plots, why they’re essential, and how they can help you improve your model, read my post Check Your Residual Plots to Ensure Trustworthy Regression Results!
When you check the Residual Plots checkbox, Excel includes both a table of residuals and a residual plot for each independent variable in your model.
On these graphs, the X-axis (horizontal) displays the value of an independent variable. Excel has a strange tendency of extending the X-axis to zero on these charts even when the independent variable’s values aren’t near zero. That looks weird. So, I’ve changed the scaling.
The Y-axis displays the residuals.
In our two residual plots below, the residuals are randomly scattered around zero without any visible patterns. No problems are apparent.
Those graphs are great. However, I usually recommend looking at the residuals versus fitted values plot. If there is a problem there, you can look at the variable specific graphs for clues about where the problem lies more specifically. However, Excel doesn’t create the residuals versus fits plot, just the variable specific graphs shown above.
Using Excel’s Residual Output table (shown below), you can create your own plot. Just select those two columns and create a scatterplot. The one I made is below.
What Excel’s Regression Analysis Does Not Include
It’s great that you can use Excel to perform multiple regression. However, it doesn’t include several vital features. Fortunately, there are workarounds for most of them.
Least squares regression can fit curvature using polynomials. Unlike most statistical software, you’ll have to create additional columns with the squared or cubed terms. That is easy to do using Excel’s formulas.
Similarly, in many statistical software packages, you tell it that you want to include interaction terms and it does the rest. In Excel, you’ll need to create columns for the interaction terms in your worksheet. An interaction term is just the product of two or more variables. For the interaction of A*B, simply create a column with a formula that contains the product of those two variables!
Interaction plots are a handy way to interpret interaction terms. While Excel doesn’t create those, you can make them in Excel. I might write a post on how to do that if there is sufficient interest.
Multicollinearity is a problem that occurs when the independent variables are highly correlated. VIFs are a crucial statistic for determining if your model has problematic levels of multicollinearity, but Excel doesn’t include this statistic in its output. However, you can have Excel calculate VIFs for you with a bit of effort.
Related posts: Curve Fitting in Regression, Understanding Interaction Effects and Plots, and Multicollinearity: Problems, Detection, and Solutions.
If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.
Gustavo Muriel says
Hi Jim
I have a dependent variable, which units are in cubic feet (volume), and two independent variables, which units are squared feet (area). According to my correlation analysis, the independent variables show a strong linear correlation with the dependent variable. However, when doing the regression analysis, the Significance F is 0, and the P-value for both independent variables is 0 as well. Is this the right regression analysis for this type of data?
Thanks,
Gustavo Muriel
Jim Frost says
Hi Gustavo,
A regression model can handle IVs with different units. So, that’s not inherently a problem. But, your model will have high multicollinearity given the mathematical relationship between the variables. We call this structural multicollinearity. Click the link to read about that problem.
I’ll assume there’s some reason why you’re estimating the volume rather than using the known areas. But if you can just calculate the volume directly, that would be better.
You could center your data to reduce the multicollinearity. But you probably don’t need to do that. Multicollinearity reduces your statistical power by increase the p-values. But you know the relationship exists presumably.
What is the Significance F? Is that the p-value for the overall test of significance? Or is that the F-value? It would make sense if that’s the overall p-value, telling you that your model is significant. And the p-values for each IV are telling you that they are also significant. That’s not surprising because you know there must be a relationship between surface area and volume (assuming it’s surface area on the object on which the volume is based).
Assuming the Significance F is a p-value, all the results make sense. Your surface area predicts (unsurprisingly) the volume.
Olivia Greathead says
I ran the regression analysis for 3 indepents in 8 different conditions all using the exact same dependent and all 8 times Excel did not give me a p-value. Also is a F-statistic of 1 mean the model is insignificant?
Jim Frost says
Hi Olivia,
You didn’t get any p-values at all? Outside of doing something wrong in Excel, the only thing I can think of is that you included too many variables relative to the sample size. How many observations did you have in your dataset?
I’m not sure what you mean by having three independent variables in 8 conditions. Are they categorical variables each with 8 values? Why are you running the model 8 times?
I guess I don’t totally understand your model. If you can better explain what you’re doing that would be helpful.
F = 1 is not significant.
Dylan Wilson says
I ran a regression analysis in excel with two independent variables; one with a p-value of 0.08, and another with a p-value <0.0001. The overall F-test of significance was also very low (1.1E-06). Should I not include the non-significant variable in the model despite the overall significance?
Jim Frost says
Hi Dylan,
That’s a great question. And the answer isn’t totally straightforward. Often, analysts will remove nonsignificant IVs from a regression model. However, if theory or subject-area knowledge suggests that the variable should be included, then it’s ok to leave it in. Just document the reasons for that.
Remember, it’s much better to leave in an unnecessary variable than to remove one that is important. Removing an important variable can bias your results whereas leaving in an extra variable won’t bias your results.
You can also try removing the nonsignificant variable see if the the remaining variable changes much. If it doesn’t, then no bias is evident. If it does, that adds complications because then you have to decide which results are better. Again, you’d discuss and explain your rational for whichever approach you use.
Read my post about Choosing the Best Regression Model. I give some tips to consider. Pay extra attention to the end part where I talk about theory vs. statistical significance.
Erica says
Aside from the excel regression tool, what else can I use to compare the significance of predictor variables?
Dalia Heggo says
Thanks a lot for your interesting way of explanation. I like your articles and would be grateful to see, if possible, a similar article on how to perform a regression analysis/strategy on MATLAB.
Ali Ahmad says
Much obliged. Thanks
Ali Ahmad says
It was great reading this post.
Is Fuel Rate held constant when Pressure is varied for calculating the residuals in the Pressure Residual Plot? If this is the case then at what value is Fuel Rate held constant? If this is not the case then how are residuals in Pressure Residual Plot are calculated?
Jim Frost says
Hi Ali,
It’s just how the model fits the data and it simplifies how you interpret the results. If you have coefficient of 2 for an IV. You know that if you increase the value of that IV by one unit, and hold all the other variables at a constant value, the DV increases by an average of 2. It isolates the effect of each variable so you can more easily understand the role of each IV without worrying about what the other IVs are doing. You don’t consider that in relation to the residuals but how you interpret the regression coefficients. Click the link for more details.
Paul says
I would like to perform linear regression (OLS) using a dataset of continuous variables. Instead of individual points, only sample means and variances for both independent (predictors, X) and dependent (responses, Y) variables are known, like in this illustrative dataset:
X_bar(1) = 10, s^2(X1) = 1; Y_bar(1) = 11, s^2(Y1) = 2
X_bar(2) = 21, s^2(X2) = 2; Y_bar(2) = 20, s^2(Y2) = 1
X_bar(3) = 29, s^2(X3) = 4; Y_bar(3) = 31, s^2(Y3) = 3
X_bar(4) = 42, s^2(X4) = 3; Y_bar(1) = 39, s^2(Y1) = 4
Is there any proper way to conduct the OLS and compute all the statistics of the model?
Tony says
Hi Jim,
Apologies, I didn’t realize that I posted in my question in the wrong section. I hope I posted the question in the correct section.
I ran a regression analysis in Excel. The intercept has a p-value of 0.0662 which indicates that the null hypothesis can be accepted. However, the x coefficient has a p-value of 0.000969 which is far below the alpha of 0.05. That being the case, the null hypothesis would be rejected for the x coefficient. The significance F is also 0.000969. In the case where one p-value indicates that the null hypothesis should be rejected and another p-value indicates that it should be accepted, what is the appropriate conclusion to draw?
Thanks,
Tony
Jim Frost says
Hi Tony,
No worries! I just like to keep the conversations organized in the right areas. It’s easier for people reading through comments that way to learn about a particular topic,
In regression, there are multiple sets of null and alternative hypotheses. It’s possible to reject the null for some but not the others.
There are different sets of hypotheses for the constant, coefficients, and the overall model. We can work through these for your model.
For your constant, the p-value indicates that you fail to reject the null. That indicates you don’t have sufficient evidence to conclude that the constant doesn’t equal zero. However, interpreting the constant (and its p-value) is usually meaningless. Except for very rare case, you don’t need to worry about this one. For more information, read my post about the constant in regression analysis.
On the other hand, your coefficient is statistically significant. You can reject the null and conclude that the coefficient does not equal zero in the population. This result is the most important. It indicates the relationship between the independent variable and dependent variable likely exists in the population and not just your sample. For more information, read my post about interpreting regression coefficients and their p-values.
Finally, you have the overall F-test of significance. This relates to the entire model. When you have just on IV, this test results equals the result for the IV’s coefficient. which is the case for your model. For more information on this measure, read my post about the F-test of overall significance.
Your model is significance and your lone IV is significant. Be sure to check the residual plots to identify potential problems!
Alexander says
Hi Jim, how you would comment that two models with multiple variables (but slightly different sets of them) provide both very good F significance (equals 0 in both models), but F factor is bigger for one set of variables, than for the other/ Which one to choose? R2 is nearly the same (0,85).
Thanks/
Renatus Daniel Mbamilo says
Hi Jim, I trust you are doing very well.
I wish you could write on Estimation of Parameter Values as part of Inferential Statistics besides Hypothesis testing approach. My strong interest is on learning intuitively the Method of Moments Estimation, Maximum Likelihood Estimation, Bayesian Estimation Method, Ordinary Least Squares (OLS), Rank Method for Censored Data and Probability Plotting Method. Would you also mind expounding on interval estimation techniques? If you cant on each method now, then I kindly ask you to give insights on each for 3 to 5 sentences.
Do you have a book on this topic of Inferential statistics? Kindly let me know so I can buy it.
Remain ever blessed
Dr Virendra Mishra says
Sir, have you conducted any statistical study on presently prevailing pandemic Covid-19 considering parameters which are responsible for its spread.
Joe says
I forgot to mention that I have two of your ebooks, and they are indispensable.
Real quick . . . . When I use the regression tool, I occasionally get a #NUM error in the p-value column (and 65535.00 in the t Stat column). Can I assume that that’s an indication some of my categorical variables are colinear/cross-correlated?
Cheers,
Joe
Toms River, NJ, USA
Jim Frost says
Hi Joe,
I don’t know the inner workings of Excel on this, but I have a guess. A t-value of ~65,000 is absolutely massive. It’s off the charts. A t-value of 3 is getting you far out into to the tails of the t-distribution. So, you can imagine how far out in the tail 65k is! That will return a miniscule but non-zero value for the p-value. I’m thinking that’s where Excels problem lies. It must be too low for Excel to calculate given however many decimals places it can calculate. Yet, it’s not zero.
At any rate, it’s an extremely significant p-value!
DEVAPRIYO Chakravarty says
The e-books that I purchased are very useful – thanks! Have seen the articles specifically on Excel that are really good since I use Excel mostly! Are there some things that I need to look out for while doing descriptive statistics analysis in Excel? Maybe can have an overall post on calculating various descriptive stats parameters using Excel
Jim Frost says
Hi, thanks so much for supporting my books. I really appreciate that! I’m thrilled to hear that you’re find them to be very helpful!
That’s a great idea for a blog post! I will write one on that topic in the near future. Thanks for the suggestion!
James says
I did some regression analysis on excel and after ticking a few boxes I got a fitted value vs residuals plot and a fitted values vs standard residuals plot. Not sure which box this was as I just ticked all of them not knowing what each of them did!! The normal probability plot came out as a straight line so not sure if this was useful either.
Jerry Miller says
Hi Jim, This is a great refresher/primer on regression and what all those output tables mean.
I have a question: if your R-squared is very high, I assume this USUALLY means that the residuals are close to he fitted line. But what if they’re close but, say, most of these points are above the line? Or if they form some other pattern? Would closeness to the fitted line (i.e., a small standard error of the regression) override any of these other considerations?
Also, the pattern of points in the residual plot for the fuel rate are evenly scattered above and below zero, but the pattern is somewhat fan-shaped, being farther from the zero line as the fuel rate goes up. Does an uneven (although symmetrical) pattern of residual points along the horizontal axis mean anything?
And finally, you did not define “VIF” in this post. And yes, we’d love to see you write more about interaction plots!
Jim Frost says
Hi Jerry,
A number of my other posts that I link to go into more detail about some of your questions. I wanted to keep this post as an overview and the make the more detailed information readily available.
You’re correct. As R-squared increases within a dataset, the points move closer to a line. That sounds like it’s a good thing, but it’s not always. Read my article about R-squared that I link to for details. In it, I show an example that is similar to what you describe. I have a model where the R-squared is high but there are patterns in the residuals. The model doesn’t adequately fit the data and, as I mention in the section about residuals, it means you can’t trust any of the other results. I write in the R-squared post that high R-squared values aren’t always good and low R-squared values aren’t always bad!
For this reason, assessing the residuals plots is crucial! It’s not just how close to the fitted line but any patterns in the data. I link to a post about residual plots that looks at this in more detail. Residual plots can help you make your model better! And, I didn’t link to this in the post itself, but residuals are the primary way to check the OLS assumptions. If you’re not satisfying those assumptions, you can’t trust the other results.
If you include the constant in your model, that forces the mean of the residuals to equal zero. So, you won’t have the case that you describe where most of the residuals are on top. Well, I suppose that can occur if you have an outlier but the average would still be zero! You can also have cases where the residuals are generally equally above and below the line but you’ll have patches where more are above and patches where more are below.
In short, patterns in the residuals indicate a violation assumption, which means you can’t trust the results. Fortunately, the type of pattern will often provide clues about how to fix the problem!
There might be slight heteroscedasticity, as indicated by the fan shape you noticed. Ideally, we’d have more data points to really know. If you remove a point or two, it looks random. The residuals at the very high end of Fuel Rate are not notably more extreme than the low Fuel Rates. The most extreme is in the middle of the range. So, it’s not a strong case for heteroscedasticity. With more data points, we’d either see the case become stronger or it would go away. These aren’t real data, so I didn’t want to focus too much on that aspect. But, it’s great that you’re looking out for that. More data would likely answer the question in a real-world study! Read my post for more information about the effects of heteroscedasticity and how to fix it.
For more about VIFs, read the post I link to about multicollinearity. I talk about them in detail there.
I suspect I’ll write an advance regression with ANOVA post where I cover things like interaction plots, curvature, VIFs, and maybe even coding for categorical variables!
Thanks for writing with the thoughtful comment!
Steven Harris says
I’ve purchased two of your e books and have found them to be very useful. Please post more on statistics using Excel
Jim Frost says
Hi Steven,
Thanks so much for supporting my books. I really appreciate that! I’m glad they have been so helpful!
I’ve written a number of posts about statistics using Excel. There are two easy ways to find them.
First, I have a search box in the right margin part way down the page. Just search for Excel.
Second, at the very end of this post, just above where the comments section starts, you’ll see a section that says, “Tagged With.” I’ve tagged all the Excel posts with the Excel tag. Click Excel in the tags and you’ll be taken to a page that lists all of my posts where I use Excel!
I also plan to write more posts using Excel as well.
Thanks for writing!
Joe Lombardi says
I haven’t even perused this post yet from top to bottom, but I think I can say it’s your best, most useful post ever. And thanks for writing those tools. I am eager to employ them to automate my sheets. (Oh, and BTW, I was plotting my residuals against the actual data, not the fitted data, so thanks for that tip.)
Jim Frost says
Hi Joe,
You’re welcome for the tip! ๐
I’m glad to hear that this post was so helpful. I’m thinking of following this up with an advanced regressing using Excel post soon!