P-values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p-values for the coefficients indicate whether these relationships are statistically significant.

After fitting a regression model, check the residual plots first to be sure that you have unbiased estimates. After that, it’s time to interpret the statistical output. Linear regression analysis can produce a lot of results, which I’ll help you navigate. In this post, I cover interpreting the p-values and coefficients for the independent variables.

**Related post**: When Should I Use Regression Analysis?

## Interpreting P-Values for Variables in a Regression Model

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. The p-value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is effect at the population level.

If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there *is* a non-zero correlation. Changes in the independent variable *are* associated with changes in the response at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.

On the other hand, a p-value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

The regression output example below shows that the South and North predictor variables are statistically significant because their p-values equal 0.000. On the other hand, East is not statistically significant because its p-value (0.092) is greater than the usual significance level of 0.05.

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model. For the results above, we would consider removing East. Keeping variables that are not statistically significant can reduce the model’s precision.

Related post: F-test of overall significance in regression

## Interpreting Regression Coefficients for Linear Relationships

The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

The coefficients in your statistical output are estimates of the actual population parameters. To obtain unbiased coefficient estimates that have the minimum variance, and to be able to trust the p-values, your model must satisfy the seven classical assumptions of OLS linear regression.

## Graphical Representation of Regression Coefficients

A simple way to grasp regression coefficients is to picture them as linear slopes. The fitted line plot illustrates this by graphing the relationship between a person’s height (IV) and weight (DV). The numeric output and the graph display information from the same model.

The height coefficient in the regression equation is 106.5. This coefficient represents the mean increase of weight in kilograms for every additional one meter in height. If your height increases by 1 meter, the average weight increases by 106.5 kilograms.

The regression line on the graph visually displays the same information. If you move to the right along the x-axis by one meter, the line increases by 106.5 kilograms. Keep in mind that it is only safe to interpret regression results within the observation space of your data. In this case, the height and weight data were collected from middle-school girls and range from 1.3 m to 1.7 m. Consequently, we can’t shift along the line by a full meter for these data.

Let’s suppose that the regression line was flat, which corresponds to a coefficient of zero. For this scenario, the mean weight wouldn’t change no matter how far along the line you move. That’s why a near zero coefficient suggests there is no effect—and you’d see a high (insignificant) p-value to go along with it.

The plot really brings this to life. However, plots can display only results from simple regression—one predictor and the response. For multiple linear regression, the interpretation remains the same.

## Use Polynomial Terms to Model Curvature in Linear Models

The previous linear relationship is relatively straightforward to understand. A linear relationship indicates that the change remains the same throughout the regression line. Now, let’s move on to interpreting the coefficients for a curvilinear relationship, where the effect depends on your location on the curve. The interpretation of the coefficients for a curvilinear relationship is less intuitive than linear relationships.

As a refresher, in linear regression, you can use polynomial terms model curves in your data. It is important to keep in mind that we’re still using linear regression to model curvature rather than nonlinear regression. That’s why I refer to curvilinear relationships in this post rather than nonlinear relationships. Nonlinear has a very specialized meaning in statistics. To read about this distinction, read my post: The Difference between Linear and Nonlinear Regression Models.

This regression example uses a quadratic (squared) term to model curvature in the data set. You can see that the p-values are statistically significant for both the linear and quadratic terms. But, what the heck do the coefficients mean?

## Graphing the Data for Regression with Polynomial Terms

Graphing the data really helps you visualize the curvature and understand the regression model.

The chart shows how the effect of machine setting on mean energy usage depends on where you are on the regression curve. On the x-axis, if you begin with a setting of 12 and increase it by 1, energy consumption should decrease. On the other hand, if you start at 25 and increase the setting by 1, you should experience an increased energy usage. Near 20 and you wouldn’t expect much change.

Regression analysis that uses polynomials to model curvature can make interpreting the results trickier. Unlike a linear relationship, the effect of the independent variable changes based on its value. Looking at the coefficients won’t make the picture any clearer. Instead, graph the data to truly understand the relationship. Expert knowledge of the study area can also help you make sense of the results.

Related post: Curve Fitting using Linear and Nonlinear Regression

## Regression Coefficients and Relationships Between Variables

Regression analysis is all about determining how changes in the independent variables are associated with changes in the dependent variable. Coefficients tell you about these changes and p-values tell you if these coefficients are significantly different from zero.

All of the effects in this post have been main effects, which is the direct relationship between an independent variable and a dependent variable. However, sometimes the relationship between an IV and a DV changes based on another variable. This condition is an interaction effect. Learn more about these effects in my post: Understanding Interaction Effects in Statistics.

In this post, I didn’t cover the constant term. Be sure to read my post about how to interpret the constant!

The statistics I cover in the post tell you how to interpret the regression equation, but they don’t tell you how well your model fits the data. For that, you should also assess R-squared.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Nancy Lohalo says

May 20, 2019 at 1:33 pmHi Jim, thank you very much for this insightful post! I have encountered a few problems with the dependent variable Y in the linear regression model. The data collected showed a decreasing trend for the past 20 years, and my hypothesis stated that X1 will have a positive impact on Y. When I ran the regression test, almost all of the independent variables had negative coefficients. How can I interpret it? Thank you!

Jim Frost says

May 20, 2019 at 2:38 pmHi Nancy,

It’s difficult for me to say much about your specific case because there’s so little information. It sounds like your hypothesis was that X1 would have a positive coefficient but your analysis produced a negative coefficient. I’m going to assume that X1 is negative and statistically significant. If it’s negative but not significant, it’s not distinguishable from zero and you can’t assume that it has a negative value in the population. Given those assumptions about the situation, there are two general possibilities.

1) Your hypothesis was incorrect. I have no way to know about that. But, it’s something you can investigate.

2) Your hypothesis is correct but your regression model has a problem that produces biased coefficients. This problem is causing the analysis to produce a negative coefficient but it’s should be a positive coefficient. There are a number of reasons why this can occur, including confounding variables, overfitting, data mining, and a misspecified model among other possibilities. Be sure to go through the OLS assumptions and see if your model violates any of them. It will probably take some effort to check these potential problems.

Because you’re performing a study with regression analysis, you might consider buying my ebook about regression analysis. In this ebook, I provide much more information all about regression analysis.

Best of luck with your study!

Karis says

May 17, 2019 at 6:02 pmHi Jim, thank you so much for this post it’s helped a lot! I’m learning this stuff at uni and have come across a question which has completely confused me and wondered if you could help? The question asks to interpret the regression analysis result and its significance of these regression results:

R^2 = 0.74 (F = 16.82, p>0.01; t = 0.54, p<0.01).

However, the differing levels of confidence levels has thrown me? Does the fact that the F ratio is not within the confidence threshold mean that the regression model altogether is not statistically significant? Thank you!

Jim Frost says

May 20, 2019 at 3:08 pmHi Karis,

So, the F-test and R-squared goes together. These are measures of Goodness-of-Fit. I’m assuming that the F-value and its p-value are for the F-test of overall significance. That test indicates that your R-squared (0.72) is not significantly different from zero–assuming that alpha is 0.01. Your model is no better at predicting the DV than just using the mean. That’s kind of odd for a model with an R-squared as high as 0.74. There might be a very small sample size or some problem with the model. I can’t tell from these results. Read more about that in my post about the F-test of overall significance. Read my post to see how to interpret R-squared.

The t-value and its p-value are for a term in the model, such as an independent variable. That particular IV is statistically significant. This post details what that means. For this model, the overall significance and significance for a particular IV disagree. The post about the F-test of overall significance describes how this disagreement can happen.

Note that none of the statistics you provide relate to confidence, as in confidence levels or confidence intervals. However, there is a disagreement about statistical significance. Read the post about the F-test to understand that issue.

I do think it’s odd that R-squared is reasonably high but that the overall F-test is not significant. I suspect something odd is going on.

I hope this helps!

Klaudia Pająk says

April 25, 2019 at 5:37 pmThe question is- when I make the analyse of regression, SPSS shows the results and COEF has some value… When I describe these results on paper- should I define the coeff value as a b or β?

Thank you in advance

Jim Frost says

April 26, 2019 at 11:54 amHi Klaudia,

You should be able to work this out from the information provided. Coefficients are estimates of population parameters. And, b is an estimate of a parameter. Therefore, b = coefficients in this context because they are both estimates. Conversely, Beta is a population parameter and not an estimate.

I hope this helps!

Klaudia says

April 25, 2019 at 8:31 amHello Jim,

I’d like to ask what does the “COEF” mean. Is it the same thing as b or β?

Klaudia 😉

Jim Frost says

April 25, 2019 at 11:00 amHi Klaudia,

COEF stands for coefficient. These are the values that the procedure estimates from your data. In a regression equation, these values multiply the independent variables.

Technically, β is the parameter value for the population. Your regression equation estimates these parameter values. In textbooks, these estimates are often denoted using beta-hats. That’s a β with a ^ on top. Some sources use a lower-case b to indicate that it’s an estimate. The key thing to note is that some forms (β) refer to the true population parameters while others (beta-hat and b) refer to the estimates of the parameters. The coefficients in your output are estimates of the parameters.

One caution, SPSS for some strange reason uses the term “beta” to refer to standardized coefficients!

I hope this helps!

Curt Miller says

April 18, 2019 at 1:25 pmHi Jim,

Do we still use p-values in determining whether or not a predictor variable should remain in the model, even when we are building a model on full population data?

Thanks you,

curt

Jim Frost says

April 18, 2019 at 2:08 pmHi Curt,

When you’re working with data for an entire population, there is no need to use any p-values. P-values are an integral part of hypothesis tests that help you determine whether an apparent effect that exists in your sample also exists in the population. When you have the population data, all effects that you observe by definition do exist in the population. There’s no need to perform any hypothesis testing to confirm it because you’re looking at all the data for the population. This applies to regression analysis and other forms of hypothesis testing such as 2-sample t-tests, et al.!

Phil A. says

April 7, 2019 at 5:50 pmHi Jim,

Quick question for a special type of regression… I have the following equation but I am not clear on the interpretation of the coefficient I obtain:

log($RealGDP) = B0 + B1(Junk-Bond Yield %) + e

My X1 data is in terms of percentage points (%) and my Y-variable (in log-scale) is in terms of dollars ($).

After I run my regression, my B1 coefficient = -0.005

As of now, I am interpreting the B1 coefficient as “A 1% increase in the Junk-Bond yield leads to a -0.5% decrease in Real GDP” – does this sound like the correct interpretation?

My main confusion is around the “1% increase in X” …. If the junk-bond spread is currently at 5%, do I interpret “a 1% change” as the junk-bond yield moving from 5% to 6%? Or do I interpret it as a 1% change of 5% (ex: 5% to 5.05%)?

Olga Pap says

March 4, 2019 at 4:19 amHello Jim.

Hello All.

I have one question. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Olga Pap says

March 4, 2019 at 3:03 amIs it possible please to answer me on the above question?

Jessica says

March 3, 2019 at 6:06 pmThank you! It does, though, when I’m looking at a scatterplot, I’ve seen an R value. This is not to be interpreted as the same thing as the correlation coefficient, r . . . correct? Even though the R value is not R-squared, it is still not the same as r . . . right?

Jim Frost says

March 3, 2019 at 10:18 pmHi Jessica,

Ah, yes, I jumped straight to R-squared because that is used much more frequently. R is the coefficient of multiple

correlationwhereas R-squared is the coefficient of multipledetermination. The use of the capital letter R for both of these statistics indicates that they are sample estimates. I’ve described R-squared so onto R!The calculation for R is (unsurprisingly) just taking the positive square root of R-squared. R represents the correlation between a set of variables with another variable. In the regression context, this could be the correlation between your set of independent variables and the dependent variable. The interpretation of R is not intuitive. Hence, R-squared is used more frequently.

Lower case r is the correlation between two variables and it is commonly used. R involves more than two variables.

I haven’t seen R used much at all. Perhaps it is in some specialized context. But, you probably don’t need to worry about R.

Jessica says

March 3, 2019 at 4:41 pmHi, I know this may seem to be a very simple question, but is there a difference between R and r? Do they stand for the same thing in regression analysis?

Jim Frost says

March 3, 2019 at 5:57 pmHi Jessica,

Yes, r and R-squared are related as they both measure the strength of relationships between variables. r is a correlation coefficient that ranges between -1 to +1. It measures the strength of the linear relationship between two continuous variables. R-squared measures the strength of the relationship between a set of independent variables and the dependent variable. It’s a percentage that ranges from 0 – 100%.

Suppose you have a pair of variables, say X and Y, and the correlation coefficient (r) is 0.7. If you perform a simple regression using these two variables, you will obtain an R-squared of 0.49 (49%). We know this because 0.7^2 = 0.49. However, unlike correlation coefficients (r), you can use R-squared when you have more than two variables.

I write about that aspect in my post about correlation. You can also read more about R-squared.

I hope this helps!

Olga Pap says

February 14, 2019 at 10:13 amHi Jim. I would be very grateful if you could help me. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Tesfakiros Semere says

February 14, 2019 at 5:26 amWhat a clear, simple, and easy to understand. You saved my time from reading lots of books. It is really helpful.

Would it be possible to get them all in Pdf just to print and read when I am out of network

THANK YOU SO MUCH Kim.

Jim Frost says

February 15, 2019 at 4:55 pmHi Kim, thanks so much for your kind words! They made my day! While I don’t have PDFs of the blog posts, in several weeks I’ll releasing an ebook all about regression analysis. If you like the simple and easy to understand approach in my blog posts, you’ll love this book. It should be out in early March 2019!

Digambar salunkhe says

February 13, 2019 at 1:38 amThank you so much for sharing this blog…It’s really helpful and easy to understand the concept of whole regression model.

Adu Emmanuel Ifedayo says

February 13, 2019 at 1:30 amThank you.

Neven says

December 6, 2018 at 12:19 pmHi Jim ! Great blog , very clear and very helpfull . The best I have found in this field! Thanks.

Qmars Safikhani says

November 28, 2018 at 8:20 amHi Jim,

Thanks a lot for sharing your knowledge through this article. I found it very interesting as you explained somehow difficult concepts in an easy way. Well done

Hans says

November 19, 2018 at 12:23 pmHey Jim,

Great Blog! You helped us a lot preparing for our studies at university. We have a question regarding the p-value… Is there an explanation for a p-value being exactly 1.0? Does it mean that there is a 100 percent chance that the independent variable has no effect on the dependent one? Or is there anything else to consider? Thanks a lot for your help and keep that great work going!

Jim Frost says

November 19, 2018 at 3:08 pmHi Hans, thank you so much! It’s great to hear that it’s been helpful for you all. That makes my day!

Yes, you can obtain a p-value of 1.0. To get exactly 1.0, your sample statistic would have to exactly equal the null hypothesis. For example, if you perform a 1-sample t-test and your null hypothesis is that the population mean equals 10. If your sample statistic is exactly 10, you obtain a p-value of exactly 1.0. In regression analysis, typically the null for a coefficient is that it equals zero. So, if the estimated coefficient equals zero exactly, you’d again get a p-value of 1.0.

The interpretation of a p-value in general is the probability of obtaining the observed sample statistic or more extreme if you assume the null hypothesis is true. The reason p = 1.0 when the sample statistics equals the null hypothesis value makes sense when you think about it with that interpretation in mind. When the sample stat equals the null value, there is a 100% probability that a sample statistic will equal the null value or be more extreme! That’s true by definition because that case covers the entire range of the sampling distribution (i.e., you’d shade the entire area beneath the sampling distribution curve).

To see these sampling distributions in action for a hypothesis test, read my post about p-values and significance levels.

Of course, the probability of obtaining a sample statistic that exactly equals your null hypothesis is miniscule. When using statistical software in the field, if you see a p-value = 1, it’s more likely due to rounding.

Paul says

October 29, 2018 at 11:20 pmHi. I want to find out if simple or multiple regressions can be used to explain effects (as in experimental studies)?

Thank you.

Jim Frost says

October 30, 2018 at 10:38 amHi Paul,

You bet they can! The coefficients describe the effects and the p-values determine whether the effects are statistically significant.

Rashan says

October 13, 2018 at 9:15 amThis is very helpful. Thank you

Surya says

October 1, 2018 at 10:08 pmThanks Jim

Surya says

October 1, 2018 at 8:50 amHi Jim, I have just subscribed to your posts after reading the wonderful post on residual plots.

Could you please let me know how do we interpret the SE of coefficients , T statistic as well.. Or do you already have an article on them… Please reply.. Thanks..

Jim Frost says

October 1, 2018 at 9:34 pmHi Surya,

Thanks so much! I’m glad that post was helpful!

The standard error (SE) of the coefficient measures the precision of the coefficient estimate. Smaller values represent more precise estimates. Standard errors are the standard deviations of sampling distributions. If you were to perform your study many times, drawing the same sample size, and fitting the same model, you’d obtain a distribution of coefficient estimates. That’s the sampling distribution of a coefficient estimate. The standard error of a coefficient is the standard deviation of that sampling distribution. The SE is used to create confidence intervals for the coefficient estimate, which I find more intuitive to interpret.

The t-statistic in the context of regression analysis is the test statistic that the analysis uses to calculate the p-value. I write a post about how it works in the context of t-tests. It’s fairly similar for coefficient estimates. Read that post but replace sample mean with coefficient estimate and you’ll get a good idea. How t-tests work.

I hope that helps!

[email protected] says

September 26, 2018 at 1:33 ambeen reading your posts all night, (morning now).. I can’t stop because it’s like a light bulb keeps going off. Been studying this stuff for weeks, now I finally get it thanks to your post. Thank you:)

-Extremely tired data science grad student.

Jim Frost says

September 26, 2018 at 9:33 amHi, I’m sorry my posts caused you to lose some sleep last night, but I love your analogy about light bulbs going off! I’m really happy to hear that they were helpful. That really makes my day! Best of luck with your studies!

Tracey says

August 26, 2018 at 3:57 amHi Jim. Thank you so much for this as it helped clear up some things in my mind as I prepare a research paper.

Jim Frost says

August 26, 2018 at 11:17 pmHi Tracey, you’re very welcome. I am happy to hear that it was helpful!

Qiumei Jing says

August 23, 2018 at 6:02 amThank you for your explanation,Jim.That’s really great!

When I’m doing multiple liner regression , I have a question.The liner regression has three independent variables(A,B,C) and one dependent variable(D). I got significant p-value of ANOVA table,but in Coefficients table ,the constant p-value is 0.237,which is not significant ,with one predictor(Variable A) p-value is 0.211,another two predictors have good significant value(P=0.000). In that case ,how can I interpret the results? The hypothesis of the two predictors (variable B and C)which have significant is”there is a relationship between B and D” and “there is a relationship between C and D ” In this case,can I say the two hypothesis were supported? And how can I interpret the one (A)with insignificant p-value in coefficient table? Thank you in advance!

Jim Frost says

August 24, 2018 at 2:06 amHi Qiumei,

It’s generally not worthwhile interpreting the constant, so I’d skip that. To learn why, click the link for interpreting the constant in this post.

Here’s how you can interpret the significant predictors.

The sample provides sufficient evidence to conclude that changes in both independent variables B and C are correlated with changes in the dependent variable D. Statistical significance indicates that the correlation does not equal zero. In other words, you can reject the null hypothesis that the coefficients equal zero.

For the insignificant variable, the sample provides insufficient evidence to conclude that there is a relationship between these insignificant variables and the dependent variable. In other words, you fail to reject the null hypothesis that these two coefficients equal zero.

For more elaboration, reread this post where I talk about this in depth.

Appadu says

August 19, 2018 at 7:58 amDear Jim

Thank you for your explanations on how to Interpret Regression Coefficients for Linear Relationships and p-value. It is very clear appreciate you time to put this together.

I have one question I was looking at an example on Estimated standardised OLS beta coefficient data. The results show R squared (%) as 26.2 and F-Value 18.14. Please advise how to interpret this 2 figures. Thank you

Jim Frost says

August 23, 2018 at 2:02 amHi Appadu,

When you standardize the continuous independent variables in your model, the output produces standardized coefficients. Standardization is when you take the original data for each variable, subtract the variable’s mean from each observation and divide by the variable’s standard deviation. The main reason I’m aware of for performing this standardization is to reduce the multicollinearity caused by including polynomials and interaction terms in your model. I write about that in my post about multicollinearity.

In terms of interpreting the standardize coefficient–it represents the mean change in the dependent variable given a one standard deviation in the independent variable. Another reason statisticians use it is as a possible measure for identifying which variable is the most important.

As for interpreting R-squared and the F-test of overall significance, those don’t change from the usual interpretations. Click on the links to read my blog post about interpreting each statistic.

I hope this helps!

Hrishikesh Geed says

July 10, 2018 at 4:30 pmThanks for the explaination Jim !!.

I have one doubt, how do you calculate the p-value corresponding to each coefficient?

How do you decide the standard deviation,and the sample mean for calculating the z value for each coefficient?

Thanks

Hrishi

eric says

July 2, 2018 at 6:35 pmThank you very much for the explanation Jim!

If the p-value is under the significant level, this would indicate that there is enough evidence to reject the null hypothesis. The null hypothesis being here that there is no correlation between 2 variables (in a single linear regression).

Here is my first question: how do we decide how to set the significant level? Is it purely arbitrary?

My second question is: since the coefficient of correlation varies -1 and 1, it is tempting to conclude that there is a significant correlation (positive or negative) between 2 variables is the coefficient of correlation is close to -1 or 1 and that there is no correlation when the coefficient of correlation is close to 0. However I think this assumption is false but can’t get the intuition to understand why.

Could you help me about those questions?

Many thanks for your time and your attention

Best regards

Eric

Hanan Shteingart says

July 1, 2018 at 4:04 pmthe following claim is not true if the features are correlated, what’s known as multicollinearity: “The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable”. In fact, a feature could have a positive correlation with the target yet a negative coefficient and vice vera.

Jim Frost says

July 2, 2018 at 2:07 amHi Hanan,

You raise a good point. The interpretation that I present, including the portion that you quote, is accurate when your model doesn’t contain a severe problem. However, if your model does contain a severe problem, it can produce unreliable results, which includes the possibility that the coefficients don’t accurately describe the relationship between the independent variables and the dependent variable. The problem isn’t with how to interpret coefficients, but rather with a condition in the model that causes it to produce coefficients that you can’t trust.

As you point out, multicollinearity can produce unreliable, erratic coefficients. In some cases, the sign of the coefficient can even be incorrect. However, the sign switch doesn’t necessarily have to happen when your model has multicollinearity. I write more multicollinearity, including switched signs, in this post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions.

By the way, there are a number of other potential problems that can cause your model to produce results that can’t trust. Multicollinearity is just scratching the surface of that. These problems include an incorrectly specified model, overfitting the model, heteroscedasticity, and data mining among others. I spend quite a bit of time talking about these problems, how they can invalidate your results, and what you can do to address them.

I hope this helps!

MN says

May 24, 2018 at 1:43 pmThank you very much for the wonderful elaboration. Amazing!!

Jim Frost says

May 25, 2018 at 2:20 pmYou’re very welcome, MN! I’m glad it’s helpful!

Rajasekar says

May 7, 2018 at 10:26 amI am currently working on a multiple regression model, where i have 4 x variable and all my variable are not statistically significant. I know when this happen i can reject null hypothesis but like to know what might be the wrong , do i need to add some more x variable in this case.Also the R Square =0.109842937

Adjusted R Square =0.034084889

Ayush says

April 20, 2018 at 12:27 amThis is really one of the best websites I have come across for DATA SCIENCE… Great effort put up by Sir Jim…

Jim Frost says

April 20, 2018 at 10:25 amThank you, Ayush!

Rali says

March 24, 2018 at 10:39 amHi Mr. Jim

Thanks for the helpful blog

all the best

Jim Frost says

March 24, 2018 at 5:50 pmHi Rali, you’re very welcome! I’m glad it was helpful!

ADIL HUSSAIN RESHI says

February 22, 2018 at 1:44 amReally fabulous ..it cleared all my doubts about p- value

Jim Frost says

February 22, 2018 at 9:55 amHi Adil, Thanks! I’m so glad to hear that it was helpful!

Javed Iqbal says

January 29, 2018 at 3:19 amThanks Jim for the nice explanation. This regression seems to violate one of the model assumption namely the homoskedasticity. Log transformation should work here.

Jim Frost says

January 29, 2018 at 9:55 amHi Javed, thanks for your comment. The residuals for this model are homoscedastic–or very close to it. Their variance are fairly equal across the entire range. The variance might appear to be lower in the very low end of the range, but there are also fewer observations in that region, which can make the dispersion appear to be smaller. At any rate, it is close enough. To see how a true case of heteroscedasticity appears, along with multiple methods for correcting it, read my post about heteroscedasticity. By the way, I explain in that post why I always recommend trying other methods of addressing this problem before using a transformation.

Toby says

January 1, 2018 at 10:13 pmGreat blog with detailed explanation! It helps clear my doubts for p-value.

Thank you Jim! and Happy new year! 😀

Jim Frost says

January 1, 2018 at 10:28 pmThank you, Toby! And, I’m very happy you found the blog to be helpful! Happy new year to you too!!