P-values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p-values for the coefficients indicate whether these relationships are statistically significant.

After fitting a regression model, check the residual plots first to be sure that you have unbiased estimates. After that, it’s time to interpret the statistical output. Linear regression analysis can produce a lot of results, which I’ll help you navigate. In this post, I cover interpreting the p-values and coefficients for the independent variables.

**Related post**: When Should I Use Regression Analysis?

## Interpreting P-Values for Variables in a Regression Model

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. The p-value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is effect at the population level.

If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there *is* a non-zero correlation. Changes in the independent variable *are* associated with changes in the response at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.

On the other hand, a p-value that is greater than the significance level indicates that there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

The regression output example below shows that the South and North predictor variables are statistically significant because their p-values equal 0.000. On the other hand, East is not statistically significant because its p-value (0.092) is greater than the usual significance level of 0.05.

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model. For the results above, we would consider removing East. Keeping variables that are not statistically significant can reduce the model’s precision.

Related post: F-test of overall significance in regression

## Interpreting Regression Coefficients for Linear Relationships

The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

The coefficients in your statistical output are estimates of the actual population parameters. To obtain unbiased coefficient estimates that have the minimum variance, and to be able to trust the p-values, your model must satisfy the seven classical assumptions of OLS linear regression.

## Graphical Representation of Regression Coefficients

A simple way to grasp regression coefficients is to picture them as linear slopes. The fitted line plot illustrates this by graphing the relationship between a person’s height (IV) and weight (DV). The numeric output and the graph display information from the same model.

The height coefficient in the regression equation is 106.5. This coefficient represents the mean increase of weight in kilograms for every additional one meter in height. If your height increases by 1 meter, the average weight increases by 106.5 kilograms.

The regression line on the graph visually displays the same information. If you move to the right along the x-axis by one meter, the line increases by 106.5 kilograms. Keep in mind that it is only safe to interpret regression results within the observation space of your data. In this case, the height and weight data were collected from middle-school girls and range from 1.3 m to 1.7 m. Consequently, we can’t shift along the line by a full meter for these data.

Let’s suppose that the regression line was flat, which corresponds to a coefficient of zero. For this scenario, the mean weight wouldn’t change no matter how far along the line you move. That’s why a near zero coefficient suggests there is no effect—and you’d see a high (insignificant) p-value to go along with it.

The plot really brings this to life. However, plots can display only results from simple regression—one predictor and the response. For multiple linear regression, the interpretation remains the same.

## Use Polynomial Terms to Model Curvature in Linear Models

The previous linear relationship is relatively straightforward to understand. A linear relationship indicates that the change remains the same throughout the regression line. Now, let’s move on to interpreting the coefficients for a curvilinear relationship, where the effect depends on your location on the curve. The interpretation of the coefficients for a curvilinear relationship is less intuitive than linear relationships.

As a refresher, in linear regression, you can use polynomial terms model curves in your data. It is important to keep in mind that we’re still using linear regression to model curvature rather than nonlinear regression. That’s why I refer to curvilinear relationships in this post rather than nonlinear relationships. Nonlinear has a very specialized meaning in statistics. To read about this distinction, read my post: The Difference between Linear and Nonlinear Regression Models.

This regression example uses a quadratic (squared) term to model curvature in the data set. You can see that the p-values are statistically significant for both the linear and quadratic terms. But, what the heck do the coefficients mean?

## Graphing the Data for Regression with Polynomial Terms

Graphing the data really helps you visualize the curvature and understand the regression model.

The chart shows how the effect of machine setting on mean energy usage depends on where you are on the regression curve. On the x-axis, if you begin with a setting of 12 and increase it by 1, energy consumption should decrease. On the other hand, if you start at 25 and increase the setting by 1, you should experience an increased energy usage. Near 20 and you wouldn’t expect much change.

Regression analysis that uses polynomials to model curvature can make interpreting the results trickier. Unlike a linear relationship, the effect of the independent variable changes based on its value. Looking at the coefficients won’t make the picture any clearer. Instead, graph the data to truly understand the relationship. Expert knowledge of the study area can also help you make sense of the results.

Related post: Curve Fitting using Linear and Nonlinear Regression

## Regression Coefficients and Relationships Between Variables

Regression analysis is all about determining how changes in the independent variables are associated with changes in the dependent variable. Coefficients tell you about these changes and p-values tell you if these coefficients are significantly different from zero.

All of the effects in this post have been main effects, which is the direct relationship between an independent variable and a dependent variable. However, sometimes the relationship between an IV and a DV changes based on another variable. This condition is an interaction effect. Learn more about these effects in my post: Understanding Interaction Effects in Statistics.

In this post, I didn’t cover the constant term. Be sure to read my post about how to interpret the constant!

The statistics I cover in the post tell you how to interpret the regression equation, but they don’t tell you how well your model fits the data. For that, you should also assess R-squared.

If you’re learning regression and like the approach I use in my blog, check out my eBook!

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Jennifer says

Hi Jim,

Thank you for this awesome post. I had a question about removing variables that are not significant (p>0.05) from my multiple logistic regression model. I have 10 predictors, and have been building models using every combination of these predictors (using the glmulti function in R) and selecting the best model with the lowest AIC score. Then I looked at a summary of the best model and found that there are some variables with coefficients with p>0.05, but when I remove these, the AIC score increases. Would you still recommend removing variables that are not significant from the regression model if removing the variables increases the AIC?

Jim Frost says

Hi Jennifer,

There are a number of potential issues here.

You’re using a data dredging technique that might be finding a model that has the lowest AIC by chance rather than because it’s actually the best model. I write about that in my post about using data mining to select regression models. You should read that. I don’t use AIC. Instead, I used R-squared but the principles are the same. The problem is you can’t necessarily trust p-values when you use that approach.

Another issue I’m unsure about is your sample size. Is it large enough to support 10 predictors? If not, you can’t trust the p-values for that reason either! Read about that: Overfitting your model.

Now, on to your specific question. One thing to consider is how much does AIC change when you remove the non-significant predictor? Perhaps it doesn’t reflect much of a change in terms of model fit? Typically, it is good to remove variables that are not significant unless theory strongly suggests you keep them in. Without knowing the subject area and the amount of change, I can’t get a concrete answer.

I’ve written an article about selecting the best regression model. This process is a mix of statistical measures and subject area knowledge. It sounds like you’re predominantly using statistical measures and I think applying more subject area knowledge will be really helpful. I’d read that article and pay particular attention to the section about theory near the end. Again, I don’t talk about AIC, but I do use other similar measures, such as R-squared and its variants. Also, pay particular attention to the discussion about residual plots. I wonder if in your case removing the non-significant variable actually creates patterns in the residual plots? If that’s the case, it would be good to not remove. But, you’d need to check the residual plots to make that determination.

As you can see, there’s a lot to consider. I wish I could give you a simple answer!

Dr kashif says

Hello, Can I determine p value when one variale is constant? like corelation of disease with some cancer where cancer could be yes or no but disease is constant, please reply

Jim Frost says

Hi Kashif,

I’m not clear on what two variables you want to correlate. However, in general, to have a relationship or correlation between two variables, you must have variation in both variables. Correlation means that there as two variables vary, they either tend to change in the same direct or a different direction. If one of the variables does not vary at all as the other variable varies, there is no relationship. If I understand your example correctly, cancer can be present or not present while the disease is always present. In that case, there is no relationship. However, if you had cancer present/not present and disease present/not present, there is variability in both variables and you could use a test of independence to determine whether a relationship exists. Is there a relationship between the presence of the cancer and the presence of the disease?

But, if all observations have the disease (or all observations don’t have the disease) there is no relationship. P-value equals 1 (or you might get an error depending on the analysis).

Prisca Keery says

Hi Jim

I am hoping you can help with my statistical question. I am looking to conduct a study with low sample size with one IV and either 3 DV’s or 9 DV’s. What statistical analysis issues may I encounter with the more DV’s I include in my study given the low sample size?

Hadas says

Sorry Jim ; have get good information from your answer for Lola about how to adjust results for insignificant variables but still am not clear with what if the variables are assumed very important for the study and their p value is greater than 5% , what if too many variables have greater than 5% for 95% confidence level ? Questions are likert types and i measure participation in 5 levels is the problem here ? By the way results form descriptive analyses showed that variables have effects on the dependent variable . of You said also that removing an important variable is potentially more problematic than leaving in a variable that’s not important . So what is your suggestion

Thank you so much for your unreserved helps

Shareful says

I’m doing a multi-regression analysis using infection rate of COVID-19 as dependent variable and air pollution, meteorological and social data as independent variables in order to understand their relationship. I have found very good results from this analysis and I’m considering p value, R2, Beta coefficient and influence results in my write up. All are OK to be added in my paper? What do you suggest in my case. By the way, I’m from Bangladesh, my study area will be in Dhaka.

Lola says

Thanks ever so much Jim. This is so helpful and insightful. I am so appreciative!!! thanks!!!

Jim Frost says

You’re very welcome Lola! Best of luck with your research!

Lola says

Dear Jim

You mention that “It is standard practice to use the coefficient p-values to decide whether to include variables in the final model”

Pls I would like to understand the following

(1) What do you mean by ‘final model’?

(2) Does it mean that the regression has to be re-run with the insignificant variable taken out, if that variable’s coefficient is NOT statistically significant??

(3) If the variable is left in the model as it is, how detrimental could this be in reality? Or is there a possibility that that one ‘problematic’ variable will not matter much if its just that one

Jim Frost says

Hi Lola,

By final model, I just the model that researchers consider to be the best model. Researchers will often remove variables that are not significant. If you leave too many insignificant variables in the model, the model is less precise. And, yes, removing a variable means that you re-run the model without that variable. However, you don’t have to remove insignificant variables. Sometimes you want to leave them in because you are specifically testing them. Or, perhaps theory suggests it’s important even if the p-value indicates otherwise. Leaving in some insignificant variables will generally not reduce precision too much. And, if you’re not sure that it is unimportant, it’s often better to leave a variable in. Removing an important variable is potentially more problematic than leaving in a variable that’s not important. Again, if you leave in too many unimportant variables, it can reduce precision.

So, it’s a balance. But, leaving in one insignificant variable shouldn’t usually be a problem unless you have a small sample size for the complexity of your model. For more information, read my post about choosing the correct model.

Harry says

Hi Jim,

Thank you for the helpful guide. I was wondering whether the p-value for the dependent value is important and if this also has to be below 0.05 for the null hypothesis to be rejected?

Harry

Jim Frost says

Hi Harry, you only get p-values for the independent variables and constant. You don’t get one for the dependent variable.

Sami says

Hi, what if one of the independent variables takes negative and positive values. How can we interpret the coefficient associated to such a variable (I mean its effect on the dependent variable) ? And what if this variable takes only negative values ?

Jim Frost says

Hi Sami, if you have a negative coefficient and a positive coefficient, that just indicates that each independent variable has a different type of relationship with the dependent variable. For the IV with the positive coefficient, you know that as that IV increases in value, the DV also tends to increase in value. There’s a positive correlation between that IV and DV. For the IV with a negative coefficient, you know that as that IV increases, the DV tends to decrease in value. There’s a negative correlation between that IV and DV.

One thing you write confuses me. If “one of the independent variables takes negative and positive values.” One IV can’t have both positive and negative values in one model. It has just one value, whether it’s positive or negative. Unless you mean you’re fitting different models with different variables and they changed based on the specific combination of IVs in the model. That’s a different matter. If that’s the case, let me know!

Nikhil Talwar says

Thanks for the explanation Sir. I have one basic question on interpretation of Beta Values ( coefficient of independent variables). If the independent variables are categorical/qualitative then how do we interpret?

Example let’s say there are 3 categorical independent variables

1.Marital status – Married

2.Marital status – single

3.Marital status – Widower

with Beta Values -1.233;9.234;-2.878 respectively

Then how do we interpret these? Assume the dependent variables is the premium rate

Jim Frost says

Hi Nikhil,

In regression, you interpret the coefficients as the difference in means between the categorical value in question and a baseline category. So, you have to know which category is the baseline. The output should indicate. If it doesn’t state it explicitly, it’s the category that is not listed in the output or does not have a coefficient value. The associated p-value allows you to determine whether the mean difference between a category and the baseline category is not zero.

I write much more about categorical variables in my regression ebook.

Ghada MGA says

Hi,

Your explanation was very helpful to understand the regression.I have some data to interpret ,the t values of the predictors are all statistically insignificant “p>0.05” but the F statistic of the whole model is significant!

is it possible to have a situation like that ?

Jim Frost says

Hi Ghada,

Yes, it is possible! In my post about the overall F-test of significance, I write about that type of situation and why it occurs. Read that post and see if it answers your questions.

Danijela B. says

Hello Jim,

I found your explanation to be very thorough and easy to follow. I have a dilemma and question that I am hoping you can answer. I am expecting a positive sign, and my results show a negative coefficient but statistically significant. Am I right to interpret it as:

The results show that there is a statistical significance because its p-value is 0.0108, which is less than the significance level of 0.10. The sample data provide enough evidence to reject the null hypothesis. Beta values take into account standard errors, which are used to determine if the value is significantly different from zero by evaluating the t – statistic value. For the model, the beta value is -1.660618, the t-value is -2.561538, and the p-value is 0.0108. This suggests that this variable is significant, and further explains that IV negatively affect DV, and the relationship is significant. The coefficient value is -1.660618, which indicates that when the independent variable increases by one unit, the dependent variable will decrease by 1.660618.

My professor said that I cannot reject the null hypothesis if the sign is not what I was expecting even though the p values is significant. I know you talk about omitted variable bias in some of the previous comments, but assuming I have the right model and fit, I am interpreting this correctly?

Jim Frost says

Hi Danijela,

I have to agree with your professor. There are several ways you can get erroneous coefficients. Omitted variable bias is one key way. You might also be overfitting your data or have multicollinearity (although you probably would’ve had insignificant results then). It’s also possible that you’re not accounting for curvature. Having significant variables doesn’t mean that you’re fitting the correct model. So, I wouldn’t just assume that you have right model.

I write about this exact cases (unexpected coefficient signs) in the section about Theory in my post about model specification. What I’d recommend is checking your residual plots and doing research to see what others have found. What variables did they use? At the very least, you’ll need to have an explanation for why the unexpected sign is correct.

There also other possibilities such as overfitting or data mining.

And about whether you’re interpreting the results correctly. What you write is the correct interpretation of the statistical output. The problem is that your coefficient is probably biased. Imagine you’re looking at a clock that runs slow. You can correctly read the clock and know that is says 9AM. However, if you have a meeting at 9AM, you’ll be late! You read the clock correctly, but didn’t make sure that the clock was running correctly. In your case, you’re interpreting the statistical output correctly. However, there’s probably some sort of assumption violation that is biasing your results. So, definitely kudos for the great interpretation. I don’t think most students would explain the results nearly so well as you have! But, we need to see what’s going on with the assumptions and how the model actually fits the data.

Best of luck with your analysis! 🙂

Dustin says

Thank you, Jim, for the very helpful blog post!

I had a couple of questions:

1. Assume we have a linear model with many predictors, and that their corresponding p-values confirm significance. If Beta X1 > Beta X2, can we simply state that X1 has a greater positive impact on y, or do we need to do additional testing to compare the two coefficients?

2. Using that same model, let’s say we wanted to compare a subset of predictors with another subset of predictors (all within the same model), and we wanted to prove that X1, X3, X4 collectively has a greater positive impact on y than X2, X5, X6. My instinct is to find the average of the coefficients for each group and compare the two averages, but I have a feeling it’s not that simple. Do I need to re-create the mode with those predictors grouped together or how could I prove this?

Cheers

Jim Frost says

Hi Dustin,

Determining the relative importance of the predictors can be difficult. But, you’re in luck! I’ve written a post about that. Identifying the Most Important Variables in a Regression Model. Read that post, and if you still have questions post them there!

Marisol says

Hi Jim, thank you for your post.

I’m wondering if regression includes both effect size and significance tests.

I think it does include effect size given that there are several ways to measure effect size in a regression analysis, including through the correlation coefficients, regression coefficients, partial and semi-partial coefficients, squared coefficients, and proportions of variance.

But I’m unsure if it includes significance tests. What do you think?

Jim Frost says

Hi Marisol,

Coefficients are the effects. P-values for the coefficients are the significance tests for the effects.

Ivo Brito says

Im a master student, currently developing my thesis on Post-Earnings Announcement Drift (An Euro Stoxx 50 Analysis) between 2012 and 2017. I’ve defined the event window (-20,0,20) and computed the normal returns and market model parameters for each firm in my study(51 companies). However, the Beta parameters of all aren’t statistically significant (p-value > 5%), which makes my study irrelevant, I don’t know if i did my calculations wrong or its how the sample is but i dont think the calculations are wrong since i tested them in both Excel and Eviews.

Can i still find something worth mentioning on the fact all the betas of my companies are not stastically significant?

Jim Frost says

Hi Ivo,

Several questions. Have you checked the residual plots? And, were any of your IVs significant? (I’m not quite clear if you’re saying that they all are not significant.)

Sarah says

Hi Jim , for my regression analysis, my p value is significant but my standardised bita value is -.13, but my prediction was that when there is an increase in the value of my predictor variable there would be an increase in the value of the dependent variable as well, does the standardised bita value anyhow affect my results? Even thought the predictor was significant

Jim Frost says

Hi Sarah,

A coefficient for a standardized independent variable represent the mean change in the dependent variable given a one standard deviation change in the independent variable. The sign for a standardize variable will match the sign for an un-standardized variable. In your case, the negative sign indicates that as the IV increases the DV tends to decrease–a negative relationship.

If theory/other research suggests that there is a positive relationship (they both tend to increase together), you should investigate. I talk about this in my post about choosing the correct regression model. Look in the section about Theory.

Omitted variable bias (aka confounding variables) might be the reason. Perhaps your model excludes and important variable that is correlated with both the DV and IV in question? For more information, read my post about omitted variable bias.

Best of luck with your analysis!

Almoghirah says

Thanks Jim for your valuable comments and clear answers.

I read well the section of (prediction) because I’m interested in practical use of regression analysis. I have data for cost of different medical tests, so I regress cost against number of patients had the test and the price of the test. Although the model fits well but I found the prediction from the coefficients different from the reality. To be more specific: the model tells me that when the price increases by one unit the cost increases by 8495 units holding the number of patients had the test constant, but when I used Excel and increased the price by one unit for each test I found the cost different. Am I wrong?

Almoghirah says

Dear Jim

Thanks again and again

I would like to inquire about the characteristics of the data set for regression to be fulfilled before running the model in a computerized program.

Jim Frost says

Hi Almoghirah,

Regression analysis is interesting in terms of checking the assumption. For other analyses, you can test some of the assumptions before performing the test (e.g., normality, equal variances). However, for regression analysis, the assumptions typically relate to the residuals, which you can check only after fitting the model. For specifics about these assumptions, read my post about least squares assumptions.

Almoghirah says

Dear Jim

Thanks a lot for very very nice and clear explanation. This is more that useful.

Nik says

Dear Jim!

Nice presentation!

Please explain me one issue. After logistic regression analysis I have found p=0.34, OR=1.4, 95%CI 0.9-3.4.

I understood that independent X does associated with outcomes Y (p=0.34). But OR=1.4 and it included in CI. How can I explain it?

Thank you.

Jim Frost says

Hi Nik,

The results of your model don’t show that there is relationship between your IV and DV. The p-value indicates this because it is higher than any reasonable significance level. Additionally, the CI for the odds ratio (OR) includes one. In short, your results are not statistically significant. Your sample data do not provide strong enough evidence to conclude that this relationship exists in the population. However, keep in mind that, non-significant results do not prove that the effect/relationship doesn’t exist. Just that your sample didn’t provide strong enough evidence to conclude that it exists. It could be that the sample size is too small or there’s too much variability in the data. Or, perhaps you need to include more variables in the model to control for potential confounding variables.

Riya Jain says

Hey thanks for the post !

I have a doubt

what happens if my X variable coeffcient is -0.647042012003429 and my significance level is 1.70654E-15

Jim Frost says

Hi Riya,

The negative coefficient indicates that for every one-unit increase in X, the mean of Y decreases by the value of the coefficient (-0.647042012003429).

Your p-value is displayed using scientific notation. You need to move the decimal point to the left 15 places, which produces a very, very small p-value. Your results are statistically significant, which means you can conclude that your coefficient is significantly different than zero (i.e., an effect exists in the population).

Anurag Maheshwari says

Amazing post Jim!! Thank you for the detailed elaboration.

I just have one doubt that how can we choose one among many equal performing linear models on test dataset?

Example:

Consider two linear models:

L1: y = 39.76x + 32.648628

and

L2: y = 43.2x + 19.8

Given the fact that both the models perform equally well on the test data set, which

one should be prefer

and why?

Jim Frost says

Hi Anurag,

Both of your models have the same IV (X) and same DV (Y), yet they’re producing different estimates. What is going on there? If you fit the same model to the same dataset, you should get the same estimates. Or, are these estimates based on different datasets? Unfortunately, there’s crucial information missing from your question.

In general, it is possible to get conflicting information about which model is best. Read my post about how to choose the correct model for many tips!

Sania Gul says

hello,

i have few question if possible answers me.

1: what does it mean when the t-value is negative?

2: in mediation when the direct relationship is significant and after adding mediator the indirect relationship become insignificant what kind of mediation is this? zero , full or partial.

3: what is meant by zero mediation?

looking for ur kind response

thanks

Jim Frost says

Hi Sania,

You get a negative t-value when the regression coefficient is negative. The absolute value of the t-value determines whether the test is significant for the typical two-sided test. Usually you can just assess the p-value, which is based on the t-value.

A mediator (M) explains the underlying mechanism of the relationship between and independent variable (X) and dependent variable (Y). When you have a mediator, it means that X has a relationship with M, and M has a relationship with Y.

First, test to see if there is a significant relationship between X and M. If that relationship exists, you can then fit a model that includes both X and M and independent variables and use Y as the dependent variable.

In the model X + M –> Y, if the effect of X on Y completely disappears and M is statistically significant, M fully mediates X and Y. In other words, there is no direct relationship between X and Y at all. It all works through the mediator.

However, if the effect of X on Y still exists with M in the model but it is smaller than in the X –> Y model, and M has a significant relationship with Y, then M partially mediates X and Y. This condition indicates that some of the observed relationship between X and Y exists directly and some of it exists through mediation. In practice, partial mediation is more common than full mediation.

Zero mediation indicates that the relationship fully exists through the direct relationship between X and Y. Zero mediation exists when there is no relationship between X and M and/or no relationship between M and Y. For any mediation to exist, both the X/M and the M/Y relationships must be significant.

If I understand your scenario correctly, you’re saying that the relationship between X and Y is significant. Then, you add M to the model, which is not significant. That indicates there is no relationship between M and Y, which would be zero mediation.

I hope that helps!

Joe Stringer says

Hello Jim,

I appreciate your blogs and have shared them with numerous friends. Thank you for sharing your knowledge in a way so easily understandable! I am looking forward to purchasing your book.

Following a regression, an IV was found to be significant. When graphing the relationship however, the slope appears to be very close to 0. I am unsure how to interpret this. What would you recommend?

Thank you!

Jim Frost says

Hi Joe,

Thank you so much for your support. I really appreciate it!! 🙂

There are three possibilities that come to mind.

Those would be the first possibilities I look into!

Johann Bachelor says

Thank you!

Explanations are way better than most other sources!

Furb says

Hi Jim,

I have a problem. My logistic regression (ordinal data (sleep hours) on mental health (binary) appears to have U-Shaped relationship. This relationship is significant, however, colleagues tell me that the linear relationship is untrustworthy and I should use Curvilinear?

Can i trust the results?

best

F

Jim Frost says

Hi Furb,

If you’re using a linear relationship to model a curved relationship, then you can’t trust the results. I write about this problem and the need to fit curvature properly in my post about curve fitting. While I write about it in the context of a continuous dependent variable and you’re talking about a binary dependent variable, the ideas are the same.

You should check the residual plots for your model. If your model doesn’t fit the data, you’ll see it in those plots. So, check those plots out and they’ll help you answer your question about whether you can trust the results.

Best of luck with your study!

Michelle says

Hi Jim,

I’m running a regression with quite a small sample size due to data limitations, n=200. Co-efficients are very small, p-values very large, and r-squared very small. Ultimately, I’d like to conclude there is a very weak or almost non-existent relationship. Can I do so if my results are not significant?

Thanks,

Michelle

Jim Frost says

Hi Michelle,

The large p-values indicate that your sample data do not indicate there is a relationship between the independent variables and the dependent variables. The low R-squared also indicates that your model explains a small proportion of the variability in the DV around its mean. Both of those suggest weak or non-existent relationship. I’d also suggest that usually a sample size of 200 is not considered small. Although that depends on the complexity of the model and other issues such as the presence of multicollinearity.

There’s no general rule for determining the strength of the relationship by the size of the coefficients. You’ll need to compare the coefficient values to values that would be considered weak, medium, and strong. Those values will vary by subject area. Additionally, the size of the coefficients will depend on the measurement units. For example, if you have IV that is weight and you measure both as grams and kilos. If you fit one model with grams and the other using kilos, the coefficient for the model that uses kilos will be 1000 times greater than the model that uses grams. That doesn’t mean that grams, with 1/1000 the size of coefficient, is less important. Both models indicate the same effect size but the units will affect the size of the coefficients. However, because your p-values are not significant, you can’t conclude that the coefficients are significantly different from zero anyway.

I hope this helps!

Rosie says

Dear Jim,

Thank you very sincerely for your time and your kind explaination.

I like your book, and I introduced it to my friends, too.

Have a nice day!

Rosie says

Dear Jim,

Thank you sincerely for your time and your kind explaination.

I like your book.

Have a nice day!

Rosie says

Hello Jim,

Sorry that I have 2 more questions for you.

1) As far as I know, with sample size of few hundreds, it’s normal to have few outliers. However, when I tried removing outliers, I got 1 more predictor significant. Thus, could you please kindly advise me should I remove outliers in this case?

2) I got the Mahal. Distance’s maximum value equals 52.361 which is far higher than the critical value (11.07) of df=5 (as I have 5 predictor variables) taken from Chi-squared distribution table at 0.5 alpha level. This indicates there are outliners which may place undue influence on the model.

– Whether my above understanding is correct?

– I tried removing the outliers by running “Select cases” with condition of “MAH1<11.07" and run the regression again. But then I still see the Mahal. Distance's maximum value equals around 15. Although it is already much lower but it is still higher than the critical value of 11.07. So can I stop with this lower value of Mahal. Distance and go ahead with interpreting the regression results, or I still need to do something else regarding removing the outliners?

Thank you so much for your kind explaination so far. I really appreciate it.

Rosie

Jim Frost says

Hi Rosie,

When you have a sample of that size, it’s typical for outlier tests to find a few outliers. However, that doesn’t mean those values are actually outliers. If you use these tests, you should consider the values as candidates that you need to investigate. Don’t assume that just because a test identifies values as being outliers that they are actually outliers. You don’t want to automatically remove outliers based on statistical tests only. Additionally, rerunning outlier tests after removing outliers can be problematic in some cases. Instead, you’ll need to investigate each outlier candidate and determine whether you should remove them based on what you find out and subject area knowledge. If you do remove an outlier, you need to be able to explain why for each data point.

It’s not surprising that removing outliers made a predictor become significant. By removing unusual values you’re reducing the variability in your data, which tends to increase statistical power. However, that doesn’t indicate that removing the values is the correct approach. Again, you’ll need to make that determination on a case-by-case basis.

I’ve just recently written two posts about outliers that you’ll probably find helpful. These posts aren’t written from the regression point of view but the general approaches are still applicable. Read Five Ways to Identify Outliers and Determining Whether to Remove Outliers.

Additionally, outliers are more complicated in regression because there are a variety of ways that an observation can be unusual. I cover this in detail from the regression perspective specifically in my ebook, Regression Analysis: An Intuitive Guide. If you haven’t bought it already, you should consider getting it.

I hope this helps!

Rosie says

Dear Jim,

Thank you for your response.

A nice day to you.

Rosie says

Dear Jim,

I would like to consult you on the conflict results that Pearson correlation and Multiple Regression test produce.

For example, my hypothesis is:

H1: There is a positive relationship between subjective norms and purchase intention for eco-products.

If my Pearson correlation test shows that there is a positive relationship between these 2 variables, but my regression test shows that subjective norms and purchase intention are not significant (I have several indepdent variables in multiple regression analysis and “subjective norms” is one of them. In my regression test, “purchase intention” is outcome variable).

So is it correct if I made conclusion for my hypothesis H1 based on result of Pearson correlation test; and for multiple regression result, I just can say and discuss that “subjective norms” is not an effective predictor of “purchase intention”?

(As Pearson test and Regression test show conflict results so I wonder for only hypotheses, conslusion should be based on which test.)

Thank you so much.

Jim Frost says

Hi Rosie,

This discrepancy sounds like a form of omitted variable bias. You have to remember that these two analyses are testing different models. Pairwise correlation only assesses two variables at a time while your multiple regression model has at least two independent variables and the dependent variable. The regression model tells you the significance of each IV after accounting for the variance that the other IVs explain. When a model excludes an important variable, it potentially biases the relationships for the variables in the model. Hence, omitted variable bias. For more information, read my post about omitted variable bias. That post tells you more about it along with conditions under which it can occur.

In your case, the Pearson correlation is essentially a model with one IV and the DV whereas your multiple regression model contains multiple IVs. The difference is the number of IVs. While I can’t say whether either model is correct, I’d lean towards your multiple regression model because it controls for additional variables. Of course, you’ll have to be sure that the model and its results make theoretical sense and that the residual plots look good.

I hope that helps!

KB says

Hello Jim,

Really really helpful blog, still getting my head multiple regression statistics so nice to find someone who simplifies and is clear.

I have a question. I have an ANOVA F value of 0.06. Both my variables have negative Beta coefficents with first P=0.02 and the second P=0.07. I understand this means the variables relationship with the dependent is inverse, but is it normal to have a good F value and one variable to be deemed not statistically significant.

Grateful for any guidance

KB

Jim Frost says

Hi KB,

It sounds like you’re referring to the Overall F-test of Significance. Click that link to read a post I’ve written about it and discuss the type of situation you’re experience. Read that post and if you have more questions, don’t hesitate to post them there!

Rosie says

Dear Jim,

Thank you very sincerely for your quick response and clear explaination!

This is the most helpful site I’ve ever found!

Rosie says

Dear Jim,

Thank you very sincerely for your quick response and clear explaination!

This is the most helpful site I’ve ever found!

Rosie says

Dear Jim,

Thank you so much for your post!

Could you please kindly help me with the following question:

The p-value of my ANOVA test is smaller than 0.05, revealing a statistical finding that there is a linear relationship between dependent variable and independent variables. However, the p-values of all independent variables in “Coefficients” table show that among five independent variables, only 2 have a statistically significant impact on the outcome variable. Is it possible? (Because I think that if ANOVA test shows a statistical finding that there is a linear relationship between dependent variable and independent variables, there also should have statistically significance for all independent variables)

(By the way, R-square I got = 0.316, showing that 31.6% of the variance in the dependent variable is explained by the independent variables. Is this % too low?)

With great thanks again!

Jim Frost says

Hi Rosie,

I’m assuming the p-value you’re referring is for the F-test of overall significance. Click that link for a post I’ve written about that test specifically. In a nutshell, when that test is significant, it indicates that your model predicts the mean dependent value significantly better than just using the mean of the dependent variable itself. In other words, your model explains the variability of the values around the dependent variable better than just using the mean. While your model has some explanatory power, it doesn’t guarantee that all of the independent variables in your model are individually significant. It assesses the collective effect of all the independent variables. For example, if your overall F-test is significant and then you add another independent variable to the model that has no relationship with the dependent variable, your overall F-test is still likely to be significant.

So, yes, it’s quite possible to have a significant F-test for the entire model but have some independent variables that are not significant.

As for the R-squared, I’ve written several posts for that. You should read one about how high does R-squared need to be. You’ll find it varies depending on your subject area and the purpose of your model. Also read my post about low R-squared values and how they can provide important information.

Best of luck with your analysis!

Nadal Merquez says

Hi Jim Frost,

Thanks for your post and the amazing books! They have been really helpful. However, I’d like to ask you two pressing questions regarding the use of p-values.

1. I do not see how the assumption of normality of the error term is need in order to make use of p-values. For the derivation of the asymptotic normality of the estimators, the normality of the error term is not needed. Could you elaborate why the normality of the error term is needed in order to make use of the p-value?

2. I noticed from computing robust regression methods in R that the p-value is usually not given. Do you know what complicates the derivation of the p-value in the case of robust regression models? How would one know if coefficients are significant in the case of robust regressions?

I’d love to hear from you!

Nadia

Jim Frost says

Hi Nadia,

I’m glad my posts and my books have been helpful! I really appreciate you supporting my books! On to your questions!

1. The distribution of the error term is intrinsically tied to the sampling distribution of the coefficient estimates. One of the properties of the normal distribution is that any linear function of normally distributed variables is itself normally distributed. Given this property, it’s not difficult to prove mathematically that the assumption of the normality of the error terms implies that the sampling distribution of the coefficient estimates are also normally distributed. Therefore, if the error distribution is nonnormal, so are the sampling distributions. In that case, the hypothesis tests based on them are not valid.

2. Unfortunately, I don’t have much experience using robust regression. As I understand it, robust regression first performs OLS, analyzes the residuals, and then reweights the observations based on the residuals. The fact that the residuals are random means that the weights themselves are random. Weighted regression assumes that the weights are fix. Hence, the problem. I gather there is a procedure to work around that to produce hypothesis tests and CIs. However, there are criticisms that the procedure or analysts need to specify a scaling factor and tuning constant, which can cause large changes in the results. That’s the extent of my knowledge on that!

I hope that helps!

Robiul says

I have got my R square .997 and adjusted R squared is .995 is that bad /or how can i reduce the value ?

Jim Frost says

Hi Robiul,

There’s no general rule whether that’s good or bad. You’ll need to use subject-area knowledge as well as knowledge about your model fitting process to make that determination. It could be good if your study area has low noise measurements and it involves something that is inherently very predictable (such as modeling physical laws). But, it could represent something like overfitting your model, which indicates that the R-squared is too high and your coefficients are likely invalid.

I’ve written a post about why your R-squared might be too high. That post will help you answer this question.

nah says

Thank you Jim

you help us a lot.

Juston Shen says

Hi Jim,

The page 284 of Regression Analysis book which was mentioned effect size, statistical significant and practical significant. Could you let us know the difference between statistical significant and practical significant? How many types of effect size in regression analysis?

Jim Frost says

Hi Juston,

First, thanks so much for supporting my ebook. I really appreciate that!

There’s really two primary measures of effect size for regression coefficients. The first is the raw regression coefficient. The coefficient tells you how much the DV changes given a 1 unit increase in the IV. Of course, you have to be careful about determining causality. It might just be an association but not causation. I cover causation vs. correlation in detail in my new Introduction to Statistics ebook by the way.

Another way to look at it is standardized coefficient, which I also write about in my regression ebook. The standardized effect size is better for comparing the magnitude of effect across different types of IVs. This measure tells you how much the DV changes given a 1 standard deviation change in the DV. Because it’s all on a common standardized scale, you can compare the coefficients.

Finally, for the question significant and practical significant, let me point you to a blog post that I’ve written all about practical vs. statistical significance. In a nutshell, statistical significance is all about whether your sample provides enough evidence to conclude that the effect exists in the population. Practical significance is about whether the estimated size of that effect is large enough to be meaningful. That’s based on subject-area knowledge and can’t be computed mathematically. Anyway, read the post on it!

I hope this helps!

eric godson says

sir what if all the result shown in the T test shows negative sign or the significant is greater than 0.05

Jim Frost says

Hi Eric,

A negative t-value just means the coefficient is negative. If a negative coefficient is statistically significant, it indicates that as that independent variable increases, the mean of the dependent variable decreases.

I’ve written a post about t-values. It’s written in the context of t-test for when you’re assessing group means. However the same principles apply to t-tests in regression analysis. I suggest you read the following post, and when write about group means, just think about regression coefficients (which is a type of mean, a mean change in the DV). Read about t-values and t-distributions.

Omoleye Ojuri says

Hi Jim,

You are a great teacher Jim. The use of simple languages and expressions fascinated me to your website. Please just a quick one. My case is MRQAP model, do I have to plot residual plots to indicate the fit of the MRQAP model? And if yes, please what are the values to use to compute the residual plots (unstandardised coefficients, standardised coefficients etc). Or are p values and R square enough to indicate the fitness of the MRQAP model.

Omo

Jim Frost says

Hi Omo,

I have to apologize, but I don’t know MRQAP models well enough to provide an answer. I just looked in to them and they sound interesting. I will need to learn more!

Julie says

Hie Jim

I just stumbled on your postings and found it to be extremely useful. Kindly help me with something . If you found a variable to be statistically insignifant for your final panel regression model can you explain the coefficient of the insignificant variable or once the variable is insignificant then the coefficient sign is not to be considered . I found one variable to be statistically insignificant but it’s coefficient sign supports previous studies

Jim Frost says

Hi Julie,

There are several considerations here.

First, when the p-value is not significant, the coefficient is indistinguishable from zero statistically. In other words, your sample provides insufficient evidence to conclude that the sample effect exists in the population. In that light, you don’t consider the sign.

However, there’s another question about leaving an insignificant variable in your model. Often analysts will remove insignificant variables from the model. In your case, you have theoretical expectations that this particular variable is relevant and the sign is consistent with expectations. Removing this variable would potentially bias the other coefficients. Consequently, I’d leave the variable in the model even though it is not significant. While it’s not good to include too many insignificant variables in the model (reduces the precision), it can be worse to remove one relevant variable, even when not significant, because it can bias the model.

In the write up, I’d explain that you left the variable in the model because of theoretical expectations and not wanting to bias the model. However, your sample doesn’t provide additional support for the effect of this variable.

I talk about some of these issues in my post about choosing the correct regression model.

I hope this helps!

Angeles Dorantes says

Are the coefficients in the hierarchical beta regression interpreted in a similar way?

Jim Frost says

Hi Angeles,

Your question contains several terms, hierarchical and beta, that mean different things in different settings and software packages.

If you’re referring to hierarchical regression as the practice of entering independent variables in groups, such as a group of demographic variables followed by a group of variables you’re testing, yes, you interpret them the same. However, there is one caveat. If a group that is entered into the model later has statistically significant IV, it’s possible that the earlier groups without that significant variable can have omitted variable bias.

Beta in SPSS refers to standardized independent variables. If that’s the case for your model, then you must use a different interpretation for these coefficients. Standardized coefficients represent the mean change in the DV given a one standard deviation change in the IV. I talk about why you might use standardized values in this post about identifying the most important variables in your model.

Jose Chvaicer says

Hi Jim, your articles have helped me understand a lot of previous unclear points. A question remains in mind however: I’ve been asked to force the intercept to pass by the zero point inspite of observed data giving a value for the “a” in Y= a+bx. What I noticed is that the residuals do change much for the modified model (Y=bx) . So what is the gain? What consequences are expected? What happens to the p-value?

Thank you.

Jim Frost says

Hi Jose,

In most cases you should NOT force the regression line to go through the origin (y intercept equals zero). The fact that you’re observing changes in the residuals suggests that you should not do this. The best case scenario is that forcing the line to go through does not change the residuals.

If you don’t fit the constant in your model, it forces the constant to equal zero. For more information, read my post about the regression constant. In that post, I show why it’s almost always good to include the constant in your model. I would say there are no benefits for excluding it. Excluding it can bias your coefficients and produce misleading p-values (check those residual plots). Excluding it also changes the meaning of the R-squared value. It almost always increases R-squared but it completely changes the meaning of it. You cannot compare R-squared values between models with and without the constant.

Rashid says

Where to know if Regression coefficient is not significant at 5, but at 10% or viceversa?

Hello Sir, I hope my questiona finds you,

In some articles Regression coeficients are mentioned to be significant at 5% level and some other predictors significant at 10% level. So, where to know if Regression coefficient is not significant at 5, but at 10%?

Jim Frost says

Hi Rashid,

The significance level is something that the researchers decide before they start the analysis. There are advantages and disadvantages between use higher and lower significance levels. I’ve written about significance levels in the context of hypothesis testing. In summary:

Higher significance levels (e.g, 0.10) require weaker evidence to determine that an effect is significant. The tests are more sensitive–more likely to detect an effect when one truly exists. However, false positives are also more likely.

Lower significance levels (e.g., 0.1) require stronger evidence to determine that an effect is significant. The tests are less sensitive. They are less likely to detect an effect when one exists. On the good side, false positives are less likely to occur.

Analysts often use a significance level of 0.05 as a compromise between the pros and cons of higher and lower values.

You can read more in my posts about significance levels and p-values and errors in hypothesis testing.

Best of luck with your analysis!

Mahshameen Munawar John says

Sir Thankyou so much for the prompt reslonse. Yes, the first model is significant (P=. 02). However, as you also mentioned there seems to be no increase in the predictive capacity when I add the IV (R square remains almost the same in both models) …is that a negative thing? Yes the p value for the IV in the second model is significant.

Thankyou again for all your guidance.

Jim Frost says

Hi, you’re very welcome!

It sounds like your results disagree a bit. That happens because the F-test and t-test for the coefficients measure different things. The F-test measures the amount variance your model accounts for. In this case, you’re seeing whether the 2nd model accounts for significantly more variance than the first model. The t-test for the coefficient p-value assesses whether the coefficient is significantly different than zero (no effect).

While it might sound bad to say the 2nd model doesn’t account for significantly more variance than the first model, it’s actually good news overall for you. We know in the second model that your IV is statistically significant even when controlling for the demographic variables. The first model doesn’t include the IV even though we know it is significant. In other words, we know the first model is incomplete. In fact, the first model might have omitted variable bias because it does not include a significant IV.

Consequently, even though the second model doesn’t necessarily explain significantly more of the variance, it does include a significant IV and is, therefore, less likely to have biased coefficients. You should ask yourself, does the sign and magnitude of the IV coefficient match theoretical expectations and other research? If so, it looks like the IV is a good addition to the model. Of course, check your residual plots to be sure that you’re not violating any OLS assumptions.

Because you’re using regression analysis, you might consider buying my ebook about regression analysis, which includes far more information about it.

Mahshameen Munawar John says

Hello Sir, your posts have been a great help for me, thank you very much! I have been experiencing much confusion while interpreting the P values for Hierarchical Regression. i have one IV and DV , I controlled the demographics in the first step. The Sig. F Change value from the Model Summary output shows that it is not significant (P= .98) for the second model, where I introduced the IV. The same model is significant in ANOVA Table (F=2.15, P=.02). Could you please explain how to you interpret this result. Is the model valid and meaningful? I have searched but could not find an explanation or understand where the problem lies.

Your reply will mean a lot.

Jim Frost says

Hi Mahshameen,

Is the first model with the demographics significant?

If it is, then the results seem to indicate that both the first and second model are significant. However, adding the IV in the second model did not significantly improve the model. In other words, both models are significant but you can’t say that the second model is better.

However, I think the more crucial statistic to assess is the p-value for the IV in the second model. That statistic will tell you specifically whether that IV is significant while controlling for all the demographic variables. I think that’s what you really want to know.

Best of luck with your analysis!

Nancy Lohalo says

Hi Jim, thank you very much for this insightful post! I have encountered a few problems with the dependent variable Y in the linear regression model. The data collected showed a decreasing trend for the past 20 years, and my hypothesis stated that X1 will have a positive impact on Y. When I ran the regression test, almost all of the independent variables had negative coefficients. How can I interpret it? Thank you!

Jim Frost says

Hi Nancy,

It’s difficult for me to say much about your specific case because there’s so little information. It sounds like your hypothesis was that X1 would have a positive coefficient but your analysis produced a negative coefficient. I’m going to assume that X1 is negative and statistically significant. If it’s negative but not significant, it’s not distinguishable from zero and you can’t assume that it has a negative value in the population. Given those assumptions about the situation, there are two general possibilities.

1) Your hypothesis was incorrect. I have no way to know about that. But, it’s something you can investigate.

2) Your hypothesis is correct but your regression model has a problem that produces biased coefficients. This problem is causing the analysis to produce a negative coefficient but it’s should be a positive coefficient. There are a number of reasons why this can occur, including confounding variables, overfitting, data mining, and a misspecified model among other possibilities. Be sure to go through the OLS assumptions and see if your model violates any of them. It will probably take some effort to check these potential problems.

Because you’re performing a study with regression analysis, you might consider buying my ebook about regression analysis. In this ebook, I provide much more information all about regression analysis.

Best of luck with your study!

Karis says

Hi Jim, thank you so much for this post it’s helped a lot! I’m learning this stuff at uni and have come across a question which has completely confused me and wondered if you could help? The question asks to interpret the regression analysis result and its significance of these regression results:

R^2 = 0.74 (F = 16.82, p>0.01; t = 0.54, p<0.01).

However, the differing levels of confidence levels has thrown me? Does the fact that the F ratio is not within the confidence threshold mean that the regression model altogether is not statistically significant? Thank you!

Jim Frost says

Hi Karis,

So, the F-test and R-squared goes together. These are measures of Goodness-of-Fit. I’m assuming that the F-value and its p-value are for the F-test of overall significance. That test indicates that your R-squared (0.72) is not significantly different from zero–assuming that alpha is 0.01. Your model is no better at predicting the DV than just using the mean. That’s kind of odd for a model with an R-squared as high as 0.74. There might be a very small sample size or some problem with the model. I can’t tell from these results. Read more about that in my post about the F-test of overall significance. Read my post to see how to interpret R-squared.

The t-value and its p-value are for a term in the model, such as an independent variable. That particular IV is statistically significant. This post details what that means. For this model, the overall significance and significance for a particular IV disagree. The post about the F-test of overall significance describes how this disagreement can happen.

Note that none of the statistics you provide relate to confidence, as in confidence levels or confidence intervals. However, there is a disagreement about statistical significance. Read the post about the F-test to understand that issue.

I do think it’s odd that R-squared is reasonably high but that the overall F-test is not significant. I suspect something odd is going on.

I hope this helps!

Klaudia Pająk says

The question is- when I make the analyse of regression, SPSS shows the results and COEF has some value… When I describe these results on paper- should I define the coeff value as a b or β?

Thank you in advance

Jim Frost says

Hi Klaudia,

You should be able to work this out from the information provided. Coefficients are estimates of population parameters. And, b is an estimate of a parameter. Therefore, b = coefficients in this context because they are both estimates. Conversely, Beta is a population parameter and not an estimate.

I hope this helps!

Klaudia says

Hello Jim,

I’d like to ask what does the “COEF” mean. Is it the same thing as b or β?

Klaudia 😉

Jim Frost says

Hi Klaudia,

COEF stands for coefficient. These are the values that the procedure estimates from your data. In a regression equation, these values multiply the independent variables.

Technically, β is the parameter value for the population. Your regression equation estimates these parameter values. In textbooks, these estimates are often denoted using beta-hats. That’s a β with a ^ on top. Some sources use a lower-case b to indicate that it’s an estimate. The key thing to note is that some forms (β) refer to the true population parameters while others (beta-hat and b) refer to the estimates of the parameters. The coefficients in your output are estimates of the parameters.

One caution, SPSS for some strange reason uses the term “beta” to refer to standardized coefficients!

I hope this helps!

Curt Miller says

Hi Jim,

Do we still use p-values in determining whether or not a predictor variable should remain in the model, even when we are building a model on full population data?

Thanks you,

curt

Jim Frost says

Hi Curt,

When you’re working with data for an entire population, there is no need to use any p-values. P-values are an integral part of hypothesis tests that help you determine whether an apparent effect that exists in your sample also exists in the population. When you have the population data, all effects that you observe by definition do exist in the population. There’s no need to perform any hypothesis testing to confirm it because you’re looking at all the data for the population. This applies to regression analysis and other forms of hypothesis testing such as 2-sample t-tests, et al.!

Phil A. says

Hi Jim,

Quick question for a special type of regression… I have the following equation but I am not clear on the interpretation of the coefficient I obtain:

log($RealGDP) = B0 + B1(Junk-Bond Yield %) + e

My X1 data is in terms of percentage points (%) and my Y-variable (in log-scale) is in terms of dollars ($).

After I run my regression, my B1 coefficient = -0.005

As of now, I am interpreting the B1 coefficient as “A 1% increase in the Junk-Bond yield leads to a -0.5% decrease in Real GDP” – does this sound like the correct interpretation?

My main confusion is around the “1% increase in X” …. If the junk-bond spread is currently at 5%, do I interpret “a 1% change” as the junk-bond yield moving from 5% to 6%? Or do I interpret it as a 1% change of 5% (ex: 5% to 5.05%)?

Olga Pap says

Hello Jim.

Hello All.

I have one question. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Olga Pap says

Is it possible please to answer me on the above question?

Jessica says

Thank you! It does, though, when I’m looking at a scatterplot, I’ve seen an R value. This is not to be interpreted as the same thing as the correlation coefficient, r . . . correct? Even though the R value is not R-squared, it is still not the same as r . . . right?

Jim Frost says

Hi Jessica,

Ah, yes, I jumped straight to R-squared because that is used much more frequently. R is the coefficient of multiple

correlationwhereas R-squared is the coefficient of multipledetermination. The use of the capital letter R for both of these statistics indicates that they are sample estimates. I’ve described R-squared so onto R!The calculation for R is (unsurprisingly) just taking the positive square root of R-squared. R represents the correlation between a set of variables with another variable. In the regression context, this could be the correlation between your set of independent variables and the dependent variable. The interpretation of R is not intuitive. Hence, R-squared is used more frequently.

Lower case r is the correlation between two variables and it is commonly used. R involves more than two variables.

I haven’t seen R used much at all. Perhaps it is in some specialized context. But, you probably don’t need to worry about R.

Jessica says

Hi, I know this may seem to be a very simple question, but is there a difference between R and r? Do they stand for the same thing in regression analysis?

Jim Frost says

Hi Jessica,

Yes, r and R-squared are related as they both measure the strength of relationships between variables. r is a correlation coefficient that ranges between -1 to +1. It measures the strength of the linear relationship between two continuous variables. R-squared measures the strength of the relationship between a set of independent variables and the dependent variable. It’s a percentage that ranges from 0 – 100%.

Suppose you have a pair of variables, say X and Y, and the correlation coefficient (r) is 0.7. If you perform a simple regression using these two variables, you will obtain an R-squared of 0.49 (49%). We know this because 0.7^2 = 0.49. However, unlike correlation coefficients (r), you can use R-squared when you have more than two variables.

I write about that aspect in my post about correlation. You can also read more about R-squared.

I hope this helps!

Olga Pap says

Hi Jim. I would be very grateful if you could help me. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Tesfakiros Semere says

What a clear, simple, and easy to understand. You saved my time from reading lots of books. It is really helpful.

Would it be possible to get them all in Pdf just to print and read when I am out of network

THANK YOU SO MUCH Kim.

Jim Frost says

Hi Kim, thanks so much for your kind words! They made my day! While I don’t have PDFs of the blog posts, in several weeks I’ll releasing an ebook all about regression analysis. If you like the simple and easy to understand approach in my blog posts, you’ll love this book. It should be out in early March 2019!

Digambar salunkhe says

Thank you so much for sharing this blog…It’s really helpful and easy to understand the concept of whole regression model.

Adu Emmanuel Ifedayo says

Thank you.

Neven says

Hi Jim ! Great blog , very clear and very helpfull . The best I have found in this field! Thanks.

Qmars Safikhani says

Hi Jim,

Thanks a lot for sharing your knowledge through this article. I found it very interesting as you explained somehow difficult concepts in an easy way. Well done

Hans says

Hey Jim,

Great Blog! You helped us a lot preparing for our studies at university. We have a question regarding the p-value… Is there an explanation for a p-value being exactly 1.0? Does it mean that there is a 100 percent chance that the independent variable has no effect on the dependent one? Or is there anything else to consider? Thanks a lot for your help and keep that great work going!

Jim Frost says

Hi Hans, thank you so much! It’s great to hear that it’s been helpful for you all. That makes my day!

Yes, you can obtain a p-value of 1.0. To get exactly 1.0, your sample statistic would have to exactly equal the null hypothesis. For example, if you perform a 1-sample t-test and your null hypothesis is that the population mean equals 10. If your sample statistic is exactly 10, you obtain a p-value of exactly 1.0. In regression analysis, typically the null for a coefficient is that it equals zero. So, if the estimated coefficient equals zero exactly, you’d again get a p-value of 1.0.

The interpretation of a p-value in general is the probability of obtaining the observed sample statistic or more extreme if you assume the null hypothesis is true. The reason p = 1.0 when the sample statistics equals the null hypothesis value makes sense when you think about it with that interpretation in mind. When the sample stat equals the null value, there is a 100% probability that a sample statistic will equal the null value or be more extreme! That’s true by definition because that case covers the entire range of the sampling distribution (i.e., you’d shade the entire area beneath the sampling distribution curve).

To see these sampling distributions in action for a hypothesis test, read my post about p-values and significance levels.

Of course, the probability of obtaining a sample statistic that exactly equals your null hypothesis is miniscule. When using statistical software in the field, if you see a p-value = 1, it’s more likely due to rounding.

Paul says

Hi. I want to find out if simple or multiple regressions can be used to explain effects (as in experimental studies)?

Thank you.

Jim Frost says

Hi Paul,

You bet they can! The coefficients describe the effects and the p-values determine whether the effects are statistically significant.

Rashan says

This is very helpful. Thank you

Surya says

Thanks Jim

Surya says

Hi Jim, I have just subscribed to your posts after reading the wonderful post on residual plots.

Could you please let me know how do we interpret the SE of coefficients , T statistic as well.. Or do you already have an article on them… Please reply.. Thanks..

Jim Frost says

Hi Surya,

Thanks so much! I’m glad that post was helpful!

The standard error (SE) of the coefficient measures the precision of the coefficient estimate. Smaller values represent more precise estimates. Standard errors are the standard deviations of sampling distributions. If you were to perform your study many times, drawing the same sample size, and fitting the same model, you’d obtain a distribution of coefficient estimates. That’s the sampling distribution of a coefficient estimate. The standard error of a coefficient is the standard deviation of that sampling distribution. The SE is used to create confidence intervals for the coefficient estimate, which I find more intuitive to interpret.

The t-statistic in the context of regression analysis is the test statistic that the analysis uses to calculate the p-value. I write a post about how it works in the context of t-tests. It’s fairly similar for coefficient estimates. Read that post but replace sample mean with coefficient estimate and you’ll get a good idea. How t-tests work.

I hope that helps!

[email protected] says

been reading your posts all night, (morning now).. I can’t stop because it’s like a light bulb keeps going off. Been studying this stuff for weeks, now I finally get it thanks to your post. Thank you:)

-Extremely tired data science grad student.

Jim Frost says

Hi, I’m sorry my posts caused you to lose some sleep last night, but I love your analogy about light bulbs going off! I’m really happy to hear that they were helpful. That really makes my day! Best of luck with your studies!

Tracey says

Hi Jim. Thank you so much for this as it helped clear up some things in my mind as I prepare a research paper.

Jim Frost says

Hi Tracey, you’re very welcome. I am happy to hear that it was helpful!

Qiumei Jing says

Thank you for your explanation,Jim.That’s really great!

When I’m doing multiple liner regression , I have a question.The liner regression has three independent variables(A,B,C) and one dependent variable(D). I got significant p-value of ANOVA table,but in Coefficients table ,the constant p-value is 0.237,which is not significant ,with one predictor(Variable A) p-value is 0.211,another two predictors have good significant value(P=0.000). In that case ,how can I interpret the results? The hypothesis of the two predictors (variable B and C)which have significant is”there is a relationship between B and D” and “there is a relationship between C and D ” In this case,can I say the two hypothesis were supported? And how can I interpret the one (A)with insignificant p-value in coefficient table? Thank you in advance!

Jim Frost says

Hi Qiumei,

It’s generally not worthwhile interpreting the constant, so I’d skip that. To learn why, click the link for interpreting the constant in this post.

Here’s how you can interpret the significant predictors.

The sample provides sufficient evidence to conclude that changes in both independent variables B and C are correlated with changes in the dependent variable D. Statistical significance indicates that the correlation does not equal zero. In other words, you can reject the null hypothesis that the coefficients equal zero.

For the insignificant variable, the sample provides insufficient evidence to conclude that there is a relationship between these insignificant variables and the dependent variable. In other words, you fail to reject the null hypothesis that these two coefficients equal zero.

For more elaboration, reread this post where I talk about this in depth.

Appadu says

Dear Jim

Thank you for your explanations on how to Interpret Regression Coefficients for Linear Relationships and p-value. It is very clear appreciate you time to put this together.

I have one question I was looking at an example on Estimated standardised OLS beta coefficient data. The results show R squared (%) as 26.2 and F-Value 18.14. Please advise how to interpret this 2 figures. Thank you

Jim Frost says

Hi Appadu,

When you standardize the continuous independent variables in your model, the output produces standardized coefficients. Standardization is when you take the original data for each variable, subtract the variable’s mean from each observation and divide by the variable’s standard deviation. The main reason I’m aware of for performing this standardization is to reduce the multicollinearity caused by including polynomials and interaction terms in your model. I write about that in my post about multicollinearity.

In terms of interpreting the standardize coefficient–it represents the mean change in the dependent variable given a one standard deviation in the independent variable. Another reason statisticians use it is as a possible measure for identifying which variable is the most important.

As for interpreting R-squared and the F-test of overall significance, those don’t change from the usual interpretations. Click on the links to read my blog post about interpreting each statistic.

I hope this helps!

Hrishikesh Geed says

Thanks for the explaination Jim !!.

I have one doubt, how do you calculate the p-value corresponding to each coefficient?

How do you decide the standard deviation,and the sample mean for calculating the z value for each coefficient?

Thanks

Hrishi

eric says

Thank you very much for the explanation Jim!

If the p-value is under the significant level, this would indicate that there is enough evidence to reject the null hypothesis. The null hypothesis being here that there is no correlation between 2 variables (in a single linear regression).

Here is my first question: how do we decide how to set the significant level? Is it purely arbitrary?

My second question is: since the coefficient of correlation varies -1 and 1, it is tempting to conclude that there is a significant correlation (positive or negative) between 2 variables is the coefficient of correlation is close to -1 or 1 and that there is no correlation when the coefficient of correlation is close to 0. However I think this assumption is false but can’t get the intuition to understand why.

Could you help me about those questions?

Many thanks for your time and your attention

Best regards

Eric

Hanan Shteingart says

the following claim is not true if the features are correlated, what’s known as multicollinearity: “The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable”. In fact, a feature could have a positive correlation with the target yet a negative coefficient and vice vera.

Jim Frost says

Hi Hanan,

You raise a good point. The interpretation that I present, including the portion that you quote, is accurate when your model doesn’t contain a severe problem. However, if your model does contain a severe problem, it can produce unreliable results, which includes the possibility that the coefficients don’t accurately describe the relationship between the independent variables and the dependent variable. The problem isn’t with how to interpret coefficients, but rather with a condition in the model that causes it to produce coefficients that you can’t trust.

As you point out, multicollinearity can produce unreliable, erratic coefficients. In some cases, the sign of the coefficient can even be incorrect. However, the sign switch doesn’t necessarily have to happen when your model has multicollinearity. I write more multicollinearity, including switched signs, in this post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions.

By the way, there are a number of other potential problems that can cause your model to produce results that can’t trust. Multicollinearity is just scratching the surface of that. These problems include an incorrectly specified model, overfitting the model, heteroscedasticity, and data mining among others. I spend quite a bit of time talking about these problems, how they can invalidate your results, and what you can do to address them.

I hope this helps!

MN says

Thank you very much for the wonderful elaboration. Amazing!!

Jim Frost says

You’re very welcome, MN! I’m glad it’s helpful!

Rajasekar says

I am currently working on a multiple regression model, where i have 4 x variable and all my variable are not statistically significant. I know when this happen i can reject null hypothesis but like to know what might be the wrong , do i need to add some more x variable in this case.Also the R Square =0.109842937

Adjusted R Square =0.034084889

Ayush says

This is really one of the best websites I have come across for DATA SCIENCE… Great effort put up by Sir Jim…

Jim Frost says

Thank you, Ayush!

Rali says

Hi Mr. Jim

Thanks for the helpful blog

all the best

Jim Frost says

Hi Rali, you’re very welcome! I’m glad it was helpful!

ADIL HUSSAIN RESHI says

Really fabulous ..it cleared all my doubts about p- value

Jim Frost says

Hi Adil, Thanks! I’m so glad to hear that it was helpful!

Javed Iqbal says

Thanks Jim for the nice explanation. This regression seems to violate one of the model assumption namely the homoskedasticity. Log transformation should work here.

Jim Frost says

Hi Javed, thanks for your comment. The residuals for this model are homoscedastic–or very close to it. Their variance are fairly equal across the entire range. The variance might appear to be lower in the very low end of the range, but there are also fewer observations in that region, which can make the dispersion appear to be smaller. At any rate, it is close enough. To see how a true case of heteroscedasticity appears, along with multiple methods for correcting it, read my post about heteroscedasticity. By the way, I explain in that post why I always recommend trying other methods of addressing this problem before using a transformation.

Toby says

Great blog with detailed explanation! It helps clear my doubts for p-value.

Thank you Jim! and Happy new year! 😀

Jim Frost says

Thank you, Toby! And, I’m very happy you found the blog to be helpful! Happy new year to you too!!