P values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The linear regression coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p values for the coefficients indicate whether these relationships are statistically significant.

After fitting a regression model, check the residual plots first to be sure that you have unbiased estimates. After that, it’s time to interpret the statistical output. Linear regression analysis can produce a lot of results, which I’ll help you navigate. In this post, I cover interpreting the linear regression p-values and coefficients for the independent variables.

**Related posts**: When Should I Use Regression Analysis? and How to Perform Regression Analysis Using Excel

## Interpreting P Values in Regression for Variables

Regression analysis is a form of inferential statistics. The p values in regression help determine whether the relationships that you observe in your sample also exist in the larger population. The linear regression p value for each independent variable tests the null hypothesis that the variable has no correlation with the dependent variable. If there is no correlation, there is no association between the changes in the independent variable and the shifts in the dependent variable. In other words, there is insufficient evidence to conclude that there is an effect at the population level.

If the p-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there *is* a non-zero correlation. Changes in the independent variable *are* associated with changes in the dependent variable at the population level. This variable is statistically significant and probably a worthwhile addition to your regression model.

On the other hand, when a p value in regression is greater than the significance level, it indicates there is insufficient evidence in your sample to conclude that a non-zero correlation exists.

The regression output example below shows that the South and North predictor variables are statistically significant because their p-values equal 0.000. On the other hand, East is not statistically significant because its p-value (0.092) is greater than the usual significance level of 0.05.

It is standard practice to use the coefficient p-values to decide whether to include variables in the final model. For the results above, we would consider removing East. Keeping variables that are not statistically significant can reduce the model’s precision.

**Related posts**: F-test of overall significance in regression and What are Independent and Dependent Variables?

## Interpreting Linear Regression Coefficients

What does the coefficient mean? The sign of a linear regression coefficient tells you whether there is a positive or negative correlation between each independent variable and the dependent variable. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also tends to increase. A negative coefficient suggests that as the independent variable increases, the dependent variable tends to decrease.

The coefficient value signifies how much the mean of the dependent variable changes given a one-unit shift in the independent variable while holding other variables in the model constant. This property of holding the other variables constant is crucial because it allows you to assess the effect of each variable in isolation from the others.

The linear regression coefficients in your statistical output are estimates of the actual population parameters. To obtain unbiased coefficient estimates that have the minimum variance, and to be able to trust the p-values, your model must satisfy the seven classical assumptions of OLS linear regression.

Statisticians consider linear regression coefficients to be an unstandardized effect size because they indicate the strength of the relationship between variables using values that retain the natural units of the dependent variable. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

To learn how least squares regression calculates the coefficients and y-intercept with a worked example, read my post Least Squares Regression: Definition, Formulas & Example.

Linear regression uses the Slope Intercept Form of a Linear Equation. Click the link for a refresher!

**Related posts**: Linear Regression and Linear Regression Equations Explained

## Graphical Representation of Linear Regression Coefficients

A simple way to grasp regression coefficient interpretation is to picture them as linear slopes. The fitted line plot illustrates this by graphing the relationship between a person’s height (IV) and weight (DV). The numeric output and the graph display information from the same model.

The height coefficient in the regression equation is 106.5. This coefficient represents the mean increase of weight in kilograms for every additional one meter in height. If your height increases by 1 meter, the average weight increases by 106.5 kilograms.

The regression line on the graph visually displays the same information. If you move to the right along the x-axis by one meter, the line increases by 106.5 kilograms. Keep in mind that it is only safe to interpret regression results within the observation space of your data. In this case, the height and weight data were collected from middle-school girls and range from 1.3 m to 1.7 m. Consequently, we can’t shift along the line by a full meter for these data.

Let’s suppose that the regression line was flat, which corresponds to a coefficient of zero. For this scenario, the mean weight wouldn’t change no matter how far along the line you move. That’s why a near zero coefficient suggests there is no effect—and you’d see a high (insignificant) p-value to go along with it.

The plot really brings this to life. However, plots can display only results from simple regression—one predictor and the response. For multiple linear regression, the interpretation remains the same.

Contour plots can graph two independent variables and the dependent variable. For more information, read my post Contour Plots: Using, Examples, and Interpreting.

## Use Polynomial Terms to Model Curvature in Linear Models

The previous linear relationship is relatively straightforward to understand. A linear relationship indicates that the change remains the same throughout the regression line. Now, let’s move on to interpreting the coefficients for a curvilinear relationship, where the effect depends on your location on the curve. The interpretation of the coefficients for a curvilinear relationship is less intuitive than linear relationships.

As a refresher, in linear regression, you can use polynomial terms model curves in your data. It is important to keep in mind that we’re still using linear regression to model curvature rather than nonlinear regression. That’s why I refer to curvilinear relationships in this post rather than nonlinear relationships. Nonlinear has a very specialized meaning in statistics. To read about this distinction, read my post: The Difference between Linear and Nonlinear Regression Models.

This regression example uses a quadratic (squared) term to model curvature in the data set. You can see that the p-values are statistically significant for both the linear and quadratic terms. But, what the heck do the coefficients mean?

## Graphing the Data for Regression with Polynomial Terms

Graphing the data really helps you visualize the curvature and understand the regression model.

The chart shows how the effect of machine setting on mean energy usage depends on where you are on the regression curve. On the x-axis, if you begin with a setting of 12 and increase it by 1, energy consumption should decrease. On the other hand, if you start at 25 and increase the setting by 1, you should experience an increased energy usage. Near 20 and you wouldn’t expect much change.

Regression analysis that uses polynomials to model curvature can make interpreting the results trickier. Unlike a linear relationship, the effect of the independent variable changes based on its value. Looking at the coefficients won’t make the picture any clearer. Instead, graph the data to truly understand the relationship. Expert knowledge of the study area can also help you make sense of the results.

Related post: Curve Fitting using Linear and Nonlinear Regression

## Regression Coefficients and Relationships Between Variables

Regression analysis is all about determining how changes in the independent variables are associated with changes in the dependent variable. Coefficients tell you about these changes and p-values tell you if these coefficients are significantly different from zero.

All of the effects in this post have been main effects, which is the direct relationship between an independent variable and a dependent variable. However, sometimes the relationship between an IV and a DV changes based on another variable. This condition is an interaction effect. Learn more about these effects in my post: Understanding Interaction Effects in Statistics.

In this post, I didn’t cover the constant term. Be sure to read my post about how to interpret the constant!

The statistics I cover in the post tell you how to interpret the regression equation, but they don’t tell you how well your model fits the data. For that, you should also assess R-squared.

If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.

**Note: I wrote a different version of this post that appeared elsewhere. I’ve completely rewritten and updated it for my blog site.**

Miriam says

Hello Jim thank you for your amazing responses so far!

Please What is your decision if your ROA coefficient is -1.065269 with a t-statistic of -2.062978 and probability of 0.0452 and std error of 0.516374 with a 5% significance and your claims ratio coefficient is -0.118924 with a t-statistic of -0.901914 and probability of 0.3721 and std error of 0.131857 with a 5% significance

Jim Frost says

Hi Miriam,

You’re going to have to help me understand the context better. ROA coefficient? Return on Assets? I need to better understand the model before being able to interpret it. The IVs and DV.

Here’s what I can tell you from a statistical perspective. ROA is statistical significant and has a negative relationship with your DV. For each one-unit increase in ROA, your DV declines by an average of 1.065269 units of the DV (whatever that is).

On the other hand, the claims ratio coefficient is not significant. While it’s coefficient is negative, it’s indistinguishable from zero. You might consider dropping it from the model unless theory strongly suggests leaving it in.

Mavis Mok says

May I know this is written by what date and year ya?

Jim Frost says

Hi Mavis,

When citing online resources, you typically use an “Accessed” date rather than a publication date because online content can change over time. For more information, read Purdue University’s Citing Electronic Resources.

Mariana says

Thank you so much for such clear information!

Ellen says

Hello Mr Jim,

thank you for making such wonderfully easy-to-understand explanations about regression. Even though I’m in college some of these concepts still seem elusive to me.

I’m so very sorry if this question is long, I would be so grateful for help!! I feel so stupid that I don’t understand it properly. I am doing multiple ordinary regression for a set of variables X1,X2,X3,X4. The assignment is to make statistically valid conclusions, as if they were going to be published for real. I’ve checked for multicollinearity. I’m sorry for the improvised matrices, I hope they’re readable. I knew X4 was heavily correlated with Y pre-hand, and I use it as a control variable.

Using only one independent variable at a time, I get (AR = adjusted r-squared, C = coefficient) :

X1 X2 X3 X4

AR. 0.567 0.0632 0.0740 0.645

C. 0.77. -0.32. 0.34. 0.95

p-value. 0.0001 0.03. 0.027. 0.004

And with two independent variables, one being X4, I get:

(X1, X4) (X2, X4) (X3, X4)

AR 0.757. 0.747. 0.63

C (0.43, 0.64). (-0.34, 0.96). (-0.03, 0.93)

p-value (0.00001, 0.000008) (0.000031, (0.000002) (0.73, 0.0000008)

All models have a constant, and I don’t know if it’s relevant, but it varies in value, positive and negative, and sometimes has a p-value<0.05 and sometimes not. There are 52 datapoints. The variables are about human behaviour, so all R^2 seem pretty high. I really expect X1 and X2 to be at least a little correlated to Y.

My questions are:

1. X2: The adjusted r^2 with (X2, X4) is 0.747, even though X2 by itself only has AR = 0.0632. Is there something wrong with my model? Or could the little explanatory power X2 have over Y (6%) be one that X4 is missing? In general, could an independent variable add to the model even though it by itself doesn't seem to be linearly correlated? All p-values are significant.

2. X1: It clearly looks as if X1 and X4 are independently correlated with Y. If I combine them, the adjusted R^2 increases to 0.757 from 0.567 and 0.645 respectively. Since all p-values are <0.05, can I say draw the conclusion that X1 has some explanatory power over Y that isn't already covered by X1? That is, is the increase in adjusted r^2 enough to say that (X1,X4) is a better model than just X4?

3. X3: X3 has AR 0.34 by itself, but is insignificant when combined with X4. Does this mean I can't claim its correlated to Y at all?

A thousand thanks,

Ellen

vsal1 says

Hi Jim,

Thanks for answering my previous question.

How do you compare the beta weights to rank them? Would you create a table to present this information and explain the order of explanatory importance? I’m not so sure on what to do after the comparison step and what is produced by comparing them.

Jim Frost says

Hi,

A table would be a fine way to present the results. As for interpreting the results and knowing what it all means, read the other post I linked to in my previous reply. All detailed in it!

vsal1 says

Thank you very much, the other post answers my question exactly!

Your blog is very informative, definitely using and sharing.

Jim Frost says

So glad I could help! Thanks for sharing too! 🙂

VSal says

After doing the regression analysis steps, I found that the last step is to compare the beta weights of the independent variables in order to rank them in order of explanatory importance.

I was expecting to use regression analysis to analyze regression results in a new way. I don’t know if I can build off on that last step of regression analysis to analyze the beta data results in a new way. (These results are secondary data)

Is there a potential way to analyze in a new way?

Jim Frost says

Hi,

Using beta weights to assess the explanatory importance is one possible assessment. And there are large caveats that go with that. I write about this in my post about identifying the most important variables in a model. What I refer to as standardized coefficients in that are article are synonymous with beta weights.

I don’t know what you mean by “analyze in a new way.”

Nick says

Hello, I really enjoyed this explanation and was wondering if there was a certain textbook you have referenced for this? I am writing a paper and would like to be able to cite something when I go over this information.

Jim Frost says

Hi Nick, I’m glad the explanation was helpful! I’m sure that any regression textbook would give the same interpretation. I always use

Applied Linear Statistical Modelsby Neter et al. Of course, there is also my own book on the subject! 🙂Ivan Vlahek says

Dear Jim,

my question is closely connected with this comment of yours.

I performed linear regression analysis with several categorical independent variables. F-test indicates that one variable is not significant, however, specific differences (between different levels of the categorical variable) are significant. Is there any simple explanation why this occurs and how to interpret it?

I’m asking this because you wrote that “This type of inconsistent results is not quite as strong but there is still some evidence of effects”.

Many thanks for your dedicated work in statistics,

Ivan

Toby says

Hi Jim, hope you are well.

I was curious about a question which came up in a past exam. It asked “are my coefficients over or under estimated”, how would you go about answering this problem?

Many thanks

Agatha says

Hello Jim

If i have 4 age group categories and two show significance while the other 2 are not, in my interpretation do i state that two age groups are significant while the other 2 were not?

Jim Frost says

Hi Agatha,

Typically, I’d report the F-test results for the categorical variable as a whole first and foremost. That F-test tells you the overall significance of your age group variable. You can then describe the specific differences that are significant (which use individual t-tests) as needed for your discussion.

If the F-test results are not significant but the specific differences are significant (which can happen occasionally), you can report them as you indicate in your comment along with the nonsignificant F-test result. This type of inconsistent results is not quite as strong but there is still some evidence of effects.

I hope that helps!

Thomas says

Dear Jim,

The following explanation is correct for p value in linear regression?

P-value describes the significance of the findings given the sample size. But what does significant mean? In this population sample, 29 observations are used. Since this is a regression analysis of a small sample, we want to know whether we will still see the resulting coefficients if we include another 29 observations, or another 29,000. Will the slope of the line be 1.42, or will it be 0 or negative? Here, the p-value indicates there is a 0% chance the coefficients will change beyond the standard error given the addition of more data points or different samples. Most important, it indicates a 0% chance the slope will become negative. In other words, significance means that, regardless of how many times the data are sampled, the relationship will hold

Jim Frost says

Hi Thomas,

Unfortunately, no, that explanation is not correct. P-values in general do not relate to the probability of future results if you were to add additional observations. It is true that when you have a higher p-value, it’s less surprising if the results change, the effect vanishes, or even flips direction. But p-values don’t measure the probability of that happening. And a very significant p-value doesn’t indicate there is zero probability of the slope becoming negative (to use your example). It is less likely to occur with a low p-value than with a high p-value, but you can’t use the p-value to know the probability of that occurrence. Additionally, you’ll never have a zero probability that the results won’t change with additional observations when you’re working with samples rather than an entire population because there’s always some degree of uncertainty associated with using a sample.

For regression coefficients, p-values indicate the probability of observing the coefficient value, or more extreme, if the null hypothesis is correct. To learn more about that, read my post about interpreting p-values. In that article, I’m not discussing regression coefficients, but the concepts still apply. One key is that p-values assume the null is true and then determines how unlikely your sample estimate (the coefficient in regression) is given that assumption.

phaimm says

Hi Jim,

First time commenting, love your site.

Is it good practice/appropriate to adjust your significance level when interpreting p-values in a multiple linear regression? i.e. If I include 5 variables in the model and wanted to interpret significance, then assuming I use a simple Bonferroni correction, would I only consider a p-value as significant if it was less that 0.01 assuming an overall type 1 error rate of 0.05?

Jim Frost says

Hi, thanks so much for the kind words!

No, typically you do not make those adjustments for multiple variables in multiple linear regression. Typically, you’ll make those adjustments when using post hoc analyses to make multiple comparisons, such as comparing the means between a set of groups.

To understand why, there’s a tradeoff between including too many variables (reduced precision) and too few (increased bias). In general, there’s more danger with including too few because even being just one variable short can cause a large amount of bias. It’s often better to include an extra variable than remove it when you’re not sure. Although, that too can be taken to extremes and cause real problems through overfitting and chance correlations due to data dredging.

But making it harder to include variables in model through an adjustment potentially biases your results.

I’ve written an in-depth article about specifying the correct regression model. I recommend reading that for more tips about different considerations.

Danial Bazli bin Mohd Rosli says

well if I make statistic between two variables and got a p-value of 0.06, I know that we fail reject null hypothesis , but how to know which one is better? let say that I’m comparing my typing speed in iPad and on Laptop?

Jim Frost says

Hi Danial,

Assuming that you’re assessing the two groups of iPads and Laptops for mean testing speed, if you’re results are not significant, then you have insufficient evidence to conclude that the population means are different. You’re unable to say either group is faster on average at the population level.

Adam Abu Basyar says

Why my significance value is under 0.05 but my standardized and unstandardized coefficients is negative?

Jim Frost says

They’re telling you two different things. The p-value indicates whether your results are statistically significant while the sign (+ or -) indicates the direction of the relationship. For your case, your results are statistically significant and there is a negative relationship between the two variables. As one variable increases, the other variable tends to decrease.

Max says

Hi Jim, thank you for the detailed explanation!

A very naïve question : If I run regression on the entire population data, in that case, how do I interpret the p values?

Jim Frost says

Hi Max,

When you have the entire population, you don’t need to use or interpret p-values. P-values are for inferential statistics, which is when you use a sample to draw inferences about a population. P-values incorporate sampling error. However, when you measure the entire population, there is no sampling error. You’re not using a sample to understand a population. Consequently, you don’t need to use p-values.

For instance, if you calculate the mean for the entire population, you know mu (the population parameter for the mean). You don’t need to estimate it and, hence, there is no need for p-values.

Chris H says

Hi Jim, I have five models for each spatial resolution dataset. I want to test to see if there is a significant difference between e.g. 100 x 100 m model and the 20 x 20 model, because both have a high r2, but are 100 m significantly better model compared to 20 m model. How would I do that?

Spatial resolution (area) Regression equation r2 P F1,199

10 x 10 m (100 m2) y = -1869.5×2 + 567.62x – 23.703 0.8187 1.48E-53 847.00

20 x 20 m (400 m2) y = -1870.6×2 + 569.13x – 23.769 0.8324 1.84E-76 922.68

30 x 30 m (900 m2) y = -2288.5×2 + 626.32x – 25.7 0.8212 2.98E-73 842.19

50 x 50 m (2 500 m2) y = -2472.1×2 + 643.45x – 26.0 0.8214 1.34E-73 850.63

100 x 100 m (10 000 m2) y = 3235×2 – 54.33x – 6.2795 0.8684 2.73E-86 1210.24

Charlene Mae says

Hello what if my P-value is 1.0665E-15 then my teacher said that “if p-value is less than .05 = significant” and then “if p-value is more than .05 = non-significant, is it still significant or not?

Jim Frost says

Hi Charlene,

That is scientific notation, which is used for very large and very small numbers. In your case, it’s a very small number, which is good when it’s a p-value!

The number to the right of the E (-15) tells you how many places to move the decimal point. The negative value indicates you need to move the decimal point to the left by 15 places.

It’s equivalent to: 0.0000000000000010665

It is significant.

I hope that helps!

Albert says

If all other data are known, how to calculate the t-statistics and p-value in a simple regression model.

Mooney says

Hi Jim,

What does it mean to have missing values in the ANOVA table for the intercept? (St Err, t value, p value and CIs?

Thank you for the great platform and information you share with us!

Jim Frost says

Hi Mooney,

It’s hard to say for to say for sure without knowing any of the details. However, my guess would be that you have a saturated model. That’s a model that contains as many parameter estimates as the number of observations. In that case, it’s impossible to calculate those particular statistics.

Kate Moody says

Hello Jim, thanks for your explanation. I wonder if you could help me with a logistic regression I have done. I am looking at whether a set of variables could be used to predict the outcome of a test. I have got these results:

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) 0.79864 0.18491 4.319 1.57e-05 ***

ID1_Gender 0.82500 0.14725 5.603 2.11e-08 ***

ID1_SEN -2.79217 0.28673 -9.738 < 2e-16 ***

ID1_Disadvantage -0.36606 0.40065 -0.914 0.3609

`ID1_FSM Eligible` -0.04059 0.41307 -0.098 0.9217

ID1_EAL 0.11384 0.33430 0.341 0.7334

GLD_Yes 0.38788 0.15147 2.561 0.0104 *

GLD_No -0.48422 0.11672 -4.149 3.34e-05 ***

I'm really struggling to interpret these results. I'm assuming I could strengthen my model by removing some of the variables but I'm not sure which ones. Could you point me in the right direction?

Jim Frost says

Hi Kate,

First, let me point you towards a post I wrote about specifying your model. That’ll give you some good pointers.

One thing you could start with is by removing, one at a time, the variables with the highest p-values until you have only significant variable in the model plus any variables that theory strongly suggests you include. By including insignificant IVs, it can reduce the precision of your model.

You should also assess the prediction intervals to evaluate the precision of the predictions. If you’re going to use the model to make predictions, you’ll need to know how good the predictions are, which you can’t do with the table of stats provided.

Tracey Rachelle Doss says

We are doing multiple Regression and I found in the second IV, though the Beta is possible, and the odds are 1.003, the significance value is P>=0.850>0.05, what does this mean? My dependent variable is gender, male and female and my IV are democracy and trust in the ruling party.

Jim Frost says

Hi Tracey,

It looks like you’re using binary logistic regression, which makes sense because gender is a binary DV. Both the odds ratio and -value suggest that your IV do not predict a person’s gender. You cannot conclude that there is a relationship between your IV and gender.

somesh984 says

Hello, i have three variables in my model and out of three, two are significant and one is not. Can I show the value of the one which is insignificant on the regression equation?

Jim Frost says

Hi, many times when a variable is not significant, analysts will remove them from the model. However, there are cases where you might want to leave it in. For example, if you’re performing an experiment and specifically testing that variable or if theory suggests it belongs in the model. For more information, read my post about specifying the correct model.

somesh984 says

hello. Do we need to show the value of the variable on the regression equation that is not significant?

Sen says

Coefficients Standard Error t Stat P-value

Intercept -103.0575448 16.01243451 -6.436094692 0.00134569

X Variable 1 1.349104859 0.154799081 8.715199434 0.00032925

How can I interpret the intercept, the negative t star and the P-Values of this regression? I need your help!

Jim Frost says

Hi Sen,

For information about the intercept, including negative values, please read my post about the Y-intercept. You can post questions about that topic in that post. Thanks!

Luã Reis says

Good work! I bought your books and it really makes a difference in learning statistics with them, thanks!

Jim Frost says

Hi Luã, I’m so glad to hear that my books were helpful!! Thanks for letting me know! 🙂

RABIA NOUSHEEN says

Hi Jim

I would like to ask that if my GLM gives me estimates for each level of factor while I am interested to get the overall estimate for that particular factor rather than each level then what should I do? Is it rights to Add up the estimates of all levels in this regard? e.g factor is polystyrene and has four levels. I sum up the estimates : level 1 (Intercept, reference category) + level 2 + level 3 + level4 = estimate for polystyrene.

Can you please clear my concept about it?

Best Regards

Jim Frost says

Hi Rabia,

There are two key ways to determine significance for categorical variables in a linear model.

One is for the categorical variable as a whole. That uses an F-test and think of it like an ANOVA. It tells you whether the variable as a whole is significant but it doesn’t tell you whether the differences between levels are statistical significant. The other method is the one you describe where it assess the differences between each factor level mean and either the reference level or overall mean, depending on which coding scheme you use. That method uses t-tests.

Unfortunately, you can’t just add them up to get the overall significance. You actually need to switch statistical tests. The individual differences use t-tests while the overall significance uses the F-test. Usually, when you have a significant difference (a significant t-test) the overall significance (F-test) will also be the significant. But not always! You’ll need to tell your software to perform an F-test for the categorical variable to get its overall significance.

I hope that helps!

What you describe is the

kayum says

Hi jim , your blogs always give my all the answers but one doubt i had is why t- test is used for individual testing why not z test ?

Jim Frost says

Hi Kayum,

Use t-tests when you have a sample estimate of the population’s variability. Use a z-test when you know the population’s variability. You’ll almost never know the actual variability of an entire population and almost always use the sample to estimate it. Consequently, you’ll almost always use t-tests instead of Z-tests!

As the DF for a t-test get over 20, the t-distribution beings to approximate the normal distribution and the two tests will give very similar results. However, with smaller sample sizes, Z-tests will be inaccurate. But, follow the general rule about about using the t-tests when you estimate the variability and you’ll always be the most accurate.

Many students will learn the Z-test even though it’s not used much in practice. The reason is because it’s easier to calculate by hand.

I hope that helps!

Jack says

Hi Jim,

Thank a lot for taking the time to respond. Very informative and makes a lot of sense!

Thanks a lot,

Jack says

Hi Jim,

I analyzed the impact of the Log + 1 transformation on 4 numeric variables, one of which is my Y.

All four are highly skewed to the right. I ran the loop below and the distributions normalized, improving my MSE and R2. However, the scale on each of the variables changed which also impacted my EDA, plots, and overall stats like mean, median, etc. Is there a way to re-scale the data after running the np.log.

My residuals and MSE are too high without the log but it is impacting the actual data – confused where to go from here.

for colname in cols_to_log:

df[colname + ‘_log’] = np.log(cars[colname] + 1)

df.drop(cols_to_log, axis=1, inplace=True)

Jim Frost says

Hi Jack,

Be aware that you cannot compare goodness-of-fit measures between models that are not-transformed and transformed. You’re changing the units of the dependent variable which affects the variance structure, etc. What you really need to determine is whether the transformation has resolved the problems you identified in the untransformed model. Don’t use transformation to reduce the size of your residuals because that’s not what the end result is. They might look smaller, but they’ve just been transformed. Instead, identify problems with the residuals (nonnormality, heteroscedasticity, etc.) and see if the transformations resolve those issues. However, transformations should be your last resort for fixing those problems.

It’s possible to back transform the values from the model after the analysis. You’d need to do that if you want to interpret things like means, coefficients, predictions, intervals, etc. using the natural data units rather than the transformed units. Some software can do that for you automatically but it can a process if it doesn’t!

Lara says

Hi Jim,

I am currently analysing results from a hierarchical multiple regression. I have tried to use the SPSS manual to report my results, however the example they give has the same number for Sig. and Sig. F Change, and I don’t understand which value goes where and what they therefore mean for my results.

I currently have p (sig.) = .036 and p (sig. f change) = .065

As you can see, one is significant is one is not, therefore they have a substantial impact on my results.

My reporting so far is as follows:

After the addition of BAS and FFFS in Step 2 the total variance explained was 20%, R2adjusted = .11, F (4, 40) = 2.415, p = .036. The two measurements explained an additional 15% of the variance after controlling for gender and years of experience, F change (4, 40) = 3.620, p = .065.

Are the values (sig. and sig. f change) in the wrong places?

Nadja says

Hi Jim!

When i do a simple linear regression the independent variables is positive and significant. But when i

put it in a model with other variables it turn negative and significant. How can the variable go from having a positive to a negative affect on the dependent variable?

Jim Frost says

Hi Nadja,

I’d say the most likely reason for your scenario is that in your simple linear model, you’re witnessing omitted variable bias in action. One of the other independent variables in the multiple regression model is a confounder. When you include that individual IV into the multiple regression model, the presence of the confounder reduces that bias. To see how this works, read my post about omitted variable bias and confounders.

Mohd Fikri Rosely rosely says

hye Jim,

What if there are 3 categorical variables and each variables have 4 levels to run multiple linear regression. How do i go about?

Jim Frost says

Hi Mohd,

Unfortunately, I don’t have a blog post to refer to you. However, in my regression book, I discuss using categorical variables in multiple regression at length.

mohamed omran says

good morning Jim

i want ask about a p value in logistic regression I did for multiple independent variables and one dependent variable —- p value appeared as #NUM! —- what is that means.

thanks

Amanuel says

Hi Jim

What do you make of the significance in the below multinominal logistic regression SPSS output ? I am still a bit uncomfortable about the meaning of significance ( and accept, reject) with regard to logistic regression ( as opposed to other types) and the associated p-value interpretation. Thanks for your reflections.

Model Fitting Information

Model Model Fitting Criteria Likelihood Ratio Tests

-2 Log Likelihood Chi-Square df Sig.

Intercept Only 179.427

Final 129.239 50.188 6 .000

Raven Cay says

What do you think the best analytical tool for likert scale ? and questions answerable by yes or no?

Ken Abbott says

Hey Jim. Thanks for your blog. It’s useful.

I’ve run thousands of regressions but ahve never done any categorial work. Do I interpret the s.e.’s and t’s the same way? If I have a block of, say, 5 categories (think: brands) and the t’s are all small, should I do an F-test to see if they as a group are significant?

Jim Frost says

Hi Ken,

Yes, I’d recommend calculating the F-test for all the indicator variables that comprise the categorical variable–except leaving one out for the baseline level. So, for a categorical variable with 5 levels, you’d include 4 of the indicator variables and calculate the F-test for that set of 4. Basically you want to compare the model with all the variables including those 4 to the model with all the variables EXCEPT those 4 indicator variables. In that manner, you can determine whether the categorical variable as a whole provides a statistically significant contribution.

Susan says

Hi Jim, thanks for the valuable information. I have a question, my scatter plot in SPSS shows linear relationship between two constructs A and B, nevertheless, the p-value shows that the relationship between A and B is not significant. I was wondering if this is possible? if so, how can I interpret this result? Thanks in advance!

Best regards,

Susan

Jim Frost says

Hi Susan,

It’s definitely possible. If the relationship in the graph is weak and/or you have a small sample size, your data might not provide sufficient evidence to conclude that the relationship exists in the population. It’s possible that random chance created the appearance of this relationship in your sample but it does not existence in the population. Or, perhaps it exists in the population but the small sample size and/or weak relationship made it so the hypothesis test cannot detect it (i.e., the hypothesis test had insufficient statistical power). What’s your sample size and how strong is the correlation?

Abhi says

hi Jim,

does your interpretation of coefficients also apply to time series regression ? meaning that a 1 unit change in an Independant Varaible indicates a ‘coeff_value’ unit change in the dependant variable ?

many thanks for your helpful website !

Abhi.

Jim says

Hi Jim,

Thanks for the detailed explanation on the interpretation of p-values.

I have one question:

Why does adding a second IV into a simple regression model alter the p-value of the 1st IV? For example, adding x2 into my model increased the p-value of x1 hence x1 now is statistically insignificant while x2 is statistically significant.

Thanks in advance.

Regards,

J

Jim Frost says

Hi Jim (great name!),

What you’re seeing is omitted variable bias in action. Leaving variables out of the model can bias the coefficients and p-values of the variables in the model. This occurs when there is a correlation structure between the IV, DV, and omitted variable. If you check the correlations between these variables, I’m sure you’ll see that it exists between all of them. When you add in the omitted variable, the bias from that particular omitted variable goes away and the results change.

For more information about the specifics of how/why this happens, read my post about omitted variable bias where I walk you through an example.

Dr. Sampark Acharya says

what if constant is not significant but residuals aresignificant?

Jim Frost says

Hi,

Can you please clarify what you mean by the residuals being significant? Do mean a normality test or something? Thanks.

Di says

Hi Jim,

Sorry to bother you

“This estimate shows that if the realization of the random variable X increases by 1%, the expectation of the random variable Y decreases by 0.3 units.” Which model are we talking about:

(a) E[ln(Y )/X] = 7 − 0,3 ln(X)

(b) E[Y/X] = 5 − 0,3 ln(X)

(c) E[ln(Y )/X] = 2 − 0,3X

My professor told me to look if X and Y are relative or absolute. I know that X is relative because is in %, and Y is absolute because is in units. But I dont understand the responses.

Thank you in advance

Zubayda says

Hello Jim, thanks for your informative blog. It is really helpful^^

I have same problem. My Intercept is 9.49622E-05 on P-value. How can we interpret them. Is it normal (or bad) model if intercept is bigger than 0.05?

Aleeha says

Hi Jim you articles are really informative land helpfull

If we have insignificant y-intercept and slope them how can we interpret them

Jim Frost says

Hi Aleeha,

You results indicate that you have insufficient evidence to conclude that the constant is different from zero. That’s no big deal. As I discuss in this post, an insignificant slope coefficient indicates that you have insufficient evidence to conclude that the coefficient is different from zero. That’s a big deal. Read through the article to see what the means.

Muazzam Hashmi says

just need to know that how we can find Raw R square on stata. stuck here

kgotso rasentsoere says

in different estimated regression equations, why estimated regression coefficients differ from one another?

Muazzam Hashmi says

Hey there,

Just need your help to explain the results of a variable with no coefficient. its just we cannot use normal R square. So what will be used to explain the results of a variable from no coefficient.

Jim Frost says

Hi Muazzam,

In regression analysis, all independent variables in the model will have a coefficient.

Naeem Aslam says

Hello everyone

I need your help to resolve a question as i am not Statistician hence your support will be highly appreciable.

Question is

What will be the expected ticket sales if

Distance from capital city is 150km

Population of the city 15000

Ticket barrier 30000

Demographic profile of town is 3

Following is the regression info i have:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept (60,374.10) 17,044.98 (3.54) 0.0016 (95,553.21) (25,194.99)

X Variable 3 0.41 0.12 3.33 0.0028 0.15 0.66

Jim Frost says

Hi Naeem,

I can’t tell which coefficient belongs with which variable. In fact, it looks like they might not all be included in the output. But, all you do is enter those values into your regression equation and calculate the answer.

Jacquie says

Hi Jim,

I’m currently working on a practice problem with the below information regarding housing prices:

Price = 42,000 (sales price in $) + 30.0 SQFT

T statistics: (3.72) (7.67)

P Value: (.001) (001)

R2 = .723. F = 18.76 (0.002). N=274

What formula would I use to calculate what the house would cost if it is priced as the regressions say it should be?

Jim Frost says

Hi Jacquie,

The first line of your output is the regression equation. Converting into Xs and Ys:

Y = 42,000 + 30X

You enter the X value, which is the square footage. The equation predicts Y, which is the mean house price. 42,000 is the constant. 30 is the coefficient for X.

Disha Singh says

Hi Jim,

I have a dataset where most of the columns are just categorical variables, with value being either 0 or 1 in the columns. On performing the logistic regression i get few of these variables’ p value greater than 0.05, (0.05 being considered as the significance level), what should be the next approach, would applying transformation on these categorical column make any difference? or should i just drop these columns or should i proceed with these variables in my dataset?

Jim Frost says

Hi Disha,

Those are called indicator variables and indicate the presence or absence of a particular characteristic. In a binary logistic regression model, using those variables as independent variables, you can learn how the presence of a characteristic relates to the odds of the “event” happening. The event is one of the two possible values that you have for your binary outcome variable.

Typically, when an independent variable is not significant, as you describe, you at least consider removing it from the model. You don’t want to include too many variables that are not significant because it reduces the precision of your model. However, you do need to let theory guide you as well. I discuss this process in my post about model specification. That’s written in the context of OLS, but also applies to binary logistic regression. It’ll give you some tips about which variables to include and exclude. Pay particular attention to parts about using theory.

I hope this helps!

Aurel says

Hello,

This was very helpful thank you very much for sharing it with us.

But, I am having this kind of issue and I have no idea what is it.

metser (the variable) and i have the results of the coefficient -3.10e-06

What is the e-06?

Thank you very much in advance.

Jim Frost says

Hi Aurel, that’s scientific notation, which is used for very large or very small numbers. In your case, you have a very small number. The number to the right of the e tells you how many places to move the decimal point. The negative sign means you need to move it to left by 6 places. For your coefficient, it corresponds to the following: 0.00000310

Carolina C Cardenas says

Jim,

your blog is extremely helpful. I am trying to get a handle on when you use the standardized vs. unstandardized coefficients in the regression equation. I used Intellectus Statistics for my linear regression and the equation uses the B (unstandardized) vs the beta (standardized). I know what they each are, but most of the equations I see use the beta. Looking it up in APA, which explains it as using what is most “interpretable” I can’t understand the “why”? Can you explain that? Or is there another blog you have posted I can go to? Thank you!

Jim Frost says

Hi Carolina,

There are two reasons I’m familiar with about why you’d want to use standardized coefficients.

For one thing, standardized coefficients help you compare the importance of various IVs when they use different scales. Standardizing them places them all on a common scale, allowing comparisons. I write about that in my post about identifying the most important variable in a regression model.

The other reason is when the units for an IV are not meaningful themselves. I gather this happens frequently in psychology where you have assessments and inventories for different characteristics. Unstandardized coefficients measure the mean change in the DV given a one-unit increase in the IV. However, if the IV units are meaningless, then the unstandardized coefficient also does not have much meaning. By standardizing them, the coefficient represents the change for a one standard deviation change in the IV. That can be more meaningful. I think that is getting to the “interpretable” aspect that you found.

That usage becomes more important for the less tangible units. However, for cases when you’re measuring more tangible units, such as height, weight, temperature, etc., those are inherently more interpretable in the original units. So, you wouldn’t need to standardize them to make the more interpretable.

The best coefficients to use really depends on the variables you’re including in your model. Are you measuring something fairly concrete and tangible? Or something where the units don’t have an inherent meaning? You can also look at other studies that include the same variables to see how they handled them.

I hope that helps!

Jessy says

Hey Jim,

I like your hp. your articles are very clear! Maybe you can help me: I still don’t understand how estimates are interpreted at GLMs.

in my case the negative estimate e.g. of migration influences the dependent variable e.g. Fish negative. And the greater this estimate, the greater the influence. In this case, migration has the greatest impact that is negative. And the larger the value, the lower the probability that fish will appear with increasing migration. Am I right?

Thanks and regards,

Jessy

Jim Frost says

Hi Jessy,

I think there are two issues you’re asking about. How to interpret a negative coefficient and which coefficient has the greatest influence.

When you have a negative coefficient, it means that as the value of the independent variable increases, the mean of the dependent variable tends to decrease. And, yes, if the value of that IV is larger, you’d expect the DV to be even lower. I don’t fully understand the particular variables in your model so I can’t interpret them specifically, but that’s the gist of how it works.

In terms of the magnitude of the estimate, that gets a bit more complicated. I’ve written a post about identifying the most important variable in a model. It’s not as simple as looking at the estimate’s absolute value for a variety of reasons that I cover in that post. I’d read that post for more information.

I hope that helps!

Emma says

Hi Jim,

I’m wondering if you could help me understand the following:

I ran a multiple regression (two model) predicting a continuous variable and included covariates in first model and Dummy coded categorical variables (n-1) added to the second model. In sum, the regression shows one ethnicity is statistically different to the reference ethnicity.

However, when running an ANCOVA (with the same DV and the same variables as Covariates, ethnicity as fixed factor) I find that all my ethnicities are different to the reference ethnicity category I used for regression (this is what I expected).

I’m wondering why the results from the regression and ANCOVA are different.

I don’t know whether you have covered this on your website (as I can’t find it).

Your assistance in understanding this difference would be really helpful.

Kind regards,

Emma

Jim Frost says

Hi Emma,

I don’t know exactly what is happening. To know the real details, you’ll need to see how your software calculates significance. However, I do have a guess that I feel fairly confident in. First, I’ll assume that you’re using the same baseline level for your categorical variable.

I’m guessing that it has to do when the software enters each variable into the model. For regression models, software usually uses the adjusted sums of squares, which means that the results for each variable are based on it being entered last in the model. In that case, the significance of the categorical levels depends on what is not already accounted for by the continuous variables.

However, I’m guessing that your ANCOVA model is entering the categorical variable first and then covariates. For some statisticians/disciplines, covariates are treated as nuisance factors that are entered into the model after the primary variables of interest. However, that’s not a consistent practice. Consequently, I can’t say for sure that is happening in your case. However, it would explain why you’re obtaining more significance using ANCOVA than regression. If this scenario is correct, the categorical variables are, in a sense, being given a priority in explaining the variability in the DV over the covariates.

That’s my best guess!

Biya says

ANOVAa

Model Sum of Squares df Mean Square F Sig.

1 Regression 2326.985 1 2326.985 3.937 .048b

Residual 248257.025 420 591.088

Total 250584.010 421

what does this anova table mean from linear regression, is there significant corelation?

Jim Frost says

Hi Biya,

You’re looking at the F-test of overall significance. Click the link to read my post about what it is. If the p-value = 0.48 (hard to tell with the formatting), it does look significant. The other post tells you exactly how to interpret that. It doesn’t apply to a single IV/predictor but rather the entire model.

I hope that helps!

Chris says

Hi Jim

Your blog helps a lot.

Thanks!!

(Chris from Kenya, East Africa)

Nur Lyana says

sorry for the late response,

I’m writing using Harvard format and it requires the month/year for web pages. If the dates are not available, no worries, I can still cite with (, no date). thanks!

Jim Frost says

Hi Nur,

Thanks for the clarification. Here’s a link to the Harvard Format for citations. Scroll down to near the bottom of the page until you see the section, “How to Cite a Website in Harvard Format.” They show how to use a date accessed format.

The thing with web pages is that they change over time. So, the date that you access it is much more important than when the page first appeared.

Best of luck with your paper!

Sophie says

Hi Jim,

First of all, thanks for your article!

I’m performing a hierarchical logistic regression for my thesis, using three predictors to determine whether they would predict whether an individual has language dominance in the left or right hemisphere. In my full model (the final step of the regression where all three predictors are included), only one predictor appears to be significant (P<.05). However, the 'Constant' also appears to be non-significant also (p=.075) in this full model, but was significant (p=.041) in the first level analysis (including only the DV and 1 predictor). I'm just a little bit confused as to why the p-value appears to increase to a point of non-significance, and was wondering whether you could explain it to me?

Thanks again!

Jim Frost says

Hi Sophie,

I always urge caution when trying to interpret the constant and its significance. To learn why, read my post about interpreting the constant.

For your case, it just means that with fewer predictors, the model was able to determine that the difference between the estimated constant and zero was statistically significant. In other words, the constant doesn’t equal zero. However, as you added predictors, that difference was not significant. You have insufficient evidence to conclude that the constant doesn’t equal zero in your final model. Alternatively, for the final model, if you look at the CI for the constant, it will include zero.

Unfortunately, that conclusion doesn’t mean much in most circumstances. Again, read my other post to learn why.

Hemant Yawalikar says

Hi Jim

Many many thanks

This article is really helpful interpreting regression analysis. It helps me to clear the relationship between IV and DV in terms of coefficient. It also help me how to interpret graph.

Thanks a lot

Regards

Hemant

Nur Lyana says

Hi Jim,

Many thanks, your blog really helped me to get clear understanding on various topics. May I know when was this article was written so that I can cite for my thesis? or is there any link/archive where I can view each article publish date for citation?

cheers.

Jim Frost says

Hi Nur,

I’m glad this article was helpful for you! When you cite web pages, you actually use the date you accessed the article. See this link at Purdue University for Electronic Citations. Look in the section for “A Page on a Website.” Thanks!

SK says

First of all thanks for your reply.

I am basically measuring the effect of succession planning practices on employee engagement…

HR policy related practices is one of the independent variables.

Following 5point likert statements were used to measure HR policy related practices:

– In my organization, there is a clear career path for each employee on his/her work position.

– There is an HR policy in the workplace to prepare personnel for their next roles.

– There is an existing HR policy to develop the employees to replace the aging workforce.

– There are systems (job postings, position descriptions etc.) and open communication so employees can gain information about opportunities within the organization.

– Grooming and promoting people for key positions from within constitute a part of accepted organization’s policies/philosophy.

I did not come across anything that shows a negative relationship;

One reason could be lack of transparency in communicating the HR policy to employees.

Plz guide me as to how I should interpret the results and what more investigation do I have to do?

Thanks and lookin forward.

Jim Frost says

Hi,

You should always note the type of variables with which you’re working. Likert scale data are ordinal data. If you’re using Likert item scores as your observation values, it becomes a bit tricky. They combine some of the properties of continuous data and some properties of categorical data. You’ll have to choose one or the other. I don’t have a blog post that covers this but I do write about it in my ebook about regression analysis, which you should consider.

I can say that, while residual plots are always important to check, you have more reason to do so with ordinal data because they are tricky data.

You’ll need to get a handle on what the values represent and what it means to increase an HR variable by one unit. What is each variable measuring exactly? And, yes, the negative coefficient tells you that you have a negative relationship. As the value of the HR variable increases, your DV tends to decrease. That’s a negative relationship.

For those items you list, are you using each variable as an independent variable? Or, are you averaging or summing them? Was the negative coefficient that was significant for one of them?

Upen says

Hello,

I have linear regression in this form

log(price) = B0 – B1 . X1 + B2 . X1 ^ (1/3). How should I intepret this one? One unit of change in X1 increases price by (B1 + B2) * 100 %

SK says

Hello Jim…

I found your way of explanation very helpful.

Currently in the process of submitting my thesis… I am confused about the results that i got and I need your help…

In my Multiple regression table: 2 B coefficient values are negative

X1 (Promotion and Internal Recruitment) —– Beta coefficient = -.029; whereas it’s p value = .763

I interpreted it as this shows an inverse relationship; where if X1 (Promotion and Internal Recruitment) increases by 1 unit, holding other variables constant, then the value of Y “employee engagement” will decrease by 0.029.

it’s p value .763 > 0.05, therefore it shows that it doesn’t have significant impact on employee engagement, so the null hypothesis is accepted and the alternative hypothesis is rejected.

however for my other variable: HR policy the Beta coefficient is negative but the p value is less than 0.05

X2 (Human Resources Policy)——- Beta coefficient = -.183; whereas it’s p value =.025

I interpreted it as If X2 (Human Resources Policy) increases by 1 unit, the value of Y will decrease by 0.183.

but as p value =.025 < 0.05, which indicates that it has a significant impact on employee engagement,

therefore the null hypothesis is rejected and the alternative hypothesis is accepted.

MY MAIN QUESTION IS REGARDING THE SECOND VARIABLE I.E. HR POLICY

how do I explain the results; why something like this happens?

Jim Frost says

Hi SK,

First, one thing for you to note. You never accept the null. I know that sounds strange but you fail to reject the null. Click the link to read an article where I explain why understanding that is crucial!

As for your interpretation of the significant coefficient, yes, you are correct assuming it is a continuous variable. The results suggest that for every one-unit increase in HR policy (whatever that means?!), employee engagement decrease by an average of 0.183. (Be sure to note that it is a mean decrease.) However, not understanding the nature of the HR Policy variable and what it is measures numerically, I can’t really interpret it more or explain why.

Does that negative relationship fit theory? If not, you have some investigation to do!

jacqueline says

Hi Jim!

Thank you so much for your posts – they’re very helpful.

I was wondering what it means if the coefficent isn’t statistically significant – do we then disregard the R^2 value?

thank you!

Giulio Castelli says

Dear Jim, thanks a lot and congrats for the brilliant blog!

Hadas says

thank you Jim for your help

Giulio Castelli says

Hi Jim,

what to do if the p-value of the constant is over 0.05 and should be removed from the model, while the other ones are ok?

Jim Frost says

Hi Giulio,

You almost always want to leave the constant in even when it’s not significant. Removing it can bias all the other coefficient estimates.

Jennifer says

Hi Jim,

Thank you for this awesome post. I had a question about removing variables that are not significant (p>0.05) from my multiple logistic regression model. I have 10 predictors, and have been building models using every combination of these predictors (using the glmulti function in R) and selecting the best model with the lowest AIC score. Then I looked at a summary of the best model and found that there are some variables with coefficients with p>0.05, but when I remove these, the AIC score increases. Would you still recommend removing variables that are not significant from the regression model if removing the variables increases the AIC?

Jim Frost says

Hi Jennifer,

There are a number of potential issues here.

You’re using a data dredging technique that might be finding a model that has the lowest AIC by chance rather than because it’s actually the best model. I write about that in my post about using data mining to select regression models. You should read that. I don’t use AIC. Instead, I used R-squared but the principles are the same. The problem is you can’t necessarily trust p-values when you use that approach.

Another issue I’m unsure about is your sample size. Is it large enough to support 10 predictors? If not, you can’t trust the p-values for that reason either! Read about that: Overfitting your model.

Now, on to your specific question. One thing to consider is how much does AIC change when you remove the non-significant predictor? Perhaps it doesn’t reflect much of a change in terms of model fit? Typically, it is good to remove variables that are not significant unless theory strongly suggests you keep them in. Without knowing the subject area and the amount of change, I can’t get a concrete answer.

I’ve written an article about selecting the best regression model. This process is a mix of statistical measures and subject area knowledge. It sounds like you’re predominantly using statistical measures and I think applying more subject area knowledge will be really helpful. I’d read that article and pay particular attention to the section about theory near the end. Again, I don’t talk about AIC, but I do use other similar measures, such as R-squared and its variants. Also, pay particular attention to the discussion about residual plots. I wonder if in your case removing the non-significant variable actually creates patterns in the residual plots? If that’s the case, it would be good to not remove. But, you’d need to check the residual plots to make that determination.

As you can see, there’s a lot to consider. I wish I could give you a simple answer!

Dr kashif says

Hello, Can I determine p value when one variale is constant? like corelation of disease with some cancer where cancer could be yes or no but disease is constant, please reply

Jim Frost says

Hi Kashif,

I’m not clear on what two variables you want to correlate. However, in general, to have a relationship or correlation between two variables, you must have variation in both variables. Correlation means that there as two variables vary, they either tend to change in the same direct or a different direction. If one of the variables does not vary at all as the other variable varies, there is no relationship. If I understand your example correctly, cancer can be present or not present while the disease is always present. In that case, there is no relationship. However, if you had cancer present/not present and disease present/not present, there is variability in both variables and you could use a test of independence to determine whether a relationship exists. Is there a relationship between the presence of the cancer and the presence of the disease?

But, if all observations have the disease (or all observations don’t have the disease) there is no relationship. P-value equals 1 (or you might get an error depending on the analysis).

Prisca Keery says

Hi Jim

I am hoping you can help with my statistical question. I am looking to conduct a study with low sample size with one IV and either 3 DV’s or 9 DV’s. What statistical analysis issues may I encounter with the more DV’s I include in my study given the low sample size?

Hadas says

Sorry Jim ; have get good information from your answer for Lola about how to adjust results for insignificant variables but still am not clear with what if the variables are assumed very important for the study and their p value is greater than 5% , what if too many variables have greater than 5% for 95% confidence level ? Questions are likert types and i measure participation in 5 levels is the problem here ? By the way results form descriptive analyses showed that variables have effects on the dependent variable . of You said also that removing an important variable is potentially more problematic than leaving in a variable that’s not important . So what is your suggestion

Thank you so much for your unreserved helps

Shareful says

I’m doing a multi-regression analysis using infection rate of COVID-19 as dependent variable and air pollution, meteorological and social data as independent variables in order to understand their relationship. I have found very good results from this analysis and I’m considering p value, R2, Beta coefficient and influence results in my write up. All are OK to be added in my paper? What do you suggest in my case. By the way, I’m from Bangladesh, my study area will be in Dhaka.

Lola says

Thanks ever so much Jim. This is so helpful and insightful. I am so appreciative!!! thanks!!!

Jim Frost says

You’re very welcome Lola! Best of luck with your research!

Lola says

Dear Jim

You mention that “It is standard practice to use the coefficient p-values to decide whether to include variables in the final model”

Pls I would like to understand the following

(1) What do you mean by ‘final model’?

(2) Does it mean that the regression has to be re-run with the insignificant variable taken out, if that variable’s coefficient is NOT statistically significant??

(3) If the variable is left in the model as it is, how detrimental could this be in reality? Or is there a possibility that that one ‘problematic’ variable will not matter much if its just that one

Jim Frost says

Hi Lola,

By final model, I just the model that researchers consider to be the best model. Researchers will often remove variables that are not significant. If you leave too many insignificant variables in the model, the model is less precise. And, yes, removing a variable means that you re-run the model without that variable. However, you don’t have to remove insignificant variables. Sometimes you want to leave them in because you are specifically testing them. Or, perhaps theory suggests it’s important even if the p-value indicates otherwise. Leaving in some insignificant variables will generally not reduce precision too much. And, if you’re not sure that it is unimportant, it’s often better to leave a variable in. Removing an important variable is potentially more problematic than leaving in a variable that’s not important. Again, if you leave in too many unimportant variables, it can reduce precision.

So, it’s a balance. But, leaving in one insignificant variable shouldn’t usually be a problem unless you have a small sample size for the complexity of your model. For more information, read my post about choosing the correct model.

Harry says

Hi Jim,

Thank you for the helpful guide. I was wondering whether the p-value for the dependent value is important and if this also has to be below 0.05 for the null hypothesis to be rejected?

Harry

Jim Frost says

Hi Harry, you only get p-values for the independent variables and constant. You don’t get one for the dependent variable.

Sami says

Hi, what if one of the independent variables takes negative and positive values. How can we interpret the coefficient associated to such a variable (I mean its effect on the dependent variable) ? And what if this variable takes only negative values ?

Jim Frost says

Hi Sami, if you have a negative coefficient and a positive coefficient, that just indicates that each independent variable has a different type of relationship with the dependent variable. For the IV with the positive coefficient, you know that as that IV increases in value, the DV also tends to increase in value. There’s a positive correlation between that IV and DV. For the IV with a negative coefficient, you know that as that IV increases, the DV tends to decrease in value. There’s a negative correlation between that IV and DV.

One thing you write confuses me. If “one of the independent variables takes negative and positive values.” One IV can’t have both positive and negative values in one model. It has just one value, whether it’s positive or negative. Unless you mean you’re fitting different models with different variables and they changed based on the specific combination of IVs in the model. That’s a different matter. If that’s the case, let me know!

Nikhil Talwar says

Thanks for the explanation Sir. I have one basic question on interpretation of Beta Values ( coefficient of independent variables). If the independent variables are categorical/qualitative then how do we interpret?

Example let’s say there are 3 categorical independent variables

1.Marital status – Married

2.Marital status – single

3.Marital status – Widower

with Beta Values -1.233;9.234;-2.878 respectively

Then how do we interpret these? Assume the dependent variables is the premium rate

Jim Frost says

Hi Nikhil,

In regression, you interpret the coefficients as the difference in means between the categorical value in question and a baseline category. So, you have to know which category is the baseline. The output should indicate. If it doesn’t state it explicitly, it’s the category that is not listed in the output or does not have a coefficient value. The associated p-value allows you to determine whether the mean difference between a category and the baseline category is not zero.

I write much more about categorical variables in my regression ebook.

Ghada MGA says

Hi,

Your explanation was very helpful to understand the regression.I have some data to interpret ,the t values of the predictors are all statistically insignificant “p>0.05” but the F statistic of the whole model is significant!

is it possible to have a situation like that ?

Jim Frost says

Hi Ghada,

Yes, it is possible! In my post about the overall F-test of significance, I write about that type of situation and why it occurs. Read that post and see if it answers your questions.

Danijela B. says

Hello Jim,

I found your explanation to be very thorough and easy to follow. I have a dilemma and question that I am hoping you can answer. I am expecting a positive sign, and my results show a negative coefficient but statistically significant. Am I right to interpret it as:

The results show that there is a statistical significance because its p-value is 0.0108, which is less than the significance level of 0.10. The sample data provide enough evidence to reject the null hypothesis. Beta values take into account standard errors, which are used to determine if the value is significantly different from zero by evaluating the t – statistic value. For the model, the beta value is -1.660618, the t-value is -2.561538, and the p-value is 0.0108. This suggests that this variable is significant, and further explains that IV negatively affect DV, and the relationship is significant. The coefficient value is -1.660618, which indicates that when the independent variable increases by one unit, the dependent variable will decrease by 1.660618.

My professor said that I cannot reject the null hypothesis if the sign is not what I was expecting even though the p values is significant. I know you talk about omitted variable bias in some of the previous comments, but assuming I have the right model and fit, I am interpreting this correctly?

Jim Frost says

Hi Danijela,

I have to agree with your professor. There are several ways you can get erroneous coefficients. Omitted variable bias is one key way. You might also be overfitting your data or have multicollinearity (although you probably would’ve had insignificant results then). It’s also possible that you’re not accounting for curvature. Having significant variables doesn’t mean that you’re fitting the correct model. So, I wouldn’t just assume that you have right model.

I write about this exact cases (unexpected coefficient signs) in the section about Theory in my post about model specification. What I’d recommend is checking your residual plots and doing research to see what others have found. What variables did they use? At the very least, you’ll need to have an explanation for why the unexpected sign is correct.

There also other possibilities such as overfitting or data mining.

And about whether you’re interpreting the results correctly. What you write is the correct interpretation of the statistical output. The problem is that your coefficient is probably biased. Imagine you’re looking at a clock that runs slow. You can correctly read the clock and know that is says 9AM. However, if you have a meeting at 9AM, you’ll be late! You read the clock correctly, but didn’t make sure that the clock was running correctly. In your case, you’re interpreting the statistical output correctly. However, there’s probably some sort of assumption violation that is biasing your results. So, definitely kudos for the great interpretation. I don’t think most students would explain the results nearly so well as you have! But, we need to see what’s going on with the assumptions and how the model actually fits the data.

Best of luck with your analysis! 🙂

Dustin says

Thank you, Jim, for the very helpful blog post!

I had a couple of questions:

1. Assume we have a linear model with many predictors, and that their corresponding p-values confirm significance. If Beta X1 > Beta X2, can we simply state that X1 has a greater positive impact on y, or do we need to do additional testing to compare the two coefficients?

2. Using that same model, let’s say we wanted to compare a subset of predictors with another subset of predictors (all within the same model), and we wanted to prove that X1, X3, X4 collectively has a greater positive impact on y than X2, X5, X6. My instinct is to find the average of the coefficients for each group and compare the two averages, but I have a feeling it’s not that simple. Do I need to re-create the mode with those predictors grouped together or how could I prove this?

Cheers

Jim Frost says

Hi Dustin,

Determining the relative importance of the predictors can be difficult. But, you’re in luck! I’ve written a post about that. Identifying the Most Important Variables in a Regression Model. Read that post, and if you still have questions post them there!

Marisol says

Hi Jim, thank you for your post.

I’m wondering if regression includes both effect size and significance tests.

I think it does include effect size given that there are several ways to measure effect size in a regression analysis, including through the correlation coefficients, regression coefficients, partial and semi-partial coefficients, squared coefficients, and proportions of variance.

But I’m unsure if it includes significance tests. What do you think?

Jim Frost says

Hi Marisol,

Coefficients are the effects. P-values for the coefficients are the significance tests for the effects.

Ivo Brito says

Im a master student, currently developing my thesis on Post-Earnings Announcement Drift (An Euro Stoxx 50 Analysis) between 2012 and 2017. I’ve defined the event window (-20,0,20) and computed the normal returns and market model parameters for each firm in my study(51 companies). However, the Beta parameters of all aren’t statistically significant (p-value > 5%), which makes my study irrelevant, I don’t know if i did my calculations wrong or its how the sample is but i dont think the calculations are wrong since i tested them in both Excel and Eviews.

Can i still find something worth mentioning on the fact all the betas of my companies are not stastically significant?

Jim Frost says

Hi Ivo,

Several questions. Have you checked the residual plots? And, were any of your IVs significant? (I’m not quite clear if you’re saying that they all are not significant.)

Sarah says

Hi Jim , for my regression analysis, my p value is significant but my standardised bita value is -.13, but my prediction was that when there is an increase in the value of my predictor variable there would be an increase in the value of the dependent variable as well, does the standardised bita value anyhow affect my results? Even thought the predictor was significant

Jim Frost says

Hi Sarah,

A coefficient for a standardized independent variable represent the mean change in the dependent variable given a one standard deviation change in the independent variable. The sign for a standardize variable will match the sign for an un-standardized variable. In your case, the negative sign indicates that as the IV increases the DV tends to decrease–a negative relationship.

If theory/other research suggests that there is a positive relationship (they both tend to increase together), you should investigate. I talk about this in my post about choosing the correct regression model. Look in the section about Theory.

Omitted variable bias (aka confounding variables) might be the reason. Perhaps your model excludes and important variable that is correlated with both the DV and IV in question? For more information, read my post about omitted variable bias.

Best of luck with your analysis!

Almoghirah says

Thanks Jim for your valuable comments and clear answers.

I read well the section of (prediction) because I’m interested in practical use of regression analysis. I have data for cost of different medical tests, so I regress cost against number of patients had the test and the price of the test. Although the model fits well but I found the prediction from the coefficients different from the reality. To be more specific: the model tells me that when the price increases by one unit the cost increases by 8495 units holding the number of patients had the test constant, but when I used Excel and increased the price by one unit for each test I found the cost different. Am I wrong?

Almoghirah says

Dear Jim

Thanks again and again

I would like to inquire about the characteristics of the data set for regression to be fulfilled before running the model in a computerized program.

Jim Frost says

Hi Almoghirah,

Regression analysis is interesting in terms of checking the assumption. For other analyses, you can test some of the assumptions before performing the test (e.g., normality, equal variances). However, for regression analysis, the assumptions typically relate to the residuals, which you can check only after fitting the model. For specifics about these assumptions, read my post about least squares assumptions.

Almoghirah says

Dear Jim

Thanks a lot for very very nice and clear explanation. This is more that useful.

Nik says

Dear Jim!

Nice presentation!

Please explain me one issue. After logistic regression analysis I have found p=0.34, OR=1.4, 95%CI 0.9-3.4.

I understood that independent X does associated with outcomes Y (p=0.34). But OR=1.4 and it included in CI. How can I explain it?

Thank you.

Jim Frost says

Hi Nik,

The results of your model don’t show that there is relationship between your IV and DV. The p-value indicates this because it is higher than any reasonable significance level. Additionally, the CI for the odds ratio (OR) includes one. In short, your results are not statistically significant. Your sample data do not provide strong enough evidence to conclude that this relationship exists in the population. However, keep in mind that, non-significant results do not prove that the effect/relationship doesn’t exist. Just that your sample didn’t provide strong enough evidence to conclude that it exists. It could be that the sample size is too small or there’s too much variability in the data. Or, perhaps you need to include more variables in the model to control for potential confounding variables.

Riya Jain says

Hey thanks for the post !

I have a doubt

what happens if my X variable coeffcient is -0.647042012003429 and my significance level is 1.70654E-15

Jim Frost says

Hi Riya,

The negative coefficient indicates that for every one-unit increase in X, the mean of Y decreases by the value of the coefficient (-0.647042012003429).

Your p-value is displayed using scientific notation. You need to move the decimal point to the left 15 places, which produces a very, very small p-value. Your results are statistically significant, which means you can conclude that your coefficient is significantly different than zero (i.e., an effect exists in the population).

Anurag Maheshwari says

Amazing post Jim!! Thank you for the detailed elaboration.

I just have one doubt that how can we choose one among many equal performing linear models on test dataset?

Example:

Consider two linear models:

L1: y = 39.76x + 32.648628

and

L2: y = 43.2x + 19.8

Given the fact that both the models perform equally well on the test data set, which

one should be prefer

and why?

Jim Frost says

Hi Anurag,

Both of your models have the same IV (X) and same DV (Y), yet they’re producing different estimates. What is going on there? If you fit the same model to the same dataset, you should get the same estimates. Or, are these estimates based on different datasets? Unfortunately, there’s crucial information missing from your question.

In general, it is possible to get conflicting information about which model is best. Read my post about how to choose the correct model for many tips!

Sania Gul says

hello,

i have few question if possible answers me.

1: what does it mean when the t-value is negative?

2: in mediation when the direct relationship is significant and after adding mediator the indirect relationship become insignificant what kind of mediation is this? zero , full or partial.

3: what is meant by zero mediation?

looking for ur kind response

thanks

Jim Frost says

Hi Sania,

You get a negative t-value when the regression coefficient is negative. The absolute value of the t-value determines whether the test is significant for the typical two-sided test. Usually you can just assess the p-value, which is based on the t-value.

A mediator (M) explains the underlying mechanism of the relationship between and independent variable (X) and dependent variable (Y). When you have a mediator, it means that X has a relationship with M, and M has a relationship with Y.

First, test to see if there is a significant relationship between X and M. If that relationship exists, you can then fit a model that includes both X and M and independent variables and use Y as the dependent variable.

In the model X + M –> Y, if the effect of X on Y completely disappears and M is statistically significant, M fully mediates X and Y. In other words, there is no direct relationship between X and Y at all. It all works through the mediator.

However, if the effect of X on Y still exists with M in the model but it is smaller than in the X –> Y model, and M has a significant relationship with Y, then M partially mediates X and Y. This condition indicates that some of the observed relationship between X and Y exists directly and some of it exists through mediation. In practice, partial mediation is more common than full mediation.

Zero mediation indicates that the relationship fully exists through the direct relationship between X and Y. Zero mediation exists when there is no relationship between X and M and/or no relationship between M and Y. For any mediation to exist, both the X/M and the M/Y relationships must be significant.

If I understand your scenario correctly, you’re saying that the relationship between X and Y is significant. Then, you add M to the model, which is not significant. That indicates there is no relationship between M and Y, which would be zero mediation.

I hope that helps!

Joe Stringer says

Hello Jim,

I appreciate your blogs and have shared them with numerous friends. Thank you for sharing your knowledge in a way so easily understandable! I am looking forward to purchasing your book.

Following a regression, an IV was found to be significant. When graphing the relationship however, the slope appears to be very close to 0. I am unsure how to interpret this. What would you recommend?

Thank you!

Jim Frost says

Hi Joe,

Thank you so much for your support. I really appreciate it!! 🙂

There are three possibilities that come to mind.

Those would be the first possibilities I look into!

Johann Bachelor says

Thank you!

Explanations are way better than most other sources!

Furb says

Hi Jim,

I have a problem. My logistic regression (ordinal data (sleep hours) on mental health (binary) appears to have U-Shaped relationship. This relationship is significant, however, colleagues tell me that the linear relationship is untrustworthy and I should use Curvilinear?

Can i trust the results?

best

F

Jim Frost says

Hi Furb,

If you’re using a linear relationship to model a curved relationship, then you can’t trust the results. I write about this problem and the need to fit curvature properly in my post about curve fitting. While I write about it in the context of a continuous dependent variable and you’re talking about a binary dependent variable, the ideas are the same.

You should check the residual plots for your model. If your model doesn’t fit the data, you’ll see it in those plots. So, check those plots out and they’ll help you answer your question about whether you can trust the results.

Best of luck with your study!

Michelle says

Hi Jim,

I’m running a regression with quite a small sample size due to data limitations, n=200. Co-efficients are very small, p-values very large, and r-squared very small. Ultimately, I’d like to conclude there is a very weak or almost non-existent relationship. Can I do so if my results are not significant?

Thanks,

Michelle

Jim Frost says

Hi Michelle,

The large p-values indicate that your sample data do not indicate there is a relationship between the independent variables and the dependent variables. The low R-squared also indicates that your model explains a small proportion of the variability in the DV around its mean. Both of those suggest weak or non-existent relationship. I’d also suggest that usually a sample size of 200 is not considered small. Although that depends on the complexity of the model and other issues such as the presence of multicollinearity.

There’s no general rule for determining the strength of the relationship by the size of the coefficients. You’ll need to compare the coefficient values to values that would be considered weak, medium, and strong. Those values will vary by subject area. Additionally, the size of the coefficients will depend on the measurement units. For example, if you have IV that is weight and you measure both as grams and kilos. If you fit one model with grams and the other using kilos, the coefficient for the model that uses kilos will be 1000 times greater than the model that uses grams. That doesn’t mean that grams, with 1/1000 the size of coefficient, is less important. Both models indicate the same effect size but the units will affect the size of the coefficients. However, because your p-values are not significant, you can’t conclude that the coefficients are significantly different from zero anyway.

I hope this helps!

Rosie says

Dear Jim,

Thank you very sincerely for your time and your kind explaination.

I like your book, and I introduced it to my friends, too.

Have a nice day!

Rosie says

Dear Jim,

Thank you sincerely for your time and your kind explaination.

I like your book.

Have a nice day!

Rosie says

Hello Jim,

Sorry that I have 2 more questions for you.

1) As far as I know, with sample size of few hundreds, it’s normal to have few outliers. However, when I tried removing outliers, I got 1 more predictor significant. Thus, could you please kindly advise me should I remove outliers in this case?

2) I got the Mahal. Distance’s maximum value equals 52.361 which is far higher than the critical value (11.07) of df=5 (as I have 5 predictor variables) taken from Chi-squared distribution table at 0.5 alpha level. This indicates there are outliners which may place undue influence on the model.

– Whether my above understanding is correct?

– I tried removing the outliers by running “Select cases” with condition of “MAH1<11.07" and run the regression again. But then I still see the Mahal. Distance's maximum value equals around 15. Although it is already much lower but it is still higher than the critical value of 11.07. So can I stop with this lower value of Mahal. Distance and go ahead with interpreting the regression results, or I still need to do something else regarding removing the outliners?

Thank you so much for your kind explaination so far. I really appreciate it.

Rosie

Jim Frost says

Hi Rosie,

When you have a sample of that size, it’s typical for outlier tests to find a few outliers. However, that doesn’t mean those values are actually outliers. If you use these tests, you should consider the values as candidates that you need to investigate. Don’t assume that just because a test identifies values as being outliers that they are actually outliers. You don’t want to automatically remove outliers based on statistical tests only. Additionally, rerunning outlier tests after removing outliers can be problematic in some cases. Instead, you’ll need to investigate each outlier candidate and determine whether you should remove them based on what you find out and subject area knowledge. If you do remove an outlier, you need to be able to explain why for each data point.

It’s not surprising that removing outliers made a predictor become significant. By removing unusual values you’re reducing the variability in your data, which tends to increase statistical power. However, that doesn’t indicate that removing the values is the correct approach. Again, you’ll need to make that determination on a case-by-case basis.

I’ve just recently written two posts about outliers that you’ll probably find helpful. These posts aren’t written from the regression point of view but the general approaches are still applicable. Read Five Ways to Identify Outliers and Determining Whether to Remove Outliers.

Additionally, outliers are more complicated in regression because there are a variety of ways that an observation can be unusual. I cover this in detail from the regression perspective specifically in my ebook, Regression Analysis: An Intuitive Guide. If you haven’t bought it already, you should consider getting it.

I hope this helps!

Rosie says

Dear Jim,

Thank you for your response.

A nice day to you.

Rosie says

Dear Jim,

I would like to consult you on the conflict results that Pearson correlation and Multiple Regression test produce.

For example, my hypothesis is:

H1: There is a positive relationship between subjective norms and purchase intention for eco-products.

If my Pearson correlation test shows that there is a positive relationship between these 2 variables, but my regression test shows that subjective norms and purchase intention are not significant (I have several indepdent variables in multiple regression analysis and “subjective norms” is one of them. In my regression test, “purchase intention” is outcome variable).

So is it correct if I made conclusion for my hypothesis H1 based on result of Pearson correlation test; and for multiple regression result, I just can say and discuss that “subjective norms” is not an effective predictor of “purchase intention”?

(As Pearson test and Regression test show conflict results so I wonder for only hypotheses, conslusion should be based on which test.)

Thank you so much.

Jim Frost says

Hi Rosie,

This discrepancy sounds like a form of omitted variable bias. You have to remember that these two analyses are testing different models. Pairwise correlation only assesses two variables at a time while your multiple regression model has at least two independent variables and the dependent variable. The regression model tells you the significance of each IV after accounting for the variance that the other IVs explain. When a model excludes an important variable, it potentially biases the relationships for the variables in the model. Hence, omitted variable bias. For more information, read my post about omitted variable bias. That post tells you more about it along with conditions under which it can occur.

In your case, the Pearson correlation is essentially a model with one IV and the DV whereas your multiple regression model contains multiple IVs. The difference is the number of IVs. While I can’t say whether either model is correct, I’d lean towards your multiple regression model because it controls for additional variables. Of course, you’ll have to be sure that the model and its results make theoretical sense and that the residual plots look good.

I hope that helps!

KB says

Hello Jim,

Really really helpful blog, still getting my head multiple regression statistics so nice to find someone who simplifies and is clear.

I have a question. I have an ANOVA F value of 0.06. Both my variables have negative Beta coefficents with first P=0.02 and the second P=0.07. I understand this means the variables relationship with the dependent is inverse, but is it normal to have a good F value and one variable to be deemed not statistically significant.

Grateful for any guidance

KB

Jim Frost says

Hi KB,

It sounds like you’re referring to the Overall F-test of Significance. Click that link to read a post I’ve written about it and discuss the type of situation you’re experience. Read that post and if you have more questions, don’t hesitate to post them there!

Rosie says

Dear Jim,

Thank you very sincerely for your quick response and clear explaination!

This is the most helpful site I’ve ever found!

Rosie says

Dear Jim,

Thank you very sincerely for your quick response and clear explaination!

This is the most helpful site I’ve ever found!

Rosie says

Dear Jim,

Thank you so much for your post!

Could you please kindly help me with the following question:

The p-value of my ANOVA test is smaller than 0.05, revealing a statistical finding that there is a linear relationship between dependent variable and independent variables. However, the p-values of all independent variables in “Coefficients” table show that among five independent variables, only 2 have a statistically significant impact on the outcome variable. Is it possible? (Because I think that if ANOVA test shows a statistical finding that there is a linear relationship between dependent variable and independent variables, there also should have statistically significance for all independent variables)

(By the way, R-square I got = 0.316, showing that 31.6% of the variance in the dependent variable is explained by the independent variables. Is this % too low?)

With great thanks again!

Jim Frost says

Hi Rosie,

I’m assuming the p-value you’re referring is for the F-test of overall significance. Click that link for a post I’ve written about that test specifically. In a nutshell, when that test is significant, it indicates that your model predicts the mean dependent value significantly better than just using the mean of the dependent variable itself. In other words, your model explains the variability of the values around the dependent variable better than just using the mean. While your model has some explanatory power, it doesn’t guarantee that all of the independent variables in your model are individually significant. It assesses the collective effect of all the independent variables. For example, if your overall F-test is significant and then you add another independent variable to the model that has no relationship with the dependent variable, your overall F-test is still likely to be significant.

So, yes, it’s quite possible to have a significant F-test for the entire model but have some independent variables that are not significant.

As for the R-squared, I’ve written several posts for that. You should read one about how high does R-squared need to be. You’ll find it varies depending on your subject area and the purpose of your model. Also read my post about low R-squared values and how they can provide important information.

Best of luck with your analysis!

Nadal Merquez says

Hi Jim Frost,

Thanks for your post and the amazing books! They have been really helpful. However, I’d like to ask you two pressing questions regarding the use of p-values.

1. I do not see how the assumption of normality of the error term is need in order to make use of p-values. For the derivation of the asymptotic normality of the estimators, the normality of the error term is not needed. Could you elaborate why the normality of the error term is needed in order to make use of the p-value?

2. I noticed from computing robust regression methods in R that the p-value is usually not given. Do you know what complicates the derivation of the p-value in the case of robust regression models? How would one know if coefficients are significant in the case of robust regressions?

I’d love to hear from you!

Nadia

Jim Frost says

Hi Nadia,

I’m glad my posts and my books have been helpful! I really appreciate you supporting my books! On to your questions!

1. The distribution of the error term is intrinsically tied to the sampling distribution of the coefficient estimates. One of the properties of the normal distribution is that any linear function of normally distributed variables is itself normally distributed. Given this property, it’s not difficult to prove mathematically that the assumption of the normality of the error terms implies that the sampling distribution of the coefficient estimates are also normally distributed. Therefore, if the error distribution is nonnormal, so are the sampling distributions. In that case, the hypothesis tests based on them are not valid.

2. Unfortunately, I don’t have much experience using robust regression. As I understand it, robust regression first performs OLS, analyzes the residuals, and then reweights the observations based on the residuals. The fact that the residuals are random means that the weights themselves are random. Weighted regression assumes that the weights are fix. Hence, the problem. I gather there is a procedure to work around that to produce hypothesis tests and CIs. However, there are criticisms that the procedure or analysts need to specify a scaling factor and tuning constant, which can cause large changes in the results. That’s the extent of my knowledge on that!

I hope that helps!

Robiul says

I have got my R square .997 and adjusted R squared is .995 is that bad /or how can i reduce the value ?

Jim Frost says

Hi Robiul,

There’s no general rule whether that’s good or bad. You’ll need to use subject-area knowledge as well as knowledge about your model fitting process to make that determination. It could be good if your study area has low noise measurements and it involves something that is inherently very predictable (such as modeling physical laws). But, it could represent something like overfitting your model, which indicates that the R-squared is too high and your coefficients are likely invalid.

I’ve written a post about why your R-squared might be too high. That post will help you answer this question.

nah says

Thank you Jim

you help us a lot.

Juston Shen says

Hi Jim,

The page 284 of Regression Analysis book which was mentioned effect size, statistical significant and practical significant. Could you let us know the difference between statistical significant and practical significant? How many types of effect size in regression analysis?

Jim Frost says

Hi Juston,

First, thanks so much for supporting my ebook. I really appreciate that!

There’s really two primary measures of effect size for regression coefficients. The first is the raw regression coefficient. The coefficient tells you how much the DV changes given a 1 unit increase in the IV. Of course, you have to be careful about determining causality. It might just be an association but not causation. I cover causation vs. correlation in detail in my new Introduction to Statistics ebook by the way.

Another way to look at it is standardized coefficient, which I also write about in my regression ebook. The standardized effect size is better for comparing the magnitude of effect across different types of IVs. This measure tells you how much the DV changes given a 1 standard deviation change in the DV. Because it’s all on a common standardized scale, you can compare the coefficients.

Finally, for the question significant and practical significant, let me point you to a blog post that I’ve written all about practical vs. statistical significance. In a nutshell, statistical significance is all about whether your sample provides enough evidence to conclude that the effect exists in the population. Practical significance is about whether the estimated size of that effect is large enough to be meaningful. That’s based on subject-area knowledge and can’t be computed mathematically. Anyway, read the post on it!

I hope this helps!

eric godson says

sir what if all the result shown in the T test shows negative sign or the significant is greater than 0.05

Jim Frost says

Hi Eric,

A negative t-value just means the coefficient is negative. If a negative coefficient is statistically significant, it indicates that as that independent variable increases, the mean of the dependent variable decreases.

I’ve written a post about t-values. It’s written in the context of t-test for when you’re assessing group means. However the same principles apply to t-tests in regression analysis. I suggest you read the following post, and when write about group means, just think about regression coefficients (which is a type of mean, a mean change in the DV). Read about t-values and t-distributions.

Omoleye Ojuri says

Hi Jim,

You are a great teacher Jim. The use of simple languages and expressions fascinated me to your website. Please just a quick one. My case is MRQAP model, do I have to plot residual plots to indicate the fit of the MRQAP model? And if yes, please what are the values to use to compute the residual plots (unstandardised coefficients, standardised coefficients etc). Or are p values and R square enough to indicate the fitness of the MRQAP model.

Omo

Jim Frost says

Hi Omo,

I have to apologize, but I don’t know MRQAP models well enough to provide an answer. I just looked in to them and they sound interesting. I will need to learn more!

Julie says

Hie Jim

I just stumbled on your postings and found it to be extremely useful. Kindly help me with something . If you found a variable to be statistically insignifant for your final panel regression model can you explain the coefficient of the insignificant variable or once the variable is insignificant then the coefficient sign is not to be considered . I found one variable to be statistically insignificant but it’s coefficient sign supports previous studies

Jim Frost says

Hi Julie,

There are several considerations here.

First, when the p-value is not significant, the coefficient is indistinguishable from zero statistically. In other words, your sample provides insufficient evidence to conclude that the sample effect exists in the population. In that light, you don’t consider the sign.

However, there’s another question about leaving an insignificant variable in your model. Often analysts will remove insignificant variables from the model. In your case, you have theoretical expectations that this particular variable is relevant and the sign is consistent with expectations. Removing this variable would potentially bias the other coefficients. Consequently, I’d leave the variable in the model even though it is not significant. While it’s not good to include too many insignificant variables in the model (reduces the precision), it can be worse to remove one relevant variable, even when not significant, because it can bias the model.

In the write up, I’d explain that you left the variable in the model because of theoretical expectations and not wanting to bias the model. However, your sample doesn’t provide additional support for the effect of this variable.

I talk about some of these issues in my post about choosing the correct regression model.

I hope this helps!

Angeles Dorantes says

Are the coefficients in the hierarchical beta regression interpreted in a similar way?

Jim Frost says

Hi Angeles,

Your question contains several terms, hierarchical and beta, that mean different things in different settings and software packages.

If you’re referring to hierarchical regression as the practice of entering independent variables in groups, such as a group of demographic variables followed by a group of variables you’re testing, yes, you interpret them the same. However, there is one caveat. If a group that is entered into the model later has statistically significant IV, it’s possible that the earlier groups without that significant variable can have omitted variable bias.

Beta in SPSS refers to standardized independent variables. If that’s the case for your model, then you must use a different interpretation for these coefficients. Standardized coefficients represent the mean change in the DV given a one standard deviation change in the IV. I talk about why you might use standardized values in this post about identifying the most important variables in your model.

Jose Chvaicer says

Hi Jim, your articles have helped me understand a lot of previous unclear points. A question remains in mind however: I’ve been asked to force the intercept to pass by the zero point inspite of observed data giving a value for the “a” in Y= a+bx. What I noticed is that the residuals do change much for the modified model (Y=bx) . So what is the gain? What consequences are expected? What happens to the p-value?

Thank you.

Jim Frost says

Hi Jose,

In most cases you should NOT force the regression line to go through the origin (y intercept equals zero). The fact that you’re observing changes in the residuals suggests that you should not do this. The best case scenario is that forcing the line to go through does not change the residuals.

If you don’t fit the constant in your model, it forces the constant to equal zero. For more information, read my post about the regression constant. In that post, I show why it’s almost always good to include the constant in your model. I would say there are no benefits for excluding it. Excluding it can bias your coefficients and produce misleading p-values (check those residual plots). Excluding it also changes the meaning of the R-squared value. It almost always increases R-squared but it completely changes the meaning of it. You cannot compare R-squared values between models with and without the constant.

Rashid says

Where to know if Regression coefficient is not significant at 5, but at 10% or viceversa?

Hello Sir, I hope my questiona finds you,

In some articles Regression coeficients are mentioned to be significant at 5% level and some other predictors significant at 10% level. So, where to know if Regression coefficient is not significant at 5, but at 10%?

Jim Frost says

Hi Rashid,

The significance level is something that the researchers decide before they start the analysis. There are advantages and disadvantages between use higher and lower significance levels. I’ve written about significance levels in the context of hypothesis testing. In summary:

Higher significance levels (e.g, 0.10) require weaker evidence to determine that an effect is significant. The tests are more sensitive–more likely to detect an effect when one truly exists. However, false positives are also more likely.

Lower significance levels (e.g., 0.1) require stronger evidence to determine that an effect is significant. The tests are less sensitive. They are less likely to detect an effect when one exists. On the good side, false positives are less likely to occur.

Analysts often use a significance level of 0.05 as a compromise between the pros and cons of higher and lower values.

You can read more in my posts about significance levels and p-values and errors in hypothesis testing.

Best of luck with your analysis!

Mahshameen Munawar John says

Sir Thankyou so much for the prompt reslonse. Yes, the first model is significant (P=. 02). However, as you also mentioned there seems to be no increase in the predictive capacity when I add the IV (R square remains almost the same in both models) …is that a negative thing? Yes the p value for the IV in the second model is significant.

Thankyou again for all your guidance.

Jim Frost says

Hi, you’re very welcome!

It sounds like your results disagree a bit. That happens because the F-test and t-test for the coefficients measure different things. The F-test measures the amount variance your model accounts for. In this case, you’re seeing whether the 2nd model accounts for significantly more variance than the first model. The t-test for the coefficient p-value assesses whether the coefficient is significantly different than zero (no effect).

While it might sound bad to say the 2nd model doesn’t account for significantly more variance than the first model, it’s actually good news overall for you. We know in the second model that your IV is statistically significant even when controlling for the demographic variables. The first model doesn’t include the IV even though we know it is significant. In other words, we know the first model is incomplete. In fact, the first model might have omitted variable bias because it does not include a significant IV.

Consequently, even though the second model doesn’t necessarily explain significantly more of the variance, it does include a significant IV and is, therefore, less likely to have biased coefficients. You should ask yourself, does the sign and magnitude of the IV coefficient match theoretical expectations and other research? If so, it looks like the IV is a good addition to the model. Of course, check your residual plots to be sure that you’re not violating any OLS assumptions.

Because you’re using regression analysis, you might consider buying my ebook about regression analysis, which includes far more information about it.

Mahshameen Munawar John says

Hello Sir, your posts have been a great help for me, thank you very much! I have been experiencing much confusion while interpreting the P values for Hierarchical Regression. i have one IV and DV , I controlled the demographics in the first step. The Sig. F Change value from the Model Summary output shows that it is not significant (P= .98) for the second model, where I introduced the IV. The same model is significant in ANOVA Table (F=2.15, P=.02). Could you please explain how to you interpret this result. Is the model valid and meaningful? I have searched but could not find an explanation or understand where the problem lies.

Your reply will mean a lot.

Jim Frost says

Hi Mahshameen,

Is the first model with the demographics significant?

If it is, then the results seem to indicate that both the first and second model are significant. However, adding the IV in the second model did not significantly improve the model. In other words, both models are significant but you can’t say that the second model is better.

However, I think the more crucial statistic to assess is the p-value for the IV in the second model. That statistic will tell you specifically whether that IV is significant while controlling for all the demographic variables. I think that’s what you really want to know.

Best of luck with your analysis!

Nancy Lohalo says

Hi Jim, thank you very much for this insightful post! I have encountered a few problems with the dependent variable Y in the linear regression model. The data collected showed a decreasing trend for the past 20 years, and my hypothesis stated that X1 will have a positive impact on Y. When I ran the regression test, almost all of the independent variables had negative coefficients. How can I interpret it? Thank you!

Jim Frost says

Hi Nancy,

It’s difficult for me to say much about your specific case because there’s so little information. It sounds like your hypothesis was that X1 would have a positive coefficient but your analysis produced a negative coefficient. I’m going to assume that X1 is negative and statistically significant. If it’s negative but not significant, it’s not distinguishable from zero and you can’t assume that it has a negative value in the population. Given those assumptions about the situation, there are two general possibilities.

1) Your hypothesis was incorrect. I have no way to know about that. But, it’s something you can investigate.

2) Your hypothesis is correct but your regression model has a problem that produces biased coefficients. This problem is causing the analysis to produce a negative coefficient but it’s should be a positive coefficient. There are a number of reasons why this can occur, including confounding variables, overfitting, data mining, and a misspecified model among other possibilities. Be sure to go through the OLS assumptions and see if your model violates any of them. It will probably take some effort to check these potential problems.

Because you’re performing a study with regression analysis, you might consider buying my ebook about regression analysis. In this ebook, I provide much more information all about regression analysis.

Best of luck with your study!

Karis says

Hi Jim, thank you so much for this post it’s helped a lot! I’m learning this stuff at uni and have come across a question which has completely confused me and wondered if you could help? The question asks to interpret the regression analysis result and its significance of these regression results:

R^2 = 0.74 (F = 16.82, p>0.01; t = 0.54, p<0.01).

However, the differing levels of confidence levels has thrown me? Does the fact that the F ratio is not within the confidence threshold mean that the regression model altogether is not statistically significant? Thank you!

Jim Frost says

Hi Karis,

So, the F-test and R-squared goes together. These are measures of Goodness-of-Fit. I’m assuming that the F-value and its p-value are for the F-test of overall significance. That test indicates that your R-squared (0.72) is not significantly different from zero–assuming that alpha is 0.01. Your model is no better at predicting the DV than just using the mean. That’s kind of odd for a model with an R-squared as high as 0.74. There might be a very small sample size or some problem with the model. I can’t tell from these results. Read more about that in my post about the F-test of overall significance. Read my post to see how to interpret R-squared.

The t-value and its p-value are for a term in the model, such as an independent variable. That particular IV is statistically significant. This post details what that means. For this model, the overall significance and significance for a particular IV disagree. The post about the F-test of overall significance describes how this disagreement can happen.

Note that none of the statistics you provide relate to confidence, as in confidence levels or confidence intervals. However, there is a disagreement about statistical significance. Read the post about the F-test to understand that issue.

I do think it’s odd that R-squared is reasonably high but that the overall F-test is not significant. I suspect something odd is going on.

I hope this helps!

Klaudia Pająk says

The question is- when I make the analyse of regression, SPSS shows the results and COEF has some value… When I describe these results on paper- should I define the coeff value as a b or β?

Thank you in advance

Jim Frost says

Hi Klaudia,

You should be able to work this out from the information provided. Coefficients are estimates of population parameters. And, b is an estimate of a parameter. Therefore, b = coefficients in this context because they are both estimates. Conversely, Beta is a population parameter and not an estimate.

I hope this helps!

Klaudia says

Hello Jim,

I’d like to ask what does the “COEF” mean. Is it the same thing as b or β?

Klaudia 😉

Jim Frost says

Hi Klaudia,

COEF stands for coefficient. These are the values that the procedure estimates from your data. In a regression equation, these values multiply the independent variables.

Technically, β is the parameter value for the population. Your regression equation estimates these parameter values. In textbooks, these estimates are often denoted using beta-hats. That’s a β with a ^ on top. Some sources use a lower-case b to indicate that it’s an estimate. The key thing to note is that some forms (β) refer to the true population parameters while others (beta-hat and b) refer to the estimates of the parameters. The coefficients in your output are estimates of the parameters.

One caution, SPSS for some strange reason uses the term “beta” to refer to standardized coefficients!

I hope this helps!

Curt Miller says

Hi Jim,

Do we still use p-values in determining whether or not a predictor variable should remain in the model, even when we are building a model on full population data?

Thanks you,

curt

Jim Frost says

Hi Curt,

When you’re working with data for an entire population, there is no need to use any p-values. P-values are an integral part of hypothesis tests that help you determine whether an apparent effect that exists in your sample also exists in the population. When you have the population data, all effects that you observe by definition do exist in the population. There’s no need to perform any hypothesis testing to confirm it because you’re looking at all the data for the population. This applies to regression analysis and other forms of hypothesis testing such as 2-sample t-tests, et al.!

Phil A. says

Hi Jim,

Quick question for a special type of regression… I have the following equation but I am not clear on the interpretation of the coefficient I obtain:

log($RealGDP) = B0 + B1(Junk-Bond Yield %) + e

My X1 data is in terms of percentage points (%) and my Y-variable (in log-scale) is in terms of dollars ($).

After I run my regression, my B1 coefficient = -0.005

As of now, I am interpreting the B1 coefficient as “A 1% increase in the Junk-Bond yield leads to a -0.5% decrease in Real GDP” – does this sound like the correct interpretation?

My main confusion is around the “1% increase in X” …. If the junk-bond spread is currently at 5%, do I interpret “a 1% change” as the junk-bond yield moving from 5% to 6%? Or do I interpret it as a 1% change of 5% (ex: 5% to 5.05%)?

Olga Pap says

Hello Jim.

Hello All.

I have one question. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Olga Pap says

Is it possible please to answer me on the above question?

Jessica says

Thank you! It does, though, when I’m looking at a scatterplot, I’ve seen an R value. This is not to be interpreted as the same thing as the correlation coefficient, r . . . correct? Even though the R value is not R-squared, it is still not the same as r . . . right?

Jim Frost says

Hi Jessica,

Ah, yes, I jumped straight to R-squared because that is used much more frequently. R is the coefficient of multiple

correlationwhereas R-squared is the coefficient of multipledetermination. The use of the capital letter R for both of these statistics indicates that they are sample estimates. I’ve described R-squared so onto R!The calculation for R is (unsurprisingly) just taking the positive square root of R-squared. R represents the correlation between a set of variables with another variable. In the regression context, this could be the correlation between your set of independent variables and the dependent variable. The interpretation of R is not intuitive. Hence, R-squared is used more frequently.

Lower case r is the correlation between two variables and it is commonly used. R involves more than two variables.

I haven’t seen R used much at all. Perhaps it is in some specialized context. But, you probably don’t need to worry about R.

Jessica says

Hi, I know this may seem to be a very simple question, but is there a difference between R and r? Do they stand for the same thing in regression analysis?

Jim Frost says

Hi Jessica,

Yes, r and R-squared are related as they both measure the strength of relationships between variables. r is a correlation coefficient that ranges between -1 to +1. It measures the strength of the linear relationship between two continuous variables. R-squared measures the strength of the relationship between a set of independent variables and the dependent variable. It’s a percentage that ranges from 0 – 100%.

Suppose you have a pair of variables, say X and Y, and the correlation coefficient (r) is 0.7. If you perform a simple regression using these two variables, you will obtain an R-squared of 0.49 (49%). We know this because 0.7^2 = 0.49. However, unlike correlation coefficients (r), you can use R-squared when you have more than two variables.

I write about that aspect in my post about correlation. You can also read more about R-squared.

I hope this helps!

Olga Pap says

Hi Jim. I would be very grateful if you could help me. Specifically, when the dependent variable (e.g. earnings) is expressed on a logarithmic form (and not the independent variables) via mincer equation, does the interpretation of coefficients follow the below rules?

• For an increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” in logarithmic form should be e^b?

• And only for very small values of b (b < |0.1|) and having in mind that

e^b ≈ 1 + b, increase of one-unit of the independent variable “X”, with coefficient b, then the change for dependent variable “Y” should be equal to (100 × b)?

Thank you in advance.

Tesfakiros Semere says

What a clear, simple, and easy to understand. You saved my time from reading lots of books. It is really helpful.

Would it be possible to get them all in Pdf just to print and read when I am out of network

THANK YOU SO MUCH Kim.

Jim Frost says

Hi Kim, thanks so much for your kind words! They made my day! While I don’t have PDFs of the blog posts, in several weeks I’ll releasing an ebook all about regression analysis. If you like the simple and easy to understand approach in my blog posts, you’ll love this book. It should be out in early March 2019!

Digambar salunkhe says

Thank you so much for sharing this blog…It’s really helpful and easy to understand the concept of whole regression model.

Adu Emmanuel Ifedayo says

Thank you.

Neven says

Hi Jim ! Great blog , very clear and very helpfull . The best I have found in this field! Thanks.

Qmars Safikhani says

Hi Jim,

Thanks a lot for sharing your knowledge through this article. I found it very interesting as you explained somehow difficult concepts in an easy way. Well done

Hans says

Hey Jim,

Great Blog! You helped us a lot preparing for our studies at university. We have a question regarding the p-value… Is there an explanation for a p-value being exactly 1.0? Does it mean that there is a 100 percent chance that the independent variable has no effect on the dependent one? Or is there anything else to consider? Thanks a lot for your help and keep that great work going!

Jim Frost says

Hi Hans, thank you so much! It’s great to hear that it’s been helpful for you all. That makes my day!

Yes, you can obtain a p-value of 1.0. To get exactly 1.0, your sample statistic would have to exactly equal the null hypothesis. For example, if you perform a 1-sample t-test and your null hypothesis is that the population mean equals 10. If your sample statistic is exactly 10, you obtain a p-value of exactly 1.0. In regression analysis, typically the null for a coefficient is that it equals zero. So, if the estimated coefficient equals zero exactly, you’d again get a p-value of 1.0.

The interpretation of a p-value in general is the probability of obtaining the observed sample statistic or more extreme if you assume the null hypothesis is true. The reason p = 1.0 when the sample statistics equals the null hypothesis value makes sense when you think about it with that interpretation in mind. When the sample stat equals the null value, there is a 100% probability that a sample statistic will equal the null value or be more extreme! That’s true by definition because that case covers the entire range of the sampling distribution (i.e., you’d shade the entire area beneath the sampling distribution curve).

To see these sampling distributions in action for a hypothesis test, read my post about p-values and significance levels.

Of course, the probability of obtaining a sample statistic that exactly equals your null hypothesis is miniscule. When using statistical software in the field, if you see a p-value = 1, it’s more likely due to rounding.

Paul says

Hi. I want to find out if simple or multiple regressions can be used to explain effects (as in experimental studies)?

Thank you.

Jim Frost says

Hi Paul,

You bet they can! The coefficients describe the effects and the p-values determine whether the effects are statistically significant.

Rashan says

This is very helpful. Thank you

Surya says

Thanks Jim

Surya says

Hi Jim, I have just subscribed to your posts after reading the wonderful post on residual plots.

Could you please let me know how do we interpret the SE of coefficients , T statistic as well.. Or do you already have an article on them… Please reply.. Thanks..

Jim Frost says

Hi Surya,

Thanks so much! I’m glad that post was helpful!

The standard error (SE) of the coefficient measures the precision of the coefficient estimate. Smaller values represent more precise estimates. Standard errors are the standard deviations of sampling distributions. If you were to perform your study many times, drawing the same sample size, and fitting the same model, you’d obtain a distribution of coefficient estimates. That’s the sampling distribution of a coefficient estimate. The standard error of a coefficient is the standard deviation of that sampling distribution. The SE is used to create confidence intervals for the coefficient estimate, which I find more intuitive to interpret.

The t-statistic in the context of regression analysis is the test statistic that the analysis uses to calculate the p-value. I write a post about how it works in the context of t-tests. It’s fairly similar for coefficient estimates. Read that post but replace sample mean with coefficient estimate and you’ll get a good idea. How t-tests work.

I hope that helps!

[email protected] says

been reading your posts all night, (morning now).. I can’t stop because it’s like a light bulb keeps going off. Been studying this stuff for weeks, now I finally get it thanks to your post. Thank you:)

-Extremely tired data science grad student.

Jim Frost says

Hi, I’m sorry my posts caused you to lose some sleep last night, but I love your analogy about light bulbs going off! I’m really happy to hear that they were helpful. That really makes my day! Best of luck with your studies!

Tracey says

Hi Jim. Thank you so much for this as it helped clear up some things in my mind as I prepare a research paper.

Jim Frost says

Hi Tracey, you’re very welcome. I am happy to hear that it was helpful!

Qiumei Jing says

Thank you for your explanation,Jim.That’s really great!

When I’m doing multiple liner regression , I have a question.The liner regression has three independent variables(A,B,C) and one dependent variable(D). I got significant p-value of ANOVA table,but in Coefficients table ,the constant p-value is 0.237,which is not significant ,with one predictor(Variable A) p-value is 0.211,another two predictors have good significant value(P=0.000). In that case ,how can I interpret the results? The hypothesis of the two predictors (variable B and C)which have significant is”there is a relationship between B and D” and “there is a relationship between C and D ” In this case,can I say the two hypothesis were supported? And how can I interpret the one (A)with insignificant p-value in coefficient table? Thank you in advance!

Jim Frost says

Hi Qiumei,

It’s generally not worthwhile interpreting the constant, so I’d skip that. To learn why, click the link for interpreting the constant in this post.

Here’s how you can interpret the significant predictors.

The sample provides sufficient evidence to conclude that changes in both independent variables B and C are correlated with changes in the dependent variable D. Statistical significance indicates that the correlation does not equal zero. In other words, you can reject the null hypothesis that the coefficients equal zero.

For the insignificant variable, the sample provides insufficient evidence to conclude that there is a relationship between these insignificant variables and the dependent variable. In other words, you fail to reject the null hypothesis that these two coefficients equal zero.

For more elaboration, reread this post where I talk about this in depth.

Appadu says

Dear Jim

Thank you for your explanations on how to Interpret Regression Coefficients for Linear Relationships and p-value. It is very clear appreciate you time to put this together.

I have one question I was looking at an example on Estimated standardised OLS beta coefficient data. The results show R squared (%) as 26.2 and F-Value 18.14. Please advise how to interpret this 2 figures. Thank you

Jim Frost says

Hi Appadu,

When you standardize the continuous independent variables in your model, the output produces standardized coefficients. Standardization is when you take the original data for each variable, subtract the variable’s mean from each observation and divide by the variable’s standard deviation. The main reason I’m aware of for performing this standardization is to reduce the multicollinearity caused by including polynomials and interaction terms in your model. I write about that in my post about multicollinearity.

In terms of interpreting the standardize coefficient–it represents the mean change in the dependent variable given a one standard deviation in the independent variable. Another reason statisticians use it is as a possible measure for identifying which variable is the most important.

As for interpreting R-squared and the F-test of overall significance, those don’t change from the usual interpretations. Click on the links to read my blog post about interpreting each statistic.

I hope this helps!

Hrishikesh Geed says

Thanks for the explaination Jim !!.

I have one doubt, how do you calculate the p-value corresponding to each coefficient?

How do you decide the standard deviation,and the sample mean for calculating the z value for each coefficient?

Thanks

Hrishi

eric says

Thank you very much for the explanation Jim!

If the p-value is under the significant level, this would indicate that there is enough evidence to reject the null hypothesis. The null hypothesis being here that there is no correlation between 2 variables (in a single linear regression).

Here is my first question: how do we decide how to set the significant level? Is it purely arbitrary?

My second question is: since the coefficient of correlation varies -1 and 1, it is tempting to conclude that there is a significant correlation (positive or negative) between 2 variables is the coefficient of correlation is close to -1 or 1 and that there is no correlation when the coefficient of correlation is close to 0. However I think this assumption is false but can’t get the intuition to understand why.

Could you help me about those questions?

Many thanks for your time and your attention

Best regards

Eric

Hanan Shteingart says

the following claim is not true if the features are correlated, what’s known as multicollinearity: “The sign of a regression coefficient tells you whether there is a positive or negative correlation between each independent variable the dependent variable”. In fact, a feature could have a positive correlation with the target yet a negative coefficient and vice vera.

Jim Frost says

Hi Hanan,

You raise a good point. The interpretation that I present, including the portion that you quote, is accurate when your model doesn’t contain a severe problem. However, if your model does contain a severe problem, it can produce unreliable results, which includes the possibility that the coefficients don’t accurately describe the relationship between the independent variables and the dependent variable. The problem isn’t with how to interpret coefficients, but rather with a condition in the model that causes it to produce coefficients that you can’t trust.

As you point out, multicollinearity can produce unreliable, erratic coefficients. In some cases, the sign of the coefficient can even be incorrect. However, the sign switch doesn’t necessarily have to happen when your model has multicollinearity. I write more multicollinearity, including switched signs, in this post: Multicollinearity in Regression Analysis: Problems, Detection, and Solutions.

By the way, there are a number of other potential problems that can cause your model to produce results that can’t trust. Multicollinearity is just scratching the surface of that. These problems include an incorrectly specified model, overfitting the model, heteroscedasticity, and data mining among others. I spend quite a bit of time talking about these problems, how they can invalidate your results, and what you can do to address them.

I hope this helps!

MN says

Thank you very much for the wonderful elaboration. Amazing!!

Jim Frost says

You’re very welcome, MN! I’m glad it’s helpful!

Rajasekar says

I am currently working on a multiple regression model, where i have 4 x variable and all my variable are not statistically significant. I know when this happen i can reject null hypothesis but like to know what might be the wrong , do i need to add some more x variable in this case.Also the R Square =0.109842937

Adjusted R Square =0.034084889

Ayush says

This is really one of the best websites I have come across for DATA SCIENCE… Great effort put up by Sir Jim…

Jim Frost says

Thank you, Ayush!

Rali says

Hi Mr. Jim

Thanks for the helpful blog

all the best

Jim Frost says

Hi Rali, you’re very welcome! I’m glad it was helpful!

ADIL HUSSAIN RESHI says

Really fabulous ..it cleared all my doubts about p- value

Jim Frost says

Hi Adil, Thanks! I’m so glad to hear that it was helpful!

Javed Iqbal says

Thanks Jim for the nice explanation. This regression seems to violate one of the model assumption namely the homoskedasticity. Log transformation should work here.

Jim Frost says

Hi Javed, thanks for your comment. The residuals for this model are homoscedastic–or very close to it. Their variance are fairly equal across the entire range. The variance might appear to be lower in the very low end of the range, but there are also fewer observations in that region, which can make the dispersion appear to be smaller. At any rate, it is close enough. To see how a true case of heteroscedasticity appears, along with multiple methods for correcting it, read my post about heteroscedasticity. By the way, I explain in that post why I always recommend trying other methods of addressing this problem before using a transformation.

Toby says

Great blog with detailed explanation! It helps clear my doubts for p-value.

Thank you Jim! and Happy new year! 😀

Jim Frost says

Thank you, Toby! And, I’m very happy you found the blog to be helpful! Happy new year to you too!!