Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.

As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!

In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.

## Use Regression to Analyze a Wide Variety of Relationships

Regression analysis can handle many things. For example, you can use regression analysis to do the following:

- Model multiple independent variables
- Include continuous and categorical variables
- Use polynomial terms to model curvature
- Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable

These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:

- Do socio-economic status and race affect educational achievement?
- Do education and IQ affect earnings?
- Do exercise habits and diet effect weight?
- Are drinking coffee and smoking cigarettes related to mortality risk?
- Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?

More on the last two examples later!

All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!

## Use Regression Analysis to Control the Independent Variables

As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.

### What does controlling for a variable mean?

When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.

To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.

### How do you control the other variables in regression?

A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.

A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.

Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.

**Related post**: Confounding Variables and Omitted Variable Bias

## How to Interpret Regression Output

To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.

For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:

The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.

Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. I’ve written an entire blog post about how to interpret regression coefficients and their p-values, which I highly recommend.

## Obtaining Trustworthy Regression Results

With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:

- Specify the correct model. As we saw, if you fail to include all the important variables in your model, the results can be biased.
- Check your residual plots. Be sure that your model fits the data adequately.
- Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem.

Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.

There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data.

If you’re learning regression, check out my Regression Tutorial!

NARAYANASWAMY AUDINARAYANA says

In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?

Jim Frost says

Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!

I hope this help!

Jim

Khadidja Benallou says

Thank you Mr. Jim

Jim Frost says

You’re very welcome!

Shamsun Naher says

In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.

Jim Frost says

Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.

I hope this helps!

Jim

Ghulam Mustafa says

Sir usually we take

5% level of significance for comparing why 0

Jim Frost says

Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.

ghulam mustafa says

why do we use 5% level of significance usually for comparing instead of 1% or other

Jim Frost says

Hi, I actually write about this topic in a post about hypothesis testing. It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.

Kaleem says

Hello, Jim.

What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?

*Here impact relates to the independent variables’ p-values and the coefficients.

Kind regards.

Jim Frost says

Hi Kaleem,

If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.

Kaleem says

Thanks for the reply. Jim.

I am unable to understand “Your model wonβt fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.

Kind regards

Jim Frost says

Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.

R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.

Kaleem says

Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.

Jim Frost says

Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?

Kaleem says

Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?

Kaleem says

Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.

Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.

Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?

Kind regards.

Jim Frost says

While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model. I include a number of tips and considerations.

Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.

Kaleem says

Thanks for your reply and for the guidance.

I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.

Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?

I look forward to your reply and I will be grateful for your reply.

Kind regards.

Hari says

Very nicely explanined. thank you

Jim Frost says

Thanks you, Hari!

Martin Amsteus says

No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.

Jim Frost says

Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.

There are two issues at play here. The type of study under which the data were collected and the statistical findings.

Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis

canhelp establish causality, but only when it’s performed on data that were collected through a randomized experiment.Martin Amsteus says

Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.

Jim Frost says

Hi Martin, yes, that is

exactlywhat I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.

Cara says

Hi Jim,

I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!

Jim Frost says

Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.

As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.

Ahmed says

Hi Jim

Hope all thing is well,

I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables .

when i do the main effect plots, i have the straight line increasing.

y= x, this linear trending

to change it i need to make y= square root for time

Im stuck with this thing i couldn’t find solution for it

Regards

SUBROTO Chatterjee says

In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know.

S. CHATTERJEE

Jim Frost says

Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.

Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.

Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.

This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!

I hope this helps!

Patrik Silva says

Hi Jim, I am everywhere in your post!

I am starting loving statistic, that’s why I am not quiet,

I have some questions for you:

To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right?

Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?)

Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?

When should I transform my independent variables, and what is the consequence of transforming them?

Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,

You are being so useful to me,

Thank you again!

Patrik

Jim Frost says

Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics, but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!

In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots. If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.

Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).

In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data. By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions. The residuals are normally distributed even though the dependent variable is not.

Patrik Silva says

Many Thanks Jim!!! You have no idea about how much you helped me.

Very well clarified!!!

God bless you always!!!

Patrik

Patrik Silva says

Hi Jim, I hope you are having good time!

I would like to ask you a question, please!

I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).

Thank you in advance!

Patrik

Jim Frost says

Hi Patrik, great to hear from you again!

Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model. Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.

There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems. This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.

So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!

Hope this helps!

Patrik Silva says

Thank you Jim!

I really appreciate it!

PS

Saeed Anowar says

Thanks

Kaushal Kumar Bhagat says

Is it necessary to conduct correlation analysis before regression analysis?

Jim Frost says

Hi Kaushal,

No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.

That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.

Rashmi Bs says

Dear sir,

I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?

Jim Frost says

Hi Rashmi,

Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.

Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.

That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.

So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.

I really need to write a blog post about this! I will soon!

In the meantime, I hope this helps!

Antonio Padua says

Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?

Jim Frost says

Hi Antonio,

Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.

I have written a post on binary logistic regression. Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!

V.G.Subramanian says

Hi Jim

As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.

Construction industry, normally uses these price indices running over a period of time to redetermine the

prices based on the movement between the base date and current date, which is called as price adjustment

Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.

Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.

It is a bit amusing that someone was suggesting to me.

Regards

V.G.Subramanian

Jim Frost says

Hi V.G.,

I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!

Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.

I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!

Almadi Obere says

Hi Jim

Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable.

almadi

Jim Frost says

Hi Almadi,

Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.

Tetyana says

Dear Jim, thatk you very much for this post! Could you, please, explain the following.

You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”

What if I have small r-squired, but the coefficiants are statistically significant with the small values?

Aisling Dunphy says

Hi Jim,

Your blog has been really helpful! π I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.

I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear.

My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.

At the moment I am struggling to complete this analysis. Any help would be greatly appreciated π

V.G.Subramanian says

Hi Jim

I have been unfortunate to get your reply to my comment on 18/09/2018

Jim Frost says

Hi V.G.,

Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.

I replied under your original comment.

V.G.Subramanian says

Hi Jim

Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind

Salam says

Hi,

What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?

Jim Frost says

Hi Salam,

Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.

However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.

I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.

I hope this helps!

Rachel Wang says

Hi Jim,

suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?

Thank you π

Jim Frost says

Hi Rachel,

Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.

Rachel Wang says

Hi Jim,

Thank you so much for the quick response!

I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?

Thank you!

Jim Frost says

That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!

CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.

Rachel Wang says

Hi Jim,

The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot?

Another follow up question: does a narrower CI equals a better estimate?

thanks!

Jim Frost says

Yes, that’s definitely it!

I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.

In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.

Rachel Wang says

thank you so much Jim! this is really helpful π

Jim Frost says

You’re very welcome! Best of luck with your analysis!

Asad says

Hello Sir!

is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values).

Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable?

Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?

Jim Frost says

Hi Asad,

Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.

There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.

Best of luck with your analysis!

Asad says

Thank you so much for your time!

Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?

Jim Frost says

You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.

Asad says

Ok.Thank you so much.