Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.
As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!
In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis.
Related post: What are Independent and Dependent Variables?
Use Regression to Analyze a Wide Variety of Relationships
Regression analysis can handle many things. For example, you can use regression analysis to do the following:
- Model multiple independent variables
- Include continuous and categorical variables
- Use polynomial terms to model curvature
- Assess interaction terms to determine whether the effect of one independent variable depends on the value of another variable
These capabilities are all cool, but they don’t include an almost magical ability. Regression analysis can unscramble very intricate problems where the variables are entangled like spaghetti. For example, imagine you’re a researcher studying any of the following:
- Do socio-economic status and race affect educational achievement?
- Do education and IQ affect earnings?
- Do exercise habits and diet effect weight?
- Are drinking coffee and smoking cigarettes related to mortality risk?
- Does a particular exercise intervention have an impact on bone density that is a distinct effect from other physical activities?
More on the last two examples later!
All these research questions have entwined independent variables that can influence the dependent variables. How do you untangle a web of related variables? Which variables are statistically significant and what role does each one play? Regression comes to the rescue because you can use it for all of these scenarios!
Use Regression Analysis to Control the Independent Variables
As I mentioned, regression analysis describes how the changes in each independent variable are related to changes in the dependent variable. Crucially, regression also statistically controls every variable in your model.
What does controlling for a variable mean?
When you perform regression analysis, you need to isolate the role of each variable. For example, I participated in an exercise intervention study where our goal was to determine whether the intervention increased the subjects’ bone mineral density. We needed to isolate the role of the exercise intervention from everything else that can impact bone mineral density, which ranges from diet to other physical activity.
To accomplish this goal, you must minimize the effect of confounding variables. Regression analysis does this by estimating the effect that changing one independent variable has on the dependent variable while holding all the other independent variables constant. This process allows you to learn the role of each independent variable without worrying about the other variables in the model. Again, you want to isolate the effect of each variable.
Regression models help you prevent spurious correlations from confusing your results by controlling for confounders.
How do you control the other variables in regression?
A beautiful aspect of regression analysis is that you hold the other independent variables constant by merely including them in your model! Let’s look at this in action with an example.
A recent study analyzed the effect of coffee consumption on mortality. The first results indicated that higher coffee intake is related to a higher risk of death. However, coffee drinkers frequently smoke, and the researchers did not include smoking in their initial model. After they included smoking in the model, the regression results indicated that coffee intake lowers the risk of mortality while smoking increases it. This model isolates the role of each variable while holding the other variable constant. You can assess the effect of coffee intake while controlling for smoking. Conveniently, you’re also controlling for coffee intake when looking at the effect of smoking.
Note that the study also illustrates how excluding a relevant variable can produce misleading results. Omitting an important variable causes it to be uncontrolled, and it can bias the results for the variables that you do include in the model. This warning is particularly applicable for observational studies where the effects of omitted variables might be unbalanced. On the other hand, the randomization process in a true experiment tends to distribute the effects of these variables equally, which lessens omitted variable bias.
Related post: Confounding Variables and Omitted Variable Bias
How to Interpret Regression Output
To answer questions using regression analysis, you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values. When you have a low p-value (typically < 0.05), the independent variable is statistically significant. The coefficients represent the average change in the dependent variable given a one-unit change in the independent variable (IV) while controlling the other IVs.
For instance, if your dependent variable is income and your IVs include IQ and education (among other relevant variables), you might see output like this:
The low p-values indicate that both education and IQ are statistically significant. The coefficient for IQ indicates that each additional IQ point increases your income by an average of approximately $4.80 while controlling everything else in the model. Furthermore, an additional unit of education increases average earnings by $24.22 while holding the other variables constant.
Regression analysis is a form of inferential statistics. The p-values help determine whether the relationships that you observe in your sample also exist in the larger population. I’ve written an entire blog post about how to interpret regression coefficients and their p-values, which I highly recommend.
Obtaining Trustworthy Regression Results
With the vast power of using regression comes great responsibility. Sorry, but that’s the way it must be. To obtain regression results that you can trust, you need to do the following:
- Specify the correct model. As we saw, if you fail to include all the important variables in your model, the results can be biased.
- Check your residual plots. Be sure that your model fits the data adequately.
- Correlation between the independent variables is called multicollinearity. As we saw, some multicollinearity is OK. However, excessive multicollinearity can be a problem.
Using regression analysis gives you the ability to separate the effects of complicated research questions. You can disentangle the spaghetti noodles by modeling and controlling all relevant variables, and then assess the role that each one plays.
There are many different regression analysis procedures. Read my post to determine which type of regression is correct for your data.
If you’re learning regression and like the approach I use in my blog, check out my eBook!
First of all, Many thanks for this fantastic website that makes statistics seem a little bit simpler and more clear. It’s a fantastic resource.
I have dataset of an experiment. It have dependent variable Choice reaction time(CRT) and independent variable Visual task. (This visual task includes two types of task; cognitive involved questions and minimizes cognitive questions. These questions are of three types questions which include choices/options(2,4,8)/bits(1,2,3) only two options to choose one answer, 4options questions and 8 options in questions.
First i used Linear regression to check the best fitting of model(Hicks law) in SPSS. But unfortunately the value of r-square was very very low.
Now, my professor push me to make new model by using that dataset.
Please suggest me some steps and hints so i will start working on it.
Hi Jim,
Following are my research objectives
a. To identify youth’s competencies in entreprenuership in the area.
b. To identify the factor of youth involvement in agricultural entreprenuership in the area.
I have used opinion based question designed on 5-point likert scale item except demographic question in the beginning of my survey. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire.
My question is which analysis is suitable for my research?
Regresion analysis or descriptive analysis or both?
Hi Nik,
The question of whether there is a dependent variable and one or more independent variables is separate from the question of whether you need to use inferential or descriptive statistics. And regression analysis can be either a descriptive or inferential procedure. Although, it is almost always an inferential procedure. Let’s go through these issues.
If you just want to describe a sample and you’re not generalizing from the sample to a population, you’re performing descriptive statistics. In this case, you don’t need to use hypothesis testing and confidence intervals.
However, if you have a representative sample and you want to infer the properties of an entire population, then you need to perform hypothesis testing and look at confidence intervals. Read my post about the Difference between Descriptive and Inferential Statistics for more information.
Regression analysis can apply to either of these cases. You perform the same analysis but if you’re only describing the sample, you can ignore the p-values and confidence intervals. Instead, you’ll focus on using the coefficients to describe the relationships between the variables within the sample. There’s less to worry about but you only know what is happening within that sample and can’t apply the results to a larger population. Conversely, if you do want to generalize to a population, then you must consider the p-values and confidence intervals and determine whether the coefficients are statistically significant. Most analysts performing regression analysis do want to generalize to the population, making it an inferential procedure.
However, regression analysis does specify independent and dependent variables. If you don’t need to specify those types of variables, then just use a correlation. Likert data is ordinal data. And for that data type, you need use Spearman’s correlation. And, like regression analysis, correlation can be either a descriptive or inferential procedure. You either pay attention to the p-values (inferential) or not (descriptive). In both cases, you are interested in the correlation coefficients. You’ll see the relationships between the variables without need to specify independent and dependent variables. You could calculate medians or modes for each item but not the mean because that’s not appropriate for ordinal data.
I hope that helps!
Hi Jim,
Supposing I’m interested in establishing an explanatory relationship between two variables, profits and average age of employees using regression analysis and I have access to data from the entire population of interest e.g. all the 30 firms in a particular industry, do I still need to perform statistical inference? What would be the meaning of p-values , F tests etc, given that I am not intending to generalize the results for firms outside the industry? Do I still need to perform power analysis given that I have access to the entire population of 30 firms? Is the population of 30 firms too small for reliable statistical deductions? Thanks in advance Jim.
Hi Patrick,
If you are truly interested in only those 30 companies and have access to data for all their employees, then you don’t need to perform inferential statistics. You’ve got the entire population. Hence, you know the population parameters. Hypothesis tests account for sampling error. But when you measure the entire population, there is zero sampling error and, hence, zero need to perform a hypothesis test.
However, if your average ages are based on only a sample of the employees in the 30 firms, then you’re still working with samples. To generalize from the sample to the population of all employees at the 30 firms, you’d need to use hypothesis testing in that case.
So, you just need to determine whether you really have access to the data for the entire population.
Hi, Following are my research objectives
a. To investigate effectiveness of asynchronous and synchronous mode of online education.
b. To identify challenges that both teachers and students encounter in synchronous and asynchronous mode of online education.
I have used pearson correlation to find relationship of effectiveness of synchronous mode with asynchronous mode and challenges of online mode and vice versa.
I have used opinion based question designed on 5-point likert scale item. The questionnaire contain simp-le opinion based question there is no dependent and independent items in questionnaire.
My question is that correlation is sufficient or i have to run other test for proving my hypothesis.
Hi Aliya,
Because you have Likert scale data, you should use Pearson’s correlation because that is more appropriate for ordinal data.
Another possibility would be to use a nonparametric test and evaluate the median difference between the asynchronous and synchronous modes of education for each item.
A scientist determined the intensity of solar radiation and temperature of plantains every hour throughout the day. He used correlation to describe the association between the two variables. A friend said he would get more information using regression. What are your views?
Hi Mary,
Yes, I’d agree the regression provides more information that correlation. But it’s also important to understand how correlation and regression presents effect sizes differently because in some cases you might want to use correlation even though it provides less information.
Correlation gives you a standardized effect size (i.e., the correlation coefficient). Standardized effect sizes don’t provide information using the natural units of the data. In other words, you can’t relate a correlation coefficient to what’s going on with the natural data units. However, it does allow you to compare correlations between dissimilar variables.
Conversely, regression gives you unstandardized effect sizes in the coefficients. They tell you exactly what’s going on between an independent variable and dependent variable using the DV’s natural data units. But it’s harder to compare results between regression models with dissimilar DV units. Although regression has its own standardize measure of the overall strength of the model in the R-squared–but not the individual variables. Additionally, in regression, you can standardize the regression coefficients, which facilitates comparisons within a regression model but not between them.
In some cases, while correlation gives you less information, you might want to use it to facilitate comparisons between studies.
Regression allows you to predict the mean outcome. It also gives you to the tools to understand the amount of error between the predicted and observed values. Additionally, you can model a variety of different types of relationships (curved and interactions) Correlation doesn’t provide those.
So, yes, in general, regression provides more information, but it also provides a different take on the nature of the relationships.
I hope that helps!
Hey Jim.
First, congrats and many thanks on this wonderful website, which makes statistics look a bit easier and understandable. Its a great resource, both for students and professionals. Thanks again.
A request for bit of help, if you’d be kind enough to comment. Doing some research on pharmaceutical industry, regulations and its effects. I am looking at a) probable effects (if any) of drug price increases on other consumption categories (like food and travel), and b) the effects of pricing regulations on drug shortages. In ‘a’, I’ve got inflation data and average consumption expense by quintiles. In ‘b’, I’ve got last 6 year data on drug shortages, mainly due to government administered pricing. However, I’d need to show statistical significance (additionally, if it could predict anything statistically significant about drug shortages in the future).
What kind of stat methodology would be appropriate in terms of ‘a’ and ‘b’? Would appreciate your help.
Thank you so much Sir.
Hello Mr. Jim,
Thank you very much for your opinion. Much helpful.
I’ve another case with 2 DV and multiple IDV and the scope is to determine the validity of data. So for this case, can I run MANOVA as regression analysis and look for significant value and null hypothesis for validity test?
Hoping to hear from you soon.
Kind Regards,
A.Kaur
Hello Mr. Jim,
Thank you for your reply Mr. Jim. My goal is to predict which approach best predicts CRI measure.
CRI-I: Disaster Management Cycle (DMC) based approach (Variable: PP, RS, RC, MP-contain all indices according to its phases)
CRI- II: Sustainability based approach (Physical, Economy, Social-contain all indices according to its phases)
CRI-III: Overall indices of data (24 indices from all the listed variable)
I’ve chosen PP and MP as my DV, and RS and RC as my IDV, since my goal focus on DMC.
Hope I’m clear now. And hoping to hear from you soon Mr. Jim. Thank you.
Hi Reet,
One approach would be to fit a regression model for each approach and the DV. Then assess the goodness-of-fit measures. You’d be particularly interested in the standard error of the regression. This measure tells you how wrong the model is typically. You’d be looking for the model that produces the lowest value because it indicates it’s less wrong at predicting the outcome.
Good day Mr. Jim,
I’ve decided to run regression analysis after correlation test. My research is about reliability and validity of dataset for 3 approaches of community resilience index(CRI) based DMC, sustainability and overall indices approach. So now, I’m literally confused on how can to interpret data with regression analysis? Can I used OLS and GLM to interpret data?
3 approaches: 1:PP,RS,RC,MP {DMC}
2: PY,EC,SC {Sustainability}
3: Overall indices {24 indices}
For your information all those approaches are proposed in 1 dataset that contains 24 indices. Add on, I’ve previously conducted Likert questionnaire(5 scale) to collect my data.
I hope my question is clear. Hoping to hear from you soon.
Hi Reet,
I’m sorry but I don’t completely understand what your goal is for your analysis. Are you trying to determine which approach best predicts sustainability? What are your IVs and DV. It wasn’t totally clear from your description. Thanks!
Going through your blog gave me a good understand when to use regression analysis, honestly it’s an amazing blog
Thanks so much, Robin!
Hey Jim, thanks for all the information.
I would like to ask: are there any limitations in the multiple regression method? Is there other method in mathematics that can be more accurate than a regression?
Sincerly,
Mythili
Hi Mythili,
There are definitely limitations for regression! That’s a broad question that could be answered with a book. But, a good place to start is to consider the assumptions for least squares regression. Click the link to learn more. You can think of those as limitations because if you violate the assumptions, you can’t necessarily trust the results! In fact, when you violate an assumption, you might need to switch to a different analysis or perform it a different way.
Additionally, the Gauss-Markov theorem states that least squares regression is the most efficient regression, but only when you satisfy those assumptions!
Hi Sir,
In regression analysis specifically multiple linear regression, should all variables (dependent and independent variables) be normally distributed?
Thank you,
Helena
Hi Helena,
In least squares regression analysis, you don’t assess the normality of the variables. Instead, you assess the normality of the residuals. However, there is some correlation because if you have dependent variable that follows a very non-normal distribution, it can be harder to obtain normal residuals. But it’s really the residuals that you need to focus on. I discuss that in my article about the least squares (OLS) regression assumptions.
I hope that helps!
Hi Sir, I’m currently a senior high school student and currently struggling on my quantitative research. As a statistician what would you recommend a statistical treatment to use in identifying an impact? To answer the question “What is the impact of the development of educational brochure in minimizing cyber bullying in terms of? 3.1 Mental health 3.2 Self-Esteem”.
Waiting for your reply, desperate for answers lol
Jane
Hi Jim, thank you
So would you advise an ordinal regression or another? i have a survey identifying if they use the new social media- which will place them into 2 groups. Then compare the 2 groups (1- use the new social media, 2- don’t use it) with a control (FB use) to compare their happiness scores (obtained from a survey aswell- higher score=more happier). The conclusions i can draw- would it be causal? or more an indication that for example the new users have lower happiness.
-Also is there a graph that can be drawn after a regression?
On a side note- when would it be advisable to do correlations? for example have both groups complete happiness score and conduct correlations for this and a regression to control for covariates? or is this not statistically advisable
Hi Sam,
I highly recommend you get my book about regression analysis because I think it would be really helpful with these nuts and bolts types of questions. You can find it in My Web Store.
As for the type of regression, as I mentioned, that depends largely on what you use for your dependent variable. If it’s a single Likert item, then you’d use ordinal logistic regression. If it’s the sum or average of multiple Likert items, you can often use the regular least squares regression. But, I don’t have a good handle on exactly how you’re defining your dependent variable.
There are graphs you can create afterwards to illustrate the results. I cover those in my book. I don’t have a good post to refer you to that shows them. Fitted line plots are good when you have simple regression (just one independent variable), but when you have more there are other types.
You can do correlations but be aware that they don’t control for other variables. If there are confounders, your correlations might exhibit omitted variable bias and differ from the relationships you’ll find in the regression model. Personally, I would just stick to the regression results because they control for confounders that you include in the model.
hi Sorry- as you can tell im a little confused on what best to do. As is it advisable to do 2 groups- users of the new social media and non users of that new social media. Then do a T-test to compare their happiness scores. Then have participants answer facebook use questionnaire to control for this by conducting a hierarchical regression where i enter this in- to identify how much this variance is explained by Facebook use?
Many thanks
Hi Sam, you wouldn’t be able to do all of that with t-tests. I think regression is a better bet. You can still include an indicator variable to identify the two groups you mention AND include the controlling variables in that model. That way you can determine whether the difference between those two groups is statistically significant while controlling for the other IVs. All in one regression model!
Hi I wanted to ask if regression is the best test for me- I am looking at happiness scores and time spent on a new social media site. As other social media sites have a relationship with happiness and that people don’t use one social media site- i was going to control for this ‘other social media’ use. My 1st group would be the new social media site and Facebook users and the 2nd group would be Facebook users. They would do a happiness questionnaire and questionnaire about their time/use. Any advice I really appreciate it
I have read around and found partial correlations- do you advice that? So instead participants would complete a questionnaire on their use on this new social media, then also do a questionnaire on their Facebook use and do a happiness questionnaire. I would do a partial correlation between the new social media app use and happiness score, while controlling for Facebook use.
Thank you
Hi Sam,
This case sounds like a good time to use regression analysis. The type of regression depends largely on the nature of the dependent variable. It’s for a survey. Perhaps it’s a Likert scale item? If it’s an item, that’s an ordinal scale and you’d need to use ordinal logistic regression. If you’re summing multiple items for the DV, you might be able to use regular linear regression. Ordinal independent variables are a bit problematic. You’d need to use them as either continuous or categorical variables. You’d include the questions about FB use to control for that.
I hope that helps!
Thank you very much for your answer,
I understand your point of view. However that data set consist of companies investing the largest sums to R&D and not companies with also the best results. Some of them even shows up with a loss of operating profit. Is that still a factor of biasing my results?
Have a nice day,
Natasha
thank you it was very useful
Hi Jim,
I am working on my thesis which is about evaluating the motivation of firms to invest in R&D of new products. I am specifically interested in automotive sector. I have a data of R&D ranking of the world top 2500 companies (by industry) which consist of data about their R&D expenses, (also R&D one-year growth), net sales (also net sales one-year growth), R&D intensity, Capex, operational profit, (also one-year growth), profitability, employees (also one-year growth), market cap (also one-year growth).
My question is that which type of analysis would you recommend to fulfill the topic requirements?
Hi Natasha,
You could certainly use regression analysis to see which variables related to R&D spending.
However, be aware that by using that list of companies, you are potentially biasing your results. For one thing, it’s a list of top R&D companies and you’d certainly want more of a mix of companies across the full range of R&D. You can learn from those who weren’t so good at R&D too. Also, by using a list of the top R&D companies, you’ll introduce some survival bias into the results because these are companies that made it and made it big (presumably). Again, you’d want mix of companies that had varying degrees of success and even some failures! If you limit your data to top companies and particularly top companies in R&D, you’ll limit how much can learn. You might still be able to learn some, but just be aware that you’re potentially biasing your results.
I hope that helps!
Hi Mr. Jim! Thank you so much for your response. Well appreciated!
You’re very welcome, Violetta!
Hi! I’m currently doing my research paper, and i am confused whether i can use regression analysis since my title is “New Normal Workplace Setting towards Employee’s Engagement with their Workloads”
as for the moment I have used correlational approach since it deals with the relationship of two variables. But still im confused on what would be the best in my research. Hope i can get a response soon. Thank you so much!
Hi Violetta,
If you’re just working with just two variables, you have a choice. You can use either correlation or regression. You can even use both together! It depends on the goals of your research. Correlation coefficient are standardized measures of an effect size while regression coefficients are unstandardized effect sizes. I write about the difference between standardized and unstandardized effect sizes. Click the link to read about that. I discuss both correlation and coefficient in that context. It should help you decide what is best for your research goals.
I hope that helps!
Hi Jim,
I am undertaking a Msc dissertation and would like to ask questions on analysis please.
The research is health related and I am looking at determinants of outcome.
I have 5 continuous data independent variables and I would like to know if they have an association with the outcome of a treatment. They involve age, temperature and blood test values. The dependent variable is binary that is the treatment was yes successful or not.
I am looking to do a logistic regression analysis.
Questions I have:
1. Do I first need to do tests to find out if there is statistical significance of each variable before I do the regression analysis or can I go straight in?
2. If so will I need to carry out tests to find out if I have skewed data in order to know whether I need to do parametric or non parametric tests?
Thank you.
Hi Lucki,
You should go in with a bunch theory and background knowledge about the independent variables you should include. Look to other research studies for guides. When you have a set of IVs identified, it’s usually ok to include them all and see what’s significant. An important caveat is if you have a small number of observations you don’t want to overfit your model. However, statistical significance shouldn’t be your only guide for which variables to include and exclude.
To read learn more about model specification, ready my post about specifying your regression model. I write about it in the context of linear regression rather than binary logistic regression, but the ideas are the same.
In terms of the distribution of your data, typically, you assess the residuals rather than the data itself. Usually, you can assess the residual plots.
Hi Jim,
Looks like treating both ordinal variables as continuous seems to solve my problems with non-mutually exclusive levels of the variables if I enter the variables as categorical. My main concern is to look at the variable as a whole not by its levels so it might be what I need; the measurement ranges were based on a an established rating system and does not have any weight for my analysis. Tho, I’ll have to looks more into it as well as the residual plot etc before deciding. Thank you for highlighting this option!
Is it correct if I assign the numerical value to the levels like this? 1 to 5, from lowest to highest.
Spacing
1: less than 60mm
2: 60-200mm
3: 200-600mm
4:0.6-2m
5: more than 2m
length
1: less than 1m
2: 1-3m
3: 3-10m
4: 10-20m
4: more than 20m
As for the data repetition, what I mean was say data for Site A is:
Set 1 (quantity: 25) SP3 PER5
Set 2 (quantity: 30) SP4 PER6
set 3 (quantity: 56) SP2 PER3
so in the data input I’d entered set 1 data 25 times, set 2 data 30 times and set 3 data 56 times. From what I have gathered from fellow student and my lecturer, it is correct but I’d like a confirmation from a statistician. Thanks again!
Hi JIm,
I’m sorry, again the levels disappeared. maybe bc I used (>) and (<) so it's messing up the coding of the comment.
spacing levels:
SP1: less than 60mm
SP2: 60-200mm
SP3: 200-600mm
SP4:0.6-2m
SP5: more than 2m
length level:
PER1: more than 20m
PER2: 10-20m
PER3: 3-10m
PER4: 1-3m
PER4: less than 1m
Spacing and Length were recoded as ranges since they were estimate and not measured individually as it'd take too much time to measure each one (1 set of cracks may have at least 10 cracks, some can reach 50 or more and the measurement are not exactly the same between cracks belonging to the same set).
I've input the dummy like in my previous reply when running the model, tho the resulting equation I've provided does not include the length. Can ordinal variable be converted/treated into continuous variables?
Also, since each set has their own quantities, so I repeated the data in the input according to their quantity. Is that the right way of doing it?
Thanks!
Hi Lyana,
Technically those are ordinal variables. I write about this in more detail in my book about regression analysis, but you can enter these variables as either continuous variables (if you assign a numeric value to the groups) or as categorical variables. If you go the categorical route, you’ll need to use the indicator variable scheme and leave out a reference level approach as we discussed. The approach you should use depends on a combination of your analysis goals, the nature of your data, and the ability to adequately fit the model (i.e., properties of the residual plots).
I don’t exactly know what you mean by “repeated the data in the input.” However, you have levels for each categorical variable. Let’s use the lowest level for each variable as the reference level. Here’s how you’d use indicator variables to include both categorical variables in your model (some statistical software will do that for you behind the scenes).
Spacing variable:
Leave out SP1. It’s the reference.
Include and indicator variable for:
SP2
SP3
SP4
SP5
Length Variable:
Leave PER5 out as reference.
Include indicator variables for:
PER1
PER2
PER3
PER4
And just code each indicator variable appropriately based on the presence or absences of the corresponding characteristic. All zeros in a set of indictor variables for a categorical variable represents the reference level for that categorical variable.
As you can see, you’ll need to include many indicator variables (8), which is a drawback of entering them as categorical variables. You can quickly get into overfitting your model.
I’m sorry I had just noticed that the levels are missing
SP1: 2m
For my case, I’m studying the cracks set on a rock face and I have two independent categorical variables (spacing and length) that have 5 levels of measurement ranges each. Dependant variable is the blasted rock size i.e I want to know how the spacing and length of the existing cracks on a rock face would effect the size of blasted rocks.
E.g: For Spacing: SP1 = 2m
I’ve coded the levels to run the regression model into:
SP1 SP2 SP3 SP4
SP1 1 0 0 0
SP2 0 1 0 0
SP3 0 0 1 0
SP4 0 0 0 1
SP5 0 0 0 0
From the coding (leaving SP5 out as the reference level) above, after running the model, I have obtained the equation:
Blasted rock size (mm) = 1849.146 + 332.224SP1 + 137.624SP2 – 115.268SP3 – 103.604SP4
1 rock slope could consist of 2 or more crack sets hence the situation where more than 1 levels of spacing and length can be observed. As an example, rock face A consist of 3 crack sets with set #1 having SP1, set #2 with SP3 and set #3 have SP4. To predict blasted rock size for rock face A using the equation, I’ll have to insert “1” for SP1, SP3 and SP4. Which is actually the wrong way of doing it since they are not mutually exclusive? Or can I calculate each crack set separately using the same equation then average the of blasted rock size for these 3 crack sets?
From the method in your explanation, does this mean that I’ll have to separate each level into 10 different variables and code them as 1=yes and 0=no? If so, for spacing, will the coding be
SP1 SP2 SP3 SP4 SP5
SP1 1 0 0 0 0
SP2 0 1 0 0 0
SP3 0 0 1 0 0
SP4 0 0 0 1 0
SP5 0 0 0 0 1
in the input table which would be similar to the initial one except with SP5 included? But if I were to include all levels when running the model, SPSS would automatically excluded 1 level since I ran several rock faces (belonging to a single location) in a model so all levels of spacing and length are present in the data set.
The other way that I can think of is to create interaction for all possible combinations and dummy code them but wouldn’t that end up with a super long equation?
I’m sorry for imposing like this but I couldn’t grasp this problem on my own. Your help is very much appreciated.
Hi Lyana,
Ah, ok, it sounds like you have two separate categorical variables. In that case, for each observation, you can have one level for each variable. Additionally, for each categorical variable, you’ll leave out one level for its own reference level.
I do have a question. spacing and length sound like continuous measurements. Why are you including them as categorical variables? There might be a good reason why but it almost seems like you can include them as continuous predictors. Perhaps you don’t have the raw measurements but instead they’re in groups? In which case, they might actually be ordinal variables. You can include ordinal variables as categorical variables. But sometimes they’ll still work as continuous variables.
I see, sorry I couldn’t fully understand your previous reply before this, thanks for the clarification. However, I am dealing with a situation where 2 or more levels of a variable could be observed simultaneously, is it theoretically right to use dummy or is there other method around it?
thanks!
Hi Lyana,
That sounds like you’re dealing with more than one variable rather than one categorical variable. Within an individual categorical variable, the levels of the variable are mutually exclusive. In your case, you need to sort out which categorical variables you have and be sure that the levels are mutually exclusive. If you looking at the presence and absence of certain characteristics, you can use a series of indicator variables. If these characteristics are not the mutually exclusive levels of a single categorical variable, you don’t use the rule about leaving one out.
For example, in a medical setting, you might include characteristics of a patient using a series of indicator variables: gender (1 = female 0 = male), high blood pressure (1 = Yes, 0 = No), On medication, etc. These are separate characteristics (not part of one larger categorical variable) and you can just include an indicator variable to indicate the presence or absence of that characteristic.
Perhaps that it what you need? But be aware that what you describe with multiple levels possible does not work for a single categorical variable. But the method I describe might be what you need if you’re talking about separate characteristics.
Thank you , sir
Thanks for the answer Jim,
does that mean predicted value for when both L4 and L1 are observed and when only L1 is observed without L4 is the same? (Y = 133)
thanks again!
The groups must be mutually exclusive. Hence, an observation could not be in both L1 and L4.
Hi Jim,
I have a question regarding categorical variables dummy coding, I can’t seem to find any post about this topic. Hope you don’t mind me asking here.
I ran a regression model with categorical variable containing 4 level: using the 4th level as the reference group. Meaning in the equation there will only be level 1 to 3 since level 4 is the reference. Say, the equation is Y = 120 + 13L1 – 6L2 + 15L3, to predict the Y with L4 then I’ll have Y = 120, right?
My question is what if I want to predict Y when there is L1 but no L4? if I calculate Y = 120 + 13L that would mean I am including L4 in the equation, or am I wrong about this?
Thank you in advance.
Hi Lyana,
I cover how this works in my book about regression analysis. If you’re using regression for a project, you might consider it.
It sounds like you’re approach is correct. You always leave one level out for the reference group. And, yes, given your equation, the predicted value for level 4 is 120.
For observations where the subject/item belongs to group 1, your equation stays the same, but you enter a 1 for L1 and 0s for L2 and L3. Hence, the predicted value is 133. In other words, you don’t change the equation given the level, you change the X values in the equation. When an observation belongs to group 4, you’ll enter 0s for L1, l2, and L3, which is why the predicted Y is 120. For a given categorical variable, you’ll only enter a single 1 for observations that belong to a non-reference group, and all 0s for observations belonging to the reference group. But the equation stays the same in all cases. I hope that makes sense!
Hi Jim,
May I just ask if there is a difference between a true and simple linear regression model? I can only think that their difference is the presence of a random error. Thanks a lot!
Hi Anthony,
I’ve never heard the dichotomy state as being true vs. simple linear regression. I take true models to refer to the model that is correctly specified for the population. A simple regression model is just one that has a single predictor whereas multiple regression has more than one predictor. The true model has as many terms as are required, which includes predictors and other terms that fit curvature and interaction as needed.
Hi Jim,
I find your explanation to questions very good and so important. Thanks for that.
Please I need your help in my thesis work. My question is if for example I want to measure say level of resilience capacity in a company’s safety management system. What tool would you advise. Regression or which other one ?
Thanks
Kwame
Hi Kwame,
The type of analysis you use depends the data you collect as well as a variety of other factors. The answer is entirely specific to your research question, field of study, data, etc. After you make those determinations, you can begin to figure out which type of analysis to use. I recommend researching your study area to answer all of those questions, including which type of analysis to use. If you need help after you start developing the answers to the preliminary question, I’d be able to provide more input.
Also, I really recommend reading my post about designing a study that includes statistical analyses. That’ll help you understand what type of information you need to collect and questions you need to answer.
Thank you so much for your answer, Jim!
hello Jim, I have a question. I have one independent variable, and two dependent variables, I will explain the case before asking you a question. So, I obtain the data for independent variable using a questionnaire, and one of my dependent variable is also using a questionnaire. But, another dependent variable, which is my second variable, the data is from official website which is secondary data, different from the another variables. Then, I have a question, Is it okay if I use regression analysis to analyze these three variables? Or I have to use another statistical analysis that suit the best to analyze these variables? Thanks in advance.
Hi Cindy,
Most forms of regression analysis allow you to use one dependent variable and multiple independent variables. Because you have two dependent variables, you’ll need to fit two regression models, one for each dependent variable.
In regression, you need to be able to tie together all corresponding values of an observation for the dependent variable and the independent variables. We’ll use an example with people. To fit a regression model, for each person, you’ll need to know their values for the dependent variable and all the independent variables in the model. In your case, it sounds like you’re mixing data from an official website and a survey. If those data sources contain the same people and you can link their values as describes, that can work. However, if those data sources have different people, or you can’t link their scores, you won’t be able to perform regression analysis.
Hi Jim, if you’ve got three predictors and one dependent variable, is it ever worth doing linear regression on each individual predictor beforehand or should you just dive into the multiple regression? Thanks a lot!
Hi Kristian,
You should probably just dive right into multiple regression. There’s a risk of being misled by starting out with regressions with individual predictors. It’s possible that omitted variable bias can increase or decrease the observed effect. By leaving out the other predictors, the model can’t control for them, which can cause that bias.
However, that said, it’s often a good idea to graph the relationship between pairs of variables using scatterplots to get an idea of the nature of each relationship. That’s a great place to start. Those plots not only reveal the direction of the relationship but also whether you need to model curvature.
I’d start with graphs and then try modeling with all the variables. You can always remove insignificant variables.
Hi Jim,
do you think it is correct to estimate a regression model based on historical data as Y=aX+b
and then use the model for the forecast as Y=aX?
Would this be biased?
if the variables involved are growth rates, would it be preferable to directly estimate the model without the intercept?
Thank you in advance
Stefania
Hi Stefania,
The answer to that question depends on a very close understanding of the subject area. However, there are very few cases where fitting a model without a constant is advisable. Bias would be very likely. Read my article about the y-intercept, where I discuss this issue specifically.
Nice article. Thank you for sharing.
If your outcome variable is a pass or fail, then it is binomial logistic. My undergrad thesis was on this topic. May be I can offer some help as this topic is of interest to me. Azad ([email protected])
Sir , what is cox regression analysis ?
Hi Jim,
A friend recommended your help with a stats question for my dissertation. I am currently looking at data regarding pass rate and student characteristics. I have collected multiple data points. One example is student pass rate (pass or rate) and observation hours (continuous variable (0-1000). Would this be a binomial logistic regression? Can that be performed in Excel?
Additionally I am looking at pass rate in relation to faculty characteristics. Another example is pass rate (percentage of 100% maybe continuous data 0-100) and categorical data (Level of degree – bachelor, masters, doctorate)? Additionally, pass rate (percentage of 100) and ratio of faculty to student within classroom (continuous Data) which test would be appropriate for this type of data comparison? Linear regression?
Thanks for your guidance!
Hi Jim. Concepts were well explained. Thank you so much for making this content available.
I have the data of Mortgage loan customers who are currently in default. There are various parameters why default would have happened. But predominantly there are two factors where we would have gone wrong while sanctioning the loan one is underwriting the loan( Credit Risk) and/or Property Valuation (Technical Risk). I have data of sub parameters coming under credit and technical risk at the point of sanction.
Now I want to arrive at an output where predominantly where did I go wrong. Either Technical/Credit risk or both. Which model of regression analysis can help in solving this.
dear sir,
i ‘m currently final year undergradute of Bsc.Radiography degree, so i choosed risk estimation of cardiovascular diseses using several risk factors from regression analysis as my undergraduate research.
i want to predict a percentage value for my cardiovascular risk estimation as a dependent variable using regression analysis.
how can i do that sir,i’m very pleased to have your answer sir ?
Thank you very much.
Hi, It sounds like you might need to use binary logistic regression. If your dependent variable indicates the presence or absence (i.e., binary outcome measure) of a cardiovascular condition, binary logistic regression will predict the probability of having that condition given the values of your dependent variables.
Hi Jim
Thank you for all the information on your page , I am currently beginning to get into statistics and wanted to ask your advice about something
I am an business analyst with MI skills building dashboard etc and using sales data and kpi s
I am wondering for regression would a good independent variable be the significance of a salespersons sales performance over the teams total sales performance or am I on the wrong track with that ?
Dear Jim… I am a first year ‘MBA’ student having least exposure to the research kind of things. Please have patience and explain me whether I can use regression to determine the impact of a variable on a ‘construct’?
Hi Jim,
which criteria does an independent variable need to meet in order to use it in a regression analysis?
How do you deal with data that does not meet these requirements?
Hi Jiren,
I recommend you read my post about specifying the correct regression model. That deals directly with which variables to include in the model. If you have further questions on the specifics, please post them in the comments section there.
Hi Jim,
How should we interpret the factor A that becomes not significant when fitting with factor B in a model? Can I conclude that factor B incorporates factor A and just ignore the effect of factor A?
Hello Mr.Jim and friends,
I have one dependent variable Y and six independent variables X1….X6. I have to find the effect of of all independent variables on Y , Specifically X6. to check wither it is effective or not
1) Can I use OLS regression
2) which other test i need to do before or after regression analysis
Hi,
If your dependent variable is continuous, then OLS is a good place to start. You’ll need to check the OLS assumptions for your model.
good,very explicit processes.
Jim,
I hope this comment reaches you in good health as we are living in some pretty tough times right now. Also, thank you for building this website as it is an excellent resource for novice statisticians such as myself. My question has to do with the first paragraph of this post. In it you state,
“Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.”
Is it possible to use regression analysis to produce a regression equation when you have two independent variables and two dependent variables? Also, while I hopefully have you attention, would I need to do regression analysis twice(one for each dependent variable versus the independent variables)?
Hi Damian,
Typically, you would separate regression models for each dependent variable. There are a few exception. For example, if you use multivariate ANOVA (MANOVA), you can include multiple dependent variables. If those DVs are correlated, using MANOVA provides some benefits. You can include covariates in the MANOVA model. For more informaton, read my post about MANOVA.
n my study, I intervened with an instructional practice. My intervention has 4 independent variables (A, B, C, and D). In literature each subskill can be graded alone and we can get one whole score.
In literature, the effect of the intervention is holistic (A, B, C, together predict the performance on D).
So, I conducted a multiple regression (enter method) before and after the intervention where individual scores of A, B, C were added as predictors on D.
I added Group (Experimental Vs Control ) to delete any difference at baseline between experimental and control. No significant effect was noticed except for individual score of A and C on D. Model had a weak fit.
However, after the intervention, I repeated the same regression. the group (experimental Vs Control) was the best predictor. No significant effect of A was noticed but significant effect of B and C was noticed
—
How do you think I can interpret the change in the significance value of A? It is relevant in literature but after the intervention it was not significant. Does the significance have to do with the increase of the significance of the Group?
I’d like to ask a question that builds on your example of income regressed on IQ and education. In the dataset I am sure there would be a range of incomes. Let’s say you want to find ways to bring up the low income earners based on the data from this regression.
Can I use the coefficients from the regression to guide ideas on how to improve the lower income earners as an estimate of how much improvement would be expected? For example, if I take the lowest earner and find that he is also below average in IQ and education, could I suggest that he gets another degree and try to improve IQ test results to potentially gain $X (n*IQ + m*Edu) in income?
This example may not be strictly usable because I imagine there are many other factors for income. Assuming that we are confident that we’ve captured most of the variables that affect income, can the numbers be used in this way?
If this is not an appropriate application, how would one go about this? Thanks.
Hello
I am completing a reflection paper for Math 221 I work in a call center can I use a regression analysis for this type of work?
Hello Jim,
I am a total novice when it comes to Statistics. My challenge is, I am working on the relationship between population growth of a town and class size of secondary schools in that same town (about 10 schools) over a period of years (2008-2018). Having gathered my data, I don’t know what to use in analyzing my data to show this relationship.
Thanks
Hi Jim!
Im just a student whos trying to finish her science investigation 🙂
but i have a question.
What is linear regression and how do we know if this method is appropriate for our data?
Hi Marlene,
I think this blog post describes pretty well when to use regression analysis generally. Linear regression analysis is a specific form of regression. Linear refers to the form of the model–not whether it can fit curvature. I talk about this in my post about the differences between linear and nonlinear regression. I always suggest that you start with linear regression because it’s an easier to use analysis. However, sometimes linear regression can’t fit your data. It can fit curvature in your data but it can fit all types of curves. Nonlinear regression is more flexible in the types of curves.
As for determining whether linear regression is appropriate for your data, you need to see if it can provide an adequate fit to your data. To make that determination, please read my posts about residual plots because that’s how you can tell.
Best of luck with your research!! 🙂
Hello Jim, thank you for this wonderful page. It has enlightened me when to use regression analysis. However, I am a complete beginner to using SPSS (and statistics at that) so I am hoping you can help me with my specific problem.
I intend to use a linear regression analysis. My dependent variable is continuous and I would think it’s ordinal (data was obtained through a 5-point Likert scale). I have two independent variables (also obtained through 5-point Likert scales). However, I also intend to use 7 control variables and this is where my problem lies. My control variables are all (I think) nominal (or is that called categorical in statistics?). They are as follows:
Age – 4 categories
Gender – 2 categories
Marital Status – 4 categories
Education level – 11 categories
Household income – 4 categories
Nationality – 4 categories
Country of origin – 9 categories
Do I input these control variables as it is? Or do I have to do something beforehand? I have heard about creating dummy variables. However, if I try creating dummy variables for each control variable, won’t I end up with many variables?
Please give me some advise regarding this. I am really stuck in this process for a while now. I look forward to hearing from you, thanks.
Hi Jinky,
There are several issues to address in your questions. I’ll provide some information. However, my regression ebook goes it into the details much further. So, I highly recommend you get that.
In terms of the dependent variable, the answer is clear. Likert scale data, if it’s the actual values of 1, 2, 3, 4, and 5, these are actually ordinal data and are not considered continuous. You’ll need to use ordinal logistic regression. If the DV is an average of multiple Likert score items for each individual, so an individual might have a 3.4, that is continuous data and you can try using linear least squares regression.
Categorical data and nominal data are the same. There are different naming conventions, but those synonyms.
For categorical data, it’s true that you need to recode them as indicator variables. However, most software should do that automatically behind the scenes. However, as you noticed, the recoding (even if your software does it for you) can involve creating many indicator variables (dummy variables), particularly when you have many categorical variables and/or many levels within a categorical variable. That can use up your degrees of freedom! My ebook covers this in more detail.
For Likert IV variables. Again, if it’s an average of multiple Likert items, you can probably include it as a continuous variable. However, if it’s the actual Likert values of 1, 2, 3, 4, and 5, then you’ll need to decide whether to include it as a continuous or categorical variable. There are pros and cons for both approaches. The best answer depends on both your data and your goals. My ebook describes this in more detail.
Yes, as a general rule, you want to include your control variables and IVs that you are specifically testing. Control variables are just more IVs, but they’re usually not your main focus of study. You include them so that you can account for them while testing your main variables of interest. Excluding relevant IVs that are significant can bias the estimates for the variables you’re interested in. However, if you include control variables and find they’re not significant, you can consider removing them from the model.
So, those are some pointers to start with!
Hi Jim and everyone!
I’m starting some some statistical analysis and is been really useful. I have a question regarding variables and samples.
I need to see if there is any relationship between days of the week and number of robberies. I already have the data but I wonder, if my variables (# of robberies in each day of the week (independent) and # of total roberies (dependent)) come from the same data sample, can it be a problem?
Thanks!
Thank you Jim this was really helpful
I have a question
How do you interpret an independent variable lets say AGE with categories that are insignificant
for example i run the regression analysis for the variable age with categories
age as a whole was found to be significant but there appear insignificance within categories , it was as follows
Age =0.002
<30 years =0.201
30-44 years=0.161
45+ ( ref cat)
I had another scenario
occupation = 0.000
peasant farmers =0.061
petty businessmen=0.003
other occupation ( ref cat)
my research question was " what are effect of socio- demographic characteristics on men's attendance to education classes
I failed to interpret them , kindly help
Hi Lisa,
For categorical variables, the linear regression procedure uses two tests of significance. It uses an F-test to determine the overall significance of the categorical variable across all its levels jointly. And, it uses separate t-tests to determine whether each individual level is different from the reference level. If you change the reference level, it can change the significance of t-tests because that changes the levels that the procedure directly compares. However, changing the reference level won’t change the F-test for the variable as a whole.
In your case, I’m guessing that the mean for <30 is on one side (high or low) compared to the reference category of 45+ while the mean of 30-44 is on the other side of 45+. These two categories are not far enough from 45+ to be significant. However, given the very low p-value for age, I'd guess that if you change the reference level from 45+ to one of the other two groups, you'll see significant p-values for at least one of the t-tests. The very low p-value for Age indicates that the means for the different levels are not all equal. However, given the reference level, you can't tell which means are different. Using a different reference level might provide more meaningful information.
For occupation, the low p-value for the F-test indicates that not all the means for the different types of occupations are equal. The t-test results indicate that the difference in means between petty businessmen and other (reference level) is statistically significant. The difference between peasant farmers and the reference category is not quite significant.
You don't include the coefficients, but those would indicate how those means differ.
Because you're using regression analysis, you should consider getting by regression ebook. I cover this topic, and others, in more detail in the book.
Best of luck with your analysis!
Hi Jim, I have followed your discussion and I want to know if I can apply this analysis in case study
Hi Jim
really appreciate your excellency in regression analysis.
please would help the steps to draw a single fitted line for several, say five IVs, against a sing DV
with regard
Hi Nigatu,
It sounds like you’re dealing with multiple regression because you have more than one IV. Each IV requires an axis (or dimension) on a graph. So, for a two-dimensional graph, you can use the X-axis (horizontal) for IV and the Y-axis for the DV. If you have two IVs, you could theoretically show them as hologram in three dimensions. Two dimensions for the IVs and one for the DV. However, when you get to three or more IVs, there’s just no way to graph them! You’d need four or more dimensions. So, what can you do?
You can view residual plots to see how the model with all 5 IVs fits the data. And, you can predict specific values by plugging numbers into the equation. But you can’t graph all 5 IVs against the DV at the same time.
You could graph them individually. Each IV by itself against the DV. However, that approach doesn’t control for the other variables in the model and can produce biased results.
The best thing you can do that shows the relationship between an individual IV and a DV while controlling for all the variables in a model is to use main effects plots and interaction plots. You can see interaction plots here. Unfortunately I don’t have a blog post about main effects plots, but I do write about them in my ebook, which I highly recommend you get to understand regression! Learn more about my ebook!
I hope this helps!
Many thanks. I appreciate it.
Hello Jim,
I stumbled across your website in hopes of finding an answer to a couple of questions regarding the methodology of my political science paper. If you could help, I would be very grateful.
My research question is “Why do North-South regional trade agreements tend to generate economic convergence while South-South agreements sooner cause economic divergence?”. North = OECD developed countries and South = non-OECD developing countries.
This is my lineup of variables and hypotheses:
DV: Economic convergence between country members in a regional trade agreement
IV1: Complementarity (differentness) of relative factor abundance
IV2: Market size of region
IV3: Economic policy coordination (Harmonization of Foreign Direct Investment (FDI) policy)
H1: The higher the factor endowment difference between countries, the greater the convergence
H2: The larger the market size, the greater the convergence
H3: The greater the harmonization of FDI policies, the greater the convergence
I am not sure what the best methodological approach is. I will have to take North-South and South-South groups of countries and assign values for the groupings. I want to show the relationship between the IVs and DV, so I thought to use a regression. But there are at least two issues:
1. I feel the variables are not appropriate for a time series, which is usually used to show relationships. This is because e.g. the market size of a region will not be changing with time. Can I not do a time series and still have meaningful results?
2. The IVs are not completely independent of one another. How can I work with that?
Also, what kind of regression would be most appropriate in your view?
Many sincere thanks in advance.
Irina
Hi Irina,
I’m not an expert in that specific field, so I can’t give you concrete advice, but here are somethings to consider.
The question about whether you need to include time related information in the model depends on the nature of your data and whether you expect temporal effects to exist. If your data are essentially collected at the same time and refer to the same time period, you probably don’t need to account for time effects. If theory suggests that the outcome does not change over time, you probably don’t need to include variables for time effects.
However, if your data are collected at or otherwise describe different points in time, and you suspect that the relationships between the IVs and DV changes overtime, or there is an overall shift over time, yes, you’d need to account for the time effects in your model. In that case, failure to account for the effects of time can bias your other coefficients–basically there’s the potential for omitted variable bias.
I don’t know the subject area well enough to be able to answer those questions, but that’s what I’d think about.
You mention that the IVs are potentially correlated (multicollinearity). That might or might not be a problem. It depends on the degree of the correlation. Some correlation is OK and might not be a problem. I’d perform the analysis and check the VIFs, which measure multicollinearity. Read my post about multicollinearity, which discusses how to detect it, determine whether it’s a problem and some corrective measures.
I’d start with linear regression. Move away from that only if you have specific reason to do so.
Best of luck with your analysis!
Hi Jim
I was wondering if you could help. I’m currently doing a lab report on Numerical cognition in Human and non human primates. Where we are looking at whether size , quantity and visibility of food effects choice. We have tested Humans so far and then are going to test chimps in the future. My Iv is Condition : visible and opague containers and my Dv is number of correct responses. So far I have compared the means of number of correct responses for both conditions using a one way repeated measures ANOVA but I don’t think this is correct. After having a look at your website, should I look to run a regression analysis instead ? Sorry for the confusion I’m really a rookie at this. Hope you can help !
Hi Lizzy,
Linear regression analysis and ANOVA are really the same type of analysis-linear models. They both use the same math “underneath the hood.” They each have their own historical traditions and terminology, but they’re really the same thing. In general, ANOVA tends to focus on categorical (nominal) independent variables while regression tends to focus on continuous IVs. However, you can add continuous variables into an ANOVA model and categorical variables into a regression model. If you fit the same model in ANOVA as regression, you’ll get the same results.
So, for your study, you can use either ANOVA or regression. However, because you have only one categorical IV, I’d normally suggest using one-way ANOVA. In fact, if you have only those two groups (visible vs opaque), you can use a 2-sample t-test.
Although, you mention repeated measures, you can use that if you in fact do have a pre-test and post-test conditions. You could even use a paired t-test if you have only the two groups and you have a pre- and post-tests.
There is one potential complication. You mention that the DV is a count of correct responses. Counts often do not follow the normal distribution but can follow other distributions such as the Poisson and Negative Binomial distributions. Although, counts can approximate the normal distribution when the mean is high enough (>~20). However, if you have two groups and each group has more than 15 observations, the analyses are robust to departures from the normal distribution.
I hope this helps! Best of luck with your analysis!
Thankyou so much for the reply . Appreciate it and I finally worked it out and got good mark on lab report, which was good :). Appreciate your time replying you explain things very clear so thankyou
Hi there. I am currently doing a lab report and have not done stats in years so hoping someone can help as due tommorow.
When I do correlation bivariate test it shows the correlations not significant between a personaility trait and a particular cognitive task. Yet when I conduct a simple t test it shows a significant p value and gives the 95 % conf interval. If I was to compare that higher scores on one trait tends to mean higher scores on a particular cognitive task then should I be doing a regression then. We were told basic correlations so I did the bivariate option and just stated that the pearson’s r is not significant r=.. n= p =.84 for example. Yet if do a regression analysis for each it is significant. Why could this be?
Thankyou
Hi Kris,
There not quite enough details to know for sure what is happening–but here are some ideas.
Be aware that a series of pairwise correlations is not equivalent to performing regression analysis with multiple predictors. Suppose you have your outcome variable and two predictors (Y X1 X2). When you peform the pairwise correlations (X1 and Y, X2 and Y), each correlation does not account for the other X. However, when you include both X1 and X2 in a regression model, it estimates the relationship between each X and Y while accounting for the other X.
If the correlation and regression model results differ as you describe, you might well have a confounding variable, which biases your correlation results. I write about this in my post about omitted variable bias. You’d favor the regression results in this situation.
As for the difference between the 2-sample t-test and correlation, that’s not surprising because they are doing two entirely different things. The 2-sample t-test requires a continuous outcome variable and a categorical grouping variable and it tests the mean difference between the two groups. Correlations measure the linear association between two continuous variables. It’s not surprising the results can differ.
It sounds like you should probably use regression analysis and include your multiple continuous variables in the model along with your categorical grouping variables as independent variables to model your outcome variable.
Best of luck with your analysis!
This is Kathlene, and I am a Grade 12 student. I am currently doing my research. It’s a quantitative research. I am having a little trouble on how will i approach my statistical treatment. My research is entitled ” Emotional Quotient and Academic Performance Among Senior High School Students in Tarlac National High School: Basis to a Guidance Program.
I was battling what to use to determine the relationship between the variables in my study.
I’m thinking to use chi-square method but a friend said it would be more accurate to use the regression analysis method. Math is not really my field of study so i badly need your opinion regarding this.
I’m hoping you could lend me a helping hand.
Thank you.
Hi Kathlene,
It sounds like you’re in a great program! I wish more 12th grade students were conducting studies and analyzing their results! 🙂
To determine how to model the relationships between your variables, it depends on the type of variables you have. It sounds like your outcome variable is academic performance. If that’s a continuous variable, like GPA, then I’d agree with your friend that regression analysis would be a good place to start!
Chi-square assesses the relationship between categorical variables.
Best of luck with your analysis!
Hi Mr Jim,
I am using orthogonal design having 7 factors with three levels. I have done regression analysis on Minitab software but i don’t know how to explain them or interpret them. I need your help in this regard.
Hi Umar,
I have a lot of content throughout my blog that will help you, including how to interpret the results. For a complete list for regression analysis, check out my regression tutorial.
Also, early next year I’ll be publishing a book about regression analysis as well that contains even more information.
If you have a more specific question after reading my other posts, you can ask them in the comments for the appropriate blog post.
Best of luck!
By the way my gun laws vs VCR, is part of a regression model. Any help you can give, I’d greatly appreciate.
Mr. Jim, I have a problem. I’m working on a research design on gun laws vs homicides with my dependent variable being violent crime rate. My sig is .308 The constant’s (VCR) standard error is 24.712 my n for violent crime rate is 430.44. I really need help ASAP. I don’t know how to interpret this well. Please help!!!
Hi Ty,
There’s not enough information for me to know how to interpret the results. How are you measuring gun laws? Also, VCR is your dependent variable, not the constant as you state. You don’t usually interpret the constant. All I can really say is that based on your p-value, it appears your independent variable is not statistically significant. You have insufficient evidence to conclude that there is a relationship between gun laws and homicides (or is it VCR?).
Hi Jim
Your blog has been very useful. I have a query.. if I am conducting a multiple regression is it okay to have an outcome variable which is normally distributed ( i winsorized an outlier to achieve this) and have two other predictor variables which are not normally distributed? ( the normality tests scores were significant).
I have read in many places that you have to transform your data to achieve normality for the entire data set to conduct a multiple regression – but doing so has not helped me at all. Please advice.
Hi Angela,
I’m dubious about the Winsorizing process in general. Winsorizing reduces the effect of outliers. However, this process is fairly indiscriminate in terms of identifying outliers. It simply defines outliers as being more extreme than an upper and lower percentile and changes those extreme values to equal the specified percentiles. Identifying outliers should be a point by point investigation. Simply changing unusual values is not a good process. It might improve the fit of your data but it is an artificial improvement that overstates the true precision of the study area. If that point is truly an outlier, it might be better to remove it altogether, but make sure you a good explanation for why it’s an outlier.
For regression analysis, the distributions of your predictors and response don’t necessarily need to be normally distributed. However, it’s helpful, and generally sought, to have residuals that are normally distributed. So, check your residual plots! For more information, read my post about OLS assumptions so you know what you need to check!
If your residuals are nonnormally distributed, sometimes transforming the response can help. There are many transformations you can try. It’s a bit trial by error. I suggest you look into the Box-Cox and Johnson transformations. Both methods assess families of transformations and pick one that works bets for your data. However, it sounds like your outcome is already normally distributed so you might not need to do that.
Also, see what other researchers in your field have done with similar data. There’s little general advice I can offer other than to check the residuals and make sure they look good. If there are patterns in the residuals, make sure you’re fitting curvature that might be present. You can graph the various predictors by the residuals to find where the problem lies. You can also try transforming the variables as I describe earlier. While the variables don’t need to follow the normal distribution, if they’re very nonnormally distributed, it can cause problems in the residuals.
Best of luck with your analysis!
Hi, I am confused about the assumption of independent observations in multiple linear regression. Here’s the case. I have heart rate data per five-minute for a day of 14 people. The dependent variable is the heart rate. During the day, the workers worked for 8 hours (8 am to 5 pm), so basically, I have 90 data points per worker for a day. So that makes it 1260 data points (90 times 14) to be included in the model. Is it valid to use multiple linear regression for this type of data?
Hi DMA,
It sounds like your model is more of a time series model. You can model those using regression analysis as well, but there are special concerns that you need to address. Your data are not independent. If someone has a height heart rate during one measurement, it’s very likely it’ll also be heighted 5 minutes later. The residuals are likely to be serially correlated, which violates one of the OLS assumptions.
You’ll likely need to include other variables in your model that capture this time dependent information, such as lagged variables. There are various considerations you’ll need to address that go beyond the scope of these comments. You’ll need to do some additional research into use regression analysis for time series data.
Best of luck with your analysis!
Ok.Thank you so much.
Thank you so much for your time!
Actually i don’t have authentic data about property values (dependent variable) nor the concerning institutions have this data. Can i ask the property value directly to the property owner thorough walk interview?
You really need to have valid data. Using a self-reported valuation might be better than no data. However, be aware there might be differences between what the property owner says and the true market value. Your model would describe self-valuation rather than market valuation. Typically, I’ve seen studies like yours use actual sales prices.
Hello Sir!
is it necessary fir dependent variable in multiple regression model to have values. i have number of independent variable( age of property, stories in building, location close to park)and single dependent variable (Property values).
Some independent variable decrease the value of dependent variable, while some independent variables increase the value of the dependent variable?
Can i put the value if my single dependent variable as ( a.<200000, b.<300000,c. d. 500000)?
Hi Asad,
Why would can’t you enter the actual property values? Ideally, that’s what you would do. If you are missing a value for a particular observation, you typically need to exclude the entire observation from the analysis. However, there are some ways to estimate missing values. For example, SPSS has advanced methods for imputing missing values. But, you should use those only to estimate a few missing values. Your plan should be to obtain the property values. If you can’t do that, it will be difficult to perform regression analysis.
There are some cases where you can’t record the exact values and it’s usually related to the observation time. This is known as censored data. A common example is in reliability analysis where you record failure times for a product. You run the experiment for a certain amount of time and you obtain some failures and know their failure times. However, some products don’t fail and you only know that their failure time is greater than the test time. There are censored regression models you can use in situations like that. However, I don’t think that applies to your subject-area, at least as far as I can tell.
Best of luck with your analysis!
thank you so much Jim! this is really helpful 🙂
You’re very welcome! Best of luck with your analysis!
Hi Jim,
The variances (SD) for the 3 groups are 0.45, 0.7 and 1. Would you say that they vary by a lot?
Another follow up question: does a narrower CI equals a better estimate?
thanks!
Yes, that’s definitely it!
I would suggest using Welch’s one-way ANOVA to analyze it and potentially use that analysis to calculate the CI. You’re essentially performing a one-way ANOVA. And, in ANOVA, there is the assumption of equal variances between groups, which your data do not satisfy. In regression, we’d refer to it as heteroscedasticity. In Welch’s ANOVA, you don’t need to satisfy that assumption. That makes it a simple solution for your case.
In terms of CIs, yes, narrower CIs indicate that the estimate is more precise than if you had a wider CI. Think of the CI as a margin of error around the estimate and it’s good to have a smaller margin of error. With a narrower CI, you can expect the actual mean to fall closer to the fitted value.
Hi Jim,
Thank you so much for the quick response!
I checked the residual plots, it gives me a pretty trend line at y=0, and my R square = 0.87. However the CI it gives me by using all 15 points (regression inference) is a little wider (2.012 – 3.655) than if I just use that 5 points(2.245 – 3.355). In this case, would you still prefer using all 15 points?
Thank you!
That’s tricky. I hate to throw out data, but it does seem warranted. At least you have a good rationale for not using the data!
CIs of the mean for the a point at the end of a data range in a regression model do tend to be wider than in the middle of the range. Still, I’m not sure why it would be wider. Are the variances of the groups roughly equal? If not, that might well be the reason.
Hi Jim,
suppose I have total of 15 data points at x=0, x=40, and x=80 (5 data points at each x value), now I can use regression to estimate y when x=60. But what if I want to estimate the average when x=0? Should I just use that 5 data points when x=0, or use the intercept from the regression line? Which is the best estimate for a 95% CI for the average y value when x=0?
Thank you 🙂
Hi Rachel,
Assuming that model provides a good fit to the data (check the residual plots), I’d use all the data to come up with the CI for the fitted value that corresponds to X = 0. That approach uses more data to calculate the estimate. Your CI might even be more precise (narrower) using all the data.
Hi,
What make us use the linear regression instead of other types of regression. In other words, the motivation for selecting a linear model?
Hi Salam,
Typically, try linear regression first. If your data contain curvature, you might still be able to use linear regression. Linear regression is generally easier to use and includes some useful statistics that nonlinear regression can’t provide, such as p-values for the coefficients and R-squared.
However, if you can’t adequately fit the curvature in your data, it might be time to try nonlinear regression. While both types allow you fit curvature, nonlinear regression is more flexible because it allows your model to fit more types of curvature.
I’ve written a post about how to choose between linear and nonlinear regression that you should read. Within that post are various related links that talk about how to fit curves using both types of regression, along with additional information about both types.
I hope this helps!
Hi Jim
Thank u so much for your reply. I am really gorgeous to know much more of this . I shall keep sending mails seeking your reply which i hope you will not mind
Hi Jim
I have been unfortunate to get your reply to my comment on 18/09/2018
Hi V.G.,
Sorry about the delay. As you can no doubt imagine, my schedule gets busy and things can fall through the cracks.
I replied under your original comment.
Hi Jim,
Your blog has been really helpful! 🙂 I am currently completing my Masters Thesis and my primary outcome is to assess the relationship between Diabetes Distress and blood glucose control. I am a newbie to SPSS and I am at a loss as to how best to analyse my small (not normally distributed pre and post data transformation) data set.
I have been advised that regression analysis may be appropriate and better than correlations? However my data does not appear to be linear.
My diabetes distress variables consist of a score of 1-6 based on a likert scale and also are categorical (low, moderate, high distress) and my blood glucose consists of continuous data and also a categorical variable of poorly controlled blood glucose and well controlled blood glucose.
At the moment I am struggling to complete this analysis. Any help would be greatly appreciated 🙂
Dear Jim, thatk you very much for this post! Could you, please, explain the following.
You are writing: “you first need to fit and verify that you have a good model. Then, you look through the regression coefficients and p-values”
What if I have small r-squired, but the coefficiants are statistically significant with the small values?
Hi Jim
Thanks for your enlightened explanations. However I want to engage you a bit. under how to interpret regression results, you indicated that a small p-Value indicates that the ” independent variable is statistically significant”. i tend not to agree> Note that since the null hypothesis is that the coefficient of the independent variable is equal to Zero, it’s rejection as evidenced by low p-Value should imply that it is the coefficient which is significantly different from zero and not the variable.
almadi
Hi Almadi,
Yes, you’re correct that the p-value tests whether the coefficient estimate is significantly different from zero. If it is, you can say that the coefficient is statistically significant. Alternatively, statisticians often say that the independent variable is statistically significant. In this context, these are two different ways of saying the same thing because the coefficient is a property of the variable itself.
Hi Jim
As u must be well aware, govt releases price indices and these are broadly used to determine the effect of base prices during a given period of time.
Construction industry, normally uses these price indices running over a period of time to redetermine the
prices based on the movement between the base date and current date, which is called as price adjustment
Govt after a few years of time releases a new series of price indices where we may not have the data of indices with old series which will necessitate us to use these new indices with a conversion factor to arrive at the equivalent value of the base price.
Where do you feel that Regression Analysis could be of help where we have to determine the current value of the base price using the new indices.
It is a bit amusing that someone was suggesting to me.
Regards
V.G.Subramanian
Hi V.G.,
I agree that switching price indices can be a problem. If the indices overlap, you can perform regression analysis where the old index is the independent variable and the new index is the dependent variable. However, that is problematic if you don’t have both indices. If you had both indices, I suppose it wouldn’t be a problem to begin with!
Ideally, you’d understand the differences behind how the government calculates both indices, and you could use that to estimate the value of the other index.
I’m not particularly familiar with this practice, so I don’t have a whole lot of insight into it. I hope this helps somewhat!
Thank you for this, Jim. I’ve always felt a common sense explanation minus all the impressive math formulas is what is needed in statistics for data science. This is a big part of the basics I’ve been missing. I’m looking forward to your Logistic Regression Tutorial. How is that coming along for you?
Hi Antonio,
Thanks so much for your kind words! They mean a lot to me! Yes, I totally agree, explanations should focus on being intuitive and helping people grasp the concepts.
I have written a post on binary logistic regression. Unfortunately, it’ll be awhile before I have a chance to write a more in-depth article–just too many subject to write about!
Dear sir,
I have a few question about when to use ANOVA and when to use regression analysis. In my study i have conducted a experiment by considering temperature , pH, weight of a compound as a independent variables and extraction as a dependent variable ( i mentioned very generally but i have some specific independent and dependent variables along with these variables). I did statistical analysis by using one way ANOVA-Tukey’s test and i have used grouping method ( using alphabets a,b,c….) to show the significance based on the p value . My question is, for these type of data can i use regression analysis? and what is the main difference between Tukey’s test and regression analysis?
Hi Rashmi,
Both regression analysis and ANOVA are linear models. As linear models, both types of analyses have the same math “under the hood.” You can even use them interchangeably and get the same results. Traditionally, you use ANOVA when you have only, or mainly, categorical factors–although you can add in covariates (continuous variables). On the other hand, you tend to use regression when you have only, or mainly, continuous variables–although you can add in categorical variables.
Because ANOVA focuses on categorical factors and comparing multiple group means, statisticians have developed additional post hoc analyses to work with ANOVA, such as Tukey’s test. Typically, you’ll perform the ANOVA first and then the post hoc test. Suppose you perform a one-way ANOVA and obtain significant results. This significance tells you that not all of the group means are equal. However, it does not tell you which differences are statistically significant.
That point is where post hoc tests come in. These tests do two things. They’ll tell you which differences are statistically significant. They also control the family error rate for the group of comparisons. When you compare multiple differences like that, you increase the risk of a Type I error–which is when you say there is a difference but there really isn’t. When you compare multiple means, the Type I error rate will be higher than your significance level (alpha). These post hoc tests (other than Fishers) maintain the type I error rate so it continues to equal alpha, which is what you would expect.
So, use an ANOVA first. If you obtain significant results for a categorical factor, you can use post hoc tests like Tukey’s to explore the differences between the various factor levels.
I really need to write a blog post about this! I will soon!
In the meantime, I hope this helps!
Is it necessary to conduct correlation analysis before regression analysis?
Hi Kaushal,
No it’s not absolutely required. I actually prefer producing a series of scatterplots (or a matrix plot) so I can see the nature of the different relationships. That helps give me a better feel for the data along with the types of relationships. However, if you have a good theory and a solid background knowledge on which variables should be included in the model, you can go straight to modeling. I think it depends a lot on your existing level of knowledge.
That all said, I personally like knowing the correlation structure between all of the variables. It gives me a better feel for the data.
Thanks
Thank you Jim!
I really appreciate it!
PS
Hi Jim, I hope you are having good time!
I would like to ask you a question, please!
I have 24 observations to perform a regression analysis (let’s say Zones), and I have many independent variables (IV). I would like to know what is the minimum number of observations I should have to perform a reasonable linear regression model. I would like to hear something from you about how to test many regression model with different IV, since I can not use many IV in a model where a have few observations (24).
Thank you in advance!
Patrik
Hi Patrik, great to hear from you again!
Those are great questions. For 24 observations, I’d say that you usually wouldn’t want more than 2 IVs. I write an entire post about how many variables you can include in a regression model. Including too many IVs (and other terms such as interactions and polynomials) is known as overfitting the model. Check that post out because it’ll provide guidance and show you the dangers of including too many.
There’s another issue a play too because you want to compare a number of different regression models to each other. If you compare many models, it’s a form of data mining. The risk here is that if you compare enough models, you will uncover chance correlations. These chance correlations look like the real thing but only appear in your sample and not the population. I’ve written a post about how using this type of data mining to choose a regression model causes problems. This concern is particularly problematic with a small sample size like yours. It can find “patterns” in randomly generated data.
So, there’s really two issues for you to watch out for–overfitting and chance correlations found through data mining!
Hope this helps!
Many Thanks Jim!!! You have no idea about how much you helped me.
Very well clarified!!!
God bless you always!!!
Patrik
Hi Jim, I am everywhere in your post!
I am starting loving statistic, that’s why I am not quiet,
I have some questions for you:
To use OLS regression, one of the assumptions is that the dependent variable is normally distributed. To achieve this requirement what I should do with my data? Should I check the normality of my dependent variable, for example using Shapiro test (etc)? If I conclude that my dependent variable is not following the normal distribution I should start to see data transformation, right?
Another way that I have used to see people analyzing the normality is by plotting the dependent variable with the independent variable and if the relationship doesn’t follow linear trend then they go to data transformation (which one you recommend me?)
Should I perform the regression using my data (original) and then the residuals will show me non-normality if do exists?
When should I transform my independent variables, and what is the consequence of transforming them?
Sorry, I use to ask many questions in a single comment, but I think this is the way to understand the full picture of my doubt,
You are being so useful to me,
Thank you again!
Patrik
Hi Patrik, I’m so happy to hear that you’re starting to love statistics! It’s a great field that is exciting. The thrill of discovery combined with getting the most value out of your data. I’m not sure if you’ve read my post about The Importance of Statistics, but if you haven’t, I recommend it. It explains why the field of statistics is more important than ever!
In OLS regression, the dependent variable does not have to be normally distributed. Instead, you need to assess the distribution of the residuals using residual plots. If your residuals are not normally distributed, there are a variety of possible reasons and different ways to resolve that issue. I always recommend that transforming your data is the last resort. For example, the residuals might be nonnormal because the model is specified incorrectly. Maybe there is curvature in the data that you aren’t modeling correctly? If so, transforming the data might mask the problem. You really want to specify the best possible model. However, if all else fails, you might need to transform the data. When you transform the data, you’ll need to back transform the results to make sense of the results because everything applies to the transformed data. Most statistical software should do this for you.
Be aware that you can’t trust R-squared and the standard error of the regression when you transform your dependent variable because they apply to the transformed data rather than the raw data (backtransformation won’t help there).
In terms of testing the normality of the residuals, I recommend using normal probability plots. You can usually tell at a glance whether they are normally distributed. If you need a test, I generally use the Anderson-Darling test–which you can see in action in my post about identifying the distribution of your data. By the way, as a case in point, the data in that post are not normal, but I use it as the dependent variable in OLS regression in this post about using regression to make predictions. The residuals are normally distributed even though the dependent variable is not.
In the coffee intake and smoking example, the first result showed that higher coffee intake leads to higher mortality, but after including smoking, coffee intake leads to lower or no mortality? Smoking was revealed to cause the mortality, but how did coffee intake now result in the opposite? Was a separate test taken for this result? Please let me know.
S. CHATTERJEE
Hi, that’s a great question. It turns out that coffee and smoking are correlated. The negative effects of smoking on mortality are well documented. However, for some reason, the researchers did not originally include smoking in their model. Because drinking coffee and smoking are correlated, the variable for coffee consumption took on some of smoking’s effect on mortality.
Put another way, because smoking was not included in the model, it was not being controlled (held constant). So, as you increased coffee consumption, smoking also tended to increase because it is both positively correlated with coffee consumption and not in the model. Therefore, it appeared as though increased coffee consumption is correlated with higher mortality rates but only because smoking was not included in the model.
Presumably, the researchers had already collected data about smoking. So, all they had to do was include the smoking variable in their regression model. Voila, the model now controls for smoking and the new output displays the new estimate of the effect that coffee has on mortality.
This point illustrates a potential problem. If the researchers had not collected the smoking information, they would have really been stuck. Before conducting any study researchers need to do a lot of background research to be sure that they are collecting the correct data!
I hope this helps!
Hi Jim
Hope all thing is well,
I have faced problem with plotting, which is included the relationship between dependent variable (response) and the independent variables .
when i do the main effect plots, i have the straight line increasing.
y= x, this linear trending
to change it i need to make y= square root for time
Im stuck with this thing i couldn’t find solution for it
Regards
Hi Jim,
I was wondering if you can help me? I am doing my dissertation and I have 1 within-subjects IV, and 3 between-subjects IVs.. most of my variables are categorical, but one is not categorical, it is a questionnaire which I am using to determine sleep quality, with both Likert scales and own answers to amount of sleep (hours), amount of times woke in the night etc. Can I use a regression when making use of both categorical data and other? I also have multiple DVs (angry/sad Likert ratings).. but I *could* combine those into one overall ’emotion’ DV. Any help would be much appreciated!
Hi Cara, because your DV use the Likert scale, you really should be using Ordinal Logistic Regression. This type of regression is designed for ordinal dependent variables like yours. As for the IVs, it can be tricky using ordinal variables. They’re not quite either continuous or categorical. My suggestion is to give them a try as continuous variable and check the residual plots to see how they look. If they look good, then it’s probably ok. However, if they don’t look good, you can try refitting the model using them as categorical variables and then rechecking the residual plots. If the residuals still don’t look good, you can then try using the chi-square test of independence for ordinal data.
As for combining the data, that would seem to be a subject-area specific decision, and I don’t know that area well enough to make an informed recommendation.
Yes. But it may be that you miss my point. Because I argue that a proper and sound experiment will allow you to test for causality, regardless of if you deploy e.g. Pearsons r or regression. With no experimental design, neither Pearsons r nor a regression will test for an effect relationship between the variables. Randomisation makes a better case for controlling for variables that you are unaware of than picking a few, and then proclaim that your study found that x will cause an incrrase in y or that x has an effect on y. You may as well argue that you dont need to control for any variables and argue that any correlational study test for Effect relationships.
Hi Martin, yes, that is exactly what I’m saying. Whether you can draw causal conclusion depends on whether you used a randomized experiment to collect your data. If it’s an observational study, you can’t assume it’s anything other than correlation. What you write in your comment agrees with what I’m saying.
The controlling for other variables that I mention in this post is a different matter. Yes, if you include a variable in a regression model, it is held constant while estimating the effects of the other variables. That doesn’t mean you can assume causality though.
No statistical tool or method turns a survey or corrolation study into an experiment, i.e. regression does not test or imply cause effect relationship. A positive relationship between smoking and cancer in a regression analysis does not mean that smoking cause cancer. You have not controlled for what you are unaware of.
Hi Martin, you are 100% correct about the fact that correlation doesn’t imply causation. This issue is one that I plan to cover in future posts.
There are two issues at play here. The type of study under which the data were collected and the statistical findings.
Being able to determine causation comes down to the difference between an observational study versus a randomized experiment. You actually use the same analyses to assess both types of designs. In an observational study, you can only establish correlation and not causality. However, in a randomized experiment, the same patterns and correlations in the data can suggest causality. So, regression analysis can help establish causality, but only when it’s performed on data that were collected through a randomized experiment.
Very nicely explanined. thank you
Thanks you, Hari!
Thanks for your reply and for the guidance.
I read your posts which are very helpful. After reading them, I concluded that only the independent variables which have a well-established association with the dependent variable should be included. Hence, in my case, variable Z should not be included given that the association of Z with dependent variable is not well-established.
Furthermore, suppose there is another variable (A) and literature suggests that it, in general, has an association with dependent variable. However, assume that A does not affect any independent variables so there is no omitted variable bias. In this case, if there is no data available for A (due to the study being conducted in different environment/context) then what statistical techniques can be deployed to address any problems caused due to the exclusion of A?
I look forward to your reply and I will be grateful for your reply.
Kind regards.
Thanks for the reply. I apologise if I am taking a considerable time out of your schedule.
Based on the literature, there isn’t any conclusive evidence that z is a determinant of y. So, that is why I intend to remove z. Some studies include it while some do not and some find significant association (between y and z) while some find the association insignificant. Hence, I think I can safely remove it.
Moreover, I will be grateful if you can answer another query. From an statistical viewpoint, is it fine if I use Generalized method of moments (GMM) for binary dependent variable?
Kind regards.
While I can’t offer you a concrete statement about whether you should include or exclude the variable (clearly there is disagreement in your own field), I do suggest that you read my article about specifying the correct regression model. I include a number of tips and considerations.
Unfortunately, I don’t know enough about GMM to make a recommendation. All of the examples I have seen personally are for continuous data, but I don’t know about binary data.
Thanks for your reply. I really appreciate it. Could you please also provide an answer to my query mentioned below for further clarification?
Further clarification on my above post. From internet I found that if a variable (z) that is related to y but unrelated to x then inclusion of z will reduce standard errors of x. So, if z is excluded, but f-stat and adjusted r-square are fine then does high standard errors create problems? I look forward to your reply.
Yes, what you read is correct. Typically, if Z is statistically significant, you should include it in your model. If you exclude it, the precision of your coefficient estimates will be lower (higher standard errors). You also risk a biased model because you are not including important information in the model–check the residual plots. The F-test of overall significance and adjusted R-squared depend on the other IVs in your model. If Z is by far the best variable, it’s possible that removing it will cause the F-test to not be significant and adjusted R-square might drop noticeably. Again, that depends on how the explanatory power of Z compares to the other IVs. Why do you want to remove a significant variable?
Thanks for the reply. Jim.
I am unable to understand “Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable”. Are you stating that other independent variables will be fine but r-square will become low? I will be grateful if you can explain this.
Kind regards
Hi, you indicated that the removed independent variable is related to the dependent variable, but it is not correlated with the other independent variables. Consequently, removing that independent variable should reduce R-squared. For one thing, that’s the typical result of removing variables, even when they’re not statistically significant. In this case, because it is not correlated to the other independent variables, you know that the removed variable is supplying unique information. Taking that variable out means that information is no longer included in the model. R-squared will definitely go down, possibly dramatically.
R-squared measures the strength of the relationship between the entire set of IVs and the DP. Read my post about R-squared for more information.
Hello, Jim.
What is the impact* on the independent variables in the model if I omit a variable that is a determinant of dependent variable but is not related to any of the independent variables?
*Here impact relates to the independent variables’ p-values and the coefficients.
Kind regards.
Hi Kaleem,
If the independent variable is not correlated with the other independent variables, it’s likely that there would be a minimal effect on the other independent variables. Your model won’t fit the data as well as before depending on the strength of the relationship between the dropped independent variable and the dependent variable. You should also check the residual plots to be sure that by removing the variable you’re not introducing bias.
why do we use 5% level of significance usually for comparing instead of 1% or other
Hi, I actually write about this topic in a post about hypothesis testing. It’s basically a tradeoff between several different error rates–and a dash of tradition. Read that post and see if it answers your questions.
Sir usually we take
5% level of significance for comparing why 0
Hi Ghulam, yes, the significance level is usually 0.05. I’m not sure what you’re asking about in regards to zero? The p-values in the example output are all listed as 0.000, which is less than the significance level of 0.05, so they are statistically significant.
In my model, I use different independent variables. Now my question is before useing regression, do I need to check the distribution of data? if yes then please write the name tests. My title is Education and Productivity Nexus, : evidence from pharmaceutical sector in Bangladesh.
Hi Shamsun, typically you test the distribution of the residuals after you fit a model. I’ve written a blog post about checking your residual plots that should read.
I hope this helps!
Jim
Thank you Mr. Jim
You’re very welcome!
In linear regression, can we use categorical variables as Independent variables? If yes, what should be the minimum or maximum categories in an Independent variable?
Hi, yes you can use categorical variables as independent variables! The number of groups really depends on what makes sense for your study area. Of course, the minimum is two. There really is no maximum in theory. It depends on what makes sense for your study. However, in practice, having more groups requires a larger total sample size, which can become expensive. If you have 2-9 groups, you should have at least 15 in each group. For 10-12 groups, you should have 20. These numbers are based on simulation studies for ANOVA, but they also apply to categorical variables in regression. In a nutshell, figure out what makes sense for your study and then be sure to collect enough data!
I hope this help!
Jim