Heteroscedasticity in Regression Analysis

By Jim Frost 64 Comments

Heteroscedasticity means unequal scatter. In regression analysis, we talk about heteroscedasticity in the context of the residuals or error term. Specifically, heteroscedasticity is a systematic change in the spread of the residuals over the range of measured values. Heteroscedasticity is a problem because ordinary least squares (OLS) regression assumes that all residuals are drawn from a population that has a constant variance (homoscedasticity).

To satisfy the regression assumptions and be able to trust the results, the residuals should have a constant variance. In this blog post, I show you how to identify heteroscedasticity, explain what produces it, the problems it causes, and work through an example to show you several solutions.

How to Identify Heteroscedasticity with Residual Plots

Let’s start with how you detect heteroscedasticity because that is easy.

In my post about checking the residual plots, I explain the importance of verifying the OLS linear regression assumptions. You want these plots to display random residuals (no patterns) that are uncorrelated and uniform. Generally speaking, if you see patterns in the residuals, your model has a problem, and you might not be able to trust the results.

Heteroscedasticity produces a distinctive fan or cone shape in residual plots. To check for heteroscedasticity, you need to assess the residuals by fitted value plots specifically. Typically, the telltale pattern for heteroscedasticity is that as the fitted values increases, the variance of the residuals also increases.

You can see an example of this cone shaped pattern in the residuals by fitted value plot below. Note how the vertical range of the residuals increases as the fitted values increases. Later in this post, we’ll return to the model that produces this plot when we try to fix the problem and produce homoscedasticity.

What Causes Heteroscedasticity?

Heteroscedasticity, also spelled heteroskedasticity, occurs more often in datasets that have a large range between the largest and smallest observed values. While there are numerous reasons why heteroscedasticity can exist, a common explanation is that the error variance changes proportionally with a factor. This factor might be a variable in the model.

In some cases, the variance increases proportionally with this factor but remains constant as a percentage. For instance, a 10% change in a number such as 100 is much smaller than a 10% change in a large number such as 100,000. In this scenario, you expect to see larger residuals associated with higher values. That’s why you need to be careful when working with wide ranges of values!

Because large ranges are associated with this problem, some types of models are more prone to heteroscedasticity.

Heteroscedasticity in cross-sectional studies

Cross-sectional studies often have very small and large values and, thus, are more likely to have heteroscedasticity. For example, a cross-sectional study that involves the United States can have very low values for Delaware and very high values for California. Similarly, cross-sectional studies of incomes can have a range that extends from poverty to billionaires.

Heteroscedasticity in time-series models

A time-series model can have heteroscedasticity if the dependent variable changes significantly from the beginning to the end of the series. For example, if we model the sales of DVD players from their first sales in 2000 to the present, the number of units sold will be vastly different. Additionally, if you’re modeling time series data and measurement error changes over time, heteroscedasticity can be present because regression analysis includes measurement error in the error term. For example, if measurement error decreases over time as better methods are introduced, you’d expect the error variance to diminish over time as well.

Example of heteroscedasticity

Let’s take a look at a classic example of heteroscedasticity. If you model household consumption based on income, you’ll find that the variability in consumption increases as income increases. Lower income households are less variable in absolute terms because they need to focus on necessities and there is less room for different spending habits. Higher income households can purchase a wide variety of luxury items, or not, which results in a broader spread of spending habits.

Pure versus impure heteroscedasticity

You can categorize heteroscedasticity into two general types.

Pure heteroscedasticity refers to cases where you specify the correct model and yet you observe non-constant variance in the residual plots.
Impure heteroscedasticity refers to cases where you incorrectly specify the model, and that causes the non-constant variance. When you leave an important variable out of a model, the omitted effect is absorbed into the error term. If the effect of the omitted variable varies throughout the observed range of data, it can produce the telltale signs of heteroscedasticity in the residual plots.

When you observe heteroscedasticity in the residual plots, it is important to determine whether you have pure or impure heteroscedasticity because the solutions are different. If you have the impure form, you need to identify the important variable(s) that have been left out of the model and refit the model with those variables. For the remainder of this blog post, I talk about the pure form of heteroscedasticity.

The causes for heteroscedasticity vary widely by subject-area. If you detect heteroscedasticity in your model, you’ll need to use your expertise to understand why it occurs. Often, the key is to identify the proportional factor that is associated with the changing variance.

What Problems Does Heteroscedasticity Cause?

As I mentioned earlier, linear regression assumes that the spread of the residuals is constant across the plot. Anytime that you violate an assumption, there is a chance that you can’t trust the statistical results.

Why fix this problem? There are two big reasons why you want homoscedasticity:

While heteroscedasticity does not cause bias in the coefficient estimates, it does make them less precise. Lower precision increases the likelihood that the coefficient estimates are further from the correct population value.
Heteroscedasticity tends to produce p-values that are smaller than they should be. This effect occurs because heteroscedasticity increases the variance of the coefficient estimates but the OLS procedure does not detect this increase. Consequently, OLS calculates the t-values and F-values using an underestimated amount of variance. This problem can lead you to conclude that a model term is statistically significant when it is actually not significant.

If you see the characteristic fan shape in your residual plots, what should you do? Read on!

How to Fix Heteroscedasticity

If you can figure out the reason for the heteroscedasticity, you might be able to correct it and improve your model. I’ll show you three common approaches for turning heteroscedasticity into homoscedasticity.

To illustrate how these solutions work, we’ll use an example cross-sectional study to model the number of automobile accidents by the population of towns and cities. These data are fictional, but they correctly illustrate the problem and how to resolve it. You can download the CSV data file to try it yourself: Heteroscedasticity. We’ll use Accident as the dependent variable and Population for the independent variable.

Imagine that we just fit the model and produced the residual plots. Typically, you see heteroscedasticity in the residuals by fitted values plot. So, when we see the plot shown earlier in this post, we know that we have a problem.

Cross-sectional studies have a larger risk of residuals with non-constant variance because of the larger disparity between the largest and smallest values. For our study, imagine the huge range of populations from towns to the major cities!

Generally speaking, you should identify the source of the non-constant variance to resolve the problem. A good place to start is a variable that has a large range.

We’ve detected heteroscedasticity, now what can we do about it? There are various methods for resolving this issue. I’ll cover three methods that I list in my order of preference. My preference is based on minimizing the amount of data manipulation. You might need to try several approaches to see which one works best. These methods are appropriate for pure heteroscedasticity but are not necessarily valid for the impure form.

Redefining the variables

If your model is a cross-sectional model that includes large differences between the sizes of the observations, you can find different ways to specify the model that reduces the impact of the size differential. To do this, change the model from using the raw measure to using rates and per capita values. Of course, this type of model answers a slightly different kind of question. You’ll need to determine whether this approach is suitable for both your data and what you need to learn.

I prefer this method when it is appropriate because it involves the least amount of tinkering with the original data. You adjust only the specific variables that need to be changed in a manner that often makes sense. Indeed, this practice forces you to think about different ways to specify your model, which frequently improves it beyond just removing heteroscedasticity.

For our original model, we were using population to predict the number of accidents. If you think about it, it isn’t surprising that larger cities have more accidents. That’s not particularly enlightening.

However, we can change the model so that we use population to predict the accident rate. This approach discounts the impact of scale and gets to the underlying behavior. Let’s try this with our example data set. I’ll use Accident Rate as the dependent variable and Population as the independent variable. The residual plot is below.

The residuals by fitted value plot looks better. If it weren’t for a few pesky values in the very high range, it would be useable. If this approach had produced homoscedasticity, I would stick with this solution and not use the following methods.

Weighted regression

Weighted regression is a method that assigns each data point a weight based on the variance of its fitted value. The idea is to give small weights to observations associated with higher variances to shrink their squared residuals. Weighted regression minimizes the sum of the weighted squared residuals. When you use the correct weights, heteroscedasticity is replaced by homoscedasticity.

I prefer this approach somewhat less than redefining the variables. For one thing, weighted regression involves more data manipulation because it applies the weights to all variables. It’s also less intuitive. And, if you skip straight to this, you might miss the opportunity to specify a more meaningful model by redefining the variables.

For our data, we know that higher populations are associated with higher variances. Consequently, we need to assign lower weights to observations of large populations. Finding the theoretically correct weight can be difficult. However, when you can identify a variable that is associated with the changing variance, a common approach is to use the inverse of that variable as the weight. In our case, the Weight column in the dataset equals 1 / Population.

I’ll go back to using Accidents as the dependent variable and Population as the independent variable. However, I’ll tell the software to perform weighted regression and apply the column of weights. The residual plot is below. For weighted regression, it is important to assess the standardized residuals because only that type of residual will show us that weighted regression fixed the heteroscedasticity.

This residual plot looks great! The variance of the residuals is constant across the full range of fitted values. Homoscedasticity!

Learn more in-depth about weighted regression and its other use cases in my article, Weighted Least Squares Explained!

Transform the dependent variable

I always save transforming the data for the last resort because it involves the most manipulation. It also makes interpreting the results very difficult because the units of your data are gone. The idea is that you transform your original data into different values that produce good looking residuals. If nothing else works, try a transformation to produce homoscedasticity.

I’ll refit the original model but use a Box-Cox transformation on the dependent variable.

As you can see, the data transformation didn’t produce homoscedasticity in this dataset. That’s good because I didn’t want to use this approach anyway! We’ll stick with the weighted regression model.

Keep in mind that there are many different reasons for heteroscedasticity. Identifying the cause and resolving the problem in order to produce homoscedasticity can require extensive subject-area knowledge. In most cases, remedial actions for severe heteroscedasticity are necessary. However, if your primary goal is to predict the total amount of the dependent variable rather than estimating the specific effects of the independent variables, you might not need to correct non-constant variance.

If you’re learning regression and like the approach I use in my blog, check out my Intuitive Guide to Regression Analysis book! You can find it on Amazon and other retailers.

Comments

SE says

May 22, 2024 at 4:58 pm

Thanks very much for the helpful explanation !
In Stata is the robust standard errors option weighting the residuals ( your second approach to achieving homoscedasticity) ?

Loading...

Reply
Rita Fontes says

October 19, 2022 at 12:38 am

Hi Jim,

I have a question regarding this matter.

Supposing we have a model with several independent variables and, overall, the model is homoscedastic but, when analysing the plot of each regressor versus the residuals, some regressors show heteroscedasticity. In this case, shall we correct the heteroscedasticity for each variable or the only thing that matters is the overall (fitted values vs residuals)?

Thanks in advance!

Loading...

Reply
Sean Pitcher says

November 10, 2021 at 7:08 pm

Hey Jim,

Bought your book on linear regression. Best $14.00 I ever spent. Well written and answered a LOT of questions. I have a linear regression I did in R. I like doing residual QQ plots because you can really see if your residuals are normal. What’s your opinion on using these to see if the residuals are normally distributed? Also, what’s your opinion on a plot like this that shows perfect adherence of the residuals to the QQ straight line but some pretty serious deviance at -2 and +2.

Thanks so much for writing these books. Gonna get my hands on the others in your series.

Loading...

Reply
- Jim Frost says
  
  November 11, 2021 at 12:42 am
  
  Hi Sean,
  
  I’m so glad to hear the great review about my regression book! Glad it was helpful!
  
  I’m a big fan of Q-Q plots. Q-Q plots are the rare case where I think the graph is better than a hypothesis test in making a decision. The problem with normality tests is that with small samples, everything looks normal to these tests, and with large samples, everything looks non-normal. And Q-Q plots are much easier to use than histograms for comparing data to a normal (or other) distribution.
  
  To read more on my thoughts about Q-Q plots, read my posts about identifying the distribution of your data and another one where I compare Q-Q plots to histograms for assessing normality. In both those posts, I’m talking about the distribution of your data, but the same principles apply to assessing residuals for normality.
  
  As for how much deviation is too much, at the statistical software compare where I used to work, we taught the fat pencil test for Q-Q plots. If you were to overlay a fat pencil to the straight line on a Q-Q plot, would it cover all the data points. If so, you’re probably ok. If not, I’d be concerned. Your data just have to be roughly normal.
  
  Loading...
  
  Reply
Dr Shamshad says

August 25, 2021 at 9:26 pm

Hi Jim
We faced same issue in our soss workshop..We plotted residuals across unstandardized predicted value and found that heteroscedasticity may be present..But all the studentized residual were within 3 standard deviation..So my argument is that when you are saying residual should have constant variance then shall i mean within 3 SD….

Loading...

Reply
- Jim Frost says
  
  August 26, 2021 at 2:56 am
  
  Hello,
  
  Heteroscedasticity has more to do with the fact that the variances are unequal. The 3 SD guideline helps detect outliers, but that’s a separate matter. Residuals can have non-constant variance even when they’re all within 3 SDs.
  
  Loading...
  
  Reply
Matt says

April 28, 2021 at 6:28 pm

Hi there, great post!

I am a bit confused about heteroscedasticity in time-series analyses… is it okay to have? My data is 7.25 years, and with an intervention 3.90 years in. There was a drastic change in the post-intervention (~48% reduction). My data is non-normal because I want to reflect the true situation. I did remove the top and bottom 1% which were major outliers. When I run a linear regression on the Pre alone and the Post alone, there is heteroscedasticity in both. However, the predictors for the dependent variable are vastly different between the Pre and Post. Log-transforming the data, while making it linear, does not appear to reflect the true situation, as the predictors for the dependent variable Pre/Post intervention are nearly exactly the same, whereas they were different (and they were different for a logical reason!).

Best,
Matt

Loading...

Reply
Chuck Utterback says

March 5, 2021 at 10:21 am

Hi Jim – great posts, I’ve read several of yours on linear regression. Your posts have some of the most insightful guidance on how to practically use linear regression with confidence. Thanks! Chuck Utterback.

Loading...

Reply
- Jim Frost says
  
  March 5, 2021 at 11:02 pm
  
  Hi Chuck,
  
  Thank you so much for your kind words! 🙂 I’m glad my posts have been helpful!
  
  Loading...
  
  Reply
Jim Knaub says

March 5, 2021 at 4:05 am

Rabia –

Are you asking me? I was only commenting that p-values by themselves are not useful. I don’t know what you expect to get out of it.

As for binomial regression, I don’t know much about that. I suppose that you could use graphical residual analysis for the beta x part. But if you estimate the coefficient of heteroscedasticity there, I haven’t looked at how that relates to the binomial. Not something I’ve done. Don’t know.

Best wishes –
Jim Knaub

Loading...

Reply
RABIA NOUSHEEN says

March 5, 2021 at 12:41 am

Hi Jim

Thank you for your comments. I have checked the standard errors and they are far less than those obtained from untransformed data. Here I want to ask about the Heteroscedasticity consistent (HC) standard errors. Please tell me that is it a good choice to report p values corresponding to HC standard errors when our dependant variable shows heteroscedasiticy and non normal distribution even after transformation? Or is it better to switch to any other statistical test when assumptions of GLM are not met? I also want to know that what is the difference between Heteroscedasticity and overdispersion, are these different terms?

I hope you are getting my point. Looking forward to your advice. Thanks a lot for being so helpful.

Loading...

Reply
Jim Knaub says

March 3, 2021 at 5:05 pm

Jim Knaub again –

Kirsty – I know it is Kirsty, not Kristy, but my phone keeps changing spelling without my permission. 🙂

Rabia –

Please let me just note that a lone p-value is rather meaningless. It changes with changes in sample size. (So does a standard error, as opposed to a standard deviation, but at least a standard error has a more intuitive meaning.)

Just sayin’ – Jim Knaub

Loading...

Reply
Jim Knaub says

March 3, 2021 at 4:49 pm

Jim Knaub here. I hope you don’t mind my interjecting something here, Jim Frost and Kristy.

I’m thinking that you might want to use a graphical residual analysis to check out your model fit, which often also includes a look at heteroscedasticity, if you know what to “look for.” If you plot the predicted y values on the x-axis and estimated residuals on then y-axis, such a graph can be helpful. You can research the terms “graphical residual analysis” for studying model fit, and “cross-validation” to study the possibility that you have overfit to your particular sample.

It is sometimes hard to recognize heteroscedasticity, and best to determine a good coefficient of heteroscedasticity, a measure, rather than to use an hypothesis test. Please see https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression, and various updates. There is an update in there regarding hypothesis tests.

BTW, you used some logs. Was that because of the relationship between the independent and dependent variables? You could graph them, though the use of other independent variables may modify that relationship. You could try different things to see what model did well on the graphical residual analysis without overfitting, considering cross-validation.

Best wishes – Jim Knaub

Loading...

Reply
RABIA NOUSHEEN says

March 3, 2021 at 3:22 am

Hi Jim

I have a query about heteroscedasticity in the model. I have a data with binomial family. I did best fit model selection based on AIC. My first model with full parameterization had normal distribution of residuals and variance was constant. Final model did not show normality and variance was not constant too. I transformed dependant variable using Arcsin squareroot transformation method.When the final (best fit) glm model was run with transformed response variable and normality, homegeneity of variance was tested I got following p value:

White test (variance test): p value = 0.04
Shapiro-wilk test(Normality test): p value = 0.05

What should I conclude? Should I be still concerned about heterscedasticity? Is it the right approach that I followed?

I shall be thankful for your help.

Loading...

Reply
Kirsty Debono says

March 2, 2021 at 5:29 pm

Hi Sir. First of all thank you for your very insightful articles!

I wished to ask whether there are any particular implications on heteroscedasticity when regressing a cross-sectional data model in which most of the regressors are dummy variables?

This is as I am currently undertaking my undergrad dissertation thesis and am not detecting any problem of heteroscedasticity, which I’m finding particularly strange given the use of cross-sectional data. My model consists of a linear dependent variable and most of my regressors are dummy variables, except for ln(income), ln(netwealth) and household size. I am suspecting that the seemingly ‘perfectly’ homoscedastic errors I am getting could be as a result of the extensive use of dummies in my model. Could this be the case? Or is there perhaps any specific test that is ideal to test for heteroscedasticity in this type of model? I am currently using the Breush-Pagan-Godfrey test.

Thank you in advance.

Loading...

Reply
Mike says

February 11, 2021 at 3:52 pm

Jim, Thank-you for the great post. I was hoping to tempt you into commenting on a problem that touches on the impure aspects of heteroscedasticity. Let’s say in your example of Accidents vs Population, after running through all three methods of addressing the heteroscedasticity issue, you were still unable to “fully” remove the fan shape in the fitted residuals and you were convinced that you were missing one or more variables. In talking to the national traffic safety lead about the problem, he indicated that lane size is also a contributing factor in accidents. After plotting Accidents vs Lane Size (lets say they range from 8.5′ to 13′) the scatter plot looks “pseudo-random” but with a curious peak (increased number of accidents) in the data between 10 and 11.5′. When you follow up with the lead, he qualifies his statement by saying that one of his inspectors has noticed over the years that in low population areas, the number of accidents remains fairly constant and low. But as the population and lane size increase they see an increase in the number of accidents “up to a point” and then after that the number of accidents drops off with increasing lane size irrespective of population. When you go back to the office and aggregate the data based on population ranges and re-plot Accidents vs Lane Size as a function of population you see a fairly clear layered-pattern emerge based on the population. How do you proceed? My initial thought was to perform OLS on Accidents vs Lane Size for each distinctive population range. Is this a valid approach or are there other steps that must be considered?

Loading...

Reply
- Jim Frost says
  
  February 11, 2021 at 5:26 pm
  
  Hi Mike,
  
  This problem illustrates the importance of subject-area knowledge. There might very well be some other variable behind the scenes that relates to low/high population areas to a degree but would be worthwhile adding or using as a weighted value. So, I’d think along those lines. That’s not my area of expertise, but here are some things to consider.
  
  I’m assuming that you’re using accident rates? Or something like that rather than raw numbers of accidents? If you’re using the raw values, I’m not surprised that you’d see the fan shape!
  
  Did you try using population or lane size as the weights?
  
  Are they including a polynomial to capture the curvature in the relationship between lane size and accidents? Be sure to check the residuals by other variables to see if there are any uncaptured relationships.
  
  Could there also be an interaction effect between population and lane size?
  
  Some of my suggestions don’t directly address heteroscedasticity but could help explain more of the variance. In general, try to think of some variable that increases with accidents. Perhaps there’s some measure of average traffic density? Perhaps that could be the weights. There might roads in largely lower density areas but for whatever reason they have higher density traffic (popular areas).
  
  Hopefully some of those ideas might be helpful! Those are the types of thing I’d think about.
  
  You could try dividing up the populations by sizes and fitting separate models as you describe. If the separate models still meet your needs, I don’t see anything inherently wrong with that. If that works, you could consider including the size categories as a categorical variable in the model with all data to see if that similarly fixes it.
  
  Loading...
  
  Reply
Joseph Lombardi says

February 2, 2021 at 11:42 am

Is there any value in plotting a histogram of the residuals? Do — or should — the residuals follow a normal distribution, or would you expect to see all the “buckets” filled approximately equally? If you see a histogram skewed left, would that be a indication of heteroskedasticity?

Cheers,
Joe

Loading...

Reply
- Jim Frost says
  
  February 3, 2021 at 12:21 am
  
  Hi Joe,
  
  When I worked at Minitab, we intentionally started to deemphasize using histograms of the residuals to assess normality (and in general, not just for residuals). Their residual plots still include the histogram, but I wouldn’t use it for that purpose. Instead, I recommend using a normal probability plot of the residuals. It’s simply easier to determine whether the residual follow a normal distribution with that type of plot. To learn the reasons, go to my post about using normal probability plots. It’s not written in the context of residuals, but the same reasoning applies.
  
  A skewed distribution doesn’t necessarily mean that you have heteroscedasticity. When your range of residuals increases, you get both more high and low values (i.e., both ends of the distribution) than you’d expect. Instead, check the residuals vs. fits plot and look for the fanning pattern I describe in this post.
  
  Loading...
  
  Reply
Jose Manuel Pereñíguez López says

September 22, 2020 at 3:43 am

Thanks Jim!!

Yes, my case could be kind of similar to the one you posed. I build time intervals (e.g., 1 hour) because, in ecology, researchers usually do that to relate variables (e.g., animals’ height with the number of humans present in a specific area in the same 1 hour time interval). My `old` method has the disadvantage of using a device which battery is consumed in a few hours, and we also have to retrieve the device from the animal. The `new` device can last for months and data is sent (we do not have to catch the animal again to get the data). The problem of the `new` method is that data is received by the receivers if the animal is not hidden, so the type of data got depends on animals’ behaviour. So there is a trade-off. I am investigating pros and cons of the `novel` method compared to the `old`, which is more accurate for sure, and thus I want to calculate `R²`, to know how close is my novel method to the old one. I also want to know how critical is the number of records I get in those time intervals for the predictions, and that’s why I incorporate `z` in the model.

Thanks for the tips. I will continue to search for a solution…

Loading...

Reply
James Knaub says

September 22, 2020 at 3:19 am

Jose –

If time is involved, that might possibly indicate a more complex model may be needed. However, I cannot really advise on panel regression or hierarchical models, except to say that there seems to be some “cross-validation” that you can do.

Cheers – Jim

PS – I just took a quick look at that stackexchange posting, and I’m not clear on the role of time there. Perhaps it is only to try to avoid repeatedly measuring the same animal, I don’t know. But comparing an old with a new measure for the same thing sounds functionally similar to when the office I was in switched data collection modes for collecting energy data, and I suggested that if both modes were used simultaneously for a while, the results of one could be used to predict the other in a ratio model (one regressor/predictor/independent variable, with no intercept term). I did not look at the stackexchange site for long, and do not know if this carries over to your work to some degree, but perhaps it may, and heteroscedasticity is very much an important consideration. – Best wishes.

Loading...

Reply
Jose Manuel Pereñíguez López says

September 22, 2020 at 2:13 am

Dear Jim,

Thank you very muchh for all the time you took, :). You are a wonderfull communicator.

First, I have to say that I made a horrible mistake, since `X` is the predictor of `Y`, not the opposite. That’s why one variable is `Y` and the other is `X`. However, I check that I can edit my comment.

I faced this question with further details in stackoverflow (https://stats.stackexchange.com/questions/488423/can-i-validate-a-residual-plot-although-it-has-residual-patterns-if-i-am-not-int), where I can upload pictures showing the relationship between `Y` and `X`, the distribution of `Y` or the residual plots. I said that I calculate R² because `Y` is a worlwide validated technique for measuring a variable and `X` is a novel technique for measuring the same, but this new technique has some constraints and I want to know how well (=R²) it works. One of the constraints of this technique is that you get more or less samples (-> `Z`) for a given time interval depending on the animal’s behaviour. That’s why I introduced `Z` in the model.

Thank you very much for your valuable time. I asked you given your huge expertise in those issues 🙂

Loading...

Reply
James Knaub says

September 21, 2020 at 5:29 pm

Jose –

The way regression equations are written now, y is a random variable, and though there can be errors-in-variables regression, all the ‘independent’ variables are on the right side of the equation, along with the estimated residual term. So when you say “I am assessing how well a variable (`y`) predicts another variable (`x`), and the effect of a second quantitative predictor (`z`),” that is the reverse of modern usage, though z could be another x “predictor” for y. But, here, we instead seem to have z and y as independent variables and x as the dependent variable, when y should be the dependent variable. Further, heteroscedasticity, which impacts the estimated variance of the prediction error for prediction intervals associated with predicted y, is impacted by predicted y, not just any one individual independent variable, unless one independent variable is all that is needed to adequately ‘predict’ y.

As predicted y gets larger, we should generally see larger sigma for the estimated residuals, and also often larger estimated variance of the prediction error, though there are other considerations there.

Your software may let you enter a regression weight, w, to account for heteroscedasticity. If y*’ is a preliminary predicted y, even if not the final predicted y, y*, then w can be y*’ ^(-2gamma), where gamma is the coefficient of heteroscedasticity I gave earlier. (It is often practical to let preliminary predicted y, y*’, be y_hat, the OLS predicted y.)

You asked “How can this heteroscedasticity affect to my conclusions?” I guess you mean if you ignore it. Well, I have not worked with a gamma distribution, just skewed data where all variables were skewed and straight line relationships occurred, but I suspect that if you ignore heteroscedasticity your predictions may not change a great deal from the better predictions you could have, at least in most cases, I think, but the estimated variance of the prediction error will possibly be very far in error in many or most cases. This is because larger predictions should be associated with larger sigma for the estimated residuals, and smaller predictions should be associated with smaller sigma for the estimated residuals, but you will have assumed constant such sigma.

You said that “My problem is that I find higher variance in my residuals for low values of `x`….” That sounds backward, but I can think of a simple case where that could happen as shown here:

y = a – bx + e

where a is a constant, often written as b with a subscript 0, and e is the estimated residual (which can be factored into random and nonrandom factors). Here predicted y is a – bx, so predicted y gets smaller with increased x, but if negative y is not possible, then there is a restriction as to the range for which this will apply.

I haven’t worked with GLM and may be missing some point here. But also I’m a little confused, starting with your statement that “I am assessing how well a variable (`y`) predicts another variable (`x`)….”

Cheers – Jim

Loading...

Reply
Jose Manuel Pereñíguez López says

September 21, 2020 at 2:06 pm

Hi Jim,

Thank you for this really nice post.

I wanted to ask you something very crucial for me. I am assessing how well a variable (`y`) predicts another variable (`x`), and the effect of a second quantitative predictor (`z`). I use GLMM with a gamma distribution because `y` follows a non-normal distribution and I have 6 tracked individuals. What I do is just compare different models which have my predictors by their AIC, to identify which combination of predictors is the best to predict `X`. Then, I calculate r2m (variance explained by the fixed factors). My problem is that I find higher variance in my residuals for low values of `x`, and I know that this is pure heteroscedasticity since I know that `x` at low values predicts badly `y`.

Could I just used my model althoug I see some heteroscedasticity in the residuals? How can this heteroscedasticity affect to my conclusions? I don’t look at the coefficients.

Thanks!!!

Loading...

Reply
James R. Knaub, Jr. (Jim) says

September 8, 2020 at 2:48 am

Joey –

I’m not sure if your question was directed toward Jim Frost, of statisticsbyjim, or me, Jim Knaub, but I’d like to place a reply here.

You ask “…how do we check for heteroscedasticity if the predictor is a categorical variable?”

A graphical residual analysis is a great way to check on model fit. (A cross-validation study is good to help avoid overfitting.) A graphical residual analysis, with predicted y on the x-axis, and estimated residuals on the y-axis can also be the first step in measuring heteroscedasticity, and I do not see that it matters if some or all of your independent variables may be categorical. You can use the paper I linked earlier and the Excel tool I provided as long as your regression is of the form y = y* + e, where y* is your predicted y, and e is the estimated residual, where e will be factored into a random factor, and a nonrandom factor. The nonrandom factor is a size measure (predicted y) raised to the coefficient of heteroscedasticity (gamma). The regression weight is the size measure raised to the negative of two times gamma.

There is more in that project link I provided.

However, if you are talking about logistic regression, that is a substantial step further from the y = y* + e format. Perhaps someone else could give you some information on that.

Cheers – Jim Knaub

Loading...

Reply
Joey says

September 8, 2020 at 12:00 am

Hi Jim!

Great post!

May I know how do we check for heteroscedasticity if the predictor is a categorical variable?

Thanks,
Joey

Loading...

Reply
James R. Knaub, Jr. says

June 7, 2020 at 1:36 am

Ghidena –

If your software lets you enter a regression weight expression, then it should use the correct formulas. These formulas are a result of minimizing the sum of squares of estimated residuals, subject to the regression weights. That is, we minimize Q, the sum of the squares of the random factors of the estimated residuals. For examples of that, please see the bottom of page 2 of https://www.researchgate.net/publication/263036348_Properties_of_Weighted_Least_Squares_Regression_for_Cutoff_Sampling_in_Establishment_Surveys for a simple example, and pages 597 and 598 in https://www.researchgate.net/publication/261534907_WEIGHTED_MULTIPLE_REGRESSION_ESTIMATION_FOR_SURVEY_MODEL_SAMPLING, for a more complex case where multiple independent variables, or even an intercept term, means solving simultaneous equations. These formulas should already have been included in your software.

To determine the regression weights, more than one method is noted in https://www.researchgate.net/publication/333642828_Estimating_the_Coefficient_of_Heteroscedasticity, but the primary emphasis here is on using a coefficient of heteroscedasticity. That is because https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity shows this to be naturally occurring. Examples in “Estimating the Coefficient of Heteroscedasticity,” show you how this works. To apply it yourself you can use https://www.researchgate.net/publication/333659087_Tool_for_estimating_coefficient_of_heteroscedasticityxlsx. This is discussed in various updates to https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression, which is on the fundamental nature and magnitude of heteroscedasticity, for regressions of form y = y* + e, most useful in predictions for finite populations.

Cheers – Jim Knaub

Loading...

Reply
Ghidena says

June 6, 2020 at 10:30 pm

sir- do u mean that we should leave the data after checking its heteroscedasticity? and no formula for this

Loading...

Reply
- Jim Frost says
  
  June 6, 2020 at 10:35 pm
  
  Hi Ghidena,
  
  I don’t know what you mean by “leave the data.” You should check the residual plots for evidence of heteroscedasticity. If you see heteroscedasticity, you should fix it using one of the methods I cover. I don’t know what formula you want.
  
  Loading...
  
  Reply
Jim Knaub says

August 26, 2019 at 12:48 am

Well, let me say that I have learned from Statistics by Jim (Frost), and I highly recommend it. These question and answer participations are a nice additional feature. – Jim K

Loading...

Reply
Jim Knaub says

August 25, 2019 at 4:40 pm

Richard Hart –

Perhaps I could interject something here.

Your project is not entirely clear to me, but yes, the main feature missed by ignoring heteroscedasticity is the large influence on the estimated variance of the prediction error, which with weighted least squares (WLS) regression will be larger with larger predictions. The OLS predicted values themselves are unbiased, but could be substantially different in practice from the WLS predictions when sample sizes are small.

Here is something I have sent to other people. Please note the spreadsheet, at the end, for finding a better value for the coefficient of heteroscedasticity than 0, which would indicate OLS:

=========================

Regarding heteroscedasticity:

If you might be interested, the following is with regard to the nature of heteroscedasticity, and consideration of its magnitude, for various linear regressions, including polynomials, which might be further extendable:

https://www.researchgate.net/project/OLS-Regression-Should-Not-Be-a-Default-for-WLS-Regression

Please particularly note
https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity

and
https://www.researchgate.net/publication/333642828_Estimating_the_Coefficient_of_Heteroscedasticity

A spreadsheet tool for estimating or considering a default value for the coefficient of heteroscedasticity, developed for linear regression, is found here (with references):

https://www.researchgate.net/publication/333659087_Tool_for_estimating_coefficient_of_heteroscedasticityxlsx

That leads to the regression weight expression which can be entered as “w” into SAS PROC REG. I assume this is similar for other statistical software. Note that OLS regression is a special case of WLS (weighted least squares) regression, where the coefficient of heteroscedasticity is zero and weights are all equal.

See Brewer, K.R.W.(2002), Combined survey sampling inference: Weighing Basu’s elephants, Arnold: London and Oxford University Press, especially pages 111, and 87, 130, 137, 142, and 203.

========================

I am not certain that your problem is related to predicting totals from a sample, with a previous census as regressor data, but the following is for that case and some cases a little more complicated:

https://www.researchgate.net/publication/319914742_Quasi-Cutoff_Sampling_and_the_Classical_Ratio_Estimator_-_Application_to_Establishment_Surveys_for_Official_Statistics_at_the_US_Energy_Information_Administration_-_Historical_Development

(There I often used a coefficient of heteroscedasticity of 0.5, as described in the links above, which appeared robust to data quality issues often occurring with smaller establishments. Otherwise a larger value would usually have been an improvement.)

I hope this is helpful for you, but as I said, I’m not really certain that you are looking at the same thing. Maybe it is somewhat related though.

Cheers – Jim Knaub

Loading...

Reply
- Jim Frost says
  
  August 25, 2019 at 10:48 pm
  
  Thanks for your thoughtful, detailed reply, Jim Knaub!
  
  Richard, Jim K. is definitely more knowledgeable in this topic than I am! It looks like he gave you some top-notch advice!
  
  Loading...
  
  Reply
Richard Hart says

August 23, 2019 at 1:38 am

Hi Jim – What a helpful article. We’re hoping you might be a life-saver for our project. We have a a linear equation that describes a industrial facility’s energy use based on the prior year of operation. It shows heteroscedasticity at the low end of production. One of our reviewers doesn’t like this. We’ve struggled to find a better model but we can’t.

At the bottom of the article you state, “if your primary goal is to predict the total amount of the dependent variable rather than estimating the specific effects of the independent variables, you might not need to correct non-constant variance.” So we’re wondering if we really need to fix this issue because what we’re ultimately measuring is the cumulative sum of differences between the modeled consumption and the actual consumption.

What do you think?

Loading...

Reply
phabdallah says

June 20, 2019 at 5:11 am

Hi,

Thank you so much for your great expanation. Is there Heteroscedasticity in the graphs in the following links:
1- [link removed]
2- [link removed]

The second graph is for same data but after excluding patients aged over 85 years.
I am student doing analysis on sample size (n= 168614).

Thank you for your help.

Loading...

Reply
- Jim Frost says
  
  June 20, 2019 at 2:09 pm
  
  Hi,
  
  I don’t believe so. In the first one with the residuals by fitted values, it does taper off at the high end. But, that might be a consequence of there simply being fewer observations at the very high end, which means you don’t the full spread of values. You’re less likely to get the relatively less common observations that fall further away.
  
  The 2nd model appears to be like the first model but with a truncated range on the residuals by fitted values. Cut off at the +2 standardized fitted value. I don’t know if it’s exactly the same data or not but it looks like it. I don’t evidence of heteroscedasticity there either. I just reread your comment more closely and noticed that they are the same data with patients aged 85+ removed. Interestingly that lops off your standardized fitted values at +2 very precisely. I wonder if including age related info in the model might improve the model? The predictions appear to be related to age.
  
  However, both models appear to have a negative trend in the residuals by fitted values plot. You might need to address that.
  
  Best of luck with your analysis!
  
  Loading...
  
  Reply
Jim Knaub says

May 4, 2019 at 4:00 am

Understood. Interesting, thanks. I’m having some trouble following back on my phone to see what was said about the BMI data. You do still have heteroscedasticity when it takes fewer points in the y direction on the right side of the scatterplot to cover the same range as on the left. I estimated the coefficient of heteroscedasticity from the data and got about 0.6. I like including regression weight w, written in terms of the coefficient of heteroscedasticity, which should not matter much for predictions here, but could for the estimated variance of the prediction error for each predicted-y.

Loading...

Reply
Jim Knaub says

May 3, 2019 at 3:34 pm

Nice interpretation and good point Jim. According to page 111 of Brewer, K.R.W. (2002), Combined survey sampling inference: Weighing Basu’s elephants, Arnold: London and Oxford University Press, 1/x^2 would be the extreme (not accounting for influences such as omitted variables and data quality issues).

Loading...

Reply
- Jim Frost says
  
  May 3, 2019 at 7:57 pm
  
  Thanks, Jim K. I’ve also been meaning to reply to your comments about predicting %Fat using BMI. I think there are several reasons why that model produces homoscedasticity.
  
  The dependent variable is a rate (percent). That’s often a great way to recode variables to reduce heteroscedasticity. If I had used an absolute value, such as total fat in grams, the potential for heteroscedasticity increases because there would be a larger range from the smallest to largest values. Additionally, if the study had included a wider range of individuals (e.g., age), that would’ve produced an even wider range of values. It appears like using percentage and a narrowly defined population probably helped produce heteroscedasticity.
  
  I mention in the post that recoding variables from absolute values to rates is my preferred method for fixing heteroscedasticity. I also think expressing variables as rates in these cases are often more meaningful than the absolute measure. Of course, that will vary by subject-area and specific application.
  
  Loading...
  
  Reply
Marissa Albers says

May 3, 2019 at 8:52 am

Jim! Thank you! Very helpful and intuitive article.

Loading...

Reply
- Jim Frost says
  
  May 3, 2019 at 5:35 pm
  
  Thank you, Marissa!
  
  Loading...
  
  Reply
James R. Knaub, Jr (Jim) says

April 19, 2019 at 4:05 pm

Howdy Jim Frost –

Something to consider:

Besides the analysis by Ken Brewer, it makes sense that a predicted value of 10,000,000 would have a larger variance of the prediction error than there would be for a predicted value of 10.

So if heteroscedasticity is the natural default, I’ve been looking at reasons why many regression scatterplots look homoscedastic. One interesting category is plots that look homoscedastic but really are heteroscedastic. An example is the nonlinear plot you have for predicting %Fat from BMI. We are interested in the vertical distribution of the points. It looks like the vertical range is a little greater on the lower end of the applicable BMI range, and thus likely the lower predicted y, i.e. y*, range. However, because the vertical point density at lower y* is greater, the variance of the prediction error is greater for larger y*. That is, points on the right side of the graph are spread out more in the vertical direction. When I measured it, the estimated coefficient of heteroscedasticity proved to be in the 0.5 to 1.0 range in my paper I last noted – closer to 0.5.

BMI is a very good (nonlinear) predictor for %Fat, at least for the range of BMI given. But for other models, we don’t always have such good predictors. But here we do have a straightforward model, apparently good data, and heteroscedasticity in the expected range for coefficient of heteroscedasticity.

Cheers – Jim Knaub

Loading...

Reply
James R. Knaub, Jr (Jim) says

April 18, 2019 at 6:14 pm

Joe, Jim F., et.al. –

From work done by Ken Brewer, discussed in https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity, it seems apparent that the best size measure which should be raised to the coefficient of heteroscedasticity to obtain the nonrandom factor of the estimated residuals is a good preliminary prediction of y. I’m working on background now to discuss how heteroscedasticity should be “ordinary,” and we only approximate homoscedasticity when the model has (a) omitted variable(s) of substantial importance, or maybe some other model issue, and/or data quality issues. So if one independent variable seems to have a lot to do with heteroscedasticity, it might often be a good single regressor. However, I think all the regressors should be used together for the size measure.

Cheers –
Jim K.

Loading...

Reply
- Jim Frost says
  
  April 19, 2019 at 11:56 am
  
  Hi Jim K.,
  
  Thanks for sharing your article! I’ll need to take some time to digest it!
  
  Loading...
  
  Reply
Joe C. says

April 18, 2019 at 3:51 am

You state,
“However, when you can identify a variable that is associated with the changing variance, a common approach is to use the inverse of that variable as the weight. In our case, the Weight column in the dataset equals 1 / Population.”

Since you’re squaring the error before summing, shouldn’t the weight be 1/Population^2 ?

Loading...

Reply
- Jim Frost says
  
  April 18, 2019 at 2:19 pm
  
  Hi Joe,
  
  It’s true that you square the differences and then sum for the SSE or variance, but that’s a separate matter than what weighted least squares (WLS) regression is doing. WLS uses the weight to shrink the variances of the residuals. The best weight depends on the relationship between the increase in fitted value and the increase in the residuals’ variances. That’s not necessarily going to be 1/X^2.
  
  Loading...
  
  Reply
Waseem Jan says

December 2, 2018 at 11:48 pm

can you please explain how to eliminate heteroscedasticity from panel data regression?

Loading...

Reply
- Jim Frost says
  
  December 3, 2018 at 12:10 am
  
  Hi Waseem, I’m pretty sure you’d use the same techniques that I discuss in this post. I’m not familiar with any techniques unique to panel data.
  
  Loading...
  
  Reply
Patrik Silva says

September 24, 2018 at 12:24 pm

Hi Jim, I am here again!

I would like to ask you a question that’s cooking my neurons!

I have a set of independent variables (IV) that I am assuming that they will affect my dependent variable (DV). However, when I use them in step-wise regression, they are reveling some consistence with my previous hypothesis and in fact they are statistically significant and reasonable R squared. Therefore, In this case do i need to check the residuals normality and homoscedasticity?

For now, I just want to prove that a particular variable (IV) is associate with my DV. If I am not interested to now how much, should I stop only by analyzing the R squared and P value? In this case of linear regression, which value should I talk about (R, R squared or Adjust R squared)?

Please, try to answer this question by explaining a bit about the differences between predictions, estimations and inferences?

Thank you in advance!

Loading...

Reply
Sulaiman Inuwa Muhammad says

August 14, 2018 at 11:07 pm

Thank you Jim for the wonderful explanation

Loading...

Reply
Jim Knaub says

August 3, 2018 at 4:24 pm

Fidel/Jim Frost –

I do not know what is available on R and other software, but one thing available on SAS is SAS PROC REG, for WLS. There you enter the regression weight as “w.”

Also, in SAS PROC REG, the estimated residuals can be printed, but unless it has changed, I noted long ago that it wasn’t really printing estimated residuals under the heading “residuals,” but actually the random factors of the estimated residuals (after accounting for regression weight).

Have a nice weekend.

Jim Knaub

Loading...

Reply
- Jim Frost says
  
  August 3, 2018 at 11:07 pm
  
  Thanks, Jim Knaub! As usual, I appreciate your contributions!
  
  Loading...
  
  Reply
fidel says

August 3, 2018 at 5:54 am

Hi, Jim. Would you please explain how to tell the software to perform the weighted regression? Should we go to Regression – Weight Estimation or Regression – Linear – WLS Weight?
Because I just went to Regression – Weight Estimation, and there was no option to Save the Residuals
Thank you, Jim

Great articles indeed you’ve posted. You do explain everything intuitively.

Loading...

Reply
- Jim Frost says
  
  August 3, 2018 at 12:04 pm
  
  Hi Fidel, thanks so much for your kind words!
  
  How to do this depends on the specific software that you’re using. The type of weighted regression that I refer to in this post is weighted least squares (WLS). So, I think that’s your best choice if you’re trying to do what I describe.
  
  Loading...
  
  Reply
Jim Knaub says

June 7, 2018 at 5:21 pm

PS – I have just worked with heteroscedasticity common to establishment surveys (finite population, cross-sectional surveys) that are used for production of official statistics.

https://www.researchgate.net/publication/320853387_Essential_Heteroscedasticity

https://www.researchgate.net/publication/324706010_Nonessential_Heteroscedasticity

https://www.researchgate.net/publication/319914742_Quasi-Cutoff_Sampling_and_the_Classical_Ratio_Estimator_-_Application_to_Establishment_Surveys_for_Official_Statistics_at_the_US_Energy_Information_Administration_-_Historical_Development

Loading...

Reply
Jim Knaub says

June 7, 2018 at 5:11 pm

Jim Frost – I am not sure what Frank means. Do you think he is saying that the ‘errors’ are not independent, so he needs to use the full GLS (generalized least squares) approach, not just dealing with heteroscedasticity? –

Frank – Don’t know, but you may find software that deals with GLS. Pretty sure SAS has that, maybe R and others. Whatever you do, you can test your model performance by saving out some data you have and checking to see if you would predict it well … besides the usual graphical residual analysis. – Maybe Jim Frost, the owner of this site, or someone else might have a better idea of what you need.

Cheers – Jim Knaub

Loading...

Reply
Frank Sauerbier says

June 7, 2018 at 9:50 am

Hi Jim, thank you for this helpful information.
Could you gave me a hint how to deal with heteroscedasticity if the dependent var is somehow related to variance itself. In detail I investigate numbers (that are the standard deviation of a measure) and their dependence on the error rate (false positives). The residual plot is like a fan. But I think it could be intrinsic that variance increases with error rate.
Thanks in advance!
Frank

Loading...

Reply
Jim Knaub says

May 3, 2018 at 11:53 pm

Sani –

Probably the most used measure of heteroscedasticity is determined by the Iterated Reweighted Least Squares (IRLS) method. The algorithm is well-described in Carroll and Ruppert(1988), Transformation and Weighting in Regression, Chapman & Hall, Ltd. London, UK.

You could also see the following:

https://www.researchgate.net/publication/263809034_Alternative_to_the_Iterated_Reweighted_Least_Squares_Method_-_Apparent_Heteroscedasticity_and_Linear_Regression_Model_Sampling

and

https://www.researchgate.net/publication/263032446_Weighting_in_Regression_for_Use_in_Survey_Methodology (Note that the use of “w” in the notation on page 2 was not a good choice when avoiding confusion with the regression weight, of which this is a part. Sorry.)

I’d say that “pure” heteroscedasticity is ‘a feature, not a bug.’

You can use WLS and estimate variances of prediction errors and prediction intervals, keeping the heteroscedasticity in the error structure where it naturally occurs, with no transformation.

Cheers – J. Knaub

Loading...

Reply
- Jim Frost says
  
  May 4, 2018 at 12:25 am
  
  Thank you for sharing the excellent information!
  
  Loading...
  
  Reply
Hiral Godhania says

March 19, 2018 at 7:56 am

Hello sir,
All of your articles are very useful .Sir could you please explain again how to detect heteroscedasticity and what are the implications of that.

Loading...

Reply
- Jim Frost says
  
  March 19, 2018 at 9:59 am
  
  Hi Hiral, this very post provides all of the information that you seek. It describes how to use residual plots to detect heteroscedasticity, the problems it causes, and possible solutions. You’re looking in the right place for the information you need!
  
  Loading...
  
  Reply
Sani says

October 29, 2017 at 2:36 am

Good day Jim,

Please, can you help me with any method that can measure (not test) heteroscedasticity in a data set?

Loading...

Reply
- Jim Frost says
  
  October 29, 2017 at 2:14 pm
  
  Hi Sani,
  
  There are tests for heteroscedasticity, but I’m not overly familiar with them. I always go by the residuals by fits plot myself because the pattern is very distinct. I’ve heard of the White test and the Breusch-Pagan test. You can look into those if you need to.
  
  Sorry, I just noticed that you’re looking for a measure rather than a test. Hmmm, these tests presumably need to measure and use a test statistic, maybe that is what you’re looking for? I’m not sure.
  
  Jim
  
  Loading...
  
  Reply
Senghort Kheang says

October 13, 2017 at 4:35 am

Dear guys

Do you have book ?, article related with machine learning?
I just start learn but Prof. give me do research with topic: Real estate data visualization in Phnom Penh.
Can anyone help me?

Loading...

Reply
Nick says

October 12, 2017 at 11:17 am

Really nice post. Thanks. Perhaps you could also mention using robust standard errors, Huber-White etc. This involves no data transformation but trades efficiency for robustness.

Loading...

Reply