The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest variance of all possible linear estimators.

The proof for this theorem goes way beyond the scope of this blog post. However, the critical point is that when you satisfy the classical assumptions, you can be confident that you are obtaining the best possible coefficient estimates. The Gauss-Markov theorem does not state that these are just the best possible estimates for the OLS procedure, but the best possible estimates for *any* linear model estimator. Think about that!

In my post about the classical assumptions of OLS linear regression, I explain those assumptions and how to verify them. In this post, I take a closer look at the nature of OLS estimates. What does the Gauss-Markov theorem mean exactly when it states that OLS estimates are the best estimates when the assumptions hold true?

## The Gauss-Markov Theorem: OLS is BLUE!

The Gauss-Markov theorem famously states that OLS is BLUE. BLUE is an acronym for the following:

Best Linear Unbiased Estimator

In this context, the definition of “best” refers to the minimum variance or the narrowest sampling distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates follow the tightest possible sampling distribution of unbiased estimates compared to other linear estimation methods.

Let’s dig deeper into everything that is packed into that sentence!

## What Does OLS Estimate?

Regression analysis is like any other inferential methodology. Our goal is to draw a random sample from a population and use it to estimate the properties of that population. In regression analysis, the coefficients in the equation are estimates of the actual population parameters.

The notation for the model of a population is the following:

The betas (β) represent the population parameter for each term in the model. Epsilon (ε) represents the random error that the model doesn’t explain. Unfortunately, we’ll never know these population values because it is generally impossible to measure the entire population. Instead, we’ll obtain estimates of them using our random sample.

The notation for an estimated model from a random sample is the following:

The hats over the betas indicate that these are parameter estimates while e represents the residuals, which are estimates of the random error.

Typically, statisticians consider estimates to be useful when they are unbiased (correct on average) and precise (minimum variance). To apply these concepts to parameter estimates and the Gauss-Markov theorem, we’ll need to understand the sampling distribution of the parameter estimates.

## Sampling Distributions of the Parameter Estimates

Imagine that we repeat the same study many times. We collect random samples of the same size, from the same population, and fit the same OLS regression model repeatedly. Each random sample produces different estimates for the parameters in the regression equation. After this process, we can graph the distribution of estimates for each parameter. Statisticians refer to this type of distribution as a sampling distribution, which is a type of probability distribution.

Keep in mind that each curve represents the sampling distribution of the estimates for a single parameter. The graphs below tell us which values of parameter estimates are more and less common. They also indicate how far estimates are likely to fall from the correct value.

Of course, when you conduct a real study, you’ll perform it once, not know the actual population value, and you definitely won’t see the sampling distribution. Instead, your analysis draws one value from the underlying sampling distribution for each parameter. However, using statistical principles, we can understand the properties of the sampling distributions without having to repeat a study many times. Isn’t the field of statistics grand?!

Hypothesis tests also use sampling distributions to calculate p-values and create confidence intervals. For more information about this process, read my post: How Hypothesis Tests Work.

## Unbiased Estimates: Sampling Distributions Centered on the True Population Parameter

In the graph below, beta represents the true population value. The curve on the right centers on a value that is too high. This model tends to produce estimates that are too high, which is a positive bias. It is not correct on average. However, the curve on the left centers on the actual value of beta. That model produces parameter estimates that are correct on average. The expected value is the actual value of the population parameter. That’s what we want and satisfying the OLS assumptions helps us!

Keep in mind that the curve on the left doesn’t indicate that an individual study necessarily produces an estimate that is right on target. Instead, it means that OLS produces the correct estimate on average when the assumptions hold true. Different studies will generate values that are sometimes higher and sometimes lower—as opposed to having a tendency to be too high or too low.

## Minimum Variance: Sampling Distributions are Tight Around the Population Parameter

In the graph below, both curves center on beta. However, one curve is wider than the other because the variances are different. Broader curves indicate that there is a higher probability that the estimates will be further away from the correct value. That’s not good. We want our estimates to be close to beta.

Both studies are correct on average. However, we want our estimates to follow the narrower curve because they’re likely to be closer to the correct value than the wider curve. The Gauss-Markov theorem states that satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.

The Best in BLUE refers to the sampling distribution with the minimum variance. That’s the tightest possible distribution of all unbiased linear estimation methods!

## Gauss-Markov Theorem OLS Estimates and Sampling Distributions

As you can see, the best estimates are those that are unbiased and have the minimum variance. When your model satisfies the assumptions, the Gauss-Markov theorem states that the OLS procedure produces unbiased estimates that have the minimum variance. The sampling distributions are centered on the actual population value and are the tightest possible distributions. Finally, these aren’t just the best estimates that OLS can produce, but the best estimates that any linear model estimator can produce. Powerful stuff!

Karthik M says

Hello Jim,

One question, I always get stuck at one point. Here is my question.

After I ran my regression, I have an estimate of Beta_1_hat, this is not the true population value Beta_1, but is just one sample drawing from a normal distribution centered at the true population Beta_1. If I were to have the luxury of repeating this experiment 1000 time then I get a good understanding of how good the estimate Beta_1_hat (obtained by just running the regression analysis once is), granted that I have a sense of the standard deviation of sampling distribution, but without knowing the TRUE Beta_1 how can I comment about how good my initial estimate (Beta_1_hat) was compared to (Beta_1_hat)?

I know I am missing something, please clarify.

Jim Frost says

Hi Karthik,

What you’re asking about gets to the heart of inferential statistics, which treats your study’s random sample as one of an infinite number of random samples it could have drawn. And, you’re correct that these procedures estimate the sampling distribution for the parameter that you’re estimating–in your case, beta coefficients. In this scenario, the “how good” you’re asking about is the concept of “precision.” How close is your parameter estimate likely to be to the true parameter value? The easiest way to assess precision is by using confidence intervals. The narrower the CI, the more precise your estimate. You can have your software calculate CIs for your coefficient estimates.

I write about CIs in the context of sample means using a t-test. If you’ll notice, coefficient estimates have a t-value and are themselves an average. Coefficients are the average change in the DV given a one-unit increase in the IV. So, my other post is in a different concept but the ideas are similar. Read that post to see how the procedure uses the sampling distribution to create a CI. I think that’ll answer many of your questions!

Larry says

Hi,

Thanks for site. Is it accurate to say , if we are given sample data (x1,y1),(x2,y2),…,(xn,yn)

with y the DV and x the IV:

We compute y^, the sample mean:

We define the random variables E_i:= ith error as:

EI:=y_i – y^

Sorry if this is a strange/nonsensical question: How would we compute the mean and variance of an error? Do we define the RV ith error as the ith error of each sample, so that E_i=0 means the mean of all the ith errors of different samples are 0?

**BUT** I don’t understand the ordering of these errors. What is the RV , say, E1?

Does it mean if we take another sample (x1′, y1′),(x2′, y2′),….., (xn’, yn’)

Here again, we compute the sample mean : x’^, as above and define the ith error

as x’^ -x’_i

Are the values x_1, x’_1, x”_1

Meaning, are the first ( and nth) data points a statistical population?

Thanks, Jim.

Jim Frost says

Hi Larry,

I’m not 100% sure that I understand what you’re asking. But, I can offer several clarifications.

First, y^, or y-hat, is not the sample mean. Y-hat is the fitted or predicted value for an observation when you input the DV values.

Y-bar (a y with a horizontal line over it) is the mean of the dependent variable.

Your formula for error is actually the residual for the ith observation. A residual is the difference between an observed value and its corresponding fitted value. In other words, it’s the difference between the observed value and the value that the model predicts for that observation. You have one residual for each observation in your model.

If you specify the correct model or include the constant in your model, the mean of the residuals will equal zero. The standard error of the regression is the standard deviation of the residuals. You’re not collecting new samples but instead understanding the mean and variability of the entire set of residuals.

For more information, read my post about residual plots, which explains why you need to assess residuals and how to do so. For even more information, get my ebook about regression analysis. In that book, I cover error and the sum of squared errors and other sums of squares in more detail.

I hope that helps!

Chris says

Very nice. Most behavior Is nonlinear. Clearly BLUE very appealing. BNUE=Best Nonlinear Unbiased Estimator. Comments, thoughts on BNUE vs BLUE?

Shahbaz Hussain says

Sir, Needed Characteristics of BLUE

Jim Frost says

Hi Shahbaz, yes, you’re in the right place. This post covers the characteristics of BLUE. BLUE refers to the characteristics of the coefficient estimates, which I discuss in detail throughout this post.

SAMUEL, Gabriel says

Wow! Your style of explanation is quite amazing, so simple and easy to comprehend. I have benefitted a lot from your blog. Thanks a lot and God bless you.

Jim Frost says

Hi Samuel, thank you so much! I really appreciate your kind words. They made my day!

Ameya Patankar says

Hi Jim,

First of all thanks a lot for this blog! This is like an entire Statistics 101 course.

My only suggestion would be can you please suggest an order in which each of the articles should be read in each of the subsections? Than would really give an idea what things should be clear before moving on to a certain topic.

Regards,

Ameya

Luong Duy Nam says

Thank you sir, your explanation is very clear in plain English

Amit says

Yes sir, I think when our correlation in the sample errors is zero,meaning one error doesnot predict another.Thus the population errors (since estimated by sample ) will also have correlation zero and that is required.Now this can also be stated at for one observation also (since in population it can have one distribution),the errors are uncorrelated.

Although the three conditions are true for the population but many times we state them for the subpopulations at given observation for the population model.

E(e/X)=0 also E(e)=0

V(e/X)=sigma sq also V(e)=sigma sq

Cov(ei,ej/X)=0 also Cov(ei,ej)=0

This doubt is because sir many times both the forms are written.

Adrien says

This is the best plain english explanation I have encountered regarding this Therorem! Thanks

Amit says

Sir my question is :

Four conditions for betas to be BLUE:

a.E(e)=0,e is population error

2.V(e)=Sigma2

3.Cov(ei,ej)=0

4.although normality of errors is not a condition ,still it is useful.

All these are at a given X ,since x is treated as a constant.My question is when we write third condition Cov(ei,ej)=0 if I take ei from one X1 and ej from another X2 ….then also it satisfies third condition?

Same for normality since in a sample we take residuals at varying X.

Pls clear the concept sir.

Jim Frost says

Hi Amit,

These conditions all apply to the error for the population, which we estimate using the sample at hand. As for the correlation (or covariance) of errors, you should not be able to use the error for one observation to predict the value of the error for another observation. When we know the correlation based on the sample, we can determine whether one residual predicts another. If the correlation is zero, we know that residuals don’t predict other residuals.

I’m not 100% sure that I understand your question, but I hope this helps.

NANCY G says

highly useful. explained in simple language.

Jim Frost says

Thanks Nancy!

Becky says

Hello, I’m trying to figure out how to calculate the estimate error variances σ2, am I in the right direction is this going to help me calculate this?