The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest variance of all possible linear estimators.

The proof for this theorem goes way beyond the scope of this blog post. However, the critical point is that when you satisfy the classical assumptions, you can be confident that you are obtaining the best possible coefficient estimates. The Gauss-Markov theorem does not state that these are just the best possible estimates for the OLS procedure, but the best possible estimates for *any* linear model estimator. Think about that!

In my post about the classical assumptions of OLS linear regression, I explain those assumptions and how to verify them. In this post, I take a closer look at the nature of OLS estimates. What does the Gauss-Markov theorem mean exactly when it states that OLS estimates are the best estimates when the assumptions hold true?

## The Gauss-Markov Theorem: OLS is BLUE!

The Gauss-Markov theorem famously states that OLS is BLUE. BLUE is an acronym for the following:

Best Linear Unbiased Estimator

In this context, the definition of “best” refers to the minimum variance or the narrowest sampling distribution. More specifically, when your model satisfies the assumptions, OLS coefficient estimates follow the tightest possible sampling distribution of unbiased estimates compared to other linear estimation methods.

Let’s dig deeper into everything that is packed into that sentence!

## What Does OLS Estimate?

Regression analysis is like any other inferential methodology. Our goal is to draw a random sample from a population and use it to estimate the properties of that population. In regression analysis, the coefficients in the equation are estimates of the actual population parameters.

The notation for the model of a population is the following:

The betas (β) represent the population parameter for each term in the model. Epsilon (ε) represents the random error that the model doesn’t explain. Unfortunately, we’ll never know these population values because it is generally impossible to measure the entire population. Instead, we’ll obtain estimates of them using our random sample.

The notation for an estimated model from a random sample is the following:

The hats over the betas indicate that these are parameter estimates while e represents the residuals, which are estimates of the random error.

Typically, statisticians consider estimates to be useful when they are unbiased (correct on average) and precise (minimum variance). To apply these concepts to parameter estimates and the Gauss-Markov theorem, we’ll need to understand the sampling distribution of the parameter estimates.

## Sampling Distributions of the Parameter Estimates

Imagine that we repeat the same study many times. We collect random samples of the same size, from the same population, and fit the same OLS regression model repeatedly. Each random sample produces different estimates for the parameters in the regression equation. After this process, we can graph the distribution of estimates for each parameter. Statisticians refer to this type of distribution as a sampling distribution, which is a type of probability distribution.

Keep in mind that each curve represents the sampling distribution of the estimates for a single parameter. The graphs below tell us which values of parameter estimates are more and less common. They also indicate how far estimates are likely to fall from the correct value.

Of course, when you conduct a real study, you’ll perform it once, not know the actual population value, and you definitely won’t see the sampling distribution. Instead, your analysis draws one value from the underlying sampling distribution for each parameter. However, using statistical principles, we can understand the properties of the sampling distributions without having to repeat a study many times. Isn’t the field of statistics grand?!

Hypothesis tests also use sampling distributions to calculate p-values and create confidence intervals. For more information about this process, read my post: How Hypothesis Tests Work.

## Unbiased Estimates: Sampling Distributions Centered on the True Population Parameter

In the graph below, beta represents the true population value. The curve on the right centers on a value that is too high. This model tends to produce estimates that are too high, which is a positive bias. It is not correct on average. However, the curve on the left centers on the actual value of beta. That model produces parameter estimates that are correct on average. The expected value is the actual value of the population parameter. That’s what we want and satisfying the OLS assumptions helps us!

Keep in mind that the curve on the left doesn’t indicate that an individual study necessarily produces an estimate that is right on target. Instead, it means that OLS produces the correct estimate on average when the assumptions hold true. Different studies will generate values that are sometimes higher and sometimes lower—as opposed to having a tendency to be too high or too low.

## Minimum Variance: Sampling Distributions are Tight Around the Population Parameter

In the graph below, both curves center on beta. However, one curve is wider than the other because the variances are different. Broader curves indicate that there is a higher probability that the estimates will be further away from the correct value. That’s not good. We want our estimates to be close to beta.

Both studies are correct on average. However, we want our estimates to follow the narrower curve because they’re likely to be closer to the correct value than the wider curve. The Gauss-Markov theorem states that satisfying the OLS assumptions keeps the sampling distribution as tight as possible for unbiased estimates.

The Best in BLUE refers to the sampling distribution with the minimum variance. That’s the tightest possible distribution of all unbiased linear estimation methods!

## Gauss-Markov Theorem OLS Estimates and Sampling Distributions

As you can see, the best estimates are those that are unbiased and have the minimum variance. When your model satisfies the assumptions, the Gauss-Markov theorem states that the OLS procedure produces unbiased estimates that have the minimum variance. The sampling distributions are centered on the actual population value and are the tightest possible distributions. Finally, these aren’t just the best estimates that OLS can produce, but the best estimates that any linear model estimator can produce. Powerful stuff!

Luong Duy Nam says

Thank you sir, your explanation is very clear in plain English

Amit says

Yes sir, I think when our correlation in the sample errors is zero,meaning one error doesnot predict another.Thus the population errors (since estimated by sample ) will also have correlation zero and that is required.Now this can also be stated at for one observation also (since in population it can have one distribution),the errors are uncorrelated.

Although the three conditions are true for the population but many times we state them for the subpopulations at given observation for the population model.

E(e/X)=0 also E(e)=0

V(e/X)=sigma sq also V(e)=sigma sq

Cov(ei,ej/X)=0 also Cov(ei,ej)=0

This doubt is because sir many times both the forms are written.

Adrien says

This is the best plain english explanation I have encountered regarding this Therorem! Thanks

Amit says

Sir my question is :

Four conditions for betas to be BLUE:

a.E(e)=0,e is population error

2.V(e)=Sigma2

3.Cov(ei,ej)=0

4.although normality of errors is not a condition ,still it is useful.

All these are at a given X ,since x is treated as a constant.My question is when we write third condition Cov(ei,ej)=0 if I take ei from one X1 and ej from another X2 ….then also it satisfies third condition?

Same for normality since in a sample we take residuals at varying X.

Pls clear the concept sir.

Jim Frost says

Hi Amit,

These conditions all apply to the error for the population, which we estimate using the sample at hand. As for the correlation (or covariance) of errors, you should not be able to use the error for one observation to predict the value of the error for another observation. When we know the correlation based on the sample, we can determine whether one residual predicts another. If the correlation is zero, we know that residuals don’t predict other residuals.

I’m not 100% sure that I understand your question, but I hope this helps.

NANCY G says

highly useful. explained in simple language.

Jim Frost says

Thanks Nancy!

Becky says

Hello, I’m trying to figure out how to calculate the estimate error variances σ2, am I in the right direction is this going to help me calculate this?