In statistics, degrees of freedom (DF) is a difficult concept to explain. However, it is an important idea that appears in many different contexts throughout statistics including hypothesis tests, probability distributions, and regression analysis. Learn more about degrees of freedom in this blog post!
I’ll start by defining degrees of freedom. However, I’ll quickly move on to practical examples in a variety of contexts because they make this concept easier to understand.
Definition of Degrees of Freedom
Degrees of freedom are the number of independent values that a statistical analysis can estimate. You can also think of it as the number of values that are free to vary as you estimate parameters. I know, it’s starting to sound a bit murky!
Degrees of freedom encompasses the notion that the amount of independent information you have limits the number of parameters that you can estimate. Typically, the degrees of freedom equal your sample size minus the number of parameters you need to calculate during an analysis. It is usually a positive whole number.
Degrees of freedom is a combination of how much data you have and how many parameters you need to estimate. It indicates how much independent information goes into a parameter estimate. In this vein, it’s easy to see that you want a lot of information to go into parameter estimates to obtain more precise estimates and more powerful hypothesis tests. So, you want many degrees of freedom!
Independent Information and Restrictions on Values
The definitions talk about independent information. You might think this refers to the sample size, but it’s a little more complicated than that. To understand why, we need talk about the freedom to vary. The best way to illustrate this concept is with an example.
Suppose we collect the random sample of observations shown below. Now, imagine that we know the mean, but we don’t know the value of an observation—the X in the table below.
The mean is 6.9, and it is based on 10 values. So, we know that the values must sum to 69 based on the equation for the mean.
Using simple algebra (64 + X = 69), we know that X must equal 5.
Estimating Parameters Imposes Constraints on the Data
As you can see, that last number has no freedom to vary. It is not an independent piece of information because it cannot be any other value. Estimating the parameter, the mean in this case, imposes a constraint on the freedom to vary. The last value and the mean are entirely dependent on each other. Consequently, after estimating the mean, we have only 9 independent pieces of information even though our sample size is 10.
That’s the basic idea for degrees of freedom in statistics. In a general sense, DF are the number of observations in a sample that are free to vary while estimating statistical parameters. You can also think of it as the amount of independent data that you can use to estimate a parameter.
Degrees of Freedom and Probability Distributions
Degrees of freedom also define the distributions for the test statistics of various hypothesis tests. For example, hypothesis tests use the t-distribution, F-distribution, and the chi-square distribution to determine statistical significance. Each of these distributions is a family of distributions where the degrees of freedom define the shape. Hypothesis tests use these distributions to calculate p-values. So, the DF are directly linked to p-values through these distributions!
Next, let’s look at how these distributions work for several hypothesis tests.
Degrees of Freedom for t-Tests and the t-Distribution
T-tests are hypothesis tests for the mean and use the t-distribution to determine statistical significance.
A 1-sample t-test determines whether the difference between the sample mean and the null hypothesis value is statistically significant. Let’s go back to our example of the mean above. We know that when you have a sample and estimate the mean, you have n – 1 degrees of freedom, where n is the sample size. Consequently, for a 1-sample t-test, the degrees of freedom is n – 1.
The DF define the shape of the t-distribution that your t-test uses to calculate the p-value. The graph below shows the t-distribution for several different degrees of freedom. Because the degrees of freedom are so closely related to sample size, you can see the effect of sample size. As the degrees of freedom decreases, the t-distribution has thicker tails. This property allows for the greater uncertainty associated with small sample sizes.
To dig into t-tests, read my post about How t-Tests Work. I show how the different t-tests calculate t-values and use t-distributions to calculate p-values.
The F-test in ANOVA also tests group means. It uses the F-distribution, which is defined by the degrees of freedom. However, you calculate the DF for an F-distribution differently. For more information, read my post about How F-tests Work in ANOVA.
Degrees of Freedom for the Chi-Square Test of Independence
The chi-square test of independence determines whether there is a statistically significant relationship between categorical variables. Just like other hypothesis tests, this test incorporates degrees of freedom. For a table with r rows and c columns, the general rule for calculating degrees of freedom for a chi-square test is (r-1) (c-1).
However, we can create tables to understand it more intuitively. The degrees of freedom for a chi-square test of independence is the number of cells in the table that can vary before you can calculate all the other cells. In a chi-square table, the cells represent the observed frequency for each combination of categorical variables. The constraints are the totals in the margins.
Chi-Square 2 X 2 Table
For example, in a 2 X 2 table, after you enter one value in the table, you can calculate the remaining cells.
In the table above, I entered the bold 15, and then I can calculate the remaining three values in parentheses. Therefore, this table has 1 DF.
Chi-Square 3 X 2 Table
Now, let’s try a 3 X 2 table. The table below illustrates the example that I use in my post about the chi-square test of independence. In that post, I determine whether there is a statistically significant relationship between uniform color and deaths on the original Star Trek TV series.
In the table, one categorical variable is shirt color, which can be blue, gold, or red. The other categorical variable is status, which can be dead or alive. After I entered the two bolded values, I can calculate all the remaining cells. Consequently, this table has 2 DF.
Read my post, Chi-Square Test of Independence and an Example, to see how this test works and how to interpret the results using the Star Trek example.
Like the t-distribution, the chi-square distribution is a family of distributions where the degrees of freedom define the shape. Chi-square tests use this distribution to calculate p-values. The graph below displays several chi-square distributions.
Degrees of Freedom in Regression Analysis
Degrees of freedom in regression is a bit more complicated, and I’ll keep it on the simple side. In a regression model, each term is an estimated parameter that uses one degree of freedom. In the regression output below, you can see how each term requires a DF. There are 28 observations and the two independent variables use a total of two degrees of freedom. The remaining 26 degrees of freedom are displayed in Error.
The error degrees of freedom are the independent pieces of information that are available for estimating your coefficient estimates. For precise coefficient estimates and powerful hypothesis tests in regression, you must have many error degrees of freedom. This equates to having many observations for each model term.
As you add terms to the model, the error degrees of freedom decreases. You have fewer pieces of information available to estimate the parameters. This situation reduces the precision of the parameters and the power of the tests. When you have too few remaining degrees of freedom, you can’t trust the regression results. If you use all your degrees of freedom, the p-values can’t be calculated.
For more information about the problems that occur when you use too many degrees of freedom and how many observations you need, read my blog post about overfitting your model.
Even though they might seem murky, degrees of freedom are essential to any statistical analysis! In a nutshell, DF define the amount of information you have relative to the number of properties that you want to estimate. If you don’t have enough information for what you want to do, you’ll have imprecise estimates and low statistical power.