Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable.
I’ve previously written about Pearson’s chi-square test of independence using a fun Star Trek example. Are the uniform colors related to the chances of dying? You can test the notion that the infamous red shirts have a higher likelihood of dying. In that post, I focus on the purpose of the test, applied it to this example, and interpreted the results.
In this post, I’ll take a bit of a different approach. I’ll show you the nuts and bolts of how to calculate the expected values, chi-square value, and degrees of freedom. Then you’ll learn how to use the chi-squared distribution in conjunction with the degrees of freedom to calculate the p-value.
I’ve used the same approach to explain how:
Of course, you’ll usually just let your statistical software perform all calculations. However, understanding the underlying methodology helps you fully comprehend the analysis.
Chi-Squared Example Dataset
For the Star Trek example, uniform color and status are the two categorical variables. The contingency table below shows the combination of variable values, frequencies, and percentages.
Blue | Gold | Red | Row total | |
Dead | 7 | 9 | 24 | 40 |
Alive | 129 | 46 | 215 | 390 |
Column total | 136 | 55 | 239 | N = 430 |
Column percentage (Dead) | 5.15% | 16.36% | 10.04% |
If uniform color and fatality rates are independent, we’d expect the column percentage in the bottom row to be roughly equal for all uniform colors. After all, if there is no connection between these variables, there’s no reason for the fatality rates to be different.
However, our fatality rates are not equal. Gold has the highest fatality rate at 16.36%, while Blue has the lowest at 5.15%. Red is in the middle at 10.04%. Does this inequality in our sample suggest that the fatality rates are different in the population? Does a relationship exist between uniform color and fatalities?
Thanks to random sampling error, our sample’s fatality rates don’t exactly equal the population’s rates. If the population rates are equal, we’d likely still see differences in our sample. So, the question becomes, after factoring in sampling error, are the fatality rates in our sample different enough to conclude that they’re different in the population? In other words, we want to be confident that the observed differences represent a relationship in the population rather than merely random fluctuations in the sample. That’s where Pearson’s chi-squared test for independence comes in!
Hypotheses for Our Test
The two hypotheses for the chi-squared test of independence are the following:
- Null: The variables are independent. No relationship exists.
- Alternative: A relationship between the variables exists.
Related posts: Hypothesis Testing Overview and Guide to Data Types
Calculating the Expected Frequencies for the Chi-squared Test of Independence
The chi-squared test of independence compares our sample data in the contingency table to the distribution of values we’d expect if the null hypothesis is correct. Let’s construct the contingency table we’d expect to see if the null hypothesis is true for our population.
For chi-squared tests, the term “expected frequencies” refers to the values we’d expect to see if the null hypothesis is true. To calculate the expected frequency for a specific combination of categorical variables (e.g., blue shirts who died), multiply the column total (Blue) by the row total (Dead), and divide by the sample size.
Row total X Column total / Sample Size = Expected value for one table cell
To calculate the expected frequency for the Dead/Blue cell in our dataset, do the following:
- Find the row total for Dead (40)
- Find the column total for Blue (136)
- Multiply those two values and divide by the sample size (430)
40 * 136 / 430 = 12.65
If the null hypothesis is true, we’d expect to see 12.65 fatalities for wearers of the Blue uniforms in our sample. Of course, we can’t have a fraction of a death, but that doesn’t affect the results.
Contingency Table with the Expected Values
I’ll calculate the expected values for all six cells that represent the combinations of the three uniform colors and two statuses. I’ll also include the observed values in our sample. Expected values are in parentheses.
Blue | Gold | Red | Row total | |
Dead | 7 (12.65) | 9 (5.12) | 24 (22.23) | 40 |
Alive | 129 (123.35) | 46 (49.88) | 215 (216.77) | 390 |
Column% (Expected Dead) | 9.3% | 9.3% | 9.3% |
In this table, notice how the column percentages for the expected dead are all 9.3%. This equality occurs when the null hypothesis is valid, which is the condition that the expected values represent.
Using this table, we can also compare the values we observe in our sample to the frequencies we’d expect if the null hypothesis that the variables are not related is correct.
For example, the observed frequency for Blue/Dead is less than the expected value (7 < 12.65). In our sample, deaths of those in blue uniforms occurred less frequently than we’d expect if the variables are independent. On the other hand, the observed frequency for Gold/Dead is greater than the expected value (9 > 5.12). Meanwhile, the observed frequency for Red/Dead approximately equals the expected value. This interpretation matches what we concluded by assessing the column percentages in the first contingency table.
Pearson’s chi-squared test works by mathematically comparing observed frequencies to the expected values and boiling all those differences down into one number. Let’s see how it does that!
Related post: Using Contingency Tables to Calculate Probabilities
Calculating the Chi-Squared Statistic
Most hypothesis tests calculate a test statistic. For example, t-tests use t-values and F-tests use F-values as their test statistics. These statistical tests compare your observed sample data to what you would expect if the null hypothesis is true. The calculations reduce your sample data down to one value that represents how different your data are from the null. Learn more about Test Statistics.
For chi-squared tests, the test statistic is, unsurprisingly, chi-squared, or χ^{2}.
The chi-squared calculations involve a familiar concept in statistics—the sum of the squared differences between the observed and expected values. This concept is similar to how regression models assess goodness-of-fit using the sum of the squared differences.
Here’s the formula for chi-squared.
Let’s walk through it!
To calculate the chi-squared statistic, take the difference between a pair of observed (O) and expected values (E), square the difference, and divide that squared difference by the expected value. Repeat this process for all cells in your contingency table and sum those values. The resulting value is χ^{2}. We’ll calculate it for our example data shortly!
Important Considerations about the Chi-Squared Statistic
Notice several important considerations about chi-squared values:
Zero represents the null hypothesis. If all your observed frequencies equal the expected frequencies exactly, the chi-squared value for each cell equals zero, and the overall chi-squared statistic equals zero. Zero indicates your sample data exactly match what you’d expect if the null hypothesis is correct.
Squaring the differences ensures both that cell values must be non-negative and that larger differences are weighted more than smaller differences. A cell can never subtract from the chi-squared value.
Larger values represent a greater difference between your sample data and the null hypothesis. Chi-squared tests are one-tailed tests rather than the more familiar two-tailed tests. The test determines whether the entire set of differences exceeds a significance threshold. If your χ^{2} passes the limit, your results are statistically significant! You can reject the null hypothesis and conclude that the variables are dependent–a relationship exists.
Related post: One-tailed and Two-tailed Hypothesis Tests
Calculating Chi-Squared for our Example Data
Let’s calculate the chi-squared statistic for our example data! To do that, I’ll rearrange the contingency table, making it easier to illustrate how to calculate the sum of the squared differences.
The first two columns indicate the combination of categorical variable values. The next two are the observed and expected values that we calculated before. The last column is the squared difference divided by the expected value for each row. The bottom line sums those values.
Our chi-squared test statistic is 6.17. Ok, great. What does that mean? Larger values indicate a more substantial divergence between our observed data and the null hypothesis. However, the number by itself is not useful because we don’t know if it’s unusually large. We need to place it into a broader context to determine whether it is an extreme value.
Using the Chi-Squared Distribution to Test Hypotheses
One chi-squared test produces a single chi-squared value. However, imagine performing the following process.
First, assume the null hypothesis is valid for the population. At the population level, there is no relationship between the two categorical variables. Now, we’ll repeat our study many times by drawing many random samples from this population using the same design and sample size. Next, we perform the chi-squared test of independence on all the samples and plot the distribution of the chi-squared values. This distribution is known as a sampling distribution, which is a type of probability distribution.
If we follow this procedure, we create a graph that displays the distribution of chi-squared values for a population where the null hypothesis is true. We use sampling distributions to calculate probabilities for how unlikely our sample statistic is if the null hypothesis is correct. Chi-squared tests use the chi-square distribution.
Fortunately, we don’t need to collect many random samples to create this graph! Statisticians understand the properties of chi-squared distributions so we can estimate the sampling distribution using the details of our design.
Our goal is to determine whether our sample chi-squared value is so rare that it justifies rejecting the null hypothesis for the entire population. The chi-squared distribution provides the context for making that determination. We’ll calculate the probability of obtaining a chi-squared value that is at least as high as the value that our study found (6.17).
This probability has a name—the P-value! A low probability indicates that our sample data are unlikely when the null hypothesis is true.
Alternatively, you can use a chi-square table to determine whether our study’s chi-square test statistic exceeds the critical value.
Related posts: Sampling Distributions, Understanding Probability Distributions and Interpreting P-values
Graphing the Chi-Squared Test Results for Our Example
For chi-squared tests, the degrees of freedom define the shape of the chi-squared distribution for a design. Chi-square tests use this distribution to calculate p-values. The graph below displays several chi-square distributions with differing degrees of freedom.
For a table with r rows and c columns, the method for calculating degrees of freedom for a chi-square test is (r-1) (c-1). For our example, we have two rows and three columns: (2-1) * (3-1) = 2 df.
Read my post about degrees of freedom to learn about this concept along with a more intuitive way of understanding degrees of freedom in chi-squared tests of independence.
Below is the chi-squared distribution for our study’s design.
The distribution curve displays the likelihood of chi-squared values for a population where there is no relationship between uniform color and status at the population level. I shaded the region that corresponds to chi-square values greater than or equal to our study’s value (6.17). When the null hypothesis is correct, chi-square values fall in this area approximately 4.6% of the time, which is the p-value (0.046). With a significance level of 0.05, our sample data are unusual enough to reject the null hypothesis.
The sample evidence suggests that a relationship between the variables exists in the population. While this test doesn’t indicate red shirts have a higher chance of dying, there is something else going on with red shirts. Read my other post chi-squared to learn about that!
Related Reading
When you have smaller sample sizes, you might need to use Fisher’s exact test instead of the chi-square version. To learn more, read my post, Fisher’s Exact Test: Using and Interpreting.
Learn more about How to Find the P Value.
You can also read about the chi-square goodness of fit test, which assesses the distribution of outcomes for a categorical or discrete variable.
Pearson’s chi-squared test for independence doesn’t tell you the effect size. To understand the strength of the relationship, you’d need to use something like Cramér’s V, which is a measure of association like Pearson’s correlation—except for categorical variables. That’s the topic of a future post!
David says
Jim – I want to start by saying that I love your site. It has helped me out greatly during many occasions. In this particular example I am interested in understanding the logic around the math for the expected values. For example, can you explain how I should interpret scaling the total number dead by the total number blue?
From there I get that we divide by the total number of people to get the number of blue deaths expected within the group of 430 people. Is this a formula that is well known for contingency tables or did you apply that strictly for this scenario?
Hopefully this question made sense?
Either way, thanks for the contributing to the community!
Jim Frost says
Hi David,
I’m so glad to hear that my site has been helpful!
I’m not 100% sure what you’re asking, so I’m not sure if I’m answering your question. To start, the formulas are the standard ones for the chi-squared test of independence, which you use in conjunction with contingency tables. You’d use the same methods and formulas for other datasets.
The portion you’re asking about is how to calculate the expected number for blue deaths if there is no association between uniform color and deaths (i.e., the null hypothesis of the test is true). So, the interpretation of the value is: If there is no relationship between uniform color and deaths, we’d expect 12.6 fatalities among those wearing blue uniforms. The test as a whole compares these expected values (for all table cells) to the observed values to determine whether the data support rejecting the null hypothesis and concluding that there is a relationship between the variables.
Monica says
I teach AP Stat and am planning on using your example. However, in checking conditions I would like to be able to give background on the origin of the data. I went to your link and found that this data was collected for the TV episodes. Are those the episodes just for the original series?
Thanks!
Jim Frost says
Hi Monica,
That’s great you’re teaching an AP Stats class! 🙂
Yes, the data I use are from the original TV series that aired from 1966-69.
Sean says
Thank you for your gracious reply. I’m especially happy because it meant that I actually understood! You’ve done a great service with this blog; I plan to return regularly! Thank you.
Jim Frost says
Hi Sean,
I was think exactly that after fixing the comment. It would make a perfect comprehension test. Read this article and find the two incorrect letters! You passed! 🙂
Sean says
I very much appreciate your clear explanations. I’m a “50 something” trying to finish a PhD in Library Science and my brain needs the help!
One question, please?
You write above:
Larger values represent a greater difference between your sample data and the null hypothesis. Chi-squared tests are one-tailed tests rather than the more familiar two-tailed tests. The test determines whether the entire set of differences exceeds a significance threshold. If your χ2 passes the limit, your results are statistically significant! You can reject the null hypothesis and conclude that the variables are independent.
I thought that rejecting the null hypothesis allowed you to conclude the opposite. If the null hypothesis is
Null: The variables are independent. No relationship exists.
Then rejecting the Null hypothesis means rejecting that the variables are independent, not concluding that the variables are independent.
This is, please, a honest question, (not being “that guy”; i’m not smart enough!).
Again, thank you for your work!! I’m going to check to see if you cover Kendall’s W, as it’s central to a paper I’m reading!
Jim Frost says
Hi Sean,
First, I definitely welcome all questions! And, especially in this case because you caught a typo! You’re correct about what rejecting the null hypothesis means for this test. I’ve updated the text to say “and conclude that the variables are dependent.” I double-checked elsewhere through article and all the other text about the conclusions based on significance are correct. Just a brain malfunction on my part! I’m grateful you caught that as that little slip changes the entire meaning!
Alas, I don’t cover Kendall’s W–at least not yet. I plan to add that down the road.
Vaishali Sarathy says
Thanks Jim. Your explanations are so effective, yet easy to understand!
Sumeet says
Thank you Jim. Great post and reply. I have a question which is an extension of Michael’s question.
In general, it seems like one could build any test statistic. Find the distribution of your statistic under the null (say using bootstrap), and that will give you a p-value for your dataset.
Are chi-squared, t, or F-statistics special in some way? Or do we continue to use them simply because people have used them historically?
Jim Frost says
Hi Sumeet,
Originally, hypothesis tests that used these distributions were easier to calculate. You could calculate the test statistic using a simple formula and then look it up in a table. Later, it got even easier when the computer could both calculate the test statistic and tell you its p-value. It’s really the ease of calculation that made them special along with the theories behind them.
Now, we have such powerful computers that they can easily construct very large sets of bootstrap samples. That would’ve been difficult earlier. So, a large part of the answer is that bootstrapping really wasn’t feasible earlier and so the use of the chi-squared, t, and F distributions became the norm. The historically accepted standards.
It’s possible that over time bootstrap methods will gain be used more. I haven’t done extensive research into how efficient they are compared to using the various distributions, but what I have done indicates they are at least roughly on par. If you haven’t, I’d suggest reading my post about bootstrapping for more information.
Thanks for asking the great question!
PRADEEP PANICKER says
Nice explanation
Cathy says
This has started my year, so far so good,
Thank you Jim.
desalegn says
great lesson thanks
djalma paiva sampaio neto says
Thankyou Jim, I will read and calc this lesson today, at 3 o’clock Brasilia time.
Phani says
Thank You Sir
Michael says
Great post, thanks for writing it. I am looking forward to the Cramer’s V post!
As a person just starting to dive into statistics I am curios why we so often square the differences to make calculations. It seems squaring a difference will put to much weight on large differences. For example, in the chi-square test what if we used the absolute value of observed and expected differences? Just something I have been wondering about.
Jim Frost says
Hi Michael,
There’s several ways of looking at your question. In some cases, if you just want to know how far observations are from the mean for a dataset, you would be justified using the mean absolute deviation rather than the standard deviation, which incorporates squared deviations but then takes the square root.
However, in other cases, the squared deviations are built into the underlying analysis. Such as in linear regression where it penalizes larger errors which helps force them to be smaller. Otherwise, the regression line would not “consider” larger errors to be much worse than smaller errors. Here’s an article about it in the regression context.
Or, if you’re working with the normal distribution and using it calculate probabilities or what not, that distribution has the mean and standard deviation as parameters. And the standard deviation incorporates squared differences. You could not work with the normal distribution using mean absolute deviations (MAD).
In a similar vein for chi-squared tests, you have to realize that the chi-squared distribution is based on squared differences. So, if you wanted to do a similar analysis but with the mean absolute deviation (MAD), you’d have to devise an entirely new test statistic and sampling distribution for it! You couldn’t just use the chi-squared distribution because that is specifically for these differences that use squaring. Same thing for F-tests which use ratios of variances, and variances are of course based on squared differences. Again, to use MAD for something like ANOVA, you’d need to come up with a new test statistic and sampling distribution!
But, the general reason is that squaring does weight large differences more heavily and that fits in with the rational that given a distribution of values, outlier values should be weighted more because they are relatively unlikely to occur so when they do it’s noteworthy. It makes those large differences between the expected and the observed more “odd.” And, some analyses use an underlying sampling distribution that is based on a test statistic calculated using squared differences in some fashion.
Niyungeko Antoine says
Thank you Jim.
Tobden says
Great lesson Jim! You’re putting it a very simple ways for non-statisticians. Thanks for sharing the knowledge!
MIna says
Thanks for sharing, Jim!