The Chi-square test of independence determines whether there is a statistically significant relationship between categorical variables. It is a hypothesis test that answers the question—do the values of one categorical variable depend on the value of other categorical variables?
As you no doubt guessed, I’m a huge fan of statistics. I’m also a big Star Trek fan. Consequently, it’s not surprising that I’m writing a blog post about both! In the Star Trek TV series, Captain Kirk and the crew wear different colored uniforms to identify the crewmember’s work area. Those who wear red shirts have the unfortunate reputation of dying more often than those who wear gold or blue shirts.
In this post, I’ll show you how the Chi-square test of independence works. Then, I’ll show you how to perform the analysis and interpret the results by working through the example. I’ll use this test to determine whether wearing the dreaded red shirt in Star Trek is the kiss of death!
If you need a primer on the basics, read my hypothesis testing overview.
Overview of the Chi-Square Test of Independence
The Chi-square test of association evaluates relationships between categorical variables. Like any statistical hypothesis test, the Chi-square test has both a null hypothesis and an alternative hypothesis.
- Null hypothesis: There are no relationships between the categorical variables. If you know the value of one variable, it does not help you predict the value of another variable.
- Alternative hypothesis: There are relationships between the categorical variables. Knowing the value of one variable does help you predict the value of another variable.
The Chi-square test of independence works by comparing the distribution that you observe to the distribution that you expect if there is no relationship between the categorical variables. In the Chi-square context, the word “expected” is equivalent to what you’d expect if the null hypothesis is true. If your observed distribution is sufficiently different than the expected distribution (no relationship), you can reject the null hypothesis and infer that the variables are related.
For a Chi-square test, a p-value that is less than or equal to your significance level indicates there is sufficient evidence to conclude that the observed distribution is not the same as the expected distribution. You can conclude that a relationship exists between the categorical variables.
Star Trek Fatalities by Uniform Colors
We’ll perform a Chi-square test of independence to determine whether there is a statistically significant association between shirt color and deaths. We need to use this test because these variables are both categorical variables. Shirt color can be only blue, gold, or red. Fatalities can be only dead or alive.
The color of the uniform represents each crewmember’s work area. We will statistically assess whether there is a connection between uniform color and the fatality rate. Believe it or not, there are “real” data about the crew from authoritative sources and the show portrayed the deaths onscreen. The table below shows how many crewmembers are in each area and how many have died.
Color | Areas | Crew | Fatalities |
Blue | Science and Medical | 136 | 7 |
Gold | Command and Helm | 55 | 9 |
Red | Operations, Engineering, and Security | 239 | 24 |
Ship’s total | All | 430 | 40 |
Performing the Chi-Square Test of Independence for Uniform Color and Fatalities
For our example, we will determine whether the observed counts of deaths by uniform color are different from the distribution that we’d expect if there is no association between the two variables.
The table below shows how I’ve entered the data into the worksheet. You can also download the CSV dataset for StarTrekFatalities.
Color | Status | Frequency |
Blue | Dead | 7 |
Blue | Alive | 129 |
Gold | Dead | 9 |
Gold | Alive | 46 |
Red | Dead | 24 |
Red | Alive | 215 |
You can use the dataset to perform the analysis in your preferred statistical software. The Chi-squared test of independence results are below. As an aside, I use this example in my post about degrees of freedom in statistics. Learn why there are two degrees of freedom for the table below.
In our statistical results, both p-values are less than 0.05. We can reject the null hypothesis and conclude there is a relationship between shirt color and deaths. The next step is to define that relationship.
Describing the relationship between categorical variables involves comparing the observed count to the expected count in each cell of the Dead column. I’ve annotated this comparison in the statistical output above. Additionally, you can graph each cell’s contribution to the Chi-square statistic, which is below.
Surprise! It’s the blue and gold uniforms that contribute the most to the Chi-square statistic and produce the statistical significance! Red shirts add almost nothing. In the statistical output, the comparison of observed counts to expected counts shows that blue shirts die less frequently than expected, gold shirts die more often than expected, and red shirts die at the expected rate.
The graph below reiterates these conclusions by displaying fatality percentages by uniform color along with the overall death rate.
The Chi-square test indicates that red shirts don’t die more frequently than expected. Hold on. There’s more to this story!
Time for a bonus lesson and a bonus analysis in this blog post!
2 Proportions test to compare Security Red-Shirts to Non-Security Red-Shirts
The bonus lesson is that it is vital to include the genuinely pertinent variables in the analysis. Perhaps the color of the shirt is not the critical variable but rather the crewmember’s work area. Crewmembers in Security, Engineering, and Operations all wear red shirts. Maybe only security guards have a higher death rate?
We can test this theory using the 2 Proportions test. We’ll compare the fatality rates of red-shirts in security to red-shirts who are not in security.
The summary data are below. In the table, the events represent the counts of deaths, while the trials are the number of personnel.
Events | Trials | |
Security | 18 | 90 |
Not security | 6 | 149 |
The p-value of 0.000 signifies that the difference between the two proportions is statistically significant. Security has a mortality rate of 20% while the other red-shirts are only at 4%.
Security officers have the highest mortality rate on the ship, closely followed by the gold-shirts. Red-shirts that are not in security have a fatality rate similar to the blue-shirts.
As it turns out, it’s not the color of the shirt that affects fatality rates; it’s the duty area. That makes more sense.
Risk by Work Area Summary
The Chi-square test of independence and the 2 Proportions test both indicate that the death rate varies by work area on the U.S.S. Enterprise. Doctors, scientists, engineers, and those in ship operations are the safest with about a 5% fatality rate. Crewmembers that are in command or security have death rates that exceed 15%!
Hi there I’m just wondering if it would be appropriate to use a Chi square test in the following scenario;
– A data set of 1000 individuals
– Calculate Score A for all 1000 individuals; results are continuous numerical data eg. 2.13, 3.16, which then allow individuals to be placed in categories; low risk (3.86)
-Calculate Score B for the same 1000 individuals; results are discrete numerical data eg. 1, 6, 26 ,4 which the allow individuals to be placed in categories; low risk (26).
– I then want to compared the two scoring systems A & B ; to see if (1) the individuals are scoring similarly on both scores (2) I have reason to believe one of the scores overestimates the risk, I’d like tot test this.
Thank you, I haven’t been able to find any similar examples and its stressing me out 🙁
Hi Jim,
Would you be able to advise?
My organization is sending out 6 different emails to employees, in which they have to click on a link in the email. We want to see if one variation in language might get a higher click rate rate for the link. So we have 6 between subjects conditions, and the response can either be a ‘clicked on the link’ or ‘NOT clicked on the link’.
Is this a Chi-Square of Independence test? Also, how would I know where the difference lies, if the test is significant? (i.e., what is the non-parametric equivalent of running an ANOVA and followup pairwise comparisons?
Thanks Jim!
I am working on the press coverage of civil military relations in Pakistani press from 2008 to 2018, I want to check that whether is a difference of coverage between two tenures ie 2008 to 2013 and 2013 to 2018. Secondly I want to check the difference of coverage between two types of newspapers ie english newspapers and urdu newspapers. furthermore I also want to check the category wise difference of coverage from the tenure 2008 to 2018.
I have divided my data into three different distributions, 1 is pro civilian, 2 is pro military and 3 is neutral.
Hi thank you so much for this. I would like to ask, if the study Is about whether factors such as pricing, marketing, and brand affects the intention of the buyer to purchase the product. Can I use Chi-test for the statistic treatment? and if it is not can I ask what statistical treatment would you suggest? Thank you so much again.
Jim,
Thank you for the post. You displayed a lot of creativity linking the two lessons to Star Trek. Your website and ebook offerings are very inspiring to me.
Bill
Thanks so much, Bill. I really appreciate the kind words and I’m happy that the website and ebooks have been helpful!
Thank-you for your explanation. I am trying to help my son with his final school year investigation. He has raw data which he collected from 21 people of varying experience. They all threw a rugby ball at a target and the accuracy, time of ball in the air and experience (rated from 1-5) were all recorded. He has calculated the speed and the displacement, and used correlation to compare speed versus accuracy and experience versus accuracy. He needs to incrementally increase the difficulty of maths he uses in his analysis and he was thinking of the Chi Square test as a next step, however from your explanation above the current form of his data would not be suitable for this test. Is there a way of re-arranging the data so that we can use the Chi Square test? Thanks!
Hi Rhonwen,
The chi-squared test of independence looks for correlation between categorical variables. From your description, I’m not seeing a good pair of categorical variables to test for correlation. To me, the next step for this data appears to be regression analysis.
Hi Jim,
Thank you for the detailed teaching! I think this explains chi square much better than other websites I have found today. Do you mind sharing which software you use to get Expected Count and contribution to Chi square? Thank you for your help.
Good day jim!
I was wondering what kind of data analysis should i use if i am going to have a research on knowledge, attitude and practices?
Looking forward to your reply!
Thank you!
Very informative and easy to understand it.
Thank you so much sir
Hi
I wanted to know how the significance probability can be calculated if the significance level wasn’t given.
Thank you
Hi, you don’t need to know the significance level to be able to calculate the p-value. For calculating the p-value, you must know the null hypothesis, which we do for this example.
However, I do use a significance level of 0.05 for this example, making the results statistically significant.
What summary statistics can I use to describe the graph of a categorical data? Good presentation by the way. Very Insightful
Hi Michael,
For categorical data like the type in this example, which is in a two-way contingency table, you’d often use counts or percentages. A bar chart is often a good choice for graphing counts or percentages by multiple categories. I show an example of graphing data for contingency tables in my Introduction to Statistics ebook.
Thank you for your answer. I saw online that bar graphs can be used to visualise the data (I guess it would be the percentage of death in my case) with 95% Ci intervals for the error bar. Is this also applicable if I only have a 2×2 contingency table? If not, what could be my error bar?
Hi John, you can obtain CIs for proportions, which is basically a percentage. And, bar charts are often good for graphing contingency tables.
Hi! So I am working on this little project where I am trying to find a relationship between sex and mortality brought by this disease so my variables are: sex (male or female) and status (dead or alive). I am new to statistics so I do not know much. Is there any way to check the normality of categorical data? There is a part wherein our data must be based on data normality but I am not sure it this applies to categorical data. Thank you for your answer!
Hi John,
The normal distribution is for continuous data. You have discrete data values–two binary variables to be precise. So, the normal distribution is not applicable to your data.
Hi Jim, this was really helpful. I am in the midst of my proposal on a research to determine the association between burnout and physical activity among anaesthesia trainees.
They are both categorial variable
physical activity – 3 categories: high, moderate, low
burnout – 2 categories: high and low
How do I calculate my sample size for my study?
Hi Jaishree,
I suggest you download a free sample size and power calculation program called G*Power. Then do the following:
Experiment and adjust values to see how that changes the output. You want to find a sample size that produces sufficient power while incorporating your best estimates of the other parameters (effect size, etc.).
Learned so much from this post!! This was such a clear example that it is the first time for me that some statistic tests really make sense to me.
Thank you so much for sharing your knowledge, Jim!!
Hello,
the information that you have given here has been so useful to me – really understand it much better now. So, thank you very much!
Just a quick question, how did you graph the contribution to chi-square statistics? Only, I’ve been using stata to do some data analysis and I’m not sure how it is that I would be able to create a graph like that for my own data. Any insight into that, that you can give would be extremely useful.
Hi Daisy,
I used Minitab statistical software for the graphs. I think graphs often bring the data to life more than just a table of numbers.
Hi, Jim.
I have the results of two Exit Satisfaction Surveys related to two cohorts (graduates of 2017-18 and graduates of 2018-19). The information I received was just the “number” of ratings on each of the 5 points on the Likert Scale (e.g., 122 respondents Strongly Agreed to a given item). I changed the raw ratings into percentages for comparison, e.g., for Part A of the Survey (Proficiency and Knowledge in my major field), I calculated the minimum and maximum percentages on the Strongly Agree point and did the same for other points on the scale. My questions are (1) can I report the range of percentages on each point on the scale for each item or is it better to report an overall agreement/disagreement? and (2) what’s the best statistics to compare the satisfaction of the two cohorts in the same survey? The 2017-18 cohorts included 126, and the 2018-19 cohort included 296 graduates.
I checked out your Introduction to Statistics book that I purchased, but I couldn’t decide about the appropriate statistics for the analysis of each of the surveys as well as comparison of both cohorts.
My sincere thanks in advance for your time and advice,
All the best,
Ellie
Thank you for an excellent post! I am myself will soon perform a Chi-square test of independence on survey responses with two variables, and now think it might be good to start with a 2 proportion test (is a Z-test with 2 proportions what you use in this example?). Since you don’t discuss whether the Star Trek data meets the assumptions of the two tests you use, I wonder if they share approximately the same assumptions? I have already made certain that my data may be used with the Chi-square (my data is by the way not necessarily normally distributed, and has unkown mean and variance), can I therefore be comfortable with using a 2 proportions Z-test too? I hope you have the time to help me out here!
Excellent post. Btw, is it similar to what they called Test of Association that uses contingency table? The way they compute for the expected value is (row total × column total)/(sample total) . And to check if there is a relationship between two variable, check if the calculate chi-squared value is greater that the critical value of the chi-squared. Is it just the same?
Hi Hephzibah,
Yes, they’re the same test–test of independence and test of association. I’ll add something to that effect to the article to make that more clear.
Jim, thanks for creating and publishing this great content. In the initial chi-square test for independence we determined that shirt color does have a relationship with death rate. The Pearson ch-square measurement is 6.189, is this number meaningful? How do we interpret this in plain english?
Hi Michael,
There’s really no direct interpretation of the chi-square value. That’s the test statistic, similar to the t-value in t-tests and the F-value in F-tests. These values are placed in the chi-square probability distribution that has the specified degrees of freedom (df=2 for this example). By placing the value into the probability distribution, the procedure can calculate probabilities, such as the p-value. I’ve been meaning to write a post that shows how this works for chi-squared tests. I show how this works for t-tests and F-tests for one-way ANOVA. Read those to get an idea of the process. Of course, for this chi-squared test uses chi-squared as the test statistic and probability distribution.
I’ll write a post soon about how this test works, both in terms of calculating the chi-square value itself and then using it in the probability distribution.
Would Chi-squared test be the statistical test of choice, for comparing the incidence rates of disease X between two states? Many thanks.
Hi Michaela,
It sounds like you’d need to use a two-sample proportions test. I show an example of this test using real data in my post about the effective of flu vaccinations. The reason you’d need to use a proportions test is because your observed data are presumably binary (diseased/not diseased).
You could use the chi-squared test, but I think for your case the results are easier to understand using a two-sample proportions test.
Jim,
Lets say the expected salary for a position is 20,000 dollars. In our observed salary we have various figures a little above and below 20,000 and we want to do a hypothesis test. These salaries are ratio, so does that mean we cannot use Chi Square? Do we have to convert? How? In fact, when you run a chi square on the salary data Chi Square turns out to be very high, sort of off the Chi Square Critical Value chart.
thanks
Lou
Hi Louis,
Chi-square analysis requires two or more categorical (nominal) variables. Salary is a continuous (ratio) variable. Consequently, you can’t use chi-square.
If you have the one continuous variable of salary and you want to determine whether the difference between the mean salary and $20,000 is statistically significant or not, you’d need to use a one-sample t-test. My post about the different forms of t-tests should be helpful for you.
Jim,
I don’t know how to thank you for your detailed informative reply. And I am happy that a specialist like you found this study interesting yoohoo 🙂
As to your comment on how we (me and my graduate student whose thesis I am directing) tracked the errors from Sample writing 1 to 5 for each participant, We did it manually through a close content analysis. I had no idea of a better alternative since going through 25 pieces of writing samples needed meticulous comparison for each participant. I advised my student to tabulate the number, frequency, and type of errors for each participant separately so we could keep track of their (lack of) improvement depending on the participant’s proficiency level.
Do you have any suggestion to make it more rigorous?
Very many thanks,
Ellie
Hi, Jim. I first decided to choose chi-square to analyze my data but now I am thinking of poisson regression since my dependent variable is ‘count.’. I want to see if there is any significant difference between Grade 10 students’ perceptions of their writing problems and the frequency of their writing errors in the five paragraphs they wrote. Here is the detailed situation:
1. Five sample paragraphs were collected from 5 students at 5 proficiency levels based on their total marks in English final exam in the previous semester (from Outstanding to Poor).
2. The students participated in an interview and expressed their perceptions of their problem areas in writing.
3. The students submitted their paragraphs every 2 weeks during the semester.
4. The paragraphs were marked based on the school’s marking rubrics.
5. Errors were categorized under five components (e.g., grammar, word choice, etc.).
6. Paragraphs were compared for measuring the students’ improvement by counting errors manually in each and every paragraph.
7. The students’ errors were also compared to their perceived problem areas to study the extent of their awareness of their writing problems. This comparison showed that students were not aware of a major part of their errors while their perceived errors were not necessarily observed in their writing samples.
8. Comparison of Paragraphs 1 and 5 for each student showed decrease in the number of errors in some language components while some errors still persisted.
9. I’m also interested to see if proficiency level has any impact on students’ perceptions of their real problem areas and the frequency of their errors in each language category.
My question is which test should be used to answer Qs 7 and 8?
As to Q9, one of the dependent variables is count and the other one is nominal. One correlation I’m thinking is eta squared (interval-nominal) but for the proficiency-frequency I’m not sure.
My sincere apologies for this long query and many thanks for any clues to the right stats.
Ellie
Hi Ellie,
That sounds like a very interesting study!
I think that you’re correct to use some form of regression rather than chi-square. The chi-squared test of independence doesn’t work with counts within an observation. Chi-squared looks at the multiple characteristics of an observations and essentially places in a basket for that combination. For example, you have a red shirt/dead basket and a red-shirt/alive basket. The procedure looks at each observation and places it into one of the baskets. Then it counts the observations in each basket.
What you have are counts (of errors) within each observation. You want to understand that IVs that relate to those counts. That’s a regression thing. Now, what form of regression. Because it involves counts, Poisson regression is a good possibility. You might also read up on negative binomial regression, which is related. Sometimes you can have count data that doesn’t meet certain requirements of the Poisson distribution, but you can use Negative Binomial regression. For more information, look on page 321-322 of my ebook that you just bought! 🙂 I talk a bit about regression with counts.
And, there’s a chance that you might be able to use OLS regression. That depends on how you’re handling the multiple assessments and the average number of errors. The Poisson distribution begins to approximate the normal distribution at around a mean of 25-ish. If the number of errors tend to fall around here or higher, OLS might be the ticket! If you’re summing multiple observations together, that might help in this regard.
I don’t understand the design of how you’re tracking changing the number of errors over time, and how you’ll model that. You might included lagged values of errors to explain current errors, along with other possible IVs.
I found point number 7 to be really interesting. Is it that the blind spot allows the error to persist in greater numbers and that awareness of errors had reduced numbers of those types? Your interpretation of that should be very interesting!
Oh, and for the nominal dependent variable, use nominal logistic regression (p. 319-320)!
I hope this helps!
Thanks for your clear posts, Could you please give some insight like in T test and F test, how can we calculate a chi- square test statistic value and how to convert to p value?
Hi Gopi,
I have that exact topic in mind for a future blog post! I’ll write one up similar to the t-test and F-test posts in the near future. It’s too much to do in the comments section, but soon an entire post for it! I’ll aim for sometime in the next couple of months. Stay tuned!
This was great. 🙂
thanks i have learnt alot
Hello, Thanks for the nice tutorial. Can you please explain how the ‘Expected count’ is being calculated in the table “tabulated statistics: Uniform color, Status” ?
Hi Shihab, that’s an excellent question!
You calculate the expected value for each cell by first multiplying the column proportion by the row proportion that are associated with each cell. This calculation produces the expected proportion for that cell. Then, you take the expected proportion and multiply it by the total number of observations to obtain the expected count. Let’s work through an example!
I’ll calculate the expected value for wearing a Blue uniform and being Alive. That’s the top-left cell in the statistical output.
At the bottom of the Alive column, we see that 90.7% of all observations are alive. So, 0.907 is the proportion for the Alive column. The output doesn’t display the proportion for the Blue row, but we can calculate that easily. We can see that there are 136 total counts in the Blue row and there are 430 total crew members. Hence, the proportion for the Blue row is 136/430 = 0.31627.
Next, we multiply 0.907 * 0.31627 = 0.28685689. That’s the expected proportion that should fall in that Blue/Alive cell.
Now, we multiply that proportion by the total number of observations to obtain the expected count for that cell:
0.28685689 * 430 = 123.348
You can see in the statistical output that has been rounded up to 123.35.
I hope this helps!
You simply repeat that procedure for the rest of the cells.
very nice, thanks
Amazing post!! In the tabulated statistics section, you ran a Pearson Chi Square and a Likelihood Ratio Chi Square test. Are both of these necessary and do BOTH have to fall below the significance level for the null to be rejected? I’m assuming so. I don’t know what the difference is between these two tests but I will look it up. That was the only part that lost me:)
Thanks again, Jessica! I really appreciate your kind words!
When the two p-values are in agreement (e.g., both significant or insignificant), that’s easy. Fortunately, in my experience, these two p-values usually do agree. And, as the sample size increases, the agreement between them also increases.
I’ve looked into what to do when they disagree and have not found any clear answers. This paper suggests that as long as all expected frequencies are at least 5, use the Pearson Chi-Square test. When it is less than 5, the article recommends an adjusted Chi-square test, which is neither of the displayed tests!
These tests are most likely to disagree when you have borderline results to begin with (near your significance level), and particularly when you have a small sample. Either of these conditions alone make the results questionable. If these tests disagree, I’d take it as a big warning sign that more research is required!
I hope this helps!
Nice post.
Thank you!
A good presentation. My experience with researchers in health sciences and clinical studies is that very often people do not bother about the hypotheses (null and alternate) but run after a p-value, more so with Chi-Square test of independence!! Your narration is excellent.
Helpful post. I can understand now
Excellent Example, Thank you.
You’re very welcome. I’m glad it was helpful!