In a previous blog post, I introduced the basic concepts of hypothesis testing and explained the need for performing these tests. In this post, I’ll build on that and compare various types of hypothesis tests that you can use with different types of data, explore some of the options, and explain how to interpret the results. Along the way, I’ll point out important planning considerations, related analyses, and pitfalls to avoid.
A hypothesis test uses sample data to assess two mutually exclusive theories about the properties of a population. Hypothesis tests allow you to use a manageable-sized sample from the process to draw inferences about the entire population.
I’ll cover common hypothesis tests for three types of variables—continuous, binary, and count data. Recognizing the different types of data is crucial because the type of data determines the hypothesis tests you can perform and, critically, the nature of the conclusions that you can draw. If you collect the wrong data, you might not be able to get the answers that you need.
Related posts: Qualitative vs. Quantitative Data, Guide to Data Types and How to Graph Them, Discrete vs. Continuous, and Nominal, Ordinal, Interval, and Ratio Scales
Hypothesis Tests for Continuous Data
Continuous data can take on any numeric value, and it can be meaningfully divided into smaller increments, including fractional and decimal values. There are an infinite number of possible values between any two values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data. With continuous variables, you can use hypothesis tests to assess the mean, median, and standard deviation.
When you collect continuous data, you usually get more bang for your data buck compared to discrete data. The two key advantages of continuous data are that you can:
- Draw conclusions with a smaller sample size.
- Use a wider variety of analyses, which allows you to learn more.
I’ll cover two of the more common hypothesis tests that you can use with continuous data—t-tests to assess means and variance tests to evaluate dispersion around the mean. Both of these tests come in one-sample and two-sample versions. One-sample tests allow you to compare your sample estimate to a target value. The two-sample tests let you compare the samples to each other. I’ll cover examples of both types.
There is also a group of tests that assess the median rather than the mean. These are known as nonparametric tests and practitioners use them less frequently. However, consider using a nonparametric test if your data are highly skewed and the median better represents the actual center of your data than the mean.
Related posts: Nonparametric vs. Parametric Tests and Determining Which Measure of Central Tendency is Best for Your Data
Graphing the data for the example scenario
Suppose we have two production methods, and our goal is to determine which one produces a stronger product. To evaluate the two methods, we draw a random sample of 30 products from each production line and measure the strength of each unit. Before performing any analyses, it’s always a good idea to graph the data because it provides an excellent overview. Here is the CSV data file in case you want to follow along: Continuous_Data_Examples.
These histograms suggest that Method 2 produces a higher mean strength while Method 1 produces more consistent strength scores. The higher mean strength is good for our product, but the greater variability might produce more defects.
Graphs provide a good picture, but they do not test the data statistically. The differences in the graphs might be caused by random sample error rather than an actual difference between production methods. If the observed differences are due to random error, it would not be surprising if another sample showed different patterns. It can be a costly mistake to base decisions on “results” that vary with each sample. Hypothesis tests factor in random error to improve our chances of making correct decisions.
Keep this graph in mind when we look at binary data because they illustrate how much more information continuous data convey.
Related posts: Using Histograms to Understand Your Data and How Hypothesis Tests Work: Significance Levels and P-values
Two-sample t-test to compare means
The first thing we want to determine is whether one of the methods produces stronger products. We’ll use a two-sample t-test to determine whether the population means are different. The hypotheses for our 2-sample t-test are:
- Null hypothesis: The mean strengths for the two populations are equal.
- Alternative hypothesis: The mean strengths for the two populations are different.
A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence to conclude that the population means are different. Below is the output for the analysis.
The p-value (0.034) is less than 0.05. From the output, we can see that the difference between the mean of Method 2 (98.39) and Method 1 (95.39) is statistically significant. We can conclude that Method 2 produces a stronger product on average.
That sounds great, and it appears that we should use Method 2 to manufacture a stronger product. However, there are other considerations. The t-test tells us that Method 2’s mean strength is greater than Method 1, but it says nothing about the variability of strength values. For that, we need to use another test.
Related posts: How T-Tests Work and How to Interpret P-values Correctly and Step-by-Step Instructions for How to Do t-Tests in Excel.
2-Variances test to compare variability
A production method that has excessive variability creates too many defects. Consequently, we will also assess the standard deviations of both methods. To determine whether either method produces greater variability in the product’s strength, we’ll use the 2 Variances test. The hypotheses for our 2 Variances test are:
- Null hypothesis: The standard deviations for the populations are equal.
- Alternative hypothesis: The standard deviations for the populations are different.
A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence for concluding that the population standard deviations are different. The 2-Variances output for our product is below.
Both of the p-values are less than 0.05. The output indicates that the variability of Method 1 is significantly less than Method 2. We can conclude that Method 1 produces a more consistent product.
Related post: Measures of Variability
What we learned and did not learn with the hypothesis tests
The hypothesis test results confirm the patterns in the graphs. Method 2 produces stronger products on average while Method 1 produces a more consistent product. The statistically significant test results indicate that these results are likely to represent actual differences between the production methods rather than sampling error.
Our example also illustrates how you can assess different properties using continuous data, which can point towards different decisions. We might want the stronger products of Method 2 but the greater consistency of Method 1. To navigate this dilemma, we’ll need to use our process knowledge.
Finally, it’s crucial to note that the tests produce estimates of population parameters—the population means (μ) and the population standard deviations (σ). While these parameters can help us make decisions, they tell us little about where individual values are likely to fall. In certain circumstances, knowing the proportion of values that fall within specified intervals is crucial.
For the examples, the products must fall within spec limits. Even when the mean falls within the spec limit, it’s possible that too many individual items will fall outside the spec limits if the variability is too high.
Other types of analyses
To better understand the distribution of individual values rather than the population parameters, use the following analyses:
Tolerance intervals: A tolerance interval is a range that likely contains a specific proportion of a population. For our example, we might want to know the range where 99% of the population falls for each production method. We can compare the tolerance interval to our requirements to determine whether there is too much variability.
Capability analysis: This type of analysis uses sample data to determine how effectively a process produces output with characteristics that fall within the spec limits. These tools incorporate both the mean and spread of your data to estimate the proportion of defects.
Related post: Confidence Intervals vs. Prediction Intervals vs. Tolerance Intervals
Proportion Hypothesis Tests for Binary Data
Let’s switch gears and move away from continuous data. Suppose we take another random sample of our product from each of the production lines. However, instead of measuring a characteristic, inspectors evaluate each product and either accept or reject it.
Binary data can have only two values. If you can place an observation into only two categories, you have a binary variable. For example, pass/fail and accept/reject data are binary. Quality improvement practitioners often use binary data to record defective units.
Binary data are useful for calculating proportions or percentages, such as the proportion of defective products in a sample. You simply take the number of defective products and divide by the sample size. Hypothesis tests that assess proportions require binary data and allow you to use sample data to make inferences about the proportions of populations.
2 Proportions test to compare two samples
For our first example, we will make a decision based on the proportions of defective parts. Our goal is to determine whether the two methods produce different proportions of defective parts.
To make this determination, we’ll use the 2 Proportions test. For this test, the hypotheses are as follows:
- Null hypothesis: The proportions of defective parts for the two populations are equal.
- Alternative hypothesis: The proportions of defective parts for the two populations are different.
A p-value less than the significance level indicates that you can reject the null hypothesis. In this case, the sample provides sufficient evidence for concluding that the population proportions are different. The 2 Proportions output for our product is below.
Both p-values are less than 0.05. The output indicates that the difference between the proportion of defective parts for Method 1 (~0.062) and Method 2 (~0.146) is statistically significant. We can conclude that Method 1 produces defective parts less frequently.
1 Proportion test example: comparison to a target
The 1 Proportion test is also handy because you can compare a sample to a target value. Suppose you receive parts from a supplier who guarantees that less than 3% of all parts they produce are defective. You can use the 1 Proportion test to assess this claim.
First, collect a random sample of parts and determine how many are defective. Then, use the 1 Proportion test to compare your sample estimate to the target proportion of 0.03. Because we are interested in detecting only whether the population proportion is greater than 0.03, we’ll use a one-sided test. One-sided tests have greater power to detect differences in one direction, but no ability to detect differences in the other direction. Our one-sided 1 Proportion test has the following hypotheses:
- Null hypothesis: The proportion of defective parts for the population equals 0.03 or less.
- Alternative hypothesis: The proportion of defective parts for the population is greater than 0.03.
For this test, a significant p-value indicates that the supplier is in trouble! The sample provides sufficient evidence to conclude that the proportion of all parts from the supplier’s process is greater than 0.03 despite their assertions to the contrary.
Comparing continuous data to binary data
Think back to the graphs for the continuous data. At a glance, you can see both the central location and spread of the data. If we added spec limits, we could see how many data points are close and far away from them. Is the process centered between the spec limits? Continuous data provide a lot of insight into our processes.
Now, compare that to the binary data that we used in the 2 Proportions test. All we learn from that data is the proportion of defects for Method 1 (0.062) and Method 2 (0.146). There is no distribution to analyze, no indication of how close the items are to the specs, and no indication of how they failed the inspection. We only know the two proportions.
Additionally, the samples sizes are much larger for the binary data than the continuous data (130 vs. 30). When the difference between proportions is smaller, the required sample sizes can become quite large. Had we used a sample size of 30 like before, we almost certainly would not have detected this difference.
In general, binary data provide less information than an equivalent amount of continuous data. If you can collect continuous data, it’s the better route to take!
If you have a binary outcome variable and at least one independent variable, consider performing Logistic Regression
Related post: Estimating a Good Sample Size for Your Study Using Power Analysis
Poisson Hypothesis Tests for Count Data
Count data can have only non-negative integers (e.g., 0, 1, 2, etc.). In statistics, we often model count data using the Poisson distribution. Poisson data are a count of the presence of a characteristic, result, or activity over a constant amount of time, area, or other length of observation. For example, you can use count data to record the number of defects per item or defective units per batch. With Poisson data, you can assess a rate of occurrence.
For this scenario, we’ll assume that we receive shipments of parts from two different suppliers. Each supplier sends the parts in the same sized batch. We need to determine whether one supplier produces fewer defects per batch than the other supplier.
To perform this analysis, we’ll randomly sample batches of parts from both suppliers. The inspectors examine all parts in each batch and record the count of defective parts. We’ll randomly sample 30 batches from each supplier. Here is the CSV data file for this example: Count_Data_Example.
Performing the Two-Sample Poisson Rate Test
We’ll use the 2-Sample Poisson Rate test. For this test, the hypotheses are as follows:
- Null hypothesis: The rates of defective parts for the two populations are equal.
- Alternative hypothesis: The rates of defective parts for the two populations are different.
A p-value less than the significance level indicates that you can reject the null hypothesis because the sample provides sufficient evidence to conclude that the population rates are different. The 2-Sample Poisson Rate output for our product is below.
Both p-values are less than 0.05. The output indicates that the difference between the rate of defects per batch for Supplier 1 (3.56667) and Supplier 2 (5.36667) is statistically significant. We can conclude that Supplier 1 produces defects at a lower rate than Supplier 2.
Hypothesis tests are a great tool that allow you to take relatively small samples and draw conclusions about entire populations. There is a selection of tests available, and different options within the tests, which make them useful for a wide variety of situations.
If you have a count outcome variable and at least one independent variable, consider performing Poisson Regression.
To see an alternative approach to these traditional hypothesis testing methods, learn about bootstrapping in statistics!
Ash says
Ah that explains why I couldn’t find an R function that returned the same output! Thank you very much for your reply – I can stop looking now!
Ash says
Dear Jim,
Thank you for this article, I’ve learned a lot from reading it.
Your Two-Sample Poisson Rate Test example is very similar in structure to my data so I am trying to follow the same approach. The results pictured look like output from an R function – but I have been unable to find one that outputs results in this way. If these were indeed created by an R library/function, would you mind sharing which one you used, please?
Kind regards,
Ash
Jim Frost says
Hi Ash,
Sorry, I’m not using R. The results are from Minitab statistical software.
Daba says
Hello guys, what is the outcome varaible in independent sample t test? binary or not? Because it compares the means of two independent populations as which is greater or lower
Jim Frost says
Hi Daba,
The outcome variable for an independent sample t-test is continuous because you’re using it to calculate the means for two groups.
The binary variable is the categorical factor in the design. The binary variable defines the two groups for which you’re calculating the means. For example, your binary variable could be gender (male/female), experimental group (control/treatment), or material type (A/B). But the outcome variable is continuous so you can calculate the means for both groups and compare them. Click the link to learn more about the test.
Elisa says
The document can’t be found, is the link still working?
Jim Frost says
You’ll need to specify which link you’re talking about so I can check it. All links should be working.
Sameer Sippy says
Dear Jim,
Greetings!!! Very intuitive explanation. Liked the way you have explained with sufficient examples.
Jim based on Inferential Statistics, could you include an article on A/B Testing Methodology incorporating from basics like —Data Collection Process, Dataset Splitting Procedures & Duration for carrying out such experiments.
Also if you could incorporate illustrations from different industries viz. Healthcare, Manufacturing, Logistics, Quality, Ecommerce, Marketing, Advertisement Domains, this would indeed be useful.
Nowadays A/B Testing & Multivariate Testing is being incorporated & implemented in a robust manner across Data Science domain. Article or Write-up regarding this would immensely be useful.
Looking forward to a favourable and positive response.
Thomas C Omer says
The poisson test example has N of 30. I am wondering the appropriate distribution if the sample is lower than 30. Is it a t statistic or chi-square
John says
Hi, great post! I have an expected and observed data set and want to do additional testing to see if they differ signficantly from each other. Furthermore, the specific entries that contribute to the most weight in that significance or places that should have special attention. I did chi-square goodness of fit, but want to go further. Just to add, this is count data.
Jim Frost says
Hi John,
I’m not 100% sure what you want to do to go further. Because it’s count data, you could model it with the Poisson or Negative Binomial distribution. If you have other relevant variables, you can fit a Poisson or Negative Binomial regression model to explore relationships in your data. I talk a bit about those types of models in my post about choosing the correct type of regression model. You can also perform hypothesis tests designed for that type of data. The chi-squared test you performed seems like a good one to see how the expected and observed differs!
Grace Diki says
HI JIM
How do you do a independence test in Stata for a categorical variable with 6 levels and a binary variable.
Jim Frost says
Hi Grace,
I’m not a Stata user, but it sounds like you need to perform a chi-square test of independence.
jazytsax says
Hi Jim – thank you for this great site!
I have a situation where there is a reference standard (tells me if there is truly fat in a mass) and I have 2 different methods of detecting if there is (or is not) fat in the mass. My null hypothesis is that there is no difference in detection. I have a set of masses where I know if there is fat in the masses and used the 2 methods to detect whether they were able to detect the fat. Is the 2 proportions test the most appropriate for this question?
Thank you so much!
Habtu says
Thank you Jim for the wonderful post. It was clearly written and I enjoyed reading through it.
I have an additional query. I wanted to compare the variances of two methods of measurement applied at each observation point of a field survey. The variables from both methods have binary data type. How can I do the statistical test. Thank you in advance for your help.
Jim Frost says
Hi Habtu,
With binary data, you can’t compare variances. You can compare proportions using a proportions test. I discuss these tests in the binary section of this post. To read an example of a 2-sample proportions test, read my post about flu shot effectiveness. In it, I use 2-proportions tests to evaluate real flu study data. Or read my post about Mythbusters test about whether yawns are contagious, where I use a 2-proportions test. That way you can see what these tests can do. I cover them, and many other tests, in much more detail in my Hypothesis Testing book!
I hope this helps!
Jack says
Hi Jim. Just wanted to follow up and see if youโve had a chance to review this question yet?
Jim Frost says
Hi Jack, thanks for the reminder! I just replied!
Rod Pedersen says
Hi Jim , a green belt has a project on flu vaccinations , with 5 data points, % vaccination rates per year averaging about 36% of staff numbers. Her project was to increase vaccination rates this year , and has accumulated a lot of data points to measure vaccination rates in different office areas as a percent of total staff numbers which have almost doubled. Should she use 2 sample t test to measure difference in means between before and after data ( continuous) or should she use 2 sample test for proportions (attribute). There is small sample size for before data and large sample size for after data
Jim Frost says
Hi Rod,
I see two possibilities. The choice depends on whether she measured the same sites in the before and after. If they’re different sites before and after, she has independent groups and can use the regular 2-sample proportions test.
If they’re the same sites before and after, she can use either the test of marginal homogeneity or McNemar’s test. I have not used these tests myself and cannot provide more information. However, if she used the same sites before and after, she has proportions data for dependent groups (same groups) and should not use the regular 2-sample proportions test. These two tests can handle proportions for dependent groups.
I hope this helps!
Jack says
Hi Jim. Would I be able to use the 2 Proportion Test for comparing 2 proportions from different time periods? Example scenario: I run a satisfaction survey on a MOOC site during Q1 to a random sample of visitors and find that 80% of them were satisfied with their experience. The following quarter I run the same survey and find that 75% were satisfied. Is the 5 percentage point drop statistically significant or just due to random noise?
Jim Frost says
Hi Jack,
Sorry about the delay in replying! Sometimes comments slip through the cracks!
Yes, you can do as you suggest assuming the respondents are different in the two quarters and assuming that the data are binary (satisfied/not satisfied). The 2 proportions test is designed for independent groups and binary data.
I hope that helps even belatedly!
Ihshan Gumilar says
Thanks…Let me see that document
Ihshan Gumilar says
Hi,
I would like to ask about 2 sample poisson rate.
How do you calculate the 95% CI and test for difference ?
Your answer is really appreciated. Thank you so much for giving this tutorial.
Best,
Jim Frost says
Hi Ihshan,
This document describes the calculations. I hope it helps!
Shabahat says
Thank you so much for your kind support. Esteemed regards.
Shabahat says
Thanks for your helpful comments. Basically I have developed my research model based on competing theories. For example, I have one IV (Exploitation) and Two DV’s (Incremental innovation and Radical Innovation). Each variable in the model has own indicators. Some researchers claims that exploitation support only incremental innovations. On the other hand there are also studies that claims that in-depth exploitation also support radical innovation. However, these researchers claim that exploitation support radical innovations in a limited capacity as compared to incremental innovation. ON the basis of these competing theories I developed my hypothesis as:
Exploitation significantly and positively influences both incremental and radical innovation, however exploitation influence incremental innovation more than radical innovation.
Thank you very much for your quick response. Its really helpful.
Jim Frost says
Hi Shabahat,
Thanks you for the additional information. I messed up one thing in my previous reply to you. Because you only have the one IV, you don’t need to worry about standardizing that IV for comparability. However, assuming the DVs use different units of measurement, that’ll make comparisons problematic. Consequently, you probably need to standardize the two dependent variables instead. You’d learn how a one unit change in the IV relates to the standard deviation of each DV. That puts the DVs on the same scale. See how other researchers have handled this situation in your field to be sure that’s an accepted approach.
Additionally, if you find a significant relationship between exploitation and radical innovation, then you’d have good evidence to support that claim you’re making.
Shabahat says
Hi Jim,
Its really amazing. However I have a query regarding my analysis. I have one independent variable (Continuous) and Two dependent variables (Continuous). In the linear regression, the Independent variable significantly explains both dependent variables. Problem: Now i want to compare the effect of my Independent variable on both dependent variables. How can I compare?. If the effect is different, how can I test whether the effect difference is statistically significant or not in SPSS.
Jim Frost says
Hi Shabahat,
The fact that you’re talking about different DVs complicates things because they’re presumably measuring different things and using different units, which makes comparison difficult. The standardized coefficients can help you get around that but it changes the interpretation of the results.
Assuming that the two DV variables measure different characteristics, you might try standardizing your IVs and fitting the model using the standardized values. This process produces standardized coefficients, which use the same units and allows you to compare–although it does change the interpretation of the coefficients. I write about this in my post about assessing the importance of your predictors. You can also look at the CIs for the standardized coefficients and see if they overlap. If they don’t overlap, you know the difference between the standardized coefficients is statistically significant.
Fernando says
Great Post!
Can we test proportions from a continuos variable with unknown distribution using the poisson distribution using a cut-off value for good and bad samples and couting them?
Sanjana Mukherjee says
Hi Jim,
I really enjoy reading your posts and they have cleared many stat concepts!
I had a question about the chi square probability distribution. Although it is a non-parametric test, why does it fall into a continuous probability distribution and why can we use the chi square distribution for categorical data if it’s a continuous probability distribution?
Jim Frost says
Hi Sanjana,
That’s a great question! I’m glad you’re thinking about the types of variables and distributions, and how they’re used together.
You’re correct on both counts. Chi-squared test of independent is nonparameteric because it doesn’t assume a particular data distribution. Additionally, analysts use it to test the independence of categorical variables. There are other ways to use this distribution as well.
Now, onto why we use chi-square (a distribution for continuous data) with categorical variables! Yes, it involves categorical variables, but the analysis assesses the observed and expected counts of these variables. For each cell, the analysis takes the squared difference between the observed count and the expected count and then divides that by the expected count. These values are summed acrossed all cells to produce the chi-square value. This process produces a continuous variable that is based on the differences between the observed and expected counts of the categorical variables. When the value of this variable is large enough, we know that the difference between the observed counts and the expected counts is large enough to be unlikely due to chance. And, that’s why we use a continuous distribution to analyze categorical variables.
Amanda says
This is very helpful! Thank you!
Amanda says
Hi Jim, Great post. I was wondering, do you know of any references that discuss the difference in sample size between binary and continuous data? I am looking for a reference to cite in a journal article.
Thanks,
Amanda.
Jim Frost says
Hi Amanda,
The article I cite below discusses the different sample sizes in terms of observations per model term in order to avoid overfitting your model. I also cover these ideas in my post about how to avoid overfitting regression models. For regression models, this provides a good context for sample size requirements.
Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, Psychosomatic Medicine 66:411-421 (2004).
I hope this helps!
Sarika says
Hi Jim ,
I am totally new to statistics,
Following a small sample from my dataset.
Views PosEmo NegEmo
1650077 2.63 1.27
753826 2.39 0.47
926647 1.71 1.02
Views = Dependent continous Variable
PosEmo = Independent Continous Variable
NegEmo = Independent Continous Variable
My query :
1. How to run Hypothesis testing on same, Im pretty confused what to use , what to do , I am using SPSS modeler and SPSS statistics tool.
2.I think Multiple Regression is Ok for this . Let me know how to use it in SPSS modeler or stats tool.
Regards
Sarika
Jim Frost says
Hi Sarika, yes, it sounds like you can use multiple regression for those data. The hypothesis test in this case would be the p-values for the regression coefficients. Click that link to learn more about that. In your stats software, choose multiple linear regression and then specify the dependent variable and the two independent variables. Fit the model and then check the statistical output and the residual plots to see if you have a good model. Be sure to check out my regression tutorial too. That covers many aspects of regression analysis.
Jules says
Thanks for your sharing!
In the binary case (or proportion case), is there any comparison between “two proportion test” and “Chi-square” test? Is there any guideline to choose which test to use?
Jim Frost says
Hi Jules,
You’re welcome! Regarding your question, a two proportion test requires one categorical variable with two levels. For example, the variable could be “test result” and the two levels are “pass” and “fail.”
A chi-square test of independence requires at least two categorical variables. Those variables can have two or more levels. You can read an example of the chi-square test of independence that I’ve written about. The example is based on the original Star Trek TV series and determines whether the uniform color affects the fatality rate. That analysis has two categorical variables–fatalities and uniform color. Fatalities has two levels that indicate whether a crewmember survived or died. Uniform color has three levels–gold, blue, and red.
As you can see, the data requirements for the two tests are different.
I hope this helps!
Jim
Summi says
Tysm!!
MS says
Great post. Thanks for sharing your expertise.
Jim Frost says
Thank you! I’m glad it was helpful.
Manali Teli says
Very nice article. Could you explain more on hypothesis testing on median?
Jim Frost says
Thank you! For more information about testing the median, click the link in the article for where I compare parametric vs nonparametric analyses.
NARAYANASWAMY AUDINARAYANA says
Please let me know when one can use Probit Analysis. May I know the Procedure in SPSS.