In a previous blog post, I introduced the basic concepts of hypothesis testing and explained the need for performing these tests. In this post, I’ll build on that and compare various types of hypothesis tests that you can use with different types of data, explore some of the options, and explain how to interpret the results. Along the way, I’ll point out important planning considerations, related analyses, and pitfalls to avoid.

A hypothesis test uses sample data to assess two mutually exclusive theories about the properties of a population. Hypothesis tests allow you to use a manageable-sized sample from the process to draw inferences about the entire population.

I’ll cover common hypothesis tests for three types of data—continuous, binary, and count data. Recognizing the different types of data is crucial because the type of data determines the hypothesis tests you can perform and, critically, the nature of the conclusions that you can draw. If you collect the wrong data, you might not be able to get the answers that you need.

**Related post**: Guide to Data Types and How to Graph Them

## Hypothesis Tests for Continuous Data

Continuous data can take on any numeric value, and it can be meaningfully divided into smaller increments, including fractional and decimal values. There are an infinite number of possible values between any two values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data. With continuous variables, you can use hypothesis tests to assess the mean, median, and standard deviation.

When you collect continuous data, you usually get more bang for your data buck compared to discrete data. The two key advantages of continuous data are that you can:

- Draw conclusions with a smaller sample size.
- Use a wider variety of analyses, which allows you to learn more.

I’ll cover two of the more common hypothesis tests that you can use with continuous data—t-tests to assess means and variance tests to evaluate dispersion around the mean. Both of these tests come in one-sample and two-sample versions. One-sample tests allow you to compare your sample estimate to a target value. The two-sample tests let you compare the samples to each other. I’ll cover examples of both types.

There is also a group of tests that assess the median rather than the mean. These are known as nonparametric tests and practitioners use them less frequently. However, consider using a nonparametric test if your data are highly skewed and the median better represents the actual center of your data than the mean.

**Related posts**: Nonparametric vs. Parametric Tests and Determining Which Measure of Central Tendency is Best for Your Data

### Graphing the data for the example scenario

Suppose we have two production methods and our goal is to determine which one produces a stronger product. To evaluate the two methods, we draw a random sample of 30 products from each production line and measure the strength of each unit. Before performing any analyses, it’s always a good idea to graph the data because it provides an excellent overview. Here is the CSV data file in case you want to follow along: Continuous_Data_Examples.

These histograms suggest that Method 2 produces a higher mean strength while Method 1 produces more consistent strength scores. The higher mean strength is good for our product, but the greater variability might produce more defects.

Graphs provide a good picture, but they do not test the data statistically. The differences in the graphs might be caused by random sample error rather than an actual difference between production methods. If the observed differences are due to random error, it would not be surprising if another sample showed different patterns. It can be a costly mistake to base decisions on “results” that vary with each sample. Hypothesis tests factor in random error to improve our chances of making correct decisions.

Keep this graph in mind when we look at binary data because they illustrate how much more information continuous data convey.

**Related posts**: Using Histograms to Understand Your Data and How Hypothesis Tests Work: Significance Levels and P-values

### Two-sample t-test to compare means

The first thing we want to determine is whether one of the methods produces stronger products. We’ll use a two-sample t-test to determine whether the population means are different. The hypotheses for our 2-sample t-test are:

**Null hypothesis:**The mean strengths for the two populations are equal.**Alternative hypothesis:**The mean strengths for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence to conclude that the population means are different. Below is the output for the analysis.

The p-value (0.034) is less than 0.05. From the output, we can see that the difference between the mean of Method 2 (98.39) and Method 1 (95.39) is statistically significant. We can conclude that Method 2 produces a stronger product on average.

That sounds great, and it appears that we should use Method 2 to manufacture a stronger product. However, there are other considerations. The t-test tells us that Method 2’s mean strength is greater than Method 1, but it says nothing about the variability of strength values. For that, we need to use another test.

**Related posts**: How T-Tests Work and How to Interpret P-values Correctly and Step-by-Step Instructions for How to Do t-Tests in Excel.

### 2-Variances test to compare variability

A production method that has excessive variability creates too many defects. Consequently, we will also assess the standard deviations of both methods. To determine whether either method produces greater variability in the product’s strength, we’ll use the 2 Variances test. The hypotheses for our 2 Variances test are:

**Null hypothesis:**The standard deviations for the populations are equal.**Alternative hypothesis:**The standard deviations for the populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence for concluding that the population standard deviations are different. The 2-Variances output for our product is below.

Both of the p-values are less than 0.05. The output indicates that the variability of Method 1 is significantly less than Method 2. We can conclude that Method 1 produces a more consistent product.

**Related post**: Measures of Variability

### What we learned and did not learn with the hypothesis tests

The hypothesis test results confirm the patterns in the graphs. Method 2 produces stronger products on average while Method 1 produces a more consistent product. The statistically significant test results indicate that these results are likely to represent actual differences between the production methods rather than sampling error.

Our example also illustrates how you can assess different properties using continuous data, which can point towards different decisions. We might want the stronger products of Method 2 but the greater consistency of Method 1. To navigate this dilemma, we’ll need to use our process knowledge.

Finally, it’s crucial to note that the tests produce estimates of population parameters—the population means (μ) and the population standard deviations (σ). While these parameters can help us make decisions, they tell us little about where individual values are likely to fall. In certain circumstances, knowing the proportion of values that fall within specified intervals is crucial.

For the examples, the products must fall within spec limits. Even when the mean falls within the spec limit, it’s possible that too many individual items will fall outside the spec limits if the variability is too high.

### Other types of analyses

To better understand the distribution of individual values rather than the population parameters, use the following analyses:

**Tolerance intervals**: A tolerance interval is a range that likely contains a specific proportion of a population. For our example, we might want to know the range where 99% of the population falls for each production method. We can compare the tolerance interval to our requirements to determine whether there is too much variability.

**Capability analysis**: This type of analysis uses sample data to determine how effectively a process produces output with characteristics that fall within the spec limits. These tools incorporate both the mean and spread of your data to estimate the proportion of defects.

**Related post**: Confidence Intervals vs. Prediction Intervals vs. Tolerance Intervals

## Proportion Hypothesis Tests for Binary Data

Let’s switch gears and move away from continuous data. Suppose we take another random sample of our product from each of the production lines. However, instead of measuring a characteristic, inspectors evaluate each product and either accept or reject it.

Binary data can have only two values. If you can place an observation into only two categories, you have a binary variable. For example, pass/fail and accept/reject data are binary. Quality improvement practitioners often use binary data to record defective units.

Binary data are useful for calculating proportions or percentages, such as the proportion of defective products in a sample. You simply take the number of defective products and divide by the sample size. Hypothesis tests that assess proportions require binary data and allow you to use sample data to make inferences about the proportions of populations.

### 2 Proportions test to compare two samples

For our first example, we will make a decision based on the proportions of defective parts. Our goal is to determine whether the two methods produce different proportions of defective parts.

To make this determination, we’ll use the 2 Proportions test. For this test, the hypotheses are as follows:

**Null hypothesis:**The proportions of defective parts for the two populations are equal.**Alternative hypothesis:**The proportions of defective parts for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In this case, the sample provides sufficient evidence for concluding that the population proportions are different. The 2 Proportions output for our product is below.

Both p-values are less than 0.05. The output indicates that the difference between the proportion of defective parts for Method 1 (~0.062) and Method 2 (~0.146) is statistically significant. We can conclude that Method 1 produces defective parts less frequently.

### 1 Proportion test example: comparison to a target

The 1 Proportion test is also handy because you can compare a sample to a target value. Suppose you receive parts from a supplier who guarantees that less than 3% of all parts they produce are defective. You can use the 1 Proportion test to assess this claim.

First, collect a random sample of parts and determine how many are defective. Then, use the 1 Proportion test to compare your sample estimate to the target proportion of 0.03. Because we are interested in detecting only whether the population proportion is greater than 0.03, we’ll use a one-sided test. One-sided tests have greater power to detect differences in one direction, but no ability to detect differences in the other direction. Our one-sided 1 Proportion test has the following hypotheses:

**Null hypothesis:**The proportion of defective parts for the population equals 0.03 or less.**Alternative hypothesis:**The proportion of defective parts for the population is greater than 0.03.

For this test, a significant p-value indicates that the supplier is in trouble! The sample provides sufficient evidence to conclude that the proportion of all parts from the supplier’s process is greater than 0.03 despite their assertions to the contrary.

### Comparing continuous data to binary data

Think back to the graphs for the continuous data. At a glance, you can see both the central location and spread of the data. If we added spec limits, we could see how many data points are close and far away from them. Is the process centered between the spec limits? Continuous data provide a lot of insight into our processes.

Now, compare that to the binary data that we used in the 2 Proportions test. All we learn from that data is the proportion of defects for Method 1 (0.062) and Method 2 (0.146). There is no distribution to analyze, no indication of how close the items are to the specs, and no indication of how they failed the inspection. We only know the two proportions.

Additionally, the samples sizes are much larger for the binary data than the continuous data (130 vs. 30). When the difference between proportions is smaller, the required sample sizes can become quite large. Had we used a sample size of 30 like before, we almost certainly would not have detected this difference.

In general, binary data provide less information than an equivalent amount of continuous data. If you can collect continuous data, it’s the better route to take!

**Related post**: Estimating a Good Sample Size for Your Study Using Power Analysis

## Poisson Hypothesis Tests for Count Data

Count data can have only non-negative integers (e.g., 0, 1, 2, etc.). In statistics, we often model count data using the Poisson distribution. Poisson data are a count of the presence of a characteristic, result, or activity over a constant amount of time, area, or other length of observation. For example, you can use count data to record the number of defects per item or defective units per batch. With Poisson data, you can assess a rate of occurrence.

For this scenario, we’ll assume that we receive shipments of parts from two different suppliers. Each supplier sends the parts in the same sized batch. We need to determine whether one supplier produces fewer defects per batch than the other supplier.

To perform this analysis, we’ll randomly sample batches of parts from both suppliers. The inspectors examine all parts in each batch and record the count of defective parts. We’ll randomly sample 30 batches from each supplier. Here is the CSV data file for this example: Count_Data_Example.

### Performing the Two-Sample Poisson Rate Test

We’ll use the 2-Sample Poisson Rate test. For this test, the hypotheses are as follows:

**Null hypothesis:**The rates of defective parts for the two populations are equal.**Alternative hypothesis:**The rates of defective parts for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis because the sample provides sufficient evidence to conclude that the population rates are different. The 2-Sample Poisson Rate output for our product is below.

Both p-values are less than 0.05. The output indicates that the difference between the rate of defects per batch for Supplier 1 (3.56667) and Supplier 2 (5.36667) is statistically significant. We can conclude that Supplier 1 produces defects at a lower rate than Supplier 2.

Hypothesis tests are a great tool that allow you to take relatively small samples and draw conclusions about entire populations. There is a selection of tests available, and different options within the tests, which make them useful for a wide variety of situations.

To see an alternative approach to these traditional hypothesis testing methods, learn about bootstrapping in statistics!

John says

Hi, great post! I have an expected and observed data set and want to do additional testing to see if they differ signficantly from each other. Furthermore, the specific entries that contribute to the most weight in that significance or places that should have special attention. I did chi-square goodness of fit, but want to go further. Just to add, this is count data.

Jim Frost says

Hi John,

I’m not 100% sure what you want to do to go further. Because it’s count data, you could model it with the Poisson or Negative Binomial distribution. If you have other relevant variables, you can fit a Poisson or Negative Binomial regression model to explore relationships in your data. I talk a bit about those types of models in my post about choosing the correct type of regression model. You can also perform hypothesis tests designed for that type of data. The chi-squared test you performed seems like a good one to see how the expected and observed differs!

Grace Diki says

HI JIM

How do you do a independence test in Stata for a categorical variable with 6 levels and a binary variable.

Jim Frost says

Hi Grace,

I’m not a Stata user, but it sounds like you need to perform a chi-square test of independence.

jazytsax says

Hi Jim – thank you for this great site!

I have a situation where there is a reference standard (tells me if there is truly fat in a mass) and I have 2 different methods of detecting if there is (or is not) fat in the mass. My null hypothesis is that there is no difference in detection. I have a set of masses where I know if there is fat in the masses and used the 2 methods to detect whether they were able to detect the fat. Is the 2 proportions test the most appropriate for this question?

Thank you so much!

Habtu says

Thank you Jim for the wonderful post. It was clearly written and I enjoyed reading through it.

I have an additional query. I wanted to compare the variances of two methods of measurement applied at each observation point of a field survey. The variables from both methods have binary data type. How can I do the statistical test. Thank you in advance for your help.

Jim Frost says

Hi Habtu,

With binary data, you can’t compare variances. You can compare proportions using a proportions test. I discuss these tests in the binary section of this post. To read an example of a 2-sample proportions test, read my post about flu shot effectiveness. In it, I use 2-proportions tests to evaluate real flu study data. Or read my post about Mythbusters test about whether yawns are contagious, where I use a 2-proportions test. That way you can see what these tests can do. I cover them, and many other tests, in much more detail in my Hypothesis Testing book!

I hope this helps!

Jack says

Hi Jim. Just wanted to follow up and see if you’ve had a chance to review this question yet?

Jim Frost says

Hi Jack, thanks for the reminder! I just replied!

Rod Pedersen says

Hi Jim , a green belt has a project on flu vaccinations , with 5 data points, % vaccination rates per year averaging about 36% of staff numbers. Her project was to increase vaccination rates this year , and has accumulated a lot of data points to measure vaccination rates in different office areas as a percent of total staff numbers which have almost doubled. Should she use 2 sample t test to measure difference in means between before and after data ( continuous) or should she use 2 sample test for proportions (attribute). There is small sample size for before data and large sample size for after data

Jim Frost says

Hi Rod,

I see two possibilities. The choice depends on whether she measured the same sites in the before and after. If they’re different sites before and after, she has independent groups and can use the regular 2-sample proportions test.

If they’re the same sites before and after, she can use either the test of marginal homogeneity or McNemar’s test. I have not used these tests myself and cannot provide more information. However, if she used the same sites before and after, she has proportions data for dependent groups (same groups) and should not use the regular 2-sample proportions test. These two tests can handle proportions for dependent groups.

I hope this helps!

Jack says

Hi Jim. Would I be able to use the 2 Proportion Test for comparing 2 proportions from different time periods? Example scenario: I run a satisfaction survey on a MOOC site during Q1 to a random sample of visitors and find that 80% of them were satisfied with their experience. The following quarter I run the same survey and find that 75% were satisfied. Is the 5 percentage point drop statistically significant or just due to random noise?

Jim Frost says

Hi Jack,

Sorry about the delay in replying! Sometimes comments slip through the cracks!

Yes, you can do as you suggest assuming the respondents are different in the two quarters and assuming that the data are binary (satisfied/not satisfied). The 2 proportions test is designed for independent groups and binary data.

I hope that helps even belatedly!

Ihshan Gumilar says

Thanks…Let me see that document

Ihshan Gumilar says

Hi,

I would like to ask about 2 sample poisson rate.

How do you calculate the 95% CI and test for difference ?

Your answer is really appreciated. Thank you so much for giving this tutorial.

Best,

Jim Frost says

Hi Ihshan,

This document describes the calculations. I hope it helps!

Shabahat says

Thank you so much for your kind support. Esteemed regards.

Shabahat says

Thanks for your helpful comments. Basically I have developed my research model based on competing theories. For example, I have one IV (Exploitation) and Two DV’s (Incremental innovation and Radical Innovation). Each variable in the model has own indicators. Some researchers claims that exploitation support only incremental innovations. On the other hand there are also studies that claims that in-depth exploitation also support radical innovation. However, these researchers claim that exploitation support radical innovations in a limited capacity as compared to incremental innovation. ON the basis of these competing theories I developed my hypothesis as:

Exploitation significantly and positively influences both incremental and radical innovation, however exploitation influence incremental innovation more than radical innovation.

Thank you very much for your quick response. Its really helpful.

Jim Frost says

Hi Shabahat,

Thanks you for the additional information. I messed up one thing in my previous reply to you. Because you only have the one IV, you don’t need to worry about standardizing that IV for comparability. However, assuming the DVs use different units of measurement, that’ll make comparisons problematic. Consequently, you probably need to standardize the two dependent variables instead. You’d learn how a one unit change in the IV relates to the standard deviation of each DV. That puts the DVs on the same scale. See how other researchers have handled this situation in your field to be sure that’s an accepted approach.

Additionally, if you find a significant relationship between exploitation and radical innovation, then you’d have good evidence to support that claim you’re making.

Shabahat says

Hi Jim,

Its really amazing. However I have a query regarding my analysis. I have one independent variable (Continuous) and Two dependent variables (Continuous). In the linear regression, the Independent variable significantly explains both dependent variables. Problem: Now i want to compare the effect of my Independent variable on both dependent variables. How can I compare?. If the effect is different, how can I test whether the effect difference is statistically significant or not in SPSS.

Jim Frost says

Hi Shabahat,

The fact that you’re talking about different DVs complicates things because they’re presumably measuring different things and using different units, which makes comparison difficult. The standardized coefficients can help you get around that but it changes the interpretation of the results.

Assuming that the two DV variables measure different characteristics, you might try standardizing your IVs and fitting the model using the standardized values. This process produces standardized coefficients, which use the same units and allows you to compare–although it does change the interpretation of the coefficients. I write about this in my post about assessing the importance of your predictors. You can also look at the CIs for the standardized coefficients and see if they overlap. If they don’t overlap, you know the difference between the standardized coefficients is statistically significant.

Fernando says

Great Post!

Can we test proportions from a continuos variable with unknown distribution using the poisson distribution using a cut-off value for good and bad samples and couting them?

Sanjana Mukherjee says

Hi Jim,

I really enjoy reading your posts and they have cleared many stat concepts!

I had a question about the chi square probability distribution. Although it is a non-parametric test, why does it fall into a continuous probability distribution and why can we use the chi square distribution for categorical data if it’s a continuous probability distribution?

Jim Frost says

Hi Sanjana,

That’s a great question! I’m glad you’re thinking about the types of variables and distributions, and how they’re used together.

You’re correct on both counts. Chi-squared test of independent is nonparameteric because it doesn’t assume a particular data distribution. Additionally, analysts use it to test the independence of categorical variables. There are other ways to use this distribution as well.

Now, onto why we use chi-square (a distribution for continuous data) with categorical variables! Yes, it involves categorical variables, but the analysis assesses the observed and expected counts of these variables. For each cell, the analysis takes the squared difference between the observed count and the expected count and then divides that by the expected count. These values are summed acrossed all cells to produce the chi-square value. This process produces a continuous variable that is based on the differences between the observed and expected counts of the categorical variables. When the value of this variable is large enough, we know that the difference between the observed counts and the expected counts is large enough to be unlikely due to chance. And, that’s why we use a continuous distribution to analyze categorical variables.

Amanda says

This is very helpful! Thank you!

Amanda says

Hi Jim, Great post. I was wondering, do you know of any references that discuss the difference in sample size between binary and continuous data? I am looking for a reference to cite in a journal article.

Thanks,

Amanda.

Jim Frost says

Hi Amanda,

The article I cite below discusses the different sample sizes in terms of observations per model term in order to avoid overfitting your model. I also cover these ideas in my post about how to avoid overfitting regression models. For regression models, this provides a good context for sample size requirements.

Babyak, MA., What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models, Psychosomatic Medicine 66:411-421 (2004).

I hope this helps!

Sarika says

Hi Jim ,

I am totally new to statistics,

Following a small sample from my dataset.

Views PosEmo NegEmo

1650077 2.63 1.27

753826 2.39 0.47

926647 1.71 1.02

Views = Dependent continous Variable

PosEmo = Independent Continous Variable

NegEmo = Independent Continous Variable

My query :

1. How to run Hypothesis testing on same, Im pretty confused what to use , what to do , I am using SPSS modeler and SPSS statistics tool.

2.I think Multiple Regression is Ok for this . Let me know how to use it in SPSS modeler or stats tool.

Regards

Sarika

Jim Frost says

Hi Sarika, yes, it sounds like you can use multiple regression for those data. The hypothesis test in this case would be the p-values for the regression coefficients. Click that link to learn more about that. In your stats software, choose multiple linear regression and then specify the dependent variable and the two independent variables. Fit the model and then check the statistical output and the residual plots to see if you have a good model. Be sure to check out my regression tutorial too. That covers many aspects of regression analysis.

Jules says

Thanks for your sharing!

In the binary case (or proportion case), is there any comparison between “two proportion test” and “Chi-square” test? Is there any guideline to choose which test to use?

Jim Frost says

Hi Jules,

You’re welcome! Regarding your question, a two proportion test requires one categorical variable with two levels. For example, the variable could be “test result” and the two levels are “pass” and “fail.”

A chi-square test of independence requires at least two categorical variables. Those variables can have two or more levels. You can read an example of the chi-square test of independence that I’ve written about. The example is based on the original Star Trek TV series and determines whether the uniform color affects the fatality rate. That analysis has two categorical variables–fatalities and uniform color. Fatalities has two levels that indicate whether a crewmember survived or died. Uniform color has three levels–gold, blue, and red.

As you can see, the data requirements for the two tests are different.

I hope this helps!

Jim

Summi says

Tysm!!

MS says

Great post. Thanks for sharing your expertise.

Jim Frost says

Thank you! I’m glad it was helpful.

Manali Teli says

Very nice article. Could you explain more on hypothesis testing on median?

Jim Frost says

Thank you! For more information about testing the median, click the link in the article for where I compare parametric vs nonparametric analyses.

NARAYANASWAMY AUDINARAYANA says

Please let me know when one can use Probit Analysis. May I know the Procedure in SPSS.