In a previous blog post, I introduced the basic concepts of hypothesis testing and explained the need for performing these tests. In this post, I’ll build on that and compare various types of hypothesis tests that you can use with different types of data, explore some of the options, and explain how to interpret the results. Along the way, I’ll point out important planning considerations, related analyses, and pitfalls to avoid.

A hypothesis test uses sample data to assess two mutually exclusive theories about the properties of a population. Hypothesis tests allow you to use a manageable-sized sample from the process to draw inferences about the entire population.

I’ll cover common hypothesis tests for three types of data—continuous, binary, and count data. Recognizing the different types of data is crucial because the type of data determines the hypothesis tests you can perform and, critically, the nature of the conclusions that you can draw. If you collect the wrong data, you might not be able to get the answers that you need.

## Hypothesis Tests for Continuous Data

Continuous data can take on any numeric value, and it can be meaningfully divided into smaller increments, including fractional and decimal values. There are an infinite number of possible values between any two values. You often measure a continuous variable on a scale. For example, when you measure height, weight, and temperature, you have continuous data. With continuous variables, you can use hypothesis tests to assess the mean, median, and standard deviation.

When you collect continuous data, you usually get more bang for your data buck compared to discrete data. The two key advantages of continuous data are that you can:

- Draw conclusions with a smaller sample size.
- Use a wider variety of analyses, which allows you to learn more.

I’ll cover two of the more common hypothesis tests that you can use with continuous data—t-tests to assess means and variance tests to evaluate dispersion around the mean. Both of these tests come in one-sample and two-sample versions. One-sample tests allow you to compare your sample estimate to a target value. The two-sample tests let you compare the samples to each other. I’ll cover examples of both types.

There is also a group of tests that assess the median rather than the mean. These are known as nonparametric tests and practitioners use them less frequently. However, consider using a nonparametric test if your data are highly skewed and the median better represents the actual center of your data than the mean.

**Related post**: Nonparametric vs. Parametric Tests

### Graphing the data for the example scenario

Suppose we have two production methods and our goal is to determine which one produces a stronger product. To evaluate the two methods, we draw a random sample of 30 products from each production line and measure the strength of each unit. Before performing any analyses, it’s always a good idea to graph the data because it provides an excellent overview. Here is the CSV data file in case you want to follow along: Continuous_Data_Examples.

These histograms suggest that Method 2 produces a higher mean strength while Method 1 produces more consistent strength scores. The higher mean strength is good for our product, but the greater variability might produce more defects.

Graphs provide a good picture, but they do not test the data statistically. The differences in the graphs might be caused by random sample error rather than an actual difference between production methods. If the observed differences are due to random error, it would not be surprising if another sample showed different patterns. It can be a costly mistake to base decisions on “results” that vary with each sample. Hypothesis tests factor in random error to improve our chances of making correct decisions.

Keep this graph in mind when we look at binary data because they illustrate how much more information continuous data convey.

**Related post**: How Hypothesis Tests Work: Significance Levels and P-values

### Two-sample t-test to compare means

The first thing we want to determine is whether one of the methods produces stronger products. We’ll use a two-sample t-test to determine whether the population means are different. The hypotheses for our 2-sample t-test are:

**Null hypothesis:**The mean strengths for the two populations are equal.**Alternative hypothesis:**The mean strengths for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence to conclude that the population means are different. Below is the output for the analysis.

The p-value (0.034) is less than 0.05. From the output, we can see that the difference between the mean of Method 2 (98.39) and Method 1 (95.39) is statistically significant. We can conclude that Method 2 produces a stronger product on average.

That sounds great, and it appears that we should use Method 2 to manufacture a stronger product. However, there are other considerations. The t-test tells us that Method 2’s mean strength is greater than Method 1, but it says nothing about the variability of strength values. For that, we need to use another test.

**Related post**: How T-Tests Work

### 2-Variances test to compare variability

A production method that has excessive variability creates too many defects. Consequently, we will also assess the standard deviations of both methods. To determine whether either method produces greater variability in the product’s strength, we’ll use the 2 Variances test. The hypotheses for our 2 Variances test are:

**Null hypothesis:**The standard deviations for the populations are equal.**Alternative hypothesis:**The standard deviations for the populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In other words, the sample provides sufficient evidence for concluding that the population standard deviations are different. The 2-Variances output for our product is below.

Both of the p-values are less than 0.05. The output indicates that the variability of Method 1 is significantly less than Method 2. We can conclude that Method 1 produces a more consistent product.

**Related post**: How to Interpret P-values Correctly

### What we learned and did not learn with the hypothesis tests

The hypothesis test results confirm the patterns in the graphs. Method 2 produces stronger products on average while Method 1 produces a more consistent product. The statistically significant test results indicate that these results are likely to represent actual differences between the production methods rather than sampling error.

Our example also illustrates how you can assess different properties using continuous data, which can point towards different decisions. We might want the stronger products of Method 2 but the greater consistency of Method 1. To navigate this dilemma, we’ll need to use our process knowledge.

Finally, it’s crucial to note that the tests produce estimates of population parameters—the population means (μ) and the population standard deviations (σ). While these parameters can help us make decisions, they tell us little about where individual values are likely to fall. In certain circumstances, knowing the proportion of values that fall within specified intervals is crucial.

For the examples, the products must fall within spec limits. Even when the mean falls within the spec limit, it’s possible that too many individual items will fall outside the spec limits if the variability is too high. To better understand the distribution of individual values, you can use the following analyses:

**Tolerance intervals**: A tolerance interval is a range that likely contains a specific proportion of a population. For our example, we might want to know the range where 99% of the population falls for each production method. We can compare the tolerance interval to our requirements to determine whether there is too much variability.

**Capability analysis**: This type of analysis uses sample data to determine how effectively a process produces output with characteristics that fall within the spec limits. These tools incorporate both the mean and spread of your data to estimate the proportion of defects.

**Related post**: Confidence Intervals vs. Prediction Intervals vs. Tolerance Intervals

## Proportion Hypothesis Tests for Binary Data

Let’s switch gears and move away from continuous data. Suppose we take another random sample of our product from each of the production lines. However, instead of measuring a characteristic, inspectors evaluate each product and either accept or reject it.

Binary data can have only two values. If you can place an observation into only two categories, you have a binary variable. For example, pass/fail and accept/reject data are binary. Quality improvement practitioners often use binary data to record defective units.

Binary data are useful for calculating proportions or percentages, such as the proportion of defective products in a sample. You simply take the number of defective products and divide by the sample size. Hypothesis tests that assess proportions require binary data and allow you to use sample data to make inferences about the proportions of populations.

### 2 Proportions test to compare two samples

For our first example, we will make a decision based on the proportions of defective parts. Our goal is to determine whether the two methods produce different proportions of defective parts.

To make this determination, we’ll use the 2 Proportions test. For this test, the hypotheses are as follows:

**Null hypothesis:**The proportions of defective parts for the two populations are equal.**Alternative hypothesis:**The proportions of defective parts for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis. In this case, the sample provides sufficient evidence for concluding that the population proportions are different. The 2 Proportions output for our product is below.

Both p-values are less than 0.05. The output indicates that the difference between the proportion of defective parts for Method 1 (~0.062) and Method 2 (~0.146) is statistically significant. We can conclude that Method 1 produces defective parts less frequently.

### 1 Proportion test example: comparison to a target

The 1 Proportion test is also handy because you can compare a sample to a target value. Suppose you receive parts from a supplier who guarantees that less than 3% of all parts they produce are defective. You can use the 1 Proportion test to assess this claim.

First, collect a random sample of parts and determine how many are defective. Then, use the 1 Proportion test to compare your sample estimate to the target proportion of 0.03. Because we are interested in detecting only whether the population proportion is greater than 0.03, we’ll use a one-sided test. One-sided tests have greater power to detect differences in one direction, but no ability to detect differences in the other direction. Our one-sided 1 Proportion test has the following hypotheses:

**Null hypothesis:**The proportion of defective parts for the population equals 0.03 or less.**Alternative hypothesis:**The proportion of defective parts for the population is greater than 0.03.

For this test, a significant p-value indicates that the supplier is in trouble! The sample provides sufficient evidence to conclude that the proportion of all parts from the supplier’s process is greater than 0.03 despite their assertions to the contrary.

### Comparing continuous data to binary data

Think back to the graphs for the continuous data. At a glance, you can see both the central location and spread of the data. If we added spec limits, we could see how many data points are close and far away from them. Is the process centered between the spec limits? Continuous data provide a lot of insight into our processes.

Now, compare that to the binary data that we used in the 2 Proportions test. All we learn from that data is the proportion of defects for Method 1 (0.062) and Method 2 (0.146). There is no distribution to analyze, no indication of how close the items are to the specs, and no indication of how they failed the inspection. We only know the two proportions.

Additionally, the samples sizes are much larger for the binary data than the continuous data (130 vs. 30). When the difference between proportions is smaller, the required sample sizes can become quite large. Had we used a sample size of 30 like before, we almost certainly would not have detected this difference.

In general, binary data provide less information than an equivalent amount of continuous data. If you can collect continuous data, it’s the better route to take!

## Poisson Hypothesis Tests for Count Data

Count data can have only non-negative integers (e.g., 0, 1, 2, etc.). In statistics, we often model count data using the Poisson distribution. Poisson data are a count of the presence of a characteristic, result, or activity over a constant amount of time, area, or other length of observation. For example, you can use count data to record the number of defects per item or defective units per batch. With Poisson data, you can assess a rate of occurrence.

For this scenario, we’ll assume that we receive shipments of parts from two different suppliers. Each supplier sends the parts in the same sized batch. We need to determine whether one supplier produces fewer defects per batch than the other supplier.

To perform this analysis, we’ll randomly sample batches of parts from both suppliers. The inspectors examine all parts in each batch and record the count of defective parts. We’ll randomly sample 30 batches from each supplier. Here is the CSV data file for this example: Count_Data_Example.

We’ll use the 2-Sample Poisson Rate test. For this test, the hypotheses are as follows:

**Null hypothesis:**The rates of defective parts for the two populations are equal.**Alternative hypothesis:**The rates of defective parts for the two populations are different.

A p-value less than the significance level indicates that you can reject the null hypothesis because the sample provides sufficient evidence to conclude that the population rates are different. The 2-Sample Poisson Rate output for our product is below.

Both p-values are less than 0.05. The output indicates that the difference between the rate of defects per batch for Supplier 1 (3.56667) and Supplier 2 (5.36667) is statistically significant. We can conclude that Supplier 1 produces defects at a lower rate than Supplier 2.

Hypothesis tests are a great tool that allow you to take relatively small samples and draw conclusions about entire populations. There is a selection of tests available, and different options within the tests, which make them useful for a wide variety of situations.

NARAYANASWAMY AUDINARAYANA says

Please let me know when one can use Probit Analysis. May I know the Procedure in SPSS.

Manali Teli says

Very nice article. Could you explain more on hypothesis testing on median?

Jim Frost says

Thank you! For more information about testing the median, click the link in the article for where I compare parametric vs nonparametric analyses.

MS says

Great post. Thanks for sharing your expertise.

Jim Frost says

Thank you! I’m glad it was helpful.

Summi says

Tysm!!

Jules says

Thanks for your sharing!

In the binary case (or proportion case), is there any comparison between “two proportion test” and “Chi-square” test? Is there any guideline to choose which test to use?

Jim Frost says

Hi Jules,

You’re welcome! Regarding your question, a two proportion test requires one categorical variable with two levels. For example, the variable could be “test result” and the two levels are “pass” and “fail.”

A chi-square test of independence requires at least two categorical variables. Those variables can have two or more levels. You can read an example of the chi-square test of independence that I’ve written about. The example is based on the original Star Trek TV series and determines whether the uniform color affects the fatality rate. That analysis has two categorical variables–fatalities and uniform color. Fatalities has two levels that indicate whether a crewmember survived or died. Uniform color has three levels–gold, blue, and red.

As you can see, the data requirements for the two tests are different.

I hope this helps!

Jim