• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Heterogeneity in Data and Samples for Statistics

By Jim Frost 6 Comments

What is Heterogeneity?

Heterogeneity is defined as a dissimilarity between elements that comprise a whole. When heterogeneity is present, there is diversity in the characteristic under study. The parts of the whole are different, not the same. It is an essential concept in science and statistics. Heterogeneous is the opposite of homogeneous.

Photograph of a heterogeneous set of jelly beans by color.
Heterogeneous jelly beans!

In chemistry, a heterogeneous mixture has a composition that varies. For example, oil and vinegar, sand and water, and salt and pepper are all heterogeneous mixtures. Multiple samples of these mixtures will contain different proportions of each component.

In statistics, heterogeneity is a vital concept that appears in various contexts, and its definition varies accordingly. Heterogeneity can indicate differences within individual samples, between samples, and between experimental results in a meta-analysis. It also applies to an assumption violation regarding errors in linear models. This post focuses on these statistical definitions of heterogeneity and shows you how to identify and test it statistically.

Heterogeneity Within Samples

When you take a sample from a population, you can assess its heterogeneity. Do the individual items in a sample tend to be relatively similar (homogeneous) or dissimilar? Do your data contain variability? If so, how much? Heterogeneous samples occur when the items have differences.

You can use a measure of dispersion to assess heterogeneity in samples. For example, higher standard deviation values indicate the sample is more diverse. Conversely, lower values indicate the items tend to be similar. When there is perfect homogeneity, all the objects in the sample are the same, and the standard deviation equals zero.

You can also plot your data to evaluate heterogeneity. In the histogram below, Sample C is more heterogenous than Sample A because the items in Sample C spread out further. This broader spread represents greater heterogeneity.

Histograms in separate panels that display two groups with the same mean but different variability.

Related posts: Standard Deviation, Measures of Variability, and Using Histograms to Evaluate Your Data

Heterogeneity Between Samples

You can also consider whether the properties of different samples, or groups in your data, are heterogeneous. When you collect multiple samples, do they tend to be similar or different? In this context, you need to be careful to define the properties that you are assessing. Some properties of the different samples can be heterogeneous, while others are homogeneous. In this section, I show you how to assess heterogeneity between samples for continuous and categorical data.

Continuous data

With continuous data, you can assess the heterogeneity between sample means and variability. Using boxplots, you can display their characteristics and determine whether the data are heterogeneous.

In the boxplot below, the groups have roughly homogeneous means and standard deviations.

Boxplot displays homogenous samples.

The samples below have heterogeneous means but homogenous variability.

Boxplot displays heterogeneous sample means.

In the graph below, the groups have the homogenous means but heterogeneous variability.

Boxplot display heterogeneous sample variability.

While these graphs visually depict heterogeneity in data, you can test these properties using statistical hypothesis tests.

For instance, ANOVA compares the means of multiple samples. It tests the heterogeneity of group means. However, the F-test ANOVA assumes that the variability of the groups are equal. In other words, you can use ANOVA when group means are heterogeneous, but the variability should be homogeneous.

To determine whether the group means are statistically heterogeneous, use hypothesis tests such as t-tests and one-way ANOVA. To evaluate whether variability differs by group, use a variances test.

Related post: Boxplots vs. Individual Value Plots for Comparing Groups

Categorical data

For categorical data, you can assess the heterogeneity of the categories. We’ll consider M&M candies for these examples, which have six colors: brown, yellow, green, red, orange, and blue.

Again, note the difference between heterogeneity within a sample versus between samples.

A single M&M sample will be homogeneous if it contains only one color. The sample grows increasingly heterogeneous as the number of colors increases.

However, for multiple samples, homogeneity occurs when the number and proportions of colors are the same between them. Heterogeneous batches will have different color ratios.

The pie charts below display pairs of homogeneous and heterogeneous samples of M&M colors.

Pie chart displays a pair of homogeneous samples of M&M colors.

Pie chart displays a pair of heterogeneous samples of M&M colors.

You can test this statistically for categorical data using the chi-square test for homogeneity. When your p-value is low, reject the null hypothesis (homogeneity) and conclude that the samples are heterogeneous. The differences between the category proportions are dissimilar enough to be statistically significant.

The calculations for the chi-square test of homogeneity are the same as the test for independence. The difference between them lies in the hypotheses, testing logic, and sampling methods.

Related post: Chi-square Test of Independence

Heterogeneity Between Scientific Studies

When you consider a series of scientific studies that all attempt to answer the same research question, you can assess the heterogeneity of their results. Meta-analysis does more than simply report the mean effect size for a set of studies. This type of analysis also considers the variability of effect sizes from the individual studies around the overall mean effect—which is where heterogeneity comes in!

Ideally, the study results are all similar (i.e., homogeneous). When that’s true, they’re all painting the same picture, giving you confidence about the real effect. However, if the results are heterogeneous, you’ll need to proceed carefully and understand the differences between the findings. You’ll also want to evaluate the degree of heterogeneity. Do the studies differ greatly, or only slightly?

I’ll show you a graphical and numeric way to evaluate heterogeneity in a meta-analysis.

Forest plots

A forest plot, also known as a blobbogram, is a specialized plot designed to display the results of different studies in a meta-analysis. These plots depict effect sizes on the horizontal axis and include a reference line for no effect. For each experiment, it displays a point estimate for the effect and a confidence interval (CI). You can use a forest plot to evaluate heterogeneity in a meta-analysis.

The forest plot below displays 13 studies and their estimates of the effectiveness of a Bacillus Calmette-Gúerin (BCG) vaccine in preventing tuberculosis (TB).

A forest plot that illustrates heterogeneity amongst tuberculosis vaccine studies.

Overall, the studies favor the treatment group that received the vaccine over the control group which did not receive it. However, there are differences between the studies. Studies have CIs of different widths. Some CIs include the null value of zero (no effect), while others do not. One study’s point estimate even favors the control group! Several other estimates fall right on the no effect line.

While the graph displays heterogeneity in the meta-analysis, we need to quantify it. This necessity brings us to the I2 statistic, which I’ve circled on the forest plot.

Related post: Control Groups in Experiments

I² Statistic

The I2 statistic quantifies the degree of heterogeneity in a series of studies within a meta-analysis. This statistic is a percentage that ranges from 0 – 100%. It indicates the proportion of variation around true effect sizes other than sampling error.

Statisticians commonly use the following benchmark values to assess the degree of heterogeneity:

  • 25%: Small
  • 50%: Moderate
  • 75%: Large

On the forest plot above, the value is 92.22%. These studies have considerable heterogeneity. We must proceed with caution when assessing the overall effectiveness of the BCG vaccine. They are not telling a consistent story!

Heterogeneous Errors in Linear Models

Linear models assume that the errors are homogeneous. When you plot the residuals, you want to see dispersion that remains consistent throughout the entire range. Unfortunately, that’s not always the case. Statisticians refer to heterogeneous residuals as heteroscedasticity, which violates the assumption. The residual plot below shows this condition.

Residuals by fitted values plot that displays heteroscedasticity.

Notice how the spread of the residuals increases as you move to higher fitted values. Fortunately, there are several ways to address this condition.

Related post: Heteroscedasticity in Regression Analysis

Share this:

  • Tweet

Related

Filed Under: Basics Tagged With: conceptual, graphs

Reader Interactions

Comments

  1. Habib says

    December 15, 2021 at 9:53 pm

    Good explanations of all the terms used. Appreciated your efforts dear Jim.
    I suggest and recommend for individauls intersted learn Statistics in a simple and easy way.

    Reply
  2. Rodrigo Campos says

    October 27, 2021 at 6:37 am

    Clear and useful presentation about the issue! Good for teaching. Thanks Jim

    Reply
    • Jim Frost says

      October 28, 2021 at 12:15 am

      Thanks, Rodrigo!

      Reply
  3. Bal Ram Bhui says

    October 3, 2021 at 10:41 pm

    I enjoyed reading it. It makes the concept clear so that I can better understands background around the t-test, ANOVA and assumption of linear regression.

    Reply
  4. Anoop says

    October 3, 2021 at 7:47 pm

    I would just add the the hetrogenity shown is in magnitude and not in direction in the meta analysis. Almost all studies show benifits, So it is less of a concern I think

    Reply
    • Jim Frost says

      October 3, 2021 at 7:54 pm

      Hi Anoop,

      I’d agree that for these studies they do show a benefit overall, as I mention in the article itself. However, if you wanted to estimate the effect, there’d be a relatively widespread of possibilities. So, you’ll gain some benefits but it would be impossible to say precisely how much. Indeed, it might be difficult to determine whether the benefits are practically meaningful as opposed to just statistically significant.

      Reply

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • Mean, Median, and Mode: Measures of Central Tendency
    • How to Find the P value: Process and Calculations
    • How to do t-Tests in Excel
    • Z-table
    • One-Tailed and Two-Tailed Hypothesis Tests Explained
    • Choosing the Correct Type of Regression Analysis
    • How to Interpret the F-test of Overall Significance in Regression Analysis

    Recent Posts

    • Slope Intercept Form of Linear Equations: A Guide
    • Population vs Sample: Uses and Examples
    • How to Calculate a Percentage
    • Control Chart: Uses, Example, and Types
    • Monte Carlo Simulation: Make Better Decisions
    • Principal Component Analysis Guide & Example

    Recent Comments

    • Jim Frost on Monte Carlo Simulation: Make Better Decisions
    • Gilberto on Monte Carlo Simulation: Make Better Decisions
    • Sultan Mahmood on Linear Regression Equation Explained
    • Sanjay Kumar P on What is the Mean and How to Find It: Definition & Formula
    • Dave on Control Variables: Definition, Uses & Examples

    Copyright © 2023 · Jim Frost · Privacy Policy