Frequency is the number of times a specific data value occurs in your dataset. A frequency table lists a set of values and how often each one appears. They help you understand which data values are common and which are rare. These tables organize your data and are an effective way to present the results to others. Frequency tables are also known as frequency distributions because they allow you to understand the distribution of values in your dataset. [Read more…] about Frequency Table

# distributions

## Mean Absolute Deviation

The mean absolute deviation (MAD) is a measure of variability that indicates the average distance between observations and their mean. MAD uses the original units of the data, which simplifies interpretation. Larger values signify that the data points spread out further from the average. Conversely, lower values correspond to data points bunching closer to it. The mean absolute deviation is also known as the mean deviation and average absolute deviation. [Read more…] about Mean Absolute Deviation

## Stem and Leaf Plot

Stem and leaf plots display the shape and spread of a continuous data distribution. These graphs are similar to histograms, but instead of using bars, they show digits. It’s a particularly valuable tool during exploratory data analysis. They can help you identify the central tendency, variability, and skewness of your distribution. Additionally, they can help you find outliers. Stem and leaf plots are also known as stemplots. [Read more…] about Stem and Leaf Plot

## Range of a Data Set

The range of a data set is the difference between the maximum and the minimum values. It measures variability using the same units as the data. Larger values represent greater variability.

The range is the easiest measure of dispersion to calculate and interpret in statistics, but it has some limitations. In this post, I’ll show you how to find the range mathematically and graphically, interpret it, explain its limitations, and clarify when to use it. [Read more…] about Range of a Data Set

## Z-score: Definition, Formula, and Uses

A z-score measures the distance between a data point and the mean using standard deviations. Z-scores can be positive or negative. The sign tells you whether the observation is above or below the mean. For example, a z-score of +2 indicates that the data point falls two standard deviations above the mean, while a -2 signifies it is two standard deviations below the mean. A z-score of zero equals the mean. Statisticians also refer to z-scores as standard scores, and I’ll use those terms interchangeably. [Read more…] about Z-score: Definition, Formula, and Uses

## Relative Frequencies and Their Distributions

A relative frequency indicates how often a specific kind of event occurs within the total number of observations. It is a type of frequency that uses percentages, proportions, and fractions.

In this post, learn about relative frequencies, the relative frequency distribution, and its cumulative counterpart. [Read more…] about Relative Frequencies and Their Distributions

## Empirical Rule: Definition, Formula, and Uses

The empirical rule in statistics, also known as the 68-95-99.7 rule, states that for normal distributions, 68% of observed data points will lie inside one standard deviation of the mean, 95% will fall within two standard deviations, and 99.7% will occur within three standard deviations. [Read more…] about Empirical Rule: Definition, Formula, and Uses

## Interquartile Range (IQR): Definition and Uses

The interquartile range (IQR) measures the spread of the middle half of your data. It is the range for the middle 50% of your sample. Use the IQR to assess the variability where most of your values lie. Larger values indicate that the central portion of your data spread out further. Conversely, smaller values show that the middle values cluster more tightly.

In this post, learn what the interquartile range means and the many ways to use it! I’ll show you how to find the interquartile range, use it to measure variability, graph it in boxplots to assess distribution properties, use it to identify outliers, and test whether your data are normally distributed.

The interquartile range is one of several measures of variability. To learn about the others and how the IQR compares, read my post, Measures of Variability.

## Interquartile Range Overview

To visualize the interquartile range, imagine dividing your data into quarters. Statisticians refer to these quarters as quartiles and label them from low to high as Q1, Q2, Q3, and Q4. The lowest quartile (Q1) covers the smallest quarter of values in your dataset. The upper quartile (Q4) comprises the highest quarter of values. The interquartile range is the middle half of the data that lies between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that are above Q1 and below Q4. The IQR is the red area in the graph below, containing Q2 and Q3 (not labeled).

When measuring variability, statisticians prefer using the interquartile range instead of the full data range because extreme values and outliers affect it less. Typically, use the IQR with a measure of central tendency, such as the median, to understand your data’s center and spread. This combination creates a fuller picture of your data’s distribution.

Unlike the more familiar mean and standard deviation, the interquartile range and the median are robust measures. Outliers do not strongly influence either statistic because they don’t depend on every value. Additionally, like the median, the interquartile range is superb for skewed distributions. For normal distributions, you can use the standard deviation to determine the percentage of observations that fall specific distances from the mean. However, that doesn’t work for skewed distributions, and the IQR is an excellent alternative.

**Related post**: Median: Definition and Uses and What are Robust Statistics?

## How to Find the IQR by Hand

The formula for calculating the interquartile range takes the third quartile value and subtracts the first quartile value.

IQR = Q3 – Q1

Equivalently, the interquartile range is the region between the 75th and 25th percentile (75 – 25 = 50% of the data).

Using the IQR formula, we need to find the values for Q3 and Q1. To do that, simply order your data from low to high and split the value into four equal portions.

I’ve divided the dataset below into quartiles. The interquartile range extends from the Q1 value to the Q3 value. For this dataset, the interquartile range is 39 – 20 = 19.

Note that different methods and statistical software programs will find slightly different Q1 and Q3 values, which affects the interquartile range. These variations stem from alternate ways of finding percentiles. For details about that, read my post about Percentiles: Interpretations and Calculations.

## Finding the Interquartile Range using Excel

All statistical software packages will identify the interquartile range as part of their descriptive statistics. Here, I’ll show you how to find it using Excel because most readers can access this application.

To follow along, download the Excel file: IQR. This dataset is the same as the one I use in the illustration above. This file also includes the interquartile range calculations for finding outliers and the IQR normality test described later in this post.

In Excel, you’ll need to use the QUARTILE.EXC function, which has the following arguments: QUARTILE.EXC(array, quart)

**Array**: Cell range of numeric values.**Quart**: Quartile you want to find.

In my spreadsheet, the data are in cells A2:A20. Consequently, I’ll use the following syntax to find Q1 and Q3, respectively:

- =QUARTILE.EXC(A2:A20,1)
- =QUARTILE.EXC(A2:A20,3)

As with my example of finding the interquartile range by hand, Excel indicates that Q3 is 39 and Q1 is 20. IQR = 39 – 20 = 19

**Related post**: Descriptive Statistics in Excel

## Using Boxplots to Graph the Interquartile Range

Boxplots are a great way to visualize interquartile ranges and their relation to the median and the overall distribution. These graphs display ranges of values based on quartiles and show asterisks for outliers that fall outside the whiskers. Boxplots work by splitting your data into quarters.

Let’s look at the boxplot anatomy before getting to the example. Notice how it divides your data into quartiles.

The box in the boxplot is your interquartile range! It contains 50% of your data. By comparing the size of these boxes, you can understand your data’s variability. More dispersed distributions have wider boxes.

Additionally, find where the median line falls within each interquartile box. If the median is closer to one side or the other of the box, it’s a skewed distribution. When the median is near the center of the interquartile range, your distribution is symmetric.

For example, in the boxplot below, method 3 has the highest variability in scores and is left-skewed. Conversely, method 2 has a tighter distribution that is symmetrical, although it also has an outlier—read the next section for more about that!

**Related post**: Boxplots versus Individual Value Plots

## Using the IQR to Find Outliers

The interquartile range can help you identify outliers. For other methods of finding outliers, the outliers themselves influence the calculations, potentially causing you to miss them. Fortunately, interquartile ranges are relatively robust against outlier influence and can avoid this problem. This method also does not assume the data follow the normal distribution or any other distribution. That’s why using the IQR to find outliers is one of my favorite methods!

To find outliers, you’ll need to know your data’s IQR, Q1, and Q3 values. Take these values and input them into the equations below. Statisticians call the result for each equation an outlier gate. I’ve included these calculations in the IQR example Excel file.

Q1 − 1.5 * IQR: Lower outlier gate.

Q3 + 1.5 * IQR: Upper outlier gate.

Using the same example dataset, I’ll calculate the two outlier gates. For that dataset, the interquartile range is 19, Q1 = 20, and Q3 = 39.

Lower outlier gate: 20 – 1.5 * 19 = -8.5

Upper outlier gate: 39 + 1.5 * 19 = 67.5

Then look for values in the dataset that are below the lower gate or above the upper gate. For the example dataset, there are no outliers. All values fall between these two gates.

Boxplots typically use this method to identify outliers and display asterisks when they exist. In the teaching method boxplot above, notice that the Method 2 group has an outlier. The researchers should investigate that value.

**Related post**: Five Ways to Find Outliers

## Using the Interquartile Range to Test Normality

You can even use the interquartile range as a simple test to determine whether your data are normally distributed. When data follow a normal distribution, the interquartile range will have specific properties. The image below highlights these properties. Specifically, in our calculations below, we’ll use the standard deviations (σ) that correspond to the interquartile range, -0.67 and 0.67.

You can assess whether your IQR is consistent with a normal distribution. However, this test should not replace a formal normality hypothesis test.

To perform this test, you’ll need to know the sample standard deviation (s) and sample mean (x̅). Input these values into the formulas for Q1 and Q3 below.

- Q1 = x̅ − (s * 0.67)
- Q3 = x̅ + (s * 0.67)

Compare these calculated values to your data’s actual Q1 and Q3 values. If they are notably different, your data might not follow the normal distribution.

We’ll return to our example dataset from before. Our actual Q1 and Q3 are 20 and 39, respectively.

The sample average is 31.3, and its standard deviation is 14.1. I’ll input those values into the equations.

Q1 = 31.3 – (14.1 * 0.67) = 21.9

Q3 = 31.3 + (14.1 * 0.67) = 40.7

The calculated values are pretty close to the actual data values, suggesting that our data follow the normal distribution. I’ve included these calculations in the IQR example spreadsheet.

**Related posts**: Understanding the Normal Distribution and How to Identify the Distribution of Your Data

## Standard Deviation: Interpretations and Calculations

The standard deviation (SD) is a single number that summarizes the variability in a dataset. It represents the typical distance between each data point and the mean. Smaller values indicate that the data points cluster closer to the mean—the values in the dataset are relatively consistent. Conversely, higher values signify that the values spread out further from the mean. Data values become more dissimilar, and extreme values become more likely. [Read more…] about Standard Deviation: Interpretations and Calculations

## What is the Mean in Statistics?

In statistics, the mean summarizes an entire dataset with a single number representing the data’s center point or typical value. It is also known as the arithmetic average, and it is one of several measures of central tendency. It is likely the measure of central tendency with which you’re most familiar! Learn how to calculate the mean, and when it is and is not a good statistic to use!

## How Do You Find the Mean?

Finding the mean is very simple. Just add all the values and divide by the number of observations—the formula is below.

For example, if the heights of five people are 48, 51, 52, 54, and 56 inches, their average height is 52.2 inches.

48 + 51 + 52 + 54 + 56 / 5 = 52.2

## When Do You Use the Mean?

Ideally, the mean indicates the region where most values in a distribution fall. Statisticians refer to it as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. The histogram below illustrates the average accurately finding the center of the data’s distribution.

However, the mean does not always find the center of the data. It is sensitive to skewed data and extreme values. For example, when the data are skewed, it can miss the mark. In the histogram below, the average is outside the area with the most common values.

This problem occurs because outliers have a substantial impact on the mean. Extreme values in an extended tail pull the it away from the center. As the distribution becomes more skewed, the average is drawn further away from the center.

In these cases, the mean can be misleading because because it might not be near the most common values. Consequently, it’s best to use the average to measure the central tendency when you have a symmetric distribution.

For skewed distributions, it’s often better to use the median, which uses a different method to find the central location. Note that the mean provides no information about the variability present in a distribution. To evaluate that characteristic, assess the standard deviation.

**Relate post**: Measures of Central Tendency: Mean, Median, and Mode

## Using Sample Means to Estimate Population Means

In statistics, analysts often use a sample average to estimate a population mean. For small samples, the sample mean can differ greatly from the population. However, as the sample size grows, the law of large numbers states that the sample average is likely to be close to the population value.

Hypothesis tests, such as t-tests and ANOVA, use samples to determine whether population means are different. Statisticians refer to this process of using samples to estimate the properties of entire populations as inferential statistics.

**Related post**: Descriptive Statistics Vs. Inferential Statistics

## Gamma Distribution

The gamma distribution is a continuous probability distribution that models right-skewed data. Statisticians have used this distribution to model cancer rates, insurance claims, and rainfall. Additionally, the gamma distribution is similar to the exponential distribution, and you can use it to model the same types of phenomena: failure times, wait times, service times, etc. [Read more…] about Gamma Distribution

## Exponential Distribution

The exponential distribution is a right-skewed continuous probability distribution that models variables in which small values occur more frequently than higher values. Small values have relatively high probabilities, which consistently decline as data values increase. Statisticians use the exponential distribution to model the amount of change in people’s pockets, the length of phone calls, and sales totals for customers. In all these cases, small values are more likely than larger values. [Read more…] about Exponential Distribution

## Weibull Distribution

The Weibull distribution is a continuous probability distribution that can fit an extensive range of distribution shapes. Like the normal distribution, the Weibull distribution describes the probabilities associated with continuous data. However, unlike the normal distribution, it can also model skewed data. In fact, its extreme flexibility allows it to model both left- and right-skewed data. [Read more…] about Weibull Distribution

## Poisson Distribution

The Poisson distribution is a discrete probability distribution that describes probabilities for counts of events that occur in a specified observation space. It is named after Siméon Denis Poisson.

In statistics, count data represent the number of events or characteristics over a given length of time, area, volume, etc. For example, you can count the number of cigarettes smoked per day, meteors seen per hour, the number of defects in a batch, and the occurrence of a particular crime by county. [Read more…] about Poisson Distribution

## Dot Plots: Using, Examples, and Interpreting

Use dot plots to display the distribution of your sample data when you have continuous variables. These graphs stack dots along the horizontal X-axis to represent the frequencies of different values. More dots indicate greater frequency. Each dot represents a set number of observations. [Read more…] about Dot Plots: Using, Examples, and Interpreting

## Chebyshev’s Theorem in Statistics

Chebyshev’s Theorem estimates the minimum proportion of observations that fall within a specified number of standard deviations from the mean. This theorem applies to a broad range of probability distributions. Chebyshev’s Theorem is also known as Chebyshev’s Inequality. [Read more…] about Chebyshev’s Theorem in Statistics

## Coefficient of Variation in Statistics

The coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to its mean. It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics. It is also known as the relative standard deviation (RSD).

In this post, you will learn about the coefficient of variation, how to calculate it, know when it is particularly useful, and when to avoid it. [Read more…] about Coefficient of Variation in Statistics

## How the Chi-Squared Test of Independence Works

Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable.

I’ve previously written about Pearson’s chi-square test of independence using a fun Star Trek example. Are the uniform colors related to the chances of dying? You can test the notion that the infamous red shirts have a higher likelihood of dying. In that post, I focus on the purpose of the test, applied it to this example, and interpreted the results.

In this post, I’ll take a bit of a different approach. I’ll show you the nuts and bolts of how to calculate the expected values, chi-square value, and degrees of freedom. Then you’ll learn how to use the chi-squared distribution in conjunction with the degrees of freedom to calculate the p-value. [Read more…] about How the Chi-Squared Test of Independence Works

## Low Power Tests Exaggerate Effect Sizes

If your study has low statistical power, it will exaggerate the effect size. What?!

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings. However, many analysts don’t realize that low power also inflates the effect size.

In this post, I show how this unexpected relationship between power and exaggerated effect sizes exists. I’ll also tie it to other issues, such as the bias of effects published in journals and other matters about statistical power. I think this post will be eye-opening and thought provoking! As always, I’ll use many graphs rather than equations. [Read more…] about Low Power Tests Exaggerate Effect Sizes

## Revisiting the Monty Hall Problem with Hypothesis Testing

The Monty Hall Problem is where Monty presents you with three doors, one of which contains a prize. He asks you to pick one door, which remains closed. Monty opens one of the other doors that does not have the prize. This process leaves two unopened doors—your original choice and one other. He allows you to switch from your initial choice to the other unopened door. Do you accept the offer?

If you accept his offer to switch doors, you’re twice as likely to win—66% versus 33%—than if you stay with your original choice.

Mind-blowing, right?

The solution to the Monty Hall Problem is tricky and counter-intuitive. It did trip up many experts back in the 1980s. However, the correct answer to the Monty Hall Problem is now well established using a variety of methods. It has been proven mathematically, with computer simulations, and empirical experiments, including on television by both the Mythbusters (CONFIRMED!) and James Mays’ Man Lab. You won’t find any statisticians who disagree with the solution.

In this post, I’ll explore aspects of this problem that have arisen in discussions with some stubborn resisters to the notion that you can increase your chances of winning by switching!

The Monty Hall problem provides a fun way to explore issues that relate to hypothesis testing. I’ve got a lot of fun lined up for this post, including the following!

- Using a computer simulation to play the game 10,000 times.
- Assessing sampling distributions to compare the 66% percent hypothesis to another contender.
- Performing a power and sample size analysis to determine the number of times you need to play the Monty Hall game to get an answer.
- Conducting an experiment by playing the game repeatedly myself, record the results, and use a proportions hypothesis test to draw conclusions! [Read more…] about Revisiting the Monty Hall Problem with Hypothesis Testing