Use dot plots to display the distribution of your sample data when you have continuous variables. These graphs stack dots along the horizontal X-axis to represent the frequencies of different values. More dots indicate greater frequency. Each dot represents a set number of observations. [Read more…] about Dot Plots: Using, Examples, and Interpreting

# distributions

## Chebyshev’s Theorem in Statistics

Chebyshev’s Theorem estimates the minimum proportion of observations that fall within a specified number of standard deviations from the mean. This theorem applies to a broad range of probability distributions. Chebyshev’s Theorem is also known as Chebyshev’s Inequality. [Read more…] about Chebyshev’s Theorem in Statistics

## Coefficient of Variation in Statistics

The coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to its mean. It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics. It is also known as the relative standard deviation (RSD).

In this post, you will learn about the coefficient of variation, how to calculate it, know when it is particularly useful, and when to avoid it. [Read more…] about Coefficient of Variation in Statistics

## How the Chi-Squared Test of Independence Works

Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable.

I’ve previously written about Pearson’s chi-square test of independence using a fun Star Trek example. Are the uniform colors related to the chances of dying? You can test the notion that the infamous red shirts have a higher likelihood of dying. In that post, I focus on the purpose of the test, applied it to this example, and interpreted the results.

In this post, I’ll take a bit of a different approach. I’ll show you the nuts and bolts of how to calculate the expected values, chi-square value, and degrees of freedom. Then you’ll learn how to use the chi-squared distribution in conjunction with the degrees of freedom to calculate the p-value. [Read more…] about How the Chi-Squared Test of Independence Works

## Low Power Tests Exaggerate Effect Sizes

If your study has low statistical power, it will exaggerate the effect size. What?!

Statistical power is the ability of a hypothesis test to detect an effect that exists in the population. Clearly, a high-powered study is a good thing just for being able to identify these effects. Low power reduces your chances of discovering real findings. However, many analysts don’t realize that low power also inflates the effect size.

In this post, I show how this unexpected relationship between power and exaggerated effect sizes exists. I’ll also tie it to other issues, such as the bias of effects published in journals and other matters about statistical power. I think this post will be eye-opening and thought provoking! As always, I’ll use many graphs rather than equations. [Read more…] about Low Power Tests Exaggerate Effect Sizes

## Revisiting the Monty Hall Problem with Hypothesis Testing

The Monty Hall Problem is where Monty presents you with three doors, one of which contains a prize. He asks you to pick one door, which remains closed. Monty opens one of the other doors that does not have the prize. This process leaves two unopened doors—your original choice and one other. He allows you to switch from your initial choice to the other unopened door. Do you accept the offer?

If you accept his offer to switch doors, you’re twice as likely to win—66% versus 33%—than if you stay with your original choice.

Mind-blowing, right?

The solution to the Monty Hall Problem is tricky and counter-intuitive. It did trip up many experts back in the 1980s. However, the correct answer to the Monty Hall Problem is now well established using a variety of methods. It has been proven mathematically, with computer simulations, and empirical experiments, including on television by both the Mythbusters (CONFIRMED!) and James Mays’ Man Lab. You won’t find any statisticians who disagree with the solution.

In this post, I’ll explore aspects of this problem that have arisen in discussions with some stubborn resisters to the notion that you can increase your chances of winning by switching!

The Monty Hall problem provides a fun way to explore issues that relate to hypothesis testing. I’ve got a lot of fun lined up for this post, including the following!

- Using a computer simulation to play the game 10,000 times.
- Assessing sampling distributions to compare the 66% percent hypothesis to another contender.
- Performing a power and sample size analysis to determine the number of times you need to play the Monty Hall game to get an answer.
- Conducting an experiment by playing the game repeatedly myself, record the results, and use a proportions hypothesis test to draw conclusions! [Read more…] about Revisiting the Monty Hall Problem with Hypothesis Testing

## Percentiles: Interpretations and Calculations

Percentiles indicate the percentage of scores that fall below a particular value. They tell you where a score stands relative to other scores. For example, a person with an IQ of 120 is at the 91^{st }percentile, which indicates that their IQ is higher than 91 percent of other scores.

Percentiles are a great tool to use when you need to know the relative standing of a value. Where does a value fall within a distribution of values? While the concept behind percentiles is straight forward, there are different mathematical methods for calculating them. In this post, learn about percentiles, special percentiles and their surprisingly flexible uses, and the various procedures for calculating them. [Read more…] about Percentiles: Interpretations and Calculations

## Central Limit Theorem Explained

The central limit theorem in statistics states that, given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population.

Unpacking the meaning from that complex definition can be difficult. That’s the topic for this post! I’ll walk you through the various aspects of the central limit theorem (CLT) definition, and show you why it is vital in statistics. [Read more…] about Central Limit Theorem Explained

## Introduction to Bootstrapping in Statistics with an Example

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.

In this blog post, I explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals. [Read more…] about Introduction to Bootstrapping in Statistics with an Example

## Assessing Normality: Histograms vs. Normal Probability Plots

Because histograms display the shape and spread of distributions, you might think they’re the best type of graph for determining whether your data are normally distributed. However, I’ll show you how histograms can trick you! Normal probability plots are a better choice for this task and they are easy to use. Normal probability plots are also known as quantile-quantile plots, or Q-Q Plots for short!

[Read more…] about Assessing Normality: Histograms vs. Normal Probability Plots

## Normal Distribution in Statistics

The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.

The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.

In this blog post, you’ll learn how to use the normal distribution, about its parameters, and how to calculate Z-scores to standardize your data and find probabilities. [Read more…] about Normal Distribution in Statistics

## Understanding Probability Distributions

A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of the variable vary based on the underlying probability distribution.

Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you can create a distribution of heights. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.

In this blog post, you’ll learn about probability distributions for both discrete and continuous variables. I’ll show you how they work and examples of how to use them. [Read more…] about Understanding Probability Distributions

## Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation

A measure of variability is a summary statistic that represents the amount of dispersion in a dataset. How spread out are the values? While a measure of central tendency describes the typical value, measures of variability define how far away the data points tend to fall from the center. We talk about variability in the context of a distribution of values. A low dispersion indicates that the data points tend to be clustered tightly around the center. High dispersion signifies that they tend to fall further away.

In statistics, variability, dispersion, and spread are synonyms that denote the width of the distribution. Just as there are multiple measures of central tendency, there are several measures of variability. In this blog post, you’ll learn why understanding the variability of your data is critical. Then, I explore the most common measures of variability—the range, interquartile range, variance, and standard deviation. I’ll help you determine which one is best for your data. [Read more…] about Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation

## Measures of Central Tendency: Mean, Median, and Mode

A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.

Choosing the best measure of central tendency depends on the type of data you have. In this post, I explore these measures of central tendency, show you how to calculate them, and how to determine which one is best for your data.

[Read more…] about Measures of Central Tendency: Mean, Median, and Mode

## Maximize the Value of Your Binary Data with the Binomial and Other Probability Distributions

Binary data occur when you can place an observation into only two categories. It tells you that an event occurred or that an item has a particular characteristic. For instance, an inspection process produces binary pass/fail results. Or, when a customer enters a store, there are two possible outcomes—sale or no sale. In this post, I show you how to use the binomial, geometric, negative binomial, and the hypergeometric distributions to glean more information from your binary data. [Read more…] about Maximize the Value of Your Binary Data with the Binomial and Other Probability Distributions

## Flu Shots, How Effective Are They?

With the arrival of Fall in the Northern hemisphere, it’s flu season again.

Do you debate getting a flu shot every year? I do get flu shots every year. I realize that they’re not perfect, but I figure they’re a low-cost way to reduce my chances of a crummy week suffering from the flu.

The media report that flu shots have an effectiveness of approximately 68%. But what does that mean exactly? What is the absolute reduction in risk? Are there long-term benefits?

In this blog post, I explore the effectiveness of flu shots from a statistical viewpoint. We’ll statistically analyze the data ourselves to go beyond the simplified accounts that the media presents. I’ll also model the long-term outcomes you can expect with regular flu vaccinations. By the time you finish this post, you’ll have a crystal clear picture of flu shot effectiveness. Some of the results surprised me! [Read more…] about Flu Shots, How Effective Are They?

## Goodness-of-Fit Tests for Discrete Distributions

Discrete probability distributions are based on discrete variables, which have a finite or countable number of values. In this post, I show you how to perform goodness-of-fit tests to determine how well your data fit various discrete probability distributions. [Read more…] about Goodness-of-Fit Tests for Discrete Distributions

## How to Identify the Distribution of Your Data

You’re probably familiar with data that follow the normal distribution. The normal distribution is that nice, familiar bell-shaped curve. Unfortunately, not all data are normally distributed or as intuitive to understand. You can picture the symmetric normal distribution, but what about the Weibull or Gamma distributions? This uncertainty might leave you feeling unsettled. In this post, I show you how to identify the probability distribution of your data. [Read more…] about How to Identify the Distribution of Your Data

## When is Easter this Year?

When is Easter this year? I ask this question every year! The next Easter occurs on April 17, 2022. And then, in the next year, Easter falls on April 9, 2023. I have a hard time remembering when it occurs in any given year. I think that March Easters are both early and unusual. Is that true?

Being a statistician, my first thought is to study the distribution of Easter dates. By analyzing the distribution, we can determine which dates are rare and which are common. How unusual are Easter dates in March? Are there patterns in the dates? [Read more…] about When is Easter this Year?

## Statistics, Exoplanets, and the Search for Earthlike Planets

I love astronomy! The discovery of thousands of exoplanets has made it only more exciting. You often hear about the really weird planets in the news. You know, things like low density puffballs, hot Jupiters, rogue planets, planets that orbit their star in hours, and even a Jupiter mass planet that is one huge diamond! As neat as these discoveries are, I also want to know how Earth fits in. [Read more…] about Statistics, Exoplanets, and the Search for Earthlike Planets