The lognormal distribution is a continuous probability distribution that models right-skewed data. The shape of the lognormal distribution is comparable to the Weibull and loglogistic distributions. [Read more…] about Lognormal Distribution
In the United States, our Thanksgiving holiday is fast approaching. On this day, we give thanks for the good things in our lives.
For this post, I wanted to quantify how thankful we should be. Ideally, I’d quantify something truly meaningful, like happiness. Unfortunately, most countries are not like Bhutan, which measures the gross national happiness and incorporates it into their five-year development plans.
Instead, I’ll focus on something that is more concrete and regularly measured around the world—income. By examining income distributions, I’ll show that you have much to be thankful for, and so does most of the world! [Read more…] about A Statistical Thanksgiving: Global Income Distributions
The uniform distribution is a symmetric probability distribution where all outcomes have an equal likelihood of occurring. All values in the distribution have a constant probability. This distribution is also known as the rectangular distribution because of its shape in probability distribution plots, as I’ll show you below. [Read more…] about Uniform Distribution
A skewed distribution occurs when one tail is longer than the other. Skewness defines the asymmetry of a distribution. Unlike the familiar normal distribution with its bell-shaped curve, these distributions are asymmetric. The two halves of the distribution are not mirror images because the data are not distributed equally on both sides of the distribution’s peak. [Read more…] about Skewed Distribution
Heterogeneity is defined as a dissimilarity between elements that comprise a whole. When heterogeneity is present, there is diversity in the characteristic under study. The parts of the whole are different, not the same. It is an essential concept in science and statistics. Heterogeneous is the opposite of homogeneous. [Read more…] about Heterogeneity
The range of a data set is the difference between the maximum and the minimum values. It measures variability using the same units as the data. Larger values represent greater variability.
The range is the easiest measure of dispersion to calculate and interpret in statistics, but it has some limitations. In this post, I’ll show you how to find the range mathematically and graphically, interpret it, explain its limitations, and clarify when to use it. [Read more…] about Range of a Data Set
A relative frequency indicates how often a specific kind of event occurs within the total number of observations. It is a type of frequency that uses percentages, proportions, and fractions.
In this post, learn about relative frequencies, the relative frequency distribution, and its cumulative counterpart. [Read more…] about Relative Frequencies and Their Distributions
The empirical rule in statistics, also known as the 68-95-99.7 rule, states that for normal distributions, 68% of observed data points will lie inside one standard deviation of the mean, 95% will fall within two standard deviations, and 99.7% will occur within three standard deviations. [Read more…] about Empirical Rule: Definition, Formula, and Uses
The standard deviation (SD) is a single number that summarizes the variability in a dataset. It represents the typical distance between each data point and the mean. Smaller values indicate that the data points cluster closer to the mean—the values in the dataset are relatively consistent. Conversely, higher values signify that the values spread out further from the mean. Data values become more dissimilar, and extreme values become more likely. [Read more…] about Standard Deviation: Interpretations and Calculations
In statistics, the mean summarizes an entire dataset with a single number representing the data’s center point or typical value. It is also known as the arithmetic average, and it is one of several measures of central tendency. It is likely the measure of central tendency with which you’re most familiar! Learn how to calculate the mean, and when it is and is not a good statistic to use!
How Do You Find the Mean?
Finding the mean is very simple. Just add all the values and divide by the number of observations—the formula is below.
For example, if the heights of five people are 48, 51, 52, 54, and 56 inches, their average height is 52.2 inches.
48 + 51 + 52 + 54 + 56 / 5 = 52.2
When Do You Use the Mean?
Ideally, the mean indicates the region where most values in a distribution fall. Statisticians refer to it as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. The histogram below illustrates the average accurately finding the center of the data’s distribution.
However, the mean does not always find the center of the data. It is sensitive to skewed data and extreme values. For example, when the data are skewed, it can miss the mark. In the histogram below, the average is outside the area with the most common values.
This problem occurs because outliers have a substantial impact on the mean. Extreme values in an extended tail pull the it away from the center. As the distribution becomes more skewed, the average is drawn further away from the center.
In these cases, the mean can be misleading because because it might not be near the most common values. Consequently, it’s best to use the average to measure the central tendency when you have a symmetric distribution.
For skewed distributions, it’s often better to use the median, which uses a different method to find the central location. Note that the mean provides no information about the variability present in a distribution. To evaluate that characteristic, assess the standard deviation.
Relate post: Measures of Central Tendency: Mean, Median, and Mode
Using Sample Means to Estimate Population Means
In statistics, analysts often use a sample average to estimate a population mean. For small samples, the sample mean can differ greatly from the population. However, as the sample size grows, the law of large numbers states that the sample average is likely to be close to the population value.
Hypothesis tests, such as t-tests and ANOVA, use samples to determine whether population means are different. Statisticians refer to this process of using samples to estimate the properties of entire populations as inferential statistics.
Related post: Descriptive Statistics Vs. Inferential Statistics
In statistics, we usually use the arithmetic mean, which is the type I focus on this post. However, there are other types of means, including the geometric mean. Read my post about the geometric mean to learn more.
The gamma distribution is a continuous probability distribution that models right-skewed data. Statisticians have used this distribution to model cancer rates, insurance claims, and rainfall. Additionally, the gamma distribution is similar to the exponential distribution, and you can use it to model the same types of phenomena: failure times, wait times, service times, etc. [Read more…] about Gamma Distribution
The exponential distribution is a right-skewed continuous probability distribution that models variables in which small values occur more frequently than higher values. Small values have relatively high probabilities, which consistently decline as data values increase. Statisticians use the exponential distribution to model the amount of change in people’s pockets, the length of phone calls, and sales totals for customers. In all these cases, small values are more likely than larger values. [Read more…] about Exponential Distribution
The Weibull distribution is a continuous probability distribution that can fit an extensive range of distribution shapes. Like the normal distribution, the Weibull distribution describes the probabilities associated with continuous data. However, unlike the normal distribution, it can also model skewed data. In fact, its extreme flexibility allows it to model both left- and right-skewed data. [Read more…] about Weibull Distribution
The Poisson distribution is a discrete probability distribution that describes probabilities for counts of events that occur in a specified observation space. It is named after Siméon Denis Poisson.
In statistics, count data represent the number of events or characteristics over a given length of time, area, volume, etc. For example, you can count the number of cigarettes smoked per day, meteors seen per hour, the number of defects in a batch, and the occurrence of a particular crime by county. [Read more…] about Poisson Distribution
Excel can calculate correlation coefficients and a variety of other statistical analyses. Even if you don’t use Excel regularly, this post is an excellent introduction to calculating and interpreting correlation.
In this post, I provide step-by-step instructions for having Excel calculate Pearson’s correlation coefficient, and I’ll show you how to interpret the results. Additionally, I include links to relevant statistical resources I’ve written that provide intuitive explanations. Together, we’ll analyze and interpret an example dataset! [Read more…] about Using Excel to Calculate Correlation
The standard error of the mean (SEM) is a bit mysterious. You’ll frequently find it in your statistical output. Is it a measure of variability? How does the standard error of the mean compare to the standard deviation? How do you interpret it?
In this post, I answer all these questions about the standard error of the mean, show how it relates to sample size considerations and statistical significance, and explain the general concept of other types of standard errors. In fact, I view standard errors as the doorway from descriptive statistics to inferential statistics. You’ll see how that works! [Read more…] about Standard Error of the Mean (SEM)
Autocorrelation is the correlation between two observations at different points in a time series. For example, values that are separated by an interval might have a strong positive or negative correlation. When these correlations are present, they indicate that past values influence the current value. Analysts use the autocorrelation and partial autocorrelation functions to understand the properties of time series data, fit the appropriate models, and make forecasts.
In this post, I cover both the autocorrelation function and partial autocorrelation function. You’ll learn about the differences between these functions and what they can tell you about your data. In later posts, I’ll show you how to incorporate this information in regression models of time series data and other time-series analyses.
Autocorrelation and Partial Autocorrelation Basics
Autocorrelation is the correlation between two values in a time series. In other words, the time series data correlate with themselves—hence, the name. We talk about these correlations using the term “lags.” Analysts record time-series data by measuring a characteristic at evenly spaced intervals—such as daily, monthly, or yearly. The number of intervals between the two observations is the lag. For example, the lag between the current and previous observation is one. If you go back one more interval, the lag is two, and so on.
In mathematical terms, the observations at yt and yt–k are separated by k time units. K is the lag. This lag can be days, quarters, or years depending on the nature of the data. When k=1, you’re assessing adjacent observations. For each lag, there is a correlation.
The autocorrelation function (ACF) assesses the correlation between observations in a time series for a set of lags. The ACF for time series y is given by: Corr (yt,yt−k), k=1,2,….
Analysts typically use graphs to display this function.
Autocorrelation Function (ACF)
Use the autocorrelation function (ACF) to identify which lags have significant correlations, understand the patterns and properties of the time series, and then use that information to model the time series data. From the ACF, you can assess the randomness and stationarity of a time series. You can also determine whether trends and seasonal patterns are present.
In an ACF plot, each bar represents the size and direction of the correlation. Bars that extend across the red line are statistically significant.
For random data, autocorrelations should be near zero for all lags. Analysts also refer to this condition as white noise. Non-random data have at least one significant lag. When the data are not random, it’s a good indication that you need to use a time series analysis or incorporate lags into a regression analysis to model the data appropriately.
This ACF plot indicates that these time series data are random.
Stationarity means that the time series does not have a trend, has a constant variance, a constant autocorrelation pattern, and no seasonal pattern. The autocorrelation function declines to near zero rapidly for a stationary time series. In contrast, the ACF drops slowly for a non-stationary time series.
In this chart for a stationary time series, notice how the autocorrelations decline to non-significant levels quickly.
When trends are present in a time series, shorter lags typically have large positive correlations because observations closer in time tend to have similar values. The correlations taper off slowly as the lags increase.
In this ACF plot for metal sales, the autocorrelations decline slowly. The first five lags are significant.
When seasonal patterns are present, the autocorrelations are larger for lags at multiples of the seasonal frequency than for other lags.
When a time series has both a trend and seasonality, the ACF plot displays a mixture of both effects. That’s the case in the autocorrelation function plot for the carbon dioxide (CO2) dataset from NIST. This dataset contains monthly mean CO2 measurements at the Mauna Loa Observatory. Download the CO2_Data.
Notice how you can see the wavy correlations for the seasonal pattern and the slowly diminishing lags of a trend.
Partial Autocorrelation Function (PACF)
The partial autocorrelation function is similar to the ACF except that it displays only the correlation between two observations that the shorter lags between those observations do not explain. For example, the partial autocorrelation for lag 3 is only the correlation that lags 1 and 2 do not explain. In other words, the partial correlation for each lag is the unique correlation between those two observations after partialling out the intervening correlations.
As you saw, the autocorrelation function helps assess the properties of a time series. In contrast, the partial autocorrelation function (PACF) is more useful during the specification process for an autoregressive model. Analysts use partial autocorrelation plots to specify regression models with time series data and Auto Regressive Integrated Moving Average (ARIMA) models. I’ll focus on that aspect in posts about those methods.
Related post: Using Moving Averages to Smooth Time Series Data
For this post, I’ll show you a quick example of a PACF plot. Typically, you will use the ACF to determine whether an autoregressive model is appropriate. If it is, you then use the PACF to help you choose the model terms.
This partial autocorrelation plot displays data from the southern oscillations dataset from NIST. The southern oscillations refer to changes in the barometric pressure near Tahiti that predicts El Niño. Download the southern_oscillations_data.
On the graph, the partial autocorrelations for lags 1 and 2 are statistically significant. The subsequent lags are nearly significant. Consequently, this PACF suggests fitting either a second or third-order autoregressive model.
By assessing the autocorrelation and partial autocorrelation patterns in your data, you can understand the nature of your time series and model it!
Historians rank the U.S. Presidents from best to worse using all the historical knowledge at their disposal. Frequently, groups, such as C-Span, ask these historians to rank the Presidents and average the results together to help reduce bias. The idea is to produce a set of rankings that incorporates a broad range of historians, a vast array of information, and a historical perspective. These rankings include informed assessments of each President’s effectiveness, leadership, moral authority, administrative skills, economic management, vision, and so on. [Read more…] about Understanding Historians’ Rankings of U.S. Presidents using Regression Models
Spearman’s correlation in statistics is a nonparametric alternative to Pearson’s correlation. Use Spearman’s correlation for data that follow curvilinear, monotonic relationships and for ordinal data. Statisticians also refer to Spearman’s rank order correlation coefficient as Spearman’s ρ (rho).
In this post, I’ll cover what all that means so you know when and why you should use Spearman’s correlation instead of the more common Pearson’s correlation. [Read more…] about Spearman’s Correlation Explained
Exponential smoothing is a forecasting method for univariate time series data. This method produces forecasts that are weighted averages of past observations where the weights of older observations exponentially decrease. Forms of exponential smoothing extend the analysis to model data with trends and seasonal components. [Read more…] about Exponential Smoothing for Time Series Forecasting