Use scatterplots to show relationships between pairs of continuous variables. These graphs display symbols at the X, Y coordinates of the data points for the paired variables. Scatterplots are also known as scattergrams and scatter charts. [Read more…] about Scatterplots: Using, Examples, and Interpreting
Use pie charts to compare the sizes of categories to the entire dataset. To create a pie chart, you must have a categorical variable that divides your data into groups. These graphs consist of a circle (i.e., the pie) with slices representing subgroups. The size of each slice is proportional to the relative size of each category out of the whole. [Read more…] about Pie Charts: Using, Examples, and Interpreting
Use bar charts to compare categories when you have at least one categorical or discrete variable. Each bar represents a summary value for one discrete level, where longer bars indicate higher values. Types of summary values include counts, sums, means, and standard deviations. Bar charts are also known as bar graphs. [Read more…] about Bar Charts: Using, Examples, and Interpreting
Use line charts to display a series of data points that are connected by lines. Analysts use line charts to emphasize changes in a metric on the vertical Y-axis by another variable on the horizontal X-axis. Often, the X-axis reflects time, but not always. Line charts are also known as line plots. [Read more…] about Line Charts: Using, Examples, and Interpreting
Use dot plots to display the distribution of your sample data when you have continuous variables. These graphs stack dots along the horizontal X-axis to represent the frequencies of different values. More dots indicate greater frequency. Each dot represents a set number of observations. [Read more…] about Dot Plots: Using, Examples, and Interpreting
Use an empirical cumulative distribution function plot to display the data points in your sample from lowest to highest against their percentiles. These graphs require continuous variables and allow you to derive percentiles and other distribution properties. This function is also known as the empirical CDF or ECDF. [Read more…] about Empirical Cumulative Distribution Function (CDF) Plots
Excel can calculate correlation coefficients and a variety of other statistical analyses. Even if you don’t use Excel regularly, this post is an excellent introduction to calculating and interpreting correlation.
In this post, I provide step-by-step instructions for having Excel calculate Pearson’s correlation coefficient, and I’ll show you how to interpret the results. Additionally, I include links to relevant statistical resources I’ve written that provide intuitive explanations. Together, we’ll analyze and interpret an example dataset! [Read more…] about Using Excel to Calculate Correlation
Autocorrelation is the correlation between two observations at different points in a time series. For example, values that are separated by an interval might have a strong positive or negative correlation. When these correlations are present, they indicate that past values influence the current value. Analysts use the autocorrelation and partial autocorrelation functions to understand the properties of time series data, fit the appropriate models, and make forecasts.
In this post, I cover both the autocorrelation function and partial autocorrelation function. You’ll learn about the differences between these functions and what they can tell you about your data. In later posts, I’ll show you how to incorporate this information in regression models of time series data and other time-series analyses.
Autocorrelation and Partial Autocorrelation Basics
Autocorrelation is the correlation between two values in a time series. In other words, the time series data correlate with themselves—hence, the name. We talk about these correlations using the term “lags.” Analysts record time-series data by measuring a characteristic at evenly spaced intervals—such as daily, monthly, or yearly. The number of intervals between the two observations is the lag. For example, the lag between the current and previous observation is one. If you go back one more interval, the lag is two, and so on.
In mathematical terms, the observations at yt and yt–k are separated by k time units. K is the lag. This lag can be days, quarters, or years depending on the nature of the data. When k=1, you’re assessing adjacent observations. For each lag, there is a correlation.
The autocorrelation function (ACF) assesses the correlation between observations in a time series for a set of lags. The ACF for time series y is given by: Corr (yt,yt−k), k=1,2,….
Analysts typically use graphs to display this function.
Autocorrelation Function (ACF)
Use the autocorrelation function (ACF) to identify which lags have significant correlations, understand the patterns and properties of the time series, and then use that information to model the time series data. From the ACF, you can assess the randomness and stationarity of a time series. You can also determine whether trends and seasonal patterns are present.
In an ACF plot, each bar represents the size and direction of the correlation. Bars that extend across the red line are statistically significant.
For random data, autocorrelations should be near zero for all lags. Analysts also refer to this condition as white noise. Non-random data have at least one significant lag. When the data are not random, it’s a good indication that you need to use a time series analysis or incorporate lags into a regression analysis to model the data appropriately.
This ACF plot indicates that these time series data are random.
Stationarity means that the time series does not have a trend, has a constant variance, a constant autocorrelation pattern, and no seasonal pattern. The autocorrelation function declines to near zero rapidly for a stationary time series. In contrast, the ACF drops slowly for a non-stationary time series.
In this chart for a stationary time series, notice how the autocorrelations decline to non-significant levels quickly.
When trends are present in a time series, shorter lags typically have large positive correlations because observations closer in time tend to have similar values. The correlations taper off slowly as the lags increase.
In this ACF plot for metal sales, the autocorrelations decline slowly. The first five lags are significant.
When seasonal patterns are present, the autocorrelations are larger for lags at multiples of the seasonal frequency than for other lags.
When a time series has both a trend and seasonality, the ACF plot displays a mixture of both effects. That’s the case in the autocorrelation function plot for the carbon dioxide (CO2) dataset from NIST. This dataset contains monthly mean CO2 measurements at the Mauna Loa Observatory. Download the CO2_Data.
Notice how you can see the wavy correlations for the seasonal pattern and the slowly diminishing lags of a trend.
Partial Autocorrelation Function (PACF)
The partial autocorrelation function is similar to the ACF except that it displays only the correlation between two observations that the shorter lags between those observations do not explain. For example, the partial autocorrelation for lag 3 is only the correlation that lags 1 and 2 do not explain. In other words, the partial correlation for each lag is the unique correlation between those two observations after partialling out the intervening correlations.
As you saw, the autocorrelation function helps assess the properties of a time series. In contrast, the partial autocorrelation function (PACF) is more useful during the specification process for an autoregressive model. Analysts use partial autocorrelation plots to specify regression models with time series data and Auto Regressive Integrated Moving Average (ARIMA) models. I’ll focus on that aspect in posts about those methods.
Related post: Using Moving Averages to Smooth Time Series Data
For this post, I’ll show you a quick example of a PACF plot. Typically, you will use the ACF to determine whether an autoregressive model is appropriate. If it is, you then use the PACF to help you choose the model terms.
This partial autocorrelation plot displays data from the southern oscillations dataset from NIST. The southern oscillations refer to changes in the barometric pressure near Tahiti that predicts El Niño. Download the southern_oscillations_data.
On the graph, the partial autocorrelations for lags 1 and 2 are statistically significant. The subsequent lags are nearly significant. Consequently, this PACF suggests fitting either a second or third-order autoregressive model.
By assessing the autocorrelation and partial autocorrelation patterns in your data, you can understand the nature of your time series and model it!
Combinations in probability theory and other areas of mathematics refer to a sequence of outcomes where the order does not matter. For example, when you’re ordering a pizza, it doesn’t matter whether you order it with ham, mushrooms, and olives or olives, mushrooms, and ham. You’re getting the same pizza! [Read more…] about Using Combinations to Calculate Probabilities
Permutations in probability theory and other branches of mathematics refer to sequences of outcomes where the order matters. For example, 9-6-8-4 is a permutation of a four-digit PIN because the order of numbers is crucial. When calculating probabilities, it’s frequently necessary to calculate the number of possible permutations to determine an event’s probability.
In this post, I explain permutations and show how to calculate the number of permutations both with repetition and without repetition. Finally, we’ll work through a step-by-step example problem that uses permutations to calculate a probability. [Read more…] about Using Permutations to Calculate Probabilities
Historians rank the U.S. Presidents from best to worse using all the historical knowledge at their disposal. Frequently, groups, such as C-Span, ask these historians to rank the Presidents and average the results together to help reduce bias. The idea is to produce a set of rankings that incorporates a broad range of historians, a vast array of information, and a historical perspective. These rankings include informed assessments of each President’s effectiveness, leadership, moral authority, administrative skills, economic management, vision, and so on. [Read more…] about Understanding Historians’ Rankings of U.S. Presidents using Regression Models
Spearman’s correlation in statistics is a nonparametric alternative to Pearson’s correlation. Use Spearman’s correlation for data that follow curvilinear, monotonic relationships and for ordinal data. Statisticians also refer to Spearman’s rank order correlation coefficient as Spearman’s ρ (rho).
In this post, I’ll cover what all that means so you know when and why you should use Spearman’s correlation instead of the more common Pearson’s correlation. [Read more…] about Spearman’s Correlation Explained
The multiplication rule in probability allows you to calculate the probability of multiple events occurring together using known probabilities of those events individually. There are two forms of this rule, the specific and general multiplication rules.
In this post, learn about when and how to use both the specific and general multiplication rules. Additionally, I’ll use and explain the standard notation for probabilities throughout, helping you learn how to interpret it. We’ll work through several example problems so you can see them in action. There’s even a bonus problem at the end! [Read more…] about Multiplication Rule for Calculating Probabilities
Exponential smoothing is a forecasting method for univariate time series data. This method produces forecasts that are weighted averages of past observations where the weights of older observations exponentially decrease. Forms of exponential smoothing extend the analysis to model data with trends and seasonal components. [Read more…] about Exponential Smoothing for Time Series Forecasting
Descriptive statistics summarize your dataset, painting a picture of its properties. These properties include various central tendency and variability measures, distribution properties, outlier detection, and other information. Unlike inferential statistics, descriptive statistics only describe your dataset’s characteristics and do not attempt to generalize from a sample to a population. [Read more…] about Descriptive Statistics in Excel
Contingency tables are a great way to classify outcomes and calculate different types of probabilities. These tables contain rows and columns that display bivariate frequencies of categorical data. Analysts also refer to contingency tables as crosstabulation (cross tabs), two-way tables, and frequency tables.
Statisticians use contingency tables for a variety of reasons. I love these tables because they both organize your data and allow you to answer a diverse set of questions. In this post, I focus on using them to calculate different types of probabilities. These probabilities include joint, marginal, and conditional probabilities. [Read more…] about Using Contingency Tables to Calculate Probabilities
Excel can perform various statistical analyses, including regression analysis. It is a great option because nearly everyone can access Excel. This post is an excellent introduction to performing and interpreting regression analysis, even if Excel isn’t your primary statistical software package.
When comparing groups in your data, you can have either independent or dependent samples. The type of samples in your design impacts sample size requirements, statistical power, the proper analysis, and even your study’s costs. Understanding the implications of each type of sample can help you design a better study. [Read more…] about Independent and Dependent Samples in Statistics
Moving averages can smooth time series data, reveal underlying trends, and identify components for use in statistical modeling. Smoothing is the process of removing random variations that appear as coarseness in a plot of raw time series data. It reduces the noise to emphasize the signal that can contain trends and cycles. Analysts also refer to the smoothing process as filtering the data. [Read more…] about Using Moving Averages to Smooth Time Series Data
Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com
Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator. Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.
Why Survival Analysis: Right Censoring
Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.
However, in some cases one might not observe the event time: this is generally called right censoring. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:
- How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?
The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.
The Survival Function and the Hazard
Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time . The hazard describes the instantaneous rate of the first event at any time .
More formally, let be the event time of interest, such as the death time. Then the survival function is . We can also note that this is related to the cumulative distribution function via .
For the hazard, the probability of the first event time being in the small interval , given survival up to is . This is illustrated in the following figure.
Rearranging terms and taking limits we obtain
where is the density function of and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via
The key question then is how to estimate the hazard and/or survival function.
Non-Parametric Estimation with Kaplan Meier
In non-parametric survival analysis, we want to estimate the survival function without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF . This equation is a succinct representation of: how many people have died by time ? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time .
While we don’t necessarily know how many people have survived by an arbitrary time , we do know how many people in the study are still at risk. We can use this instead. Partition the study time into , where each is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let be the number of people at risk at just before time . Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is , and say that the probability of dying at any other time is . We can then say that the probability of surviving at any event time , given survival at previous candidate event times is . The probability of surviving up to a time is then
We call this  the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.
Kaplan Meier R Example
In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.
library(survminer) library(survival) kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']]) kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log') ggsurvplot(kaplan_meier_treatment,conf.int = 'True')
Semi-Parametric Regression with Cox’s Proportional Hazards Model
Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:
The baseline hazard describes how the average person’s risk evolves over time. The relative risk describes how covariates affect the hazard. In particular, a unit increase in leads to an increase of the hazard by a factor of .
Because of the non-parametric nuissance term , it is difficult to maximize the full likelihood for directly. Cox’s insight  was that the assignment probabilities given the death times contain most of the information about , and the remaining terms contain most of the information about . The assignment probabilities give the following partial likelihood
We can then maximize this to get an estimator of . In [3,4] they show that this estimator is consistent and asymptotically normal.
Cox Proportional Hazards R Example
In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.
cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian) summary(cox_fit)
which gives the following results
Call: coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx), data = ovarian) n= 26, number of events= 12 coef exp(coef) se(coef) z Pr(>|z|) age 0.13853 1.14858 0.04801 2.885 0.00391 ** ecog.ps -0.09670 0.90783 0.62994 -0.154 0.87800 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 age 1.1486 0.8706 1.0454 1.262 ecog.ps 0.9078 1.1015 0.2641 3.120 Concordance= 0.819 (se = 0.058 ) Likelihood ratio test= 12.71 on 2 df, p=0.002 Wald test = 8.43 on 2 df, p=0.01 Score (logrank) test = 12.24 on 2 df, p=0.002
this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.
Aalen’s Additive Model
Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model  relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form
As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.
Aalen’s Additive Model R Example
In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.
library(timereg) data(sTRACE) # Fits Aalen model out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100) summary(out) par(mfrow=c(2,3)) plot(out)
This gives us
Additive Aalen Model Test for nonparametric terms Test for non-significant effects Supremum-test of significance p-value H_0: B(t)=0 (Intercept) 7.29 0.00 age 8.63 0.00 sex 2.95 0.01 diabetes 2.31 0.24 chf 5.30 0.00 vf 2.95 0.03 Test for time invariant effects Kolmogorov-Smirnov test (Intercept) 0.57700 age 0.00866 sex 0.11900 diabetes 0.16200 chf 0.12900 vf 0.43500 p-value H_0:constant effect (Intercept) 0.00 age 0.00 sex 0.18 diabetes 0.43 chf 0.06 vf 0.02 Cramer von Mises test (Intercept) 0.875000 age 0.000179 sex 0.017700 diabetes 0.041200 chf 0.053500 vf 0.434000 p-value H_0:constant effect (Intercept) 0.00 age 0.00 sex 0.29 diabetes 0.42 chf 0.02 vf 0.05 Call: aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes + chf + vf, data = sTRACE, max.time = 7, n.sim = 100)
The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.
In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.
 Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.” Journal of the American statistical association 53, no. 282 (1958): 457-481.
 Cox, David R. “Regression models and life-tables.” In Breakthroughs in statistics, pp. 527-541. Springer, New York, NY, 1992.
 Tsiatis, Anastasios A. “A large sample study of Cox’s regression model.” The Annals of Statistics 9, no. 1 (1981): 93-108.
 Andersen, Per Kragh, and Richard David Gill. “Cox’s regression model for counting processes: a large sample study.” The annals of statistics (1982): 1100-1120.
 Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” In Mathematical statistics and probability theory, pp. 1-25. Springer, New York, NY, 1980.