Effect sizes in statistics quantify the differences between group means and the relationships between variables. While analysts often focus on statistical significance using p-values, effect sizes determine the practical importance of the findings. [Read more…] about Effect Sizes in Statistics
Blog
Proxy Variables: The Good Twin of Confounding Variables
Proxy variables are easily measurable variables that analysts include in a model in place of a variable that cannot be measured or is difficult to measure. Proxy variables can be something that is not of any great interest itself, but has a close correlation with the variable of interest. [Read more…] about Proxy Variables: The Good Twin of Confounding Variables
Multiplication Rule for Calculating Probabilities
The multiplication rule in probability allows you to calculate the joint probability of multiple events occurring together using known probabilities of those events individually. There are two forms of this rule, the specific and general multiplication rules.
In this post, learn about when and how to use both the specific and general multiplication rules. Additionally, Iโll use and explain the standard notation for probabilities throughout, helping you learn how to interpret it. Weโll work through several example problems so you can see them in action. Thereโs even a bonus problem at the end! [Read more…] about Multiplication Rule for Calculating Probabilities
Exponential Smoothing for Time Series Forecasting
Exponential smoothing is a forecasting method for univariate time series data. This method produces forecasts that are weighted averages of past observations where the weights of older observations exponentially decrease. Forms of exponential smoothing extend the analysis to model data with trends and seasonal components. [Read more…] about Exponential Smoothing for Time Series Forecasting
Descriptive Statistics in Excel
Descriptive statistics summarize your dataset, painting a picture of its properties. These properties include various central tendency and variability measures, distribution properties, outlier detection, and other information. Unlike inferential statistics, descriptive statistics only describe your dataset’s characteristics and do not attempt to generalize from a sample to a population. [Read more…] about Descriptive Statistics in Excel
Using Contingency Tables to Calculate Probabilities
Contingency tables are a great way to classify outcomes and calculate different types of probabilities. These tables contain rows and columns that display bivariate frequencies of categorical data. Analysts also refer to contingency tables as crosstabulation (cross tabs), two-way tables, and frequency tables.
Statisticians use contingency tables for a variety of reasons. I love these tables because they both organize your data and allow you to answer a diverse set of questions. In this post, I focus on using them to calculate different types of probabilities. These probabilities include joint, marginal, and conditional probabilities. [Read more…] about Using Contingency Tables to Calculate Probabilities
Probability Definition and Fundamentals
What is Probability?
The definition of probability is the likelihood of an event happening. Probability theory analyzes the chances of events occurring. You can think of probabilities as being the following:
- The long-term proportion of times an event occurs during a random process.
- The propensity for a particular outcome to occur.
Common terms for describing probabilities include likelihood, chances, and odds. [Read more…] about Probability Definition and Fundamentals
Using Applied Statistics to Expand Human Knowledge
My background includes working on scientific projects as the data guy. In these positions, I was responsible for establishing valid data collection procedures, collecting usable data, and statistically analyzing and presenting the results. In this post, I describe the excitement of being a statistician helping expand the limits of human knowledge, what I learned about applied statistics and data analysis during the first big project in my career, and the challenges along the way! [Read more…] about Using Applied Statistics to Expand Human Knowledge
Variance Inflation Factors (VIFs)
Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.
In this post, I focus on VIFs and how they detect multicollinearity, why they’re better than pairwise correlations, how to calculate VIFs yourself, and interpreting VIFs. If you need a refresher about the types of problems that multicollinearity causes and how to fix them, read my post: Multicollinearity: Problems, Detection, and Solutions. [Read more…] about Variance Inflation Factors (VIFs)
Assessing a COVID-19 Vaccination Experiment and Its Results
Moderna has announced encouraging preliminary results for their COVID-19 vaccine. In this post, I assess the available data and explain what the vaccineโs effectiveness really means. I also look at Modernaโs experimental design and examine how it incorporates statistical procedures and concepts that I discuss throughout my blog posts and books. [Read more…] about Assessing a COVID-19 Vaccination Experiment and Its Results
P-Values, Error Rates, and False Positives
In my post about how to interpret p-values, I emphasize that p-values are not an error rate. The number one misinterpretation of p-values is that they are the probability of the null hypothesis being correct.
The correct interpretation is that p-values indicate the probability of observing your sample data, or more extreme, when you assume the null hypothesis is true. If you donโt solidly grasp that correct interpretation, please take a moment to read that post first.
Hopefully, thatโs clear.
Unfortunately, one part of that blog post confuses some readers. In that post, I explain how p-values are not a probability, or error rate, of a hypothesis. I then show how that misinterpretation is dangerous because it overstates the evidence against the null hypothesis. [Read more…] about P-Values, Error Rates, and False Positives
How to Perform Regression Analysis using Excel
Excel can perform various statistical analyses, including regression analysis. It is a great option because nearly everyone can access Excel. This post is an excellent introduction to performing and interpreting regression analysis, even if Excel isn’t your primary statistical software package.
[Read more…] about How to Perform Regression Analysis using Excel
Coefficient of Variation in Statistics
The coefficient of variation (CV) is a relative measure of variability that indicates the size of a standard deviation in relation to its mean. It is a standardized, unitless measure that allows you to compare variability between disparate groups and characteristics. It is also known as the relative standard deviation (RSD).
In this post, you will learn about the coefficient of variation, how to calculate it, know when it is particularly useful, and when to avoid it. [Read more…] about Coefficient of Variation in Statistics
Independent and Dependent Samples in Statistics
When comparing groups in your data, you can have either independent or dependent samples. The type of samples in your experimental design impacts sample size requirements, statistical power, the proper analysis, and even your studyโs costs. Understanding the implications of each type of sample can help you design a better experiment. [Read more…] about Independent and Dependent Samples in Statistics
Independent and Identically Distributed Data (IID)
Having independent and identically distributed (IID) data is a common assumption for statistical procedures and hypothesis tests. But what does that mouthful of words actually mean? Thatโs the topic of this post! And, Iโll provide helpful tips for determining whether your data are IID. [Read more…] about Independent and Identically Distributed Data (IID)
Using Moving Averages to Smooth Time Series Data
Moving averages, also known as rolling averages, can smooth time series data, reveal underlying trends, and identify components for use in statistical modeling. Smoothing is the process of removing random variations that appear as coarseness in a plot of raw time series data. It reduces the noise to emphasize the signal that can contain trends and cycles. Analysts also refer to the smoothing process as filtering the data. [Read more…] about Using Moving Averages to Smooth Time Series Data
A Tour of Survival Analysis
Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com
Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator.ย Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.
Why Survival Analysis: Right Censoring
Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.
However, in some cases one might not observe the event time: this is generally calledย right censoring. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:
- How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?

The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.
The Survival Function and the Hazard
Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time t. The hazard describes the instantaneous rate of the first event at any time t.
More formally, let t be the event time of interest, such as the death time. Then the survival function is S(t) = P(T > t). We can also note that this is related to the cumulative distribution function:
For the hazard, the probability of the first event time being in the small interval (t,t+dt), given survival up to t is:
This is illustrated in the following figure.

Rearranging terms and taking limits we obtain
where f(t) is the density function of T and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via
The key question then is how to estimate the hazard and/or survival function.
Non-Parametric Estimation with Kaplan Meier
In non-parametric survival analysis, we want to estimate the survival function S(t) without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF:
This equation is a succinct representation of: how many people have died by time t? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time t.
While we don’t necessarily know how many people have survived by an arbitrary time t, we do know how many people in the study are still at risk. We can use this instead. Partition the study time into 0 < t1 < . . . < tn-1 <ย tn, where each ti is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let Y(t) be the number of people at risk at just before time t. Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is 1/Y(t), and say that the probability of dying at any other time is 0. We can then say that the probability of surviving at any event time Ti, given survival at previous candidate event times is:ย
The probability of surviving up to a time t is then:
We call this [1] the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.

Learn more about Hazard Ratios.
Kaplan Meier R Example
In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.
library(survminer)
library(survival)
kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']])
kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log')
ggsurvplot(kaplan_meier_treatment,conf.int = 'True')

Semi-Parametric Regression with Cox’s Proportional Hazards Model
Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:
The baseline hazard ฮป0(t) describes how the average person’s risk evolves over time. The relative risk exp(ฮฒTx) describes how covariates affect the hazard. In particular, a unit increase in xi leads to an increase of the hazard by a factor of exp(ฮฒi).
Because of the non-parametric nuisance term ฮป0(t), it is difficult to maximize the full likelihood for ฮฒ directly. Cox’s insight [2] was that the assignment probabilities given the death times contain most of the information about ฮฒ, and the remaining terms contain most of the information about ฮป0(t). The assignment probabilities give the following partial likelihood
We can then maximize this to get an estimator of ฮฒ. In [3,4] they show that this estimator is consistent and asymptotically normal.
Cox Proportional Hazards R Example
In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.
cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian)
summary(cox_fit)
which gives the following results
Call:
coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx),
data = ovarian)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.13853 1.14858 0.04801 2.885 0.00391 **
ecog.ps -0.09670 0.90783 0.62994 -0.154 0.87800
---
Signif. codes: 0 โ***โ 0.001 โ**โ 0.01 โ*โ 0.05 โ.โ 0.1 โ โ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1486 0.8706 1.0454 1.262
ecog.ps 0.9078 1.1015 0.2641 3.120
Concordance= 0.819 (se = 0.058 )
Likelihood ratio test= 12.71 on 2 df, p=0.002
Wald test = 8.43 on 2 df, p=0.01
Score (logrank) test = 12.24 on 2 df, p=0.002
this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.
Aalen’s Additive Model
Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model [5] relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form
As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.
Aalen’s Additive Model R Example
In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.
library(timereg)
data(sTRACE)
# Fits Aalen model
out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100)
summary(out)
par(mfrow=c(2,3))
plot(out)
This gives us
Additive Aalen Model
Test for nonparametric terms
Test for non-significant effects
Supremum-test of significance p-value H_0: B(t)=0
(Intercept) 7.29 0.00
age 8.63 0.00
sex 2.95 0.01
diabetes 2.31 0.24
chf 5.30 0.00
vf 2.95 0.03
Test for time invariant effects
Kolmogorov-Smirnov test
(Intercept) 0.57700
age 0.00866
sex 0.11900
diabetes 0.16200
chf 0.12900
vf 0.43500
p-value H_0:constant effect
(Intercept) 0.00
age 0.00
sex 0.18
diabetes 0.43
chf 0.06
vf 0.02
Cramer von Mises test
(Intercept) 0.875000
age 0.000179
sex 0.017700
diabetes 0.041200
chf 0.053500
vf 0.434000
p-value H_0:constant effect
(Intercept) 0.00
age 0.00
sex 0.29
diabetes 0.42
chf 0.02
vf 0.05
Call:
aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes +
chf + vf, data = sTRACE, max.time = 7, n.sim = 100)
The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.

Discussion
In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.
[1] Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.”ย Journal of the American statistical associationย 53, no. 282 (1958): 457-481.
[2] Cox, David R. โRegression models and life-tables.โ In Breakthroughs in statistics, pp. 527-541. Springer, New York, NY, 1992.
[3] Tsiatis, Anastasios A. โA large sample study of Coxโs regression model.โ The Annals of Statisticsย 9, no. 1 (1981): 93-108.
[4] Andersen, Per Kragh, and Richard David Gill. โCoxโs regression model for counting processes: a large sample study.โ The annals of statisticsย (1982): 1100-1120.
[5] Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” Inย Mathematical statistics and probability theory, pp. 1-25. Springer, New York, NY, 1980.
Time Series Analysis Introduction
Time series analysis tracks characteristics of a process at regular time intervals. Itโs a fundamental method for understanding how a metric changes over time and forecasting future values. Analysts use time series methods in a wide variety of contexts. [Read more…] about Time Series Analysis Introduction
New eBook Release! Hypothesis Testing: An Intuitive Guide
Iโm thrilled to release my new book! Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions. [Read more…] about New eBook Release! Hypothesis Testing: An Intuitive Guide
Answering the Birthday Problem in Statistics
The Birthday Problem in statistics asks, how many people do you need in a group to have a 50% chance that at least two people will share a birthday? Go ahead and think about that for a moment. The answer surprises many people. Weโll get to that shortly.
In this post, Iโll not only answer the birthday paradox, but I’ll also show you how to calculate the probabilities for any size group, run a computer simulation of it, and explain why the answer to the Birthday Problem is so surprising. [Read more…] about Answering the Birthday Problem in Statistics