• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Blog

Independent and Dependent Samples in Statistics

By Jim Frost 14 Comments

When comparing groups in your data, you can have either independent or dependent samples. The type of samples in your experimental design impacts sample size requirements, statistical power, the proper analysis, and even your study’s costs. Understanding the implications of each type of sample can help you design a better experiment. [Read more…] about Independent and Dependent Samples in Statistics

Filed Under: Basics Tagged With: analysis example, choosing analysis, conceptual, experimental design

Independent and Identically Distributed Data (IID)

By Jim Frost 4 Comments

Having independent and identically distributed (IID) data is a common assumption for statistical procedures and hypothesis tests. But what does that mouthful of words actually mean? That’s the topic of this post! And, I’ll provide helpful tips for determining whether your data are IID. [Read more…] about Independent and Identically Distributed Data (IID)

Filed Under: Basics Tagged With: assumptions, conceptual

Using Moving Averages to Smooth Time Series Data

By Jim Frost 10 Comments

Moving averages can smooth time series data, reveal underlying trends, and identify components for use in statistical modeling. Smoothing is the process of removing random variations that appear as coarseness in a plot of raw time series data. It reduces the noise to emphasize the signal that can contain trends and cycles. Analysts also refer to the smoothing process as filtering the data. [Read more…] about Using Moving Averages to Smooth Time Series Data

Filed Under: Time Series Tagged With: analysis example, conceptual, Excel

A Tour of Survival Analysis

By Alexander Moreno 5 Comments

Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com

Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator.  Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.

Why Survival Analysis: Right Censoring

Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.

However, in some cases one might not observe the event time: this is generally called right censoring. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:

  • How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?

The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.

The Survival Function and the Hazard

Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time t. The hazard describes the instantaneous rate of the first event at any time t.

More formally, let t be the event time of interest, such as the death time. Then the survival function is S(t) = P(T > t). We can also note that this is related to the cumulative distribution function:

Survival cumulative distribution function.

For the hazard, the probability of the first event time being in the small interval (t,t+dt), given survival up to t is:

Hazard probability of the first event.

This is illustrated in the following figure.

Rearranging terms and taking limits we obtain

Hazard function.

where f(t) is the density function of T and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via

Survival function.

The key question then is how to estimate the hazard and/or survival function.

Non-Parametric Estimation with Kaplan Meier

In non-parametric survival analysis, we want to estimate the survival function S(t) without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF:

Empirical CDF

This equation is a succinct representation of: how many people have died by time t? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time t.

While we don’t necessarily know how many people have survived by an arbitrary time t, we do know how many people in the study are still at risk. We can use this instead. Partition the study time into 0 < t1 < . . . < tn-1 <  tn, where each ti is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let Y(t) be the number of people at risk at just before time t. Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is 1/Y(t), and say that the probability of dying at any other time is 0. We can then say that the probability of surviving at any event time Ti, given survival at previous candidate event times is: 

Survival probability.

The probability of surviving up to a time t is then:

Survival probability.

We call this [1] the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.

Learn more about Hazard Ratios.

Kaplan Meier R Example

In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.

library(survminer)
library(survival)
kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']])
kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log')
ggsurvplot(kaplan_meier_treatment,conf.int = 'True')

Semi-Parametric Regression with Cox’s Proportional Hazards Model

Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:

Cox proportional hazard model.

The baseline hazard λ0(t) describes how the average person’s risk evolves over time. The relative risk exp(βTx) describes how covariates affect the hazard. In particular, a unit increase in xi leads to an increase of the hazard by a factor of exp(βi).

Because of the non-parametric nuisance term λ0(t), it is difficult to maximize the full likelihood for β directly. Cox’s insight [2] was that the assignment probabilities given the death times contain most of the information about β, and the remaining terms contain most of the information about λ0(t). The assignment probabilities give the following partial likelihood

Cox partial likelihood.

We can then maximize this to get an estimator of β. In [3,4] they show that this estimator is consistent and asymptotically normal.

Cox Proportional Hazards R Example

In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.

cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian)
summary(cox_fit)

which gives the following results

Call:
coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx), 
    data = ovarian)

  n= 26, number of events= 12 

            coef exp(coef) se(coef)      z Pr(>|z|)   
age      0.13853   1.14858  0.04801  2.885  0.00391 **
ecog.ps -0.09670   0.90783  0.62994 -0.154  0.87800   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.1486     0.8706    1.0454     1.262
ecog.ps    0.9078     1.1015    0.2641     3.120

Concordance= 0.819  (se = 0.058 )
Likelihood ratio test= 12.71  on 2 df,   p=0.002
Wald test            = 8.43  on 2 df,   p=0.01
Score (logrank) test = 12.24  on 2 df,   p=0.002

this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.

Aalen’s Additive Model

Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model [5] relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form

Aalen's additive model.

As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.

Aalen’s Additive Model R Example

In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.

library(timereg)
data(sTRACE) 
# Fits Aalen model 
out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100) 
summary(out) 
par(mfrow=c(2,3)) 
plot(out)

This gives us

Additive Aalen Model 

Test for nonparametric terms 

Test for non-significant effects 
            Supremum-test of significance p-value H_0: B(t)=0
(Intercept)                          7.29                0.00
age                                  8.63                0.00
sex                                  2.95                0.01
diabetes                             2.31                0.24
chf                                  5.30                0.00
vf                                   2.95                0.03

Test for time invariant effects 
                  Kolmogorov-Smirnov test
(Intercept)                       0.57700
age                               0.00866
sex                               0.11900
diabetes                          0.16200
chf                               0.12900
vf                                0.43500
            p-value H_0:constant effect
(Intercept)                        0.00
age                                0.00
sex                                0.18
diabetes                           0.43
chf                                0.06
vf                                 0.02
                    Cramer von Mises test
(Intercept)                      0.875000
age                              0.000179
sex                              0.017700
diabetes                         0.041200
chf                              0.053500
vf                               0.434000
            p-value H_0:constant effect
(Intercept)                        0.00
age                                0.00
sex                                0.29
diabetes                           0.42
chf                                0.02
vf                                 0.05

   
   
  Call: 
aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes + 
    chf + vf, data = sTRACE, max.time = 7, n.sim = 100)

The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.

Discussion

In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.

[1] Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.” Journal of the American statistical association 53, no. 282 (1958): 457-481.
[2] Cox, David R. “Regression models and life-tables.” In Breakthroughs in statistics, pp. 527-541. Springer, New York, NY, 1992.
[3] Tsiatis, Anastasios A. “A large sample study of Cox’s regression model.” The Annals of Statistics 9, no. 1 (1981): 93-108.
[4] Andersen, Per Kragh, and Richard David Gill. “Cox’s regression model for counting processes: a large sample study.” The annals of statistics (1982): 1100-1120.
[5] Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” In Mathematical statistics and probability theory, pp. 1-25. Springer, New York, NY, 1980.

Filed Under: Survival Tagged With: analysis example, choosing analysis, conceptual

Time Series Analysis Introduction

By Jim Frost 28 Comments

Time series analysis tracks characteristics of a process at regular time intervals. It’s a fundamental method for understanding how a metric changes over time and forecasting future values. Analysts use time series methods in a wide variety of contexts. [Read more…] about Time Series Analysis Introduction

Filed Under: Time Series Tagged With: conceptual, data types, graphs

New eBook Release! Hypothesis Testing: An Intuitive Guide

By Jim Frost 10 Comments

I’m thrilled to release my new book! Hypothesis Testing: An Intuitive Guide for Making Data Driven Decisions. [Read more…] about New eBook Release! Hypothesis Testing: An Intuitive Guide

Filed Under: Hypothesis Testing Tagged With: ebook

Answering the Birthday Problem in Statistics

By Jim Frost 18 Comments

The Birthday Problem in statistics asks, how many people do you need in a group to have a 50% chance that at least two people will share a birthday? Go ahead and think about that for a moment. The answer surprises many people. We’ll get to that shortly.

In this post, I’ll not only answer the birthday paradox, but I’ll also show you how to calculate the probabilities for any size group, run a computer simulation of it, and explain why the answer to the Birthday Problem is so surprising. [Read more…] about Answering the Birthday Problem in Statistics

Filed Under: Fun Tagged With: Excel, graphs, probability

Coronavirus Mortality Rates by Country

By Jim Frost 54 Comments

Skull and crossbonesUPDATED! April 3, 2020. The coronavirus mortality rate varies significantly by country. In this post, I look at the mortality rates for ten countries and assess factors that affect these numbers. After discussing the trends, I provide a rough estimate for where the actual fatality rate might lie. [Read more…] about Coronavirus Mortality Rates by Country

Filed Under: Basics Tagged With: coronavirus, graphs

Coronavirus: Exponential Growth and Hospital Beds

By Jim Frost 19 Comments

UPDATED March 24, 2020: As the number of confirmed coronavirus cases continues to grow exponentially, the capacity of the hospital system to treat these cases is becoming a concern. The goal of “flattening the curve” is that testing, isolation, and social distancing will slow the increase of new cases. Hopefully, these efforts reduce the numbers of new patients who require hospitalization to a rate that hospitals can handle.

In this post, I’ll identify the top 10 states in the United States that have the greatest likelihood of experiencing hospital capacity problems if coronavirus cases continue to grow exponentially. To recognize these states, I’ll assess per capita rates for both coronavirus infections and hospital beds. I’m looking for states that have a relatively large number of coronavirus cases given the size of their population and have a relatively low number of hospital beds. [Read more…] about Coronavirus: Exponential Growth and Hospital Beds

Filed Under: Basics Tagged With: coronavirus

Coronavirus Curves and Different Outcomes

By Jim Frost 247 Comments

Microscopic photograph of the coronavirus
Coronavirus particles as seen by negative stain electron microscopy. Notice the characteristic club-like projections on the membrane.

UPDATED May 9, 2020. The coronavirus, or COVID19, has swept around the world. However, not all countries have had the same experiences. Outcomes have varied by the number of cases, the rate of increase, and how countries have responded.

In this post, I present coronavirus growth curves for 15 countries and their per capita values, graph their new cases per day, daily coronavirus deaths, and describe how each country approached controlling the virus. You can see the differences in outcomes and when the effects of coronavirus mitigation efforts started taking effect. I also include the per capita values for these countries in a table near the end.

At this time, there is plenty of good news with evidence that many of the 15 countries have slowed the growth rate of new cases. However, several other countries have reason to worry. And, we have one new cautionary tale about a country that had the virus contained but is now seeing a spike in new cases. [Read more…] about Coronavirus Curves and Different Outcomes

Filed Under: Basics Tagged With: coronavirus, graphs

Failing to Reject the Null Hypothesis

By Jim Frost 66 Comments

Failing to reject the null hypothesis is an odd way to state that the results of your hypothesis test are not statistically significant. Why the peculiar phrasing? “Fail to reject” sounds like one of those double negatives that writing classes taught you to avoid. What does it mean exactly? There’s an excellent reason for the odd wording!

In this post, learn what it means when you fail to reject the null hypothesis and why that’s the correct wording. While accepting the null hypothesis sounds more straightforward, it is not statistically correct! [Read more…] about Failing to Reject the Null Hypothesis

Filed Under: Hypothesis Testing Tagged With: conceptual

Understanding Significance Levels in Statistics

By Jim Frost 30 Comments

Significance levels in statistics are a crucial component of hypothesis testing. However, unlike other values in your statistical output, the significance level is not something that statistical software calculates. Instead, you choose the significance level. Have you ever wondered why?

In this post, I’ll explain the significance level conceptually, why you choose its value, and how to choose a good value. Statisticians also refer to the significance level as alpha (α). [Read more…] about Understanding Significance Levels in Statistics

Filed Under: Hypothesis Testing Tagged With: conceptual

How the Chi-Squared Test of Independence Works

By Jim Frost 21 Comments

Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable.

I’ve previously written about Pearson’s chi-square test of independence using a fun Star Trek example. Are the uniform colors related to the chances of dying? You can test the notion that the infamous red shirts have a higher likelihood of dying. In that post, I focus on the purpose of the test, applied it to this example, and interpreted the results.

In this post, I’ll take a bit of a different approach. I’ll show you the nuts and bolts of how to calculate the expected values, chi-square value, and degrees of freedom. Then you’ll learn how to use the chi-squared distribution in conjunction with the degrees of freedom to calculate the p-value. [Read more…] about How the Chi-Squared Test of Independence Works

Filed Under: Hypothesis Testing Tagged With: analysis example, distributions, interpreting results

How to Test Variances in Excel

By Jim Frost 7 Comments

Use a variances test to determine whether the variability of two groups differs. In this post, we’ll work through a two-sample variances test that Excel provides. Even if Excel isn’t your primary statistical software, this post provides an excellent introduction to variance tests. Excel refers to this analysis as F-Test Two-Sample for Variances. [Read more…] about How to Test Variances in Excel

Filed Under: Hypothesis Testing Tagged With: analysis example, Excel, interpreting results

How to do Two-Way ANOVA in Excel

By Jim Frost 30 Comments

Use two-way ANOVA to assess differences between the group means that are defined by two categorical factors. In this post, we’ll work through two-way ANOVA using Excel. Even if Excel isn’t your main statistical package, this post is an excellent introduction to two-way ANOVA. Excel refers to this analysis as two factor ANOVA. [Read more…] about How to do Two-Way ANOVA in Excel

Filed Under: ANOVA Tagged With: analysis example, Excel, interpreting results

Guidelines for Removing and Handling Outliers in Data

By Jim Frost 66 Comments

Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons. [Read more…] about Guidelines for Removing and Handling Outliers in Data

Filed Under: Basics Tagged With: assumptions, choosing analysis, conceptual

5 Ways to Find Outliers in Your Data

By Jim Frost 35 Comments

Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.

Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding outliers depends on subject-area knowledge and an understanding of the data collection process. While there is no solid mathematical definition, there are guidelines and statistical tests you can use to find outlier candidates. [Read more…] about 5 Ways to Find Outliers in Your Data

Filed Under: Basics Tagged With: analysis example, conceptual, graphs

How to do One-Way ANOVA in Excel

By Jim Frost 23 Comments

Use one-way ANOVA to test whether the means of at least three groups are different. Excel refers to this test as Single Factor ANOVA. This post is an excellent introduction to performing and interpreting a one-way ANOVA test even if Excel isn’t your primary statistical software package. [Read more…] about How to do One-Way ANOVA in Excel

Filed Under: ANOVA Tagged With: analysis example, Excel, interpreting results

How to do t-Tests in Excel

By Jim Frost 114 Comments

Excel can perform various statistical analyses, including t-tests. It is an excellent option because nearly everyone can access Excel. This post is a great introduction to performing and interpreting t-tests even if Excel isn’t your primary statistical software package.

In this post, I provide step-by-step instructions for using Excel to perform t-tests. Importantly, I also show you how to select the correct form of t-test, choose the right options, and interpret the results. I also include links to additional resources I’ve written, which present clear explanations of relevant t-test concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together! [Read more…] about How to do t-Tests in Excel

Filed Under: Hypothesis Testing Tagged With: analysis example, Excel, interpreting results

New eBook Release! Introduction to Statistics: An Intuitive Guide

By Jim Frost 23 Comments

I’m thrilled to release my new book! Introduction to Statistics: An Intuitive Guide for Analyzing Data and Unlocking Discoveries. [Read more…] about New eBook Release! Introduction to Statistics: An Intuitive Guide

Filed Under: Basics Tagged With: ebook

  • « Go to Previous Page
  • Go to page 1
  • Interim pages omitted …
  • Go to page 10
  • Go to page 11
  • Go to page 12
  • Go to page 13
  • Go to page 14
  • Interim pages omitted …
  • Go to page 18
  • Go to Next Page »

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • F-table
    • How To Interpret R-squared in Regression Analysis
    • Z-table
    • How to do t-Tests in Excel
    • How to Find the P value: Process and Calculations
    • Weighted Average: Formula & Calculation Examples
    • T-Distribution Table of Critical Values
    • Cronbach’s Alpha: Definition, Calculations & Example
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions

    Recent Posts

    • Longitudinal Study: Overview, Examples & Benefits
    • Correlation vs Causation: Understanding the Differences
    • One Way ANOVA Overview & Example
    • Observational Study vs Experiment with Examples
    • Goodness of Fit: Definition & Tests
    • Binomial Distribution Formula: Probability, Standard Deviation & Mean

    Recent Comments

    • aftab hussain on Factor Analysis Guide with an Example
    • Jim Frost on Joint Probability: Definition, Formula & Examples
    • Harmeet on Joint Probability: Definition, Formula & Examples
    • kafia on Cronbach’s Alpha: Definition, Calculations & Example
    • Jim Frost on How to Interpret P-values and Coefficients in Regression Analysis

    Copyright © 2023 · Jim Frost · Privacy Policy