*Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com*

Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator. Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.

## Why Survival Analysis: Right Censoring

Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.

However, in some cases one might not observe the event time: this is generally called *right censoring*. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:

*How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?*

The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.

## The Survival Function and the Hazard

Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time *t*. The hazard describes the instantaneous rate of the first event at any time *t*.

More formally, let *t* be the event time of interest, such as the death time. Then the survival function is S(t) = P(T > t). We can also note that this is related to the cumulative distribution function:

For the hazard, the probability of the first event time being in the small interval (t,t+dt), given survival up to *t* is:

This is illustrated in the following figure.

Rearranging terms and taking limits we obtain

where f(t) is the density function of T and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via

The key question then is how to estimate the hazard and/or survival function.

## Non-Parametric Estimation with Kaplan Meier

In non-parametric survival analysis, we want to estimate the survival function S(t) without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF:

This equation is a succinct representation of: how many people have died by time *t*? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time *t*.

While we don’t necessarily know how many people have survived by an arbitrary time *t*, we do know how many people in the study are still at risk. We can use this instead. Partition the study time into 0 < t_{1} < . . . < t_{n-1} < t_{n}, where each t_{i} is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let Y(t) be the number of people at risk at just before time *t*. Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is 1/Y(t), and say that the probability of dying at any other time is 0. We can then say that the probability of surviving at any event time T_{i}, given survival at previous candidate event times is:

The probability of surviving up to a time *t* is then:

We call this [1] the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.

Learn more about Hazard Ratios.

### Kaplan Meier R Example

In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.

```
library(survminer)
library(survival)
kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']])
kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log')
ggsurvplot(kaplan_meier_treatment,conf.int = 'True')
```

## Semi-Parametric Regression with Cox’s Proportional Hazards Model

Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:

The baseline hazard λ_{0}(t) describes how the average person’s risk evolves over time. The relative risk exp(β^{T}x) describes how covariates affect the hazard. In particular, a unit increase in x_{i} leads to an increase of the hazard by a factor of exp(β_{i}).

Because of the non-parametric nuisance term λ_{0}(t), it is difficult to maximize the full likelihood for β directly. Cox’s insight [2] was that the assignment probabilities given the death times contain most of the information about β, and the remaining terms contain most of the information about λ_{0}(t). The assignment probabilities give the following *partial likelihood*

We can then maximize this to get an estimator of β. In [3,4] they show that this estimator is consistent and asymptotically normal.

### Cox Proportional Hazards R Example

In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.

```
cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian)
summary(cox_fit)
```

which gives the following results

```
Call:
coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx),
data = ovarian)
n= 26, number of events= 12
coef exp(coef) se(coef) z Pr(>|z|)
age 0.13853 1.14858 0.04801 2.885 0.00391 **
ecog.ps -0.09670 0.90783 0.62994 -0.154 0.87800
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
age 1.1486 0.8706 1.0454 1.262
ecog.ps 0.9078 1.1015 0.2641 3.120
Concordance= 0.819 (se = 0.058 )
Likelihood ratio test= 12.71 on 2 df, p=0.002
Wald test = 8.43 on 2 df, p=0.01
Score (logrank) test = 12.24 on 2 df, p=0.002
```

this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.

## Aalen’s Additive Model

Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model [5] relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form

As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.

### Aalen’s Additive Model R Example

In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.

```
library(timereg)
data(sTRACE)
# Fits Aalen model
out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100)
summary(out)
par(mfrow=c(2,3))
plot(out)
```

This gives us

```
Additive Aalen Model
Test for nonparametric terms
Test for non-significant effects
Supremum-test of significance p-value H_0: B(t)=0
(Intercept) 7.29 0.00
age 8.63 0.00
sex 2.95 0.01
diabetes 2.31 0.24
chf 5.30 0.00
vf 2.95 0.03
Test for time invariant effects
Kolmogorov-Smirnov test
(Intercept) 0.57700
age 0.00866
sex 0.11900
diabetes 0.16200
chf 0.12900
vf 0.43500
p-value H_0:constant effect
(Intercept) 0.00
age 0.00
sex 0.18
diabetes 0.43
chf 0.06
vf 0.02
Cramer von Mises test
(Intercept) 0.875000
age 0.000179
sex 0.017700
diabetes 0.041200
chf 0.053500
vf 0.434000
p-value H_0:constant effect
(Intercept) 0.00
age 0.00
sex 0.29
diabetes 0.42
chf 0.02
vf 0.05
Call:
aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes +
chf + vf, data = sTRACE, max.time = 7, n.sim = 100)
```

The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.

## Discussion

In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.

[1] Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.” Journal of the American statistical association 53, no. 282 (1958): 457-481.

[2] Cox, David R. “Regression models and life-tables.” In *Breakthroughs in statistics*, pp. 527-541. Springer, New York, NY, 1992.

[3] Tsiatis, Anastasios A. “A large sample study of Cox’s regression model.” *The Annals of Statistics* 9, no. 1 (1981): 93-108.

[4] Andersen, Per Kragh, and Richard David Gill. “Cox’s regression model for counting processes: a large sample study.” *The annals of statistics* (1982): 1100-1120.

[5] Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” In *Mathematical statistics and probability theory*, pp. 1-25. Springer, New York, NY, 1980.

Hania says

Thank you very much for a wonderful article, which is very helpful. I have to draw the Kaplan meier curve to show the proportion of clinical trials with and without results (on y-axis) w.r.t their result reporting time in months (on x-axis). The event of interest is to see if the trial has either reported the result or not (if reported then 1 and 0 for not reported).

For all those without results, we don’t have trial result reporting date to calculate the survival time (that is the diff between trial completion date and result reporting date) and these are right censored because the trials did not report the results even till the date of our data collection.

What I know from the literature on KMC is “ information of time to event can be missing for some entries. For instance, at the end of a study, a subset of patients (trials in our case) may still have not undergone the event of interest (that is result reporting in our case). In this scenario, the real-time to event information is missing in some way. However, survival time until a certain point is still important information that can be easily incorporated in survival analysis. These data are referred to as right censored”. Based on this, I thought that I should then use the date of data collection in result reporting column for all those trials with missing result reporting dates (censored) but the issue is by doing this they are not counted as right censored and it assumes they were reported at that day. Any idea to solve this issue because I want to show both trials- the trials with results and those without results (censored) on the x-axis. How to deal with trials not reporting the results even till the time of our data collection. What I need to put in result reporting column for these trials. Any suggestion to deal with this issue will be appreciated

Hania says

Thank you very much for a wonderful article, which is very helpful. I have to draw the Kaplan meier curve to show the proportion of clinical trials with and without results (on y-axis) w.r.t their result reporting time in months (on x-axis). The event of interest is to see if the trial has either reported the result or not (if reported then 1 and 0 for not reported).

For all those without results, we don’t have trial result reporting date to calculate the survival time (that is the diff between trial completion date and result reporting date) and these are right censored because the trials did not report the results even till the date of our data collection.

What I know from the literature on KMC is “ information of time to event can be missing for some entries. For instance, at the end of a study, a subset of patients (trials in our case) may still have not undergone the event of interest (that is result reporting in our case). In this scenario, the real-time to event information is missing in some way. However, survival time until a certain point is still important information that can be easily incorporated in survival analysis. These data are referred to as right censored”. Based on this, I thought that I should then use the date of data collection in result reporting column for all those trials with missing result reporting dates (censored) but the issue is by doing this they are not counted as right censored and it assumes they were reported at that day. Any idea to solve this issue because I want to show both trials- the trials with results and those without results (censored) on the x-axis. How to deal with trials not reporting the results even till the time of our data collection. What I need to put in result reporting column for these trials. Any suggestion to deal with this issue will be appreciated.

jsti says

It’s a good cover about survival analysis. But can u more expand on what is parametric, non-parametric, semi-parametric? and how are we relating them in survival analysis?

Charles says

I am looking forward to getting this book especially in the area of time to event analysis.

agnes mugagga says

thank you very much.