• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

choosing analysis

Spearman’s Correlation Explained

By Jim Frost 45 Comments

Spearman’s correlation in statistics is a nonparametric alternative to Pearson’s correlation. Use Spearman’s correlation for data that follow curvilinear, monotonic relationships and for ordinal data. Statisticians also refer to Spearman’s rank order correlation coefficient as Spearman’s ρ (rho).

In this post, I’ll cover what all that means so you know when and why you should use Spearman’s correlation instead of the more common Pearson’s correlation. [Read more…] about Spearman’s Correlation Explained

Filed Under: Basics Tagged With: analysis example, choosing analysis, conceptual, data types, Excel, graphs

Multiplication Rule for Calculating Probabilities

By Jim Frost 7 Comments

The multiplication rule in probability allows you to calculate the probability of multiple events occurring together using known probabilities of those events individually. There are two forms of this rule, the specific and general multiplication rules.

In this post, learn about when and how to use both the specific and general multiplication rules. Additionally, I’ll use and explain the standard notation for probabilities throughout, helping you learn how to interpret it. We’ll work through several example problems so you can see them in action. There’s even a bonus problem at the end! [Read more…] about Multiplication Rule for Calculating Probabilities

Filed Under: Probability Tagged With: analysis example, choosing analysis, conceptual

Independent and Dependent Samples in Statistics

By Jim Frost 14 Comments

When comparing groups in your data, you can have either independent or dependent samples. The type of samples in your experimental design impacts sample size requirements, statistical power, the proper analysis, and even your study’s costs. Understanding the implications of each type of sample can help you design a better experiment. [Read more…] about Independent and Dependent Samples in Statistics

Filed Under: Basics Tagged With: analysis example, choosing analysis, conceptual, experimental design

A Tour of Survival Analysis

By Alexander Moreno 5 Comments

Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com

Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator.  Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.

Why Survival Analysis: Right Censoring

Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.

However, in some cases one might not observe the event time: this is generally called right censoring. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:

  • How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?

The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.

The Survival Function and the Hazard

Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time t. The hazard describes the instantaneous rate of the first event at any time t.

More formally, let T be the event time of interest, such as the death time. Then the survival function is S(t)=P(T>t). We can also note that this is related to the cumulative distribution function F(t)=P(T\leq t) via S(t)=1-F(t).

For the hazard, the probability of the first event time being in the small interval [t,t+dt), given survival up to t is P(T\in [t,t+dt)|T\geq t)=\lambda(t)dt. This is illustrated in the following figure.

Rearranging terms and taking limits we obtain

\lambda(t)=\lim_{dt\rightarrow 0}\frac{P(T\in [t,t+dt)|T\geq t)}{dt}=\frac{f(t)}{S(t)}

where f(t) is the density function of T and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via

S(t)=\exp(-\int_0^t \lambda(s)ds)

The key question then is how to estimate the hazard and/or survival function.

Non-Parametric Estimation with Kaplan Meier

In non-parametric survival analysis, we want to estimate the survival function S(t) without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF \hat{F}(t)=\sum_{i=1}^n I(T_i\leq t). This equation is a succinct representation of: how many people have died by time t? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time t.

While we don’t necessarily know how many people have survived by an arbitrary time t, we do know how many people in the study are still at risk. We can use this instead. Partition the study time into 0<t_1<\cdots t_{n-1}<t_n, where each t_i is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let Y(t) be the number of people at risk at just before time t. Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is \frac{1}{Y(t)}, and say that the probability of dying at any other time is 0. We can then say that the probability of surviving at any event time T_i, given survival at previous candidate event times is 1-\frac{1}{Y(T_i)}. The probability of surviving up to a time t is then

S(t)=\prod_{T_i\leq t}(1-\frac{1}{Y(T_i)})

We call this [1] the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.

Learn more about Hazard Ratios.

Kaplan Meier R Example

In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.

library(survminer)
library(survival)
kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']])
kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log')
ggsurvplot(kaplan_meier_treatment,conf.int = 'True')

Semi-Parametric Regression with Cox’s Proportional Hazards Model

Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:

\lambda(t)=\lambda_0(t)\exp(\beta^T x)

The baseline hazard \lambda_0(t) describes how the average person’s risk evolves over time. The relative risk \exp(\beta^T x) describes how covariates affect the hazard. In particular, a unit increase in x_i leads to an increase of the hazard by a factor of \exp(\beta_i).

Because of the non-parametric nuissance term \lambda_0(t), it is difficult to maximize the full likelihood for \beta directly. Cox’s insight [2] was that the assignment probabilities given the death times contain most of the information about \beta, and the remaining terms contain most of the information about \lambda_0(t). The assignment probabilities give the following partial likelihood

L(\beta)=\prod_{i=1}^n\frac{\exp(\beta^T x^{(i)})}{\sum_{k=1}^n \exp(\beta^T x^{(k)})}

We can then maximize this to get an estimator \hat{\beta} of \beta. In [3,4] they show that this estimator is consistent and asymptotically normal.

Cox Proportional Hazards R Example

In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.

cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian)
summary(cox_fit)

which gives the following results

Call:
coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx), 
    data = ovarian)

  n= 26, number of events= 12 

            coef exp(coef) se(coef)      z Pr(>|z|)   
age      0.13853   1.14858  0.04801  2.885  0.00391 **
ecog.ps -0.09670   0.90783  0.62994 -0.154  0.87800   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        exp(coef) exp(-coef) lower .95 upper .95
age        1.1486     0.8706    1.0454     1.262
ecog.ps    0.9078     1.1015    0.2641     3.120

Concordance= 0.819  (se = 0.058 )
Likelihood ratio test= 12.71  on 2 df,   p=0.002
Wald test            = 8.43  on 2 df,   p=0.01
Score (logrank) test = 12.24  on 2 df,   p=0.002

this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.

Aalen’s Additive Model

Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model [5] relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form

\lambda_i(t)=\beta_0(t)+\beta_1(t)x^{(i)}_1+\cdots+\beta_p(t)x^{(i)}_p

As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.

Aalen’s Additive Model R Example

In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.

library(timereg)
data(sTRACE) 
# Fits Aalen model 
out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100) 
summary(out) 
par(mfrow=c(2,3)) 
plot(out)

This gives us

Additive Aalen Model 

Test for nonparametric terms 

Test for non-significant effects 
            Supremum-test of significance p-value H_0: B(t)=0
(Intercept)                          7.29                0.00
age                                  8.63                0.00
sex                                  2.95                0.01
diabetes                             2.31                0.24
chf                                  5.30                0.00
vf                                   2.95                0.03

Test for time invariant effects 
                  Kolmogorov-Smirnov test
(Intercept)                       0.57700
age                               0.00866
sex                               0.11900
diabetes                          0.16200
chf                               0.12900
vf                                0.43500
            p-value H_0:constant effect
(Intercept)                        0.00
age                                0.00
sex                                0.18
diabetes                           0.43
chf                                0.06
vf                                 0.02
                    Cramer von Mises test
(Intercept)                      0.875000
age                              0.000179
sex                              0.017700
diabetes                         0.041200
chf                              0.053500
vf                               0.434000
            p-value H_0:constant effect
(Intercept)                        0.00
age                                0.00
sex                                0.29
diabetes                           0.42
chf                                0.02
vf                                 0.05

   
   
  Call: 
aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes + 
    chf + vf, data = sTRACE, max.time = 7, n.sim = 100)

The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.

Discussion

In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.

[1] Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.” Journal of the American statistical association 53, no. 282 (1958): 457-481.
[2] Cox, David R. “Regression models and life-tables.” In Breakthroughs in statistics, pp. 527-541. Springer, New York, NY, 1992.
[3] Tsiatis, Anastasios A. “A large sample study of Cox’s regression model.” The Annals of Statistics 9, no. 1 (1981): 93-108.
[4] Andersen, Per Kragh, and Richard David Gill. “Cox’s regression model for counting processes: a large sample study.” The annals of statistics (1982): 1100-1120.
[5] Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” In Mathematical statistics and probability theory, pp. 1-25. Springer, New York, NY, 1980.

Filed Under: Survival Tagged With: analysis example, choosing analysis, conceptual

Guidelines for Removing and Handling Outliers in Data

By Jim Frost 63 Comments

Outliers are unusual values in your dataset, and they can distort statistical analyses and violate their assumptions. Unfortunately, all analysts will confront outliers and be forced to make decisions about what to do with them. Given the problems they can cause, you might think that it’s best to remove them from your data. But, that’s not always the case. Removing outliers is legitimate only for specific reasons. [Read more…] about Guidelines for Removing and Handling Outliers in Data

Filed Under: Basics Tagged With: assumptions, choosing analysis, conceptual

Using Histograms to Understand Your Data

By Jim Frost 23 Comments

Histograms are graphs that display the distribution of your continuous data. They are fantastic exploratory tools because they reveal properties about your sample data in ways that summary statistics cannot. For instance, while the mean and standard deviation can numerically summarize your data, histograms bring your sample data to life.

In this blog post, I’ll show you how histograms reveal the shape of the distribution, its central tendency, and the spread of values in your sample data. You’ll also learn how to identify outliers, how histograms relate to probability distribution functions, and why you might need to use hypothesis tests with them.
[Read more…] about Using Histograms to Understand Your Data

Filed Under: Basics Tagged With: choosing analysis, data types, graphs

Boxplots vs. Individual Value Plots: Comparing Groups

By Jim Frost 27 Comments

Use boxplots and individual value plots when you have a categorical grouping variable and a continuous outcome variable. The levels of the categorical variables form the groups in your data, and the researchers measure the continuous variable. Both types of charts help you compare distributions of measurements between the groups. Boxplots are also known as box and whisker plots. [Read more…] about Boxplots vs. Individual Value Plots: Comparing Groups

Filed Under: Basics Tagged With: choosing analysis, data types, graphs

Using Post Hoc Tests with ANOVA

By Jim Frost 125 Comments

Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.

In this post, I’ll show you what post hoc analyses are, the critical benefits they provide, and help you choose the correct one for your study. Additionally, I’ll show why failure to control the experiment-wise error rate will cause you to have severe doubts about your results. [Read more…] about Using Post Hoc Tests with ANOVA

Filed Under: ANOVA Tagged With: analysis example, choosing analysis, conceptual, graphs, interpreting results

Introduction to Bootstrapping in Statistics with an Example

By Jim Frost 106 Comments

Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.

In this blog post, I explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals. [Read more…] about Introduction to Bootstrapping in Statistics with an Example

Filed Under: Hypothesis Testing Tagged With: analysis example, assumptions, choosing analysis, conceptual, distributions, graphs, interpreting results

Comparing Hypothesis Tests for Continuous, Binary, and Count Data

By Jim Frost 39 Comments

In a previous blog post, I introduced the basic concepts of hypothesis testing and explained the need for performing these tests. In this post, I’ll build on that and compare various types of hypothesis tests that you can use with different types of data, explore some of the options, and explain how to interpret the results. Along the way, I’ll point out important planning considerations, related analyses, and pitfalls to avoid. [Read more…] about Comparing Hypothesis Tests for Continuous, Binary, and Count Data

Filed Under: Hypothesis Testing Tagged With: choosing analysis, data types, interpreting results, quality improvement

Choosing the Correct Type of Regression Analysis

By Jim Frost 577 Comments


Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data. [Read more…] about Choosing the Correct Type of Regression Analysis

Filed Under: Regression Tagged With: choosing analysis, data types

How to Choose Between Linear and Nonlinear Regression

By Jim Frost 32 Comments

As you fit regression models, you might need to make a choice between linear and nonlinear regression models. The field of statistics can be weird. Despite their names, both forms of regression can fit curvature in your data. So, how do you choose? In this blog post, I show you how to choose between linear and nonlinear regression models. [Read more…] about How to Choose Between Linear and Nonlinear Regression

Filed Under: Regression Tagged With: analysis example, assumptions, choosing analysis, conceptual, interpreting results

Confidence Intervals vs Prediction Intervals vs Tolerance Intervals

By Jim Frost 36 Comments

Intervals are estimation methods in statistics that use sample data to produce ranges of values that are likely to contain the population value of interest. In contrast, point estimates are single value estimates of a population value. Of the different types of statistical intervals, confidence intervals are the most well-known. However, certain kinds of analyses and situations call for other types of ranges that provide different information. [Read more…] about Confidence Intervals vs Prediction Intervals vs Tolerance Intervals

Filed Under: Hypothesis Testing Tagged With: choosing analysis, conceptual

Nonparametric Tests vs. Parametric Tests

By Jim Frost 108 Comments

Nonparametric tests don’t require that your data follow the normal distribution. They’re also known as distribution-free tests and can provide benefits in certain situations. Typically, people who perform statistical hypothesis tests are more comfortable with parametric tests than nonparametric tests.

You’ve probably heard it’s best to use nonparametric tests if your data are not normally distributed—or something along these lines. That seems like an easy way to choose, but there’s more to the decision than that. [Read more…] about Nonparametric Tests vs. Parametric Tests

Filed Under: Hypothesis Testing Tagged With: assumptions, choosing analysis

How t-Tests Work: 1-sample, 2-sample, and Paired t-Tests

By Jim Frost 15 Comments

T-tests are statistical hypothesis tests that analyze one or two sample means. When you analyze your data with any t-test, the procedure reduces your entire sample to a single value, the t-value. In this post, I describe how each type of t-test calculates the t-value. I don’t explain this just so you can understand the calculation, but I describe it in a way that really helps you grasp how t-tests work. [Read more…] about How t-Tests Work: 1-sample, 2-sample, and Paired t-Tests

Filed Under: Hypothesis Testing Tagged With: choosing analysis, conceptual

Benefits of Welch’s ANOVA Compared to the Classic One-Way ANOVA

By Jim Frost 63 Comments

Welch’s ANOVA is an alternative to the traditional analysis of variance (ANOVA) and it offers some serious benefits. One-way analysis of variance determines whether differences between the means of at least three groups are statistically significant. For decades, introductory statistics classes have taught the classic Fishers one-way ANOVA that uses the F-test. It’s a standard statistical analysis, and you might think it’s pretty much set in stone by now. Surprise, there’s a significant change occurring in the world of one-way analysis of variance! [Read more…] about Benefits of Welch’s ANOVA Compared to the Classic One-Way ANOVA

Filed Under: ANOVA Tagged With: analysis example, assumptions, choosing analysis, conceptual, interpreting results

How to Analyze Likert Scale Data

By Jim Frost 144 Comments

How do you analyze Likert scale data? Likert scales are the most broadly used method for scaling responses in survey studies. Survey questions that ask you to indicate your level of agreement, from strongly agree to strongly disagree, use the Likert scale. The data in the worksheet are five-point Likert scale data for two groups [Read more…] about How to Analyze Likert Scale Data

Filed Under: Hypothesis Testing Tagged With: assumptions, choosing analysis, conceptual

Multivariate ANOVA (MANOVA) Benefits and When to Use It

By Jim Frost 152 Comments

Multivariate ANOVA (MANOVA) extends the capabilities of analysis of variance (ANOVA) by assessing multiple dependent variables simultaneously. ANOVA statistically tests the differences between three or more group means. For example, if you have three different teaching methods and you want to evaluate the average scores for these groups, you can use ANOVA. However, ANOVA does have a drawback. It can assess only one dependent variable at a time. This limitation can be an enormous problem in certain circumstances because it can prevent you from detecting effects that actually exist. [Read more…] about Multivariate ANOVA (MANOVA) Benefits and When to Use It

Filed Under: ANOVA Tagged With: analysis example, choosing analysis, conceptual, interpreting results

  • « Go to Previous Page
  • Go to page 1
  • Go to page 2

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • How to do t-Tests in Excel
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • Mean, Median, and Mode: Measures of Central Tendency
    • How to Find the P value: Process and Calculations
    • Z-table
    • One-Tailed and Two-Tailed Hypothesis Tests Explained
    • Choosing the Correct Type of Regression Analysis
    • Understanding Interaction Effects in Statistics

    Recent Posts

    • ANCOVA: Uses, Assumptions & Example
    • Fibonacci Sequence: Formula & Uses
    • Undercoverage Bias: Definition & Examples
    • Matched Pairs Design: Uses & Examples
    • Nonresponse Bias: Definition & Reducing
    • Cumulative Distribution Function (CDF): Uses, Graphs & vs PDF

    Recent Comments

    • Steve on Survivorship Bias: Definition, Examples & Avoiding
    • Jim Frost on Using Post Hoc Tests with ANOVA
    • Jim Frost on Statistical Significance: Definition & Meaning
    • Gary on Statistical Significance: Definition & Meaning
    • Gregory C. Alexander on Use Control Charts with Hypothesis Tests

    Copyright © 2023 · Jim Frost · Privacy Policy