When comparing groups in your data, you can have either independent or dependent samples. The type of samples in your design impacts sample size requirements, statistical power, the proper analysis, and even your study’s costs. Understanding the implications of each type of sample can help you design a better study. [Read more…] about Independent and Dependent Samples in Statistics
Moving averages can smooth time series data, reveal underlying trends, and identify components for use in statistical modeling. Smoothing is the process of removing random variations that appear as coarseness in a plot of raw time series data. It reduces the noise to emphasize the signal that can contain trends and cycles. Analysts also refer to the smoothing process as filtering the data.
Developed in the 1920s, the moving average is the oldest process for smoothing data and continues to be a useful tool today. This method relies on the notion that observations close in time are likely to have similar values. Consequently, the averaging removes random variation, or noise, from the data.
In this post, I look at using moving averages to smooth time series data. This method is the simplest form of smoothing. In future posts, I’ll explore more complex ways of smoothing.
What are Moving Averages?
Moving averages are a series of averages calculated using sequential segments of data points over a series of values. They have a length, which defines the number of data points to include in each average.
One-sided moving averages
One-sided moving averages include the current and previous observations for each average. For example, the formula for a moving average (MA) of X at time t with a length of 7 is the following:
In the graph, the circled one-sided moving average uses the seven observations that fall within the red interval. The subsequent moving average shifts the interval to the right by one observation. And, so on.
Centered moving averages
Centered moving averages include both previous and future observations to calculate the average at a given point in time. In other words, centered moving averages use observations that surround it in both directions and, consequently, are also known as two-sided moving averages. The formula for a centered moving average of X at time t with a length of 7 is the following:
In the plot below, the circled centered moving average uses the seven observations in the red interval. The next moving average shifts the interval to the right by one.
Centered intervals work out evenly for an odd number of observations because they allow for an equal amount of observations before and after the moving average. However, when you have an even length, the calculations must adjust for that by using a weighted moving average. For example, the formula for a centered moving average with a length of 8 is as follows:
For a length of 8, the calculations incorporate the formula for a length of 7 (t-3 through t+3). Then, it extends the segment by one observation in both directions (t-4 and t+4). However, those two observations each have half the weight, which yields the equivalent of 7 + 2*0.5 = 8 data points.
Using Moving Averages to Reveal Trends
Moving averages can remove seasonal patterns to reveal underlying trends. In future posts, I’ll write more about time series components and incorporating them into models for accurate forecasting. For now, we’ll work through an example to visually assess a trend.
When there is a seasonal pattern in your data and you want to remove it, set the length of your moving average to equal the pattern’s length. If there is no seasonal pattern in your data, choose a length that makes sense. Longer lengths will produce smoother lines.
Note that the term “seasonal” pattern doesn’t necessarily indicate a meteorological season. Instead, it refers to a repeating pattern that has a fixed length in your data.
Time Series Example: Daily COVID-19 Deaths in Florida
For our example, I’ll use daily COVID-19 deaths in the State of Florida. The time series plot below displays a recurring pattern in the number of daily deaths.
This pattern likely reflects a data artifact. We know the coronavirus does not operate on a seven-day weekly schedule! Instead, it must reflect some human-based scheduling factor that influences when causes of death are determined and recorded. Some of these activities must be less likely to occur on weekends because the lowest day of the week is almost always Sunday, and weekends, in general, tend to be low. Tuesdays are often the highest day of the week. Perhaps that is when the weekend backlog shows up in the data?
Because of this seasonal pattern, the number of recorded deaths for a particular day depends on the day of the week you’re evaluating. Let’s remove this season pattern to reveal the underlying trend component. The original data are from Johns Hopkins University. Download my Excel spreadsheet: Florida Deaths Time Series.
The graph displays one-sided moving averages with a length of 7 days for these data. Notice how the seasonal pattern is gone and the underlying trend is visible. Each moving average point is the daily average of the past seven days. We can look at any date, and the day of the week no longer plays a role. We can see that the trend increases up to April 17, 2020. It plateaus, with a slight decline, until around June 22nd. Since then, there is an upward trend that appears to steepen at the end.
Smoothing time series data helps reveal the underlying trends in your data. That process can aid in the simple visual assessment of the data, as seen in this article. However, it can also help you fit the best time series model to your data. The moving average is a simple but very effective calculation!
Note: this is a guest post by Alexander Moreno, a Computer Science PhD student at the Georgia Institute of Technology. He blogs at www.boostedml.com
Survival analysis is an important subfield of statistics and biostatistics. These methods involve modeling the time to a first event such as death. In this post we give a brief tour of survival analysis. We first describe the motivation for survival analysis, and then describe the hazard and survival functions. We follow this with non-parametric estimation via the Kaplan Meier estimator. Then we describe Cox’s proportional hazard model and after that Aalen’s additive model. Finally, we conclude with a brief discussion.
Why Survival Analysis: Right Censoring
Modeling first event times is important in many applications. This could be time to death for severe health conditions or time to failure of a mechanical system. If one always observed the event time and it was guaranteed to occur, one could model the distribution directly. For instance, in the non-parametric setting, one could use the empirical cumulative distribution function to estimate the probability of death by some time. In the parametric setting one could do non-negative regression.
However, in some cases one might not observe the event time: this is generally called right censoring. In clinical trials with death as the event, this occurs when one of the following happens. 1) participants drop out of the study 2) the study reaches a pre-determined end time, and some participants have survived until the end 3) the study ends when a certain number of participants have died. In each case, after the surviving participants have left the study, we don’t know what happens to them. We then have the question:
- How can we model the empirical distribution or do non-negative regression when for some individuals, we only observe a lower bound on their event time?
The above figure illustrates right censoring. For participant 1 we see when they died. Participant 2 dropped out, and we know that they survived until then, but don’t know what happened afterwards. For participant 3, we know that they survived until the pre-determined study end, but again don’t know what happened afterwards.
The Survival Function and the Hazard
Two of the key tools in survival analysis are the survival function and the hazard. The survival function describes the probability of the event not having happened by a time . The hazard describes the instantaneous rate of the first event at any time .
More formally, let be the event time of interest, such as the death time. Then the survival function is . We can also note that this is related to the cumulative distribution function via .
For the hazard, the probability of the first event time being in the small interval , given survival up to is . This is illustrated in the following figure.
Rearranging terms and taking limits we obtain
where is the density function of and the second equality follows from applying Bayes theorem. By rearranging again and solving a differential equation, we can use the hazard to compute the survival function via
The key question then is how to estimate the hazard and/or survival function.
Non-Parametric Estimation with Kaplan Meier
In non-parametric survival analysis, we want to estimate the survival function without covariates, and with censoring. If we didn’t have censoring, we could start with the empirical CDF . This equation is a succinct representation of: how many people have died by time ? The survival function would then be: how many people are still alive? However, we can’t answer this question as posed when some people are censored by time .
While we don’t necessarily know how many people have survived by an arbitrary time , we do know how many people in the study are still at risk. We can use this instead. Partition the study time into , where each is either an event time or a censoring time for a participant. Assume that participants can only lapse at observed event times. Let be the number of people at risk at just before time . Assuming no one dies at exactly the same time (no ties), we can look at each time someone died. We say that the probability of dying at that specific time is , and say that the probability of dying at any other time is . We can then say that the probability of surviving at any event time , given survival at previous candidate event times is . The probability of surviving up to a time is then
We call this  the Kaplan Meier estimator. Under mild assumptions, including that participants have independent and identically distributed event times and that censoring and event times are independent, this gives an estimator that is consistent. The next figure gives an example of the Kaplan Meier estimator for a simple case.
Kaplan Meier R Example
In R we can use the Surv and survfit functions from the survival package to fit a Kaplan Meier model. We can also use ggsurvplot from the survminer package to make plots. Here we will use the ovarian cancer dataset from the survival package. We will stratify based on treatment group assignment.
library(survminer) library(survival) kaplan_meier <- Surv(time = ovarian[['futime']], event = ovarian[['fustat']]) kaplan_meier_treatment<-survfit(kaplan_meier~rx,data=ovarian, type='kaplan-meier',conf.type='log') ggsurvplot(kaplan_meier_treatment,conf.int = 'True')
Semi-Parametric Regression with Cox’s Proportional Hazards Model
Kaplan Meier makes sense when we don’t have covariates, but often we want to model how some covariates affect death risk. For instance, how does one’s weight affect death risk? One way to do this is to assume that covariates have a multiplicative effect on the hazard. This leads us to Cox’s proportional hazard model, which involves the following functional form for the hazard:
The baseline hazard describes how the average person’s risk evolves over time. The relative risk describes how covariates affect the hazard. In particular, a unit increase in leads to an increase of the hazard by a factor of .
Because of the non-parametric nuissance term , it is difficult to maximize the full likelihood for directly. Cox’s insight  was that the assignment probabilities given the death times contain most of the information about , and the remaining terms contain most of the information about . The assignment probabilities give the following partial likelihood
We can then maximize this to get an estimator of . In [3,4] they show that this estimator is consistent and asymptotically normal.
Cox Proportional Hazards R Example
In R, we can use the Surv and coxph functions from the survival package. For the ovarian cancer dataset, we notice from the Kaplan Meier example that treatment is not proportional. Under a proportional hazards assumption, the curves would have the same pattern but diverge. However, instead they move apart and then move back together. Further, treatment does seem to lead to different survival patterns over shorter time horizons. We should not use it as a covariate, but we can stratify based on it. In R we can regress on age and presence of residual disease.
cox_fit <- coxph(Surv(futime, fustat) ~ age + ecog.ps+strata(rx), data=ovarian) summary(cox_fit)
which gives the following results
Call: coxph(formula = Surv(futime, fustat) ~ age + ecog.ps + strata(rx), data = ovarian) n= 26, number of events= 12 coef exp(coef) se(coef) z Pr(>|z|) age 0.13853 1.14858 0.04801 2.885 0.00391 ** ecog.ps -0.09670 0.90783 0.62994 -0.154 0.87800 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 exp(coef) exp(-coef) lower .95 upper .95 age 1.1486 0.8706 1.0454 1.262 ecog.ps 0.9078 1.1015 0.2641 3.120 Concordance= 0.819 (se = 0.058 ) Likelihood ratio test= 12.71 on 2 df, p=0.002 Wald test = 8.43 on 2 df, p=0.01 Score (logrank) test = 12.24 on 2 df, p=0.002
this suggests that age has a significant multiplicative effect on death, and that a one year increase in age increases instantaneous risk by a factor of 1.15.
Aalen’s Additive Model
Cox regression makes two strong assumptions: 1) that covariate effects are constant over time 2) that effects are multiplicative. Aalen’s additive model  relaxes the first, and replaces the second with the assumption that effects are additive. Here the hazard takes the form
As this is a linear model, we can estimate the cumulative regression functions using a least squares type procedure.
Aalen’s Additive Model R Example
In R we can use the timereg package and the aalen function to estimate cumulative regression functions, which we can also plot.
library(timereg) data(sTRACE) # Fits Aalen model out<-aalen(Surv(time,status==9)~age+sex+diabetes+chf+vf, sTRACE,max.time=7,n.sim=100) summary(out) par(mfrow=c(2,3)) plot(out)
This gives us
Additive Aalen Model Test for nonparametric terms Test for non-significant effects Supremum-test of significance p-value H_0: B(t)=0 (Intercept) 7.29 0.00 age 8.63 0.00 sex 2.95 0.01 diabetes 2.31 0.24 chf 5.30 0.00 vf 2.95 0.03 Test for time invariant effects Kolmogorov-Smirnov test (Intercept) 0.57700 age 0.00866 sex 0.11900 diabetes 0.16200 chf 0.12900 vf 0.43500 p-value H_0:constant effect (Intercept) 0.00 age 0.00 sex 0.18 diabetes 0.43 chf 0.06 vf 0.02 Cramer von Mises test (Intercept) 0.875000 age 0.000179 sex 0.017700 diabetes 0.041200 chf 0.053500 vf 0.434000 p-value H_0:constant effect (Intercept) 0.00 age 0.00 sex 0.29 diabetes 0.42 chf 0.02 vf 0.05 Call: aalen(formula = Surv(time, status == 9) ~ age + sex + diabetes + chf + vf, data = sTRACE, max.time = 7, n.sim = 100)
The results first test whether the cumulative regression functions are non-zero, and then whether the effects are constant. The plots of the cumulative regression functions are given below.
In this post we did a brief tour of several methods in survival analysis. We first described why right censoring requires us to develop new tools. We then described the survival function and the hazard. Next we discussed the non-parametric Kaplan Meier estimator and the semi-parametric Cox regression model. We concluded with Aalen’s additive model.
 Kaplan, Edward L., and Paul Meier. “Nonparametric estimation from incomplete observations.” Journal of the American statistical association 53, no. 282 (1958): 457-481.
 Cox, David R. “Regression models and life-tables.” In Breakthroughs in statistics, pp. 527-541. Springer, New York, NY, 1992.
 Tsiatis, Anastasios A. “A large sample study of Cox’s regression model.” The Annals of Statistics 9, no. 1 (1981): 93-108.
 Andersen, Per Kragh, and Richard David Gill. “Cox’s regression model for counting processes: a large sample study.” The annals of statistics (1982): 1100-1120.
 Aalen, Odd. “A model for nonparametric regression analysis of counting processes.” In Mathematical statistics and probability theory, pp. 1-25. Springer, New York, NY, 1980.
Chi-squared tests of independence determine whether a relationship exists between two categorical variables. Do the values of one categorical variable depend on the value of the other categorical variable? If the two variables are independent, knowing the value of one variable provides no information about the value of the other variable.
I’ve previously written about Pearson’s chi-square test of independence using a fun Star Trek example. Are the uniform colors related to the chances of dying? You can test the notion that the infamous red shirts have a higher likelihood of dying. In that post, I focus on the purpose of the test, applied it to this example, and interpreted the results.
In this post, I’ll take a bit of a different approach. I’ll show you the nuts and bolts of how to calculate the expected values, chi-square value, and degrees of freedom. Then you’ll learn how to use the chi-squared distribution in conjunction with the degrees of freedom to calculate the p-value. [Read more…] about How the Chi-Squared Test of Independence Works
Use a variances test to determine whether the variability of two groups differs. In this post, we’ll work through a two-sample variances test that Excel provides. Even if Excel isn’t your primary statistical software, this post provides an excellent introduction to variance tests. Excel refers to this analysis as F-Test Two-Sample for Variances. [Read more…] about How to Test Variances in Excel
Use two-way ANOVA to assess differences between the group means that are defined by two categorical factors. In this post, we’ll work through two-way ANOVA using Excel. Even if Excel isn’t your main statistical package, this post is an excellent introduction to two-way ANOVA. Excel refers to this analysis as two factor ANOVA. [Read more…] about How to do Two-Way ANOVA in Excel
Outliers are data points that are far from other data points. In other words, they’re unusual values in a dataset. Outliers are problematic for many statistical analyses because they can cause tests to either miss significant findings or distort real results.
Unfortunately, there are no strict statistical rules for definitively identifying outliers. Finding outliers depends on subject-area knowledge and an understanding of the data collection process. While there is no solid mathematical definition, there are guidelines and statistical tests you can use to find outlier candidates. [Read more…] about 5 Ways to Find Outliers in Your Data
Use one-way ANOVA to determine whether the means of at least three groups are different. Excel refers to this test as Single Factor ANOVA. This post is an excellent introduction to performing and interpreting one-way ANOVA even if Excel isn’t your primary statistical software package. [Read more…] about How to do One-Way ANOVA in Excel
Excel can perform various statistical analyses, including t-tests. It is an excellent option because nearly everyone can access Excel. This post is a great introduction to performing and interpreting t-tests even if Excel isn’t your primary statistical software package.
In this post, I provide step-by-step instructions for using Excel to perform t-tests. Importantly, I also show you how to select the correct form of t-test, choose the right options, and interpret the results. I also include links to additional resources I’ve written, which present clear explanations of relevant t-test concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together! [Read more…] about How to do t-Tests in Excel
The Monty Hall Problem is where Monty presents you with three doors, one of which contains a prize. He asks you to pick one door, which remains closed. Monty opens one of the other doors that does not have the prize. This process leaves two unopened doors—your original choice and one other. He allows you to switch from your initial choice to the other unopened door. Do you accept the offer?
If you accept his offer to switch doors, you’re twice as likely to win—66% versus 33%—than if you stay with your original choice.
The solution to the Monty Hall Problem is tricky and counter-intuitive. It did trip up many experts back in the 1980s. However, the correct answer to the Monty Hall Problem is now well established using a variety of methods. It has been proven mathematically, with computer simulations, and empirical experiments, including on television by both the Mythbusters (CONFIRMED!) and James Mays’ Man Lab. You won’t find any statisticians who disagree with the solution.
In this post, I’ll explore aspects of this problem that have arisen in discussions with some stubborn resisters to the notion that you can increase your chances of winning by switching!
The Monty Hall problem provides a fun way to explore issues that relate to hypothesis testing. I’ve got a lot of fun lined up for this post, including the following!
- Using a computer simulation to play the game 10,000 times.
- Assessing sampling distributions to compare the 66% percent hypothesis to another contender.
- Performing a power and sample size analysis to determine the number of times you need to play the Monty Hall game to get an answer.
- Conducting an experiment by playing the game repeatedly myself, record the results, and use a proportions hypothesis test to draw conclusions! [Read more…] about Revisiting the Monty Hall Problem with Hypothesis Testing
Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. However, ANOVA results do not identify which particular differences between pairs of means are significant. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.
In this post, I’ll show you what post hoc analyses are, the critical benefits they provide, and help you choose the correct one for your study. Additionally, I’ll show why failure to control the experiment-wise error rate will cause you to have severe doubts about your results. [Read more…] about Using Post Hoc Tests with ANOVA
Choosing whether to perform a one-tailed or a two-tailed hypothesis test is one of the methodology decisions you might need to make for your statistical analysis. This choice can have critical implications for the types of effects it can detect, the statistical power of the test, and potential errors.
In this post, you’ll learn about the differences between one-tailed and two-tailed hypothesis tests and their advantages and disadvantages. I include examples of both types of statistical tests. In my next post, I cover the decision between one and two-tailed tests in more detail.
[Read more…] about One-Tailed and Two-Tailed Hypothesis Tests Explained
Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.
In this blog post, I explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals. [Read more…] about Introduction to Bootstrapping in Statistics with an Example
Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.
Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study! [Read more…] about Estimating a Good Sample Size for Your Study Using Power Analysis
Interaction effects occur when the effect of one variable depends on the value of another variable. Interaction effects are common in regression analysis, ANOVA, and designed experiments. In this blog post, I explain interaction effects, how to interpret them in statistical designs, and the problems you will face if you don’t include them in your model. [Read more…] about Understanding Interaction Effects in Statistics
Log-log plots display data in two dimensions where both axes use logarithmic scales. When one variable changes as a constant power of another, a log-log graph shows the relationship as a straight line. In this post, I’ll show you why these graphs are valuable and how to interpret them. [Read more…] about Using Log-Log Plots to Determine Whether Size Matters
Standardization is the process of putting different variables on the same scale. In regression analysis, there are some scenarios where it is crucial to standardize your independent variables or risk obtaining misleading results.
In this blog post, I show when and why you need to standardize your variables in regression analysis. Don’t worry, this process is simple and helps ensure that you can trust your results. In fact, standardizing your variables can reveal essential findings that you would otherwise miss! [Read more…] about When Do You Need to Standardize the Variables in a Regression Model?
With the arrival of Fall in the Northern hemisphere, it’s flu season again.
Do you debate getting a flu shot every year? I do get flu shots every year. I realize that they’re not perfect, but I figure they’re a low-cost way to reduce my chances of a crummy week suffering from the flu.
The media report that flu shots have an effectiveness of approximately 68%. But what does that mean exactly? What is the absolute reduction in risk? Are there long-term benefits?
In this blog post, I explore the effectiveness of flu shots from a statistical viewpoint. We’ll statistically analyze the data ourselves to go beyond the simplified accounts that the media presents. I’ll also model the long-term outcomes you can expect with regular flu vaccinations. By the time you finish this post, you’ll have a crystal clear picture of flu shot effectiveness. Some of the results surprised me! [Read more…] about Flu Shots, How Effective Are They?
Precision in predictive analytics refers to how close the model’s predictions are to the observed values. The more precise the model, the closer the data points are to the predictions. When you have an imprecise model, the observations tend to be further away from the predictions, thereby reducing the usefulness of the predictions. If you have a model that is not sufficiently precise, you risk making costly mistakes! [Read more…] about Understand Precision in Predictive Analytics to Avoid Costly Mistakes
As you fit regression models, you might need to make a choice between linear and nonlinear regression models. The field of statistics can be weird. Despite their names, both forms of regression can fit curvature in your data. So, how do you choose? In this blog post, I show you how to choose between linear and nonlinear regression models. [Read more…] about How to Choose Between Linear and Nonlinear Regression