Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. This process allows you to calculate standard errors, construct confidence intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap methods are alternative approaches to traditional hypothesis testing and are notable for being easier to understand and valid for more conditions.

In this blog post, I explain bootstrapping basics, compare bootstrapping to conventional statistical methods, and explain when it can be the better method. Additionally, I’ll work through an example using real data to create bootstrapped confidence intervals.

## Bootstrapping and Traditional Hypothesis Testing Are Inferential Statistical Procedures

Both bootstrapping and traditional methods use samples to draw inferences about populations. To accomplish this goal, these procedures treat the single sample that a study obtains as only one of many random samples that the study could have collected.

From a single sample, you can calculate a variety of sample statistics, such as the mean, median, and standard deviation—but we’ll focus on the mean here.

Now, suppose an analyst repeats their study many times. In this situation, the mean will vary from sample to sample and form a distribution of sample means. Statisticians refer to this type of distribution as a sampling distribution. Sampling distributions are crucial because they place the value of your sample statistic into the broader context of many other possible values.

While performing a study many times is infeasible, both methods can estimate sampling distributions. Using the larger context that sampling distributions provide, these procedures can construct confidence intervals and perform hypothesis testing.

**Related posts**: Differences between Descriptive and Inferential Statistics

## Differences between Bootstrapping and Traditional Hypothesis Testing

A primary difference between bootstrapping and traditional statistics is how they estimate sampling distributions.

Traditional hypothesis testing procedures require equations that estimate sampling distributions using the properties of the sample data, the experimental design, and a test statistic. To obtain valid results, you’ll need to use the proper test statistic and satisfy the assumptions. I describe this process in more detail in other posts—links below.

The bootstrap method uses a very different approach to estimate sampling distributions. This method takes the sample data that a study obtains, and then resamples it over and over to create many simulated samples. Each of these simulated samples has its own properties, such as the mean. When you graph the distribution of these means on a histogram, you can observe the sampling distribution of the mean. You don’t need to worry about test statistics, formulas, and assumptions.

The bootstrap procedure uses these sampling distributions as the foundation for confidence intervals and hypothesis testing. Let’s take a look at how this resampling process works.

**Related posts**: How t-Tests Work and How the F-test Works in ANOVA

## How Bootstrapping Resamples Your Data to Create Simulated Datasets

Bootstrapping resamples the original dataset with replacement many thousands of times to create simulated datasets. This process involves drawing random samples from the original dataset. Here’s how it works:

- The bootstrap method has an equal probability of randomly drawing each original data point for inclusion in the resampled datasets.
- The procedure can select a data point more than once for a resampled dataset. This property is the “with replacement” aspect of the process.
- The procedure creates resampled datasets that are the same size as the original dataset.

The process ends with your simulated datasets having many different combinations of the values that exist in the original dataset. Each simulated dataset has its own set of sample statistics, such as the mean, median, and standard deviation. Bootstrapping procedures use the distribution of the sample statistics across the simulated samples as the sampling distribution.

## Example of Bootstrap Samples

Let’s work through an easy case. Suppose a study collects five data points and creates four bootstrap samples, as shown below.

This simple example illustrates the properties of bootstrap samples. The resampled datasets are the same size as the original dataset and only contain values that exist in the original set. Furthermore, these values can appear more or less frequently in the resampled datasets than in the original dataset. Finally, the resampling process is random and could have created a different set of simulated datasets.

Of course, in a real study, you’d hope to have a larger sample size, and you’d create thousands of resampled datasets. Given the enormous number of resampled data sets, you’ll always use a computer to perform these analyses.

## How Well Does Bootstrapping Work?

Resampling involves reusing your one dataset many times. It almost seems too good to be true! In fact, the term “bootstrapping” comes from the impossible phrase of pulling yourself up by your own bootstraps! However, using the power of computers to randomly resample your one dataset to create thousands of simulated datasets produces meaningful results.

The bootstrap method has been around since 1979, and its usage has increased. Various studies over the intervening decades have determined that bootstrap sampling distributions approximate the correct sampling distributions.

To understand how it works, keep in mind that bootstrapping does not create new data. Instead, it treats the original sample as a proxy for the real population and then draws random samples from it. Consequently, the central assumption for bootstrapping is that the original sample accurately represents the actual population.

The resampling process creates many possible samples that a study could have drawn. The various combinations of values in the simulated samples collectively provide an estimate of the variability between random samples drawn from the same population. The range of these potential samples allows the procedure to construct confidence intervals and perform hypothesis testing. Importantly, as the sample size increases, bootstrapping converges on the correct sampling distribution under most conditions.

Now, let’s see an example of this procedure in action!

## Example of Using Bootstrapping to Create Confidence Intervals

For this example, I’ll use bootstrapping to construct a confidence interval for a dataset that contains the body fat percentages of 92 adolescent girls. I used this dataset in my post about identifying the distribution of your data. These data do not follow the normal distribution. Because it does not meet the normality assumption of traditional statistics, it’s a good candidate for bootstrapping. Although, the large sample size might let us bypass this assumption. The histogram below displays the distribution of the original sample data.

Download the CSV dataset to try it yourself: body_fat.

### Performing the bootstrap procedure

To create the bootstrapped samples, I’m using Statistics101, which is a giftware program. This is a great simulation program that I’ve also used to tackle the Monty Hall Problem!

Using its programming language, I’ve written a script that takes my original dataset and resamples it with replacement 500,000 times. This process produces 500,000 bootstrapped samples with 92 observations in each. The program calculates each sample’s mean and plots the resulting 500,000 means in the histogram below. Statisticians refer to this type of distribution as the sampling distribution of means. Bootstrapping methods create them using resampling, while traditional methods use equations. Download this script to run it yourself: BodyFatBootstrapCI.

To create the bootstrapped confidence interval, we simply use percentiles. For a 95% confidence interval, we need to identify the middle 95% of the distribution. To do that, we use the 97.5^{th} percentile and the 2.5^{th} percentile (97.5 – 2.5 = 95). In other words, if we order all sample means from low to high, and then chop off the lowest 2.5% and the highest 2.5% of the means, the middle 95% of the means remain. That range is our bootstrapped confidence interval!

For the body fat data, the program calculates a 95% bootstrapped confidence interval of the mean [27.16 30.01]. We can be 95% confident that the population mean falls within this range.

This interval has the same width as the traditional confidence interval for these data, and it is different by only several percentage points. The two methods are very close.

Notice how the sampling distribution in the histogram approximates a normal distribution even though the underlying data distribution is skewed. This approximation occurs thanks to the central limit theorem. As the sample size increases, the sampling distribution converges on a normal distribution regardless of the underlying data distribution (with a few exceptions). For more information about this theorem, read my post about the Central Limit Theorem.

Compare this process to how traditional statistical methods create confidence intervals.

## Benefits of Bootstrapping over Traditional Statistics

Readers of my blog know that I love intuitive explanations of complex statistical methods. And, bootstrapping fits right in with this philosophy. This process is much easier to comprehend than the complex equations required for the probability distributions of the traditional methods. However, bootstrapping provides more benefits than just being easy to understand!

Bootstrapping does not make assumptions about the distribution of your data. You merely resample your data and use whatever sampling distribution emerges. Then, you work with that distribution, whatever it might be, as we did in the example.

Conversely, the traditional methods often assume that the data follow the normal distribution or some other distribution. For the normal distribution, the central limit theorem might let you bypass this assumption for sample sizes that are larger than ~30. Consequently, you can use bootstrapping for a wider variety of distributions, unknown distributions, and smaller sample sizes. Sample sizes as small as 10 can be usable.

In this vein, all traditional methods use equations that estimate the sampling distribution for a specific sample statistic when the data follow a particular distribution. Unfortunately, formulas for all combinations of sample statistics and data distributions do not exist! For example, there is no known sampling distribution for medians, which makes bootstrapping the perfect analyses for it. Other analyses have assumptions such as equality of variances. However, none of these issues are problems for bootstrapping.

## For Which Sample Statistics Can I Use Bootstrapping?

While this blog post focuses on the sample mean, the bootstrap method can analyze a broad range of sample statistics and properties. These statistics include the mean, median, mode, standard deviation, analysis of variance, correlations, regression coefficients, proportions, odds ratios, variance in binary data, and multivariate statistics among others.

There are several, mostly esoteric, conditions when bootstrapping is not appropriate, such as when the population variance is infinite, or when the population values are discontinuous at the median. And, there are various conditions where tweaks to the bootstrapping process are necessary to adjust for bias. However, those cases go beyond the scope of this introductory blog post.

In this post, you learned that bootstrapping takes real data and simulates samples. Next, learn about Monte Carlo simulations that use simulated data! Learn more in, Monte Carlo Simulation: Make Better Decisions.

Josh says

Hi Jim,

Thanks for the great introduction to bootstrapping. I had a couple of questions I was hoping you could help me out with.

1. Is it improper to construct a tolerance interval based on statistics obtained from bootstrapping if the original sample appears to be non-normal?

2. If a data transformation can be used to make the original sample approximately normal, is it better to use traditional methods over bootstrapping?

Jim Frost says

Hi Josh,

Thanks for the great questions! And I’m glad you found the blog post helpful!

Using bootstrapping for a non-normal sample is a great idea. Bootstrapping doesn’t require making assumptions about the distribution. It just works with the sample as it is for the resampling.

There are other possible approaches. If you know the specific non-normal distribution that your data follows, many tolerance interval methods can work with that other distribution (e.g., Weibull, exponential, logistic, etc.). If you can identify the distribution, that’s the approach I’d recommend. This method incorporates the other probability function instead of the normal function. I’m not sure if it’s any better than using the bootstrapping method but it’s probably more of a standard approach and easier to explain & justify. Plus, it’s usually helpful knowing the distribution in general. However, in all likelihood, using a bootstrapped tolerance interval is probably just as good for those cases.

If you can’t identify the distribution, you could instead use Wilks’ standard nonparametric approach, which also doesn’t make assumptions about the distribution but uses a different method than bootstrapping. However, Wilks’ nonparametric method dates back to the 1940s and tends to require large sample sizes. So, I’d recommend bootstrapping for that situation.

Last resort, use a transformation. If you can successfully transform the data so it follows the normal distribution (not always possible), you can use that. After the analysis, you’ll need to back-transform the interval results to get the values in the natural data units.

Cealo says

Hello Jim!

thank you for sharing the post. Could you please explain the following calculation in your post please.

“For the body fat data, the program calculates a 95% bootstrapped confidence interval of the mean [27.16 30.01]. We can be 95% confident that the population mean falls within this range”

How it become 27.16 and 30.01 as 95% confidence interval ?

Jim Frost says

Hi Cealo,

Please read the paragraph immediately before the paragraph you quote, which discusses using percentiles. That paragraph describes the calculation and should answer your question. If something isn’t clear in that explanation, please let me know and I’ll elaborate.

Andy says

Hi~ I’ve get troubles in R, and I just don’t know how it work when I use boot(). As is known, the second blank to fill is the function of statistic and I don’t know how to choose. I don’t know if you can solve my problem

James says

Hi Jim,

Thank you very much for getting back to me. Since my question I have carried out further work and rejigged my methodology. I thought I’d update you as it might be of interest. To begin with I carried out time series analysis and didn’t identify any significant seasonal trends. I also used the Augmented Dickey-Fuller (ADF) test to confirm the time series was stationary (tested counts with and without log transformation). From here I decided to improve my ‘data granularity’ by aggregating daily counts when the events I am modelling occurred each year.

Stat software returned a statistically significant fit for the Poisson distribution. So I’ve gone ahead and produced a confidence interval for the Poisson error for each year based on the annual count and then one for the underlying change between separate years to see if the interval contains zero. The bootstrap comes into it as a final check of any years which are deemed to have statistically significant changes occurring. I’ve found that on many occasions the bootstrap results are very close to the Poisson based confidence intervals but there are some slight deviations, where some statistically significant results (assuming Poisson based) are now ‘on the margin’ so to speak.

I plan on working on my bootstrap code further as there are other intervals beyond percentile and empirical/pivot which I’ve used so far. I think I’ve accounted for the seasonality and improved the data granularity as the monthly count did return further off results. The sample size is bigger now and more ‘smooth’ than before, plus I’m using numpy to generate 100000 samples. My bootstrap results are close to the Poisson results with some deviation and I think this is logical as I’d expect to see plenty of similarity as the Poisson fit was statistically significant. Hopefully I’ve summarised where I am at adequately, if you can see any glaring holes in my improved methodology I’d be grateful for a heads up but I think I’m on the right track now. Thanks again for your help!

Jim Frost says

Hi James,

Thanks for the update! Sounds an interesting project. I can’t think of any glaring issues offhand. If the Poisson distribution is a good fit, you might not need to use the bootstrap method. However, as I’m sure you know, not all count data follow the Poisson distribution.

James says

First off let me thank you for the fantastic blog which is a great resource. I have a scenario for which which I’d much appreciate your feedback concerning the bootstrap. For example, if i wanted to get the bootstrap c.i interval for one annual count, could I set up a bootstrap that would take 1000 samples using monthly count data (sample size :12), sum the total and append to a list in R. Then use the 2.5 & 97.5 percentiles of the 1000 counts (based on summed monthly totals) to get the bootstrap interval around the annual total. This methodology could then be adapted say to a study of the mean across a 5 year period where you get the total of the sample and divide by 5 and repeat 1000 times.

As some background, I have both monthly & annual count data and i have carried out data modelling to assess the underlying rate of change using a Poisson distribution. I wanted to assess whether annual changes fall within expected bounds of natural variation or there is evidence of a statistically significant change occurring, this has worked well. I can continue using the Poisson distribution but i wanted to contrast my findings using bootstrap intervals which i assume will work well with skewed count data. However, the monthly count aggregated based methodology will get around the small sample size when looking at annual figures if my method is valid. One last point is that i have the full ‘population’ in that i know all of the published events that have occurred.

I’d be interested to hear your thoughts as perhaps the bootstrap can be adapted like this, i’m more familiar with fitting a distribution and working direct with Poisson and normal distributions. To date i have just worked with monthly data for the bootstrap without aggregation but i think my audience will be more interested in yearly data. More generally i assume there may be readers who are working with count data in situations when ‘all the data has been gathered’ and want to look at statistically significant changes so the bootstrap may be of interest. The bootstrap testing would also enable me to check any problems when using the Poisson distribution such as over dispersion, etc. Thank you for your time, best wishes

Jim Frost says

Hi James,

Certainly, the bootstrap method can be used with count data. I’m not sure about the part where you take the monthly data randomly and sum 12 months to create a bootstrapped yearly sample. That might be OK, but I’ve never seen that done before. So, I’d research the validity of that approach. More ideally, you’d bootstrap the annual data, but the I understand the sample size issue. If you have patterns in your monthly data that could be problematic for your approach. At the very least, you might need to incorporate that into your sampling. For example, suppose summers are a low point for your outcome. Then when you sample for a summer month in a year, you would need to sample from the summer months.

That would also be the preferred technique for fitting a Poisson distribution as well. For the Poisson distribution, you need to define a consistent length, which would be a year in your case.

I don’t think I’ve given a clear answer. Yes, bootstrapping will work with count data. However, I’m not sure about your specific approach of sampling 12 months to create a bootstrapped year. I’d look into that carefully before proceeding.

MIke Coulthart says

Emikel, please show our host and the rest of his guests on his website some courtesy and think carefully about what he is saying before putting up a new post (which is essentially a “Reply to all” for subscribers). His definitions and explanations of accuracy and precision are exactly what everyone in the field uses — whether expert or informed user — and are among the clearest I have ever seen.

Jim Frost says

Thanks, Mike!

Emikel says

Hi Jim,

First of all, I never claimed to be a statistician. I will be one in the future, as I’ll be pursuing a master’s degree in statistics this coming fall semester. Despite this, I’ve already taken a few statistics classes and done a lot of review in my own time. So, I feel that my statistics knowledge is moderate. I was just simply repeating the accuracy and precision definitions I saw online. All of the definitions for accuracy and precision I saw online (three google webpages) were the same or very similar. Some of these websites are managed by statisticians with Ph.Ds. in statistics. Why did they state the accuracy definition wrong? They’re supposed to be experts in statistics, right? Here are just a few example websites that states the same definitions that I stated, https://wp.stolaf.edu/it/gis-precision-accuracy/ , and https://www.itl.nist.gov/div898/handbook/mpc/section1/mpc113.htm . The last website is a federal government website. I assumed the federal government has experts in statistics. Your website is the only website I saw online that had a different definition for accuracy. I’m curious as to why? Which accuracy definition is correct? Is this similar to the term “Kurtosis”? I saw different definitions for this term online. Some websites state that Kurtosis is the “fatness” or “thinness” of the probability distribution tails. Other websites state that Kurtosis is the shape of the “hill” or “hump” of the probability distribution.

On the other hand, your last paragraph does make some sense. If my sample items have high precision, then they are falling close or clustering together. As a result, with a large sample size and high confidence level of 90%, they should fall close to the true or population value. I am very optimistic about statistics. But, I’m a little concern about the difference in some statistical definitions.

Emikel

Jim Frost says

Hi Emikel,

Those websites use the same exact definitions I use. Precision relates to how close measurements are to each other. Accuracy relates to how close measurements are to the correct value. I’m not sure why you think there is a difference, but their definitions are the same ones that I use in my article about accuracy vs. precision.

Kurtosis is a property of distributions.

I’d highly recommend that when you come to someone’s website asking questions that you also don’t cast aspersions! Really, politeness goes along way. Trust me, I know what I’m talking about re: accuracy vs. precision. You’ve had a few classes while I’ve had a few decades!

Regarding your last paragraph, you can have very precise measurements that can also be inaccurate. In other words, measurements of the same item can be close together but not centered on the correct value. So, it doesn’t necessarily follow that high precision leads to high accuracy. It might or might not.

Emikel says

Hi Jim,

Sorry for the late response. I’ve been busy with college work and my part-time job. After reading both of your articles on bootstrap sampling, and parametric and nonparametric tests, I just feel more comfortable using bootstrap sampling. Some of the nonparametric tests I’ve never even heard of before. With small samples, I’ve heard more about bootstrap sampling and the T-distribution than the nonparametric tests. Regarding accuracy and precision, I know you have some issue about the real definition of accuracy and precision (I read your article on this). After some searching, all of the definitions I saw online for accuracy and precision are the same, including the definitions on some federal government websites. From what I saw is that accuracy is the measure of how close a sample statistic is to a population parameter, for example a sample mean and population mean. Precision is a measure of how close a group of sample statistics are clump together, for example, a group of sample means. Since I’m doing inferential statistics, I’m more interested in accuracy. I want to be able to predict certain events with good accuracy, that’s why I’m so interested in collecting a large sample size. I do see the larger picture of options. Thanks for the information. Just one quick question, does the Central Limit Theorem also applies to other statistics, like the sample median or standard deviation, in addition to the sample mean?

Sincerely,

Emikel

Jim Frost says

Hi Emikel,

As I said, I think it’s fine going with a bootstrap procedure. I don’t believe they’re better with small samples than other procedures, but they’re probably not worse either.

It’s not that I “have some issue” with accuracy vs. precision. There’s a technical difference between those two concepts in statistics that you should understand if you want to converse with analysts and appear knowledgeable. From your comment, I can tell that you don’t understand the difference.

Accuracy relates to being unbiased. Virtually all statistical procedures attempt to be unbiased. Precision relates to how close values fall together. In a statistical model, you want precise predictions because that means the predicted values fall close to the actual values. So, when you state that you want to predict events, you’re entirely incorrect by saying that accuracy is your main focus. With predictions, accuracy is just a minimum condition whereas sufficient precision is what you really need for good predictions. Additionally, a large sample helps you with precision, but it does not help you with accuracy.

The CLT does not apply to the median or standard deviation. For other examples of what it does apply to, read my post about the central limit theorem.

Ron Swanson says

I am so glad I found your posts. First you put me onto Statistics101 which I did not know existed and second your explanation of bootstrapping is wonderful. I plan on going through many more of your posts.

Thankyou for taking the time to produce those posts.

Jim Frost says

Hi Ron,

I’m so glad to hear that you found it helpful. That makes my day!

Also, I always have fun playing with Statistics101. I hope you do too!

Emikel says

Hi Jim,

Thanks for your reply. I’m using inferential statistics. What I want to obtain from my sample are patterns and trends to able to make accurate predictions about the population. I was leaning towards the sample median not only because my probability distribution is not normal and is heavily skewed to the right (Skewness value is +3.58), but also because the variation of it is so high (Coefficient of Variation is 164.31%). Even though my sample size is large enough to use the Central Limit Theorem, I’m still cautious about using the sample mean because my probability distribution is so heavily skewed to the right and the variation of it is so high. I’ve always read that if my probability distribution is skewed then use the median instead of the mean because the median is a better measure of central value. I want to make the most accurate prediction as possible. Accuracy is my goal. I don’t feel comfortable using the nonparametric tests after reading your parametric/nonparametric tests post. I feel more comfortable using Bootstrap sampling since so much praise have been given to it from other statisticians. Since accuracy is my goal and my variation is so high (COV: 164.31%), I’m leaning towards the sample median using Bootstrap sampling. What do you think? I’m only going to use Bootstrap sampling until I increase my sample size from 77 to my desired goal of 271.

Sincerely,

Emikel

Jim Frost says

Hi Emikel,

Part of what I wanted to impart in my previous comment was that you should go beyond a general rule of thumb you’ve read but really use a thorough understanding and apply it to your specific case. So, yes, the median is often a better measure for skewed distributions. But, as I point out in my other post, it really depends on what you want to learn from your data. Sometimes both measures are good, and it comes down to what you want to learn.

But the median is a good choice. And, certainly, bootstrapping is also a good choice. I suspect that it won’t give you a more precise estimate than nonparametric measures. I would imagine they’d be approximately the same in terms of precision. Out of curiosity, what made you not want to use a nonparametric method? Depending on your field, there might be more acceptance of the more familiar nonparametric tests.

But, again, using bootstrapping with the median is a good choice. I just wanted you see the larger picture of options.

And you’ll note that I’m using the word precision here rather than accuracy. Either mean or median can be accurate in this context but you’re really looking for a precise estimate of the median. Read precision vs. accuracy.

Emikel says

Hello Jim,

I enjoyed reading your post about Bootstrap sampling. The timing of this post is perfect. It just so happens that I’m currently collecting a sample for a private project that I’m doing. I only have one sample and its current size is 77 items. My goal is to reach a sample size of 271 items with a 90% confidence level. I obtain this sample size by using the Cochran’s sample size equation. It took me about six months to collect 77 sample items. By the rate I’m going, I’ll reach my sample size goal of 271 items in about two years. I don’t have the time or the money to wait that long. So, I’ve decided to use Bootstrap sampling. I just found out about Bootstrap sampling, and it does seem “too good to be true.” But if you and other statisticians say it works, then I’ll take your word for it and try it out. My question is do I use the sample mean or the sample median as my sample statistic for the Bootstrap sampling? I was thinking about using the sample median because the probability distribution for my sample is not normal and is heavily skewed to the right (Skewness value is +3.58). But I read in your Bootstrap post, that the sample median may not be a good choice because the population values maybe discontinuous at the median. What exactly do you mean when you say population values maybe discontinuous at the median? My sample data is continuous data, not discrete data. Should I use the sample mean instead? Thanks.

Sincerely,

Emikel

Jim Frost says

Hi Emikel,

You can use either the mean or median. And you can use either bootstrapping, a parametric test, or a nonparametric test. So, you have a bunch of options!

A sample size of 77 is usually considered pretty good. However, it depends on what you want to do with it, the effect size you want to detect, and the variability in the data.

I’m not 100% what you want to learn from your data but let me cover your questions and some possibilities.

The question of mean or median often boils down to the distribution of your data and what you want to learn from it. Please read my post about nonparametric vs. parametric tests. Mean or median is an important issue in that context. Sometimes you can technically use either, but the distribution and what you want to learn determines which is better. I talk about that in this post.

Because your sample size is fairly large, the central limit theorem allows you to use even skewed data with a parametric test. Of course, bootstrapping can handle the mean or median. Nonparametric tests assess the median. Those are all options for your situation. I don’t know which one would be best. If you determine the mean is best, I’d probably stick with a parametric test because others are more familiar with that tried-and-true approach. If the median is better, you could use either bootstrapping or a nonparametric test. I could really recommend one over the other. They’re both probably fairly equivalent.

As I see it, your real question is which is better for your subject area, the mean or median. I think it’ll be clearer after reading the post I link to above. From there, you have multiple approaches for both cases. Bootstrapping or not is really secondary question after making that determination. There’s a chance you’d want to go with a parametric or nonparametric test over bootstrapping just because others are more familiar with those approaches. But that’s something you’d need to look into.

I hope that helps!

Mark M says

Jim,

Your explanation of bootstrapping was so good that I subscribed to your site. I found probability easier to understand than statistics in college, so I am very glad I found your site. I do have a conceptual question about your (adolescent girl body fat) example. How do you estimate a minimum reasonable size for the sample set you plan to bootstrap from?

It seems like a number of factors would have a significant impact on body fat. If we use age (ten annual buckets – 10-19 years), parental income/education level (at least 3 buckets) and race (at least 5 buckets) those factors would generate (10x3x5) = 150 buckets. Given that, it seemed as if 92 data points is not enough to accurately represent the underlying data.

How does one figure out a minimum reasonable sample size? If you address this in one of your books, I am glad to buy it.

Sorry if this is a silly question.

Mark

Sujay Dutta says

Thanks for a great article, Jim. I have a question. One interpretation of parametric confidence intervals is that values near the center of the interval are more likely for the population parameter concerned than values near the peripheries of the interval. Does this interpretation apply to bootstrap CIs also?

Jeremy says

Thanks for the very good intro to boostrapping here, Jim! I have a couple of questions. If your sample has a wide range (i.e. a wide standard deviation), or even just a few extreme outliers, would your bootstrap-derived sample distribution end up being wider, too, or would it shift the mean? I guess the premise of bootstrapping is that the variation in your single sample will end up being mirrored in your bootstrap-derived distribution of sample means? My second question is how do you use bootstrapping for regression—do you simply get a bootstrapped distribution of sample means for each value of your regression coefficients to get confidence intervals?

MIke Coulthart says

Thank you Jim, for this wonderfully clear explanation of the principles behind the classical nonparametric bootstrap. Have you considered presenting a similar exposition of Donald Rubin’s (1981) Bayesian bootstrap? I have tried and tried to read this paper, but still cannot intuitively grasp it.

Jim Frost says

Hi Mike,

Perhaps that can be fodder for a future blog post!

saeideh says

Hello, Thank you for the tutorial.

I have a question. I have some subjects with different numbers of trials. since the number of trials affect the result first I want to equalize the number of trials and for example extract randomly 60 trials of each subjects. How can do this with bootstrapping?

Thank you in advance

Jim Frost says

Hi Saeideh,

My real strengths are in the parametric methods. In those methods, it’s ok to have different numbers of subjects between groups. It’s not the most efficient in terms of maximizing statistical power, but it’s ok. I know the same is true with at least some of the bootstrapping techniques. For example, I do know that you can compare the means of unequal size groups. There might be some alteration in the bootstrapping method to account for that size difference. I’d look into that before you attempt equalize group numbers.

Tony says

Hi Jim,

I am not really understanding how to create a confident interval from a bootstrap. Could you please give another example?

Thanks,

Tony

Jim Frost says

Hi Tony,

It’s really the process that’s important. Apply the following process to any sample.

All this method does is to take your sample and resampling it many times to create many bootstrap samples. It calculates the mean (or whatever you’re studying) for each bootstrap sample. Then it lines those up in order from low to high. From that line up, it picks out the middle 95% of values, or whatever confidence level you’re using. If you use 95%, then you know that 95% of your bootstrap sample means fell within that range.

It’s really that sample for bootstrap CIs. Some methods will add correction factors, incorporate PDFs, etc., but I’m just showcasing the simplest version to illustrate the principles behind how they work.

Gemechu Asfaw says

Hi JIM.

I am new for bootstrapping method. bootstrapping parametric or non parametric?. when we use it ?

Jim Frost says

Hi Gemechu,

There are parametric and nonparametric forms of bootstrapping. Read what I write in this comment about that issue.

I’d recommend using bootstrapping methods when you can’t satisfying the assumptions for a parametric or nonparametric non-bootstrapping test. You can read about that in my post about parametric vs. nonparametric tests. In some cases, you can use bootstrapping when a non-bootstrapping form of the test doesn’t exist.

I hope this helps!

Robert Matthews says

This was a really helpful addendum to Jim’s basic introduction; thanks for going to all the trouble of providing links too !

Adrian says

Dear Jim,

I am a big fan of your blog and admire not only your deep knowledge but also your extraordinary ability to teach it in a “digestive” way. That’s a virtue!

Now, let me add my few cents on these methods. They are very useful by relaxing the most problematic parametric assumption, but one should be very careful while using it. Let me list a few points, coming from my everyday work:

1) proper method should be used for proper goals. Bootstrapping tests doesn’t evaluate the p-value under the null hypothesis, unless additional steps are taken (shifting the data by the mean to simulate the null). Otherwise one may be *very* surprised.

Please find:

* “Computing p-value using bootstrap with R” – https://tinyurl.com/y4d2lu6o

* “Why shift the mean of a bootstrap distribution when conducting a hypothesis test?” – https://tinyurl.com/y4wra7z2

1a) Permutation tests do that. But the permutation test can be run if and if only the samples are IID = same sample shape and same dispersion. It doesn’t have to be normal (it can be any distribution), but should be IID. Why? Because permutation assumes directly exchangeability of the data. If there is a difference – the rule is broken, so the method is broken.

1b) If, instead of an exact permutation test, an approximate test is used (only a subset of all permutations are employed), the p-value won’t be exact too.

2) bootstrap provides only asymptotic and only average coverage probability (“95%” approaches the requested 95%). In certain industries, like the Clinical Trials, it’s often unacceptable. Here we often use the conservative, exact methods, giving the minimum (not average) probability coverage (usually we also want the shortest CI).

BTW, the FDA advises against using boostrap for the primary endpoints. The statistical properties aren’t still fully explored in case of strong skewness and kurtosis.

https://www.fda.gov/media/102657/download (it’s a draft guidance, but people comply)

3) it requires some data to work, a few tens at least. It must be sampled representatively, or – garbage in = garbage out, regardless of the number of samples.

4) it requires lots of samples. One may end up with tens-hundreds of thousands or the estimates get unstable There’s no agreement on that, still.

* “Bootstrap confidence intervals – how many replications to choose?” https://tinyurl.com/y2sqnc54

* “Why on average does each bootstrap sample contain roughly two thirds of observations?” – https://tinyurl.com/y4qwb3yn

* “Rule of thumb for number of bootstrap samples” – https://tinyurl.com/y5jvpklb

* “Choosing the number of bootstrap resamples” – https://tinyurl.com/y33old8s

* “Can we use bootstrap samples that are smaller than original sample?” – https://tinyurl.com/y2pqxhff

5) excessive skewness / fat tails may affect it

6) One should decide on which type of a CI to use. BCa is commonly advised, but too often fails to calculate (it happens to me very often). The percentile CIs are often blamed (not once I was disallowed by our sponsors to use them at work, when the BCa fails. Studentized CIs were requested instead)

* “What Teachers Should Know About the Bootstrap: Resampling in the Undergraduate Statistics Curriculum” – https://tinyurl.com/y55l5tel

* “Bootstrap confidence intervals: when, which, what?A practical guide for medical statisticians” – https://tinyurl.com/ycttwfnk

* “[SAS] The bias-corrected and accelerated (BCa) bootstrap interval” – https://tinyurl.com/y3qn5qf2

7) seed of the RNG should be always saved for reproducibility.

I hope you find something useful in it 🙂

David says

Hi Jim,

Thanks for the excellent report. FYI, Scientific American 1983 Vol 248(5):116 may be useful. My question asks whether you have commented on the issue of multiple testing (and proper corrections (i.e. Bonferrioni, Benjamin-Hochberg) therein versus resampling. Literature/blogs indicate this to be a longstanding and unresolved issue. I would be curious to read your comments.

Kudos, David.

Jim Frost says

Hi David, I’m much less familiar with multiple comparison methods for bootstrapping than for parametric tests. So, I don’t have a great deal of insight to add here. However, I’d imagine it’s a solvable problem. Both bootstrapping and parametric tests use a sampling distribution to calculate probabilities determine significance for a single comparison. Bootstrapping uses resampling to create its sampling distributions while parametric methods use probability distributions (t, chi-square, F, etc.).

Multiple comparison methods will typically adjust what is considered significant. In simple terms, if you know the Type I error for a single comparison, you can figure out the family-wise error rate. I write about this in my post about ANOVA and multiple comparisons. You can then adjust the individual Type I error rate so that the family rate is a desired value. So, I don’t have first-hand knowledge about these methods for bootstrapping, but they should be solvable.

I did a quick search and there do seem to be multiple comparison methods available for bootstrapping. Here’s an article about using multiple comparison methods with bootstrapping: On Using the Bootstrap for Multiple Comparisons.

Thanks for the great question!

Emily Stern says

Thanks for your intuitive blog post. Very helpful.

Dan Grove says

Hey Jim, thanks for this post! I didn’t know anything about bootstrapping before reading this. Explaining it via statistical concepts that I am familiar with (and the easy to follow example) made it a really easy and informative read. Cheers!

Swathi says

Hi sir

Myself swathi. Unknowingly I opened your blog. It’s really a lovely explanation about bootstrap .Before reading this article I thought that bootstrap is a complicated topic but after reading my thoughts have changed.Before your article I read

Many articles but my problem was not solved.thank you so much for your lovely explanation.

Akhil Mathew says

Hi Jim,

I love your blog posts and they have helped me understand many concepts. Now I came across a scenario where I need to know how would you solve this.

My question is regarding how to compare two or more bootstrap results. I am trying to do a AA test to find best matching control cases for a test group from a big pile of observations. Basically once I check for the differences, there should not be any as I am giving no special treatment to the test group. I do a bootstrapping once I finish finding test and control cases to see sampling distribution of the differences between the test and controls.

Problem is I am trying different approaches to match the test and controls, but I am not sure how to compare the bootstrap results. Most of the results are of mean around 0 but not 0 (luckily none of my means were above 1) but with wide range of standard deviation. I want to know whether there is any metric find best result with the least mean and least standard deviation.I have thought about just multiplying mean and standard deviation but do not think it doesn’t have statistical explanation. Differing mean and standard deviation is actually not helping me with z-score values.

If you still didn’t understand my problem, this question is almost the same thing. https://www.researchgate.net/post/How-can-I-compare-two-errornormal-distributions-to-find-which-one-is-better-rather-than-simply-finding-difference-in-their-means

Most of the people here suggest to go with the distribution with least standard deviation which means least uncertainty, but what if the mean value was 0.9? I want the least values for both 🙁

Please let me know how would you solve this problem!

Thanks,

Akhil

Rachel says

Hi Jim. You say sample sizes as small as 10 can be used. Is this a rule of thumb? Can you point me to any relevant literature regarding this?

Al says

Hello Jim –

Thank you for this blog post. The explanation was great. You mentioned that bootstrapping can be used to estimate regression coefficients. Can you present some post on how this can be done using software like Excel or Stata or R?

Kalie says

This explanation was so helpful – much better than what I received in class. Thanks so much!

Jim Frost says

Hi Kalie,

Thanks so much for writing! Your comment made my day! 🙂

Mahmud says

That’s a very nice introduction about Bootstrap sampling. I had some ideas of Bootstrap sampling but I was not very clear about all the aspects. Your clear cut explanation makes everything very clear to me.

Thanks again. May Alllah bless you.

Steven Zenos says

Hi Jim, if we believe the central assumption for bootstrapping to be true, namely the original sampling results accurately represents the actual population, can it be used to increase precision in DOE, by bootstrapping every ‘y’ dependent variable outcome?

Jim Frost says

Hi Steven,

I’m not exactly sure what you’re proposing and what you’re using as a baseline level of precision for that comparison. However, if you’re talking about using bootstrapping to increase precision as compared to a parametric analysis, that depends. If the parametric analysis adequately fits the data, you probably won’t get better precision with bootstrapping. However, if there is some assumption violation that causes an unresolvable problem with the parametric analysis, then bootstrapping might improve your results. Although, that might do more to reduce bias than necessarily increasing precision–but it would be an improvement.

Lasheen says

Thanks a lot for your help.

Lasheen says

Thanks a lot for your reply. I think that I got your point. It seems to me that when we are talking about the bootstrap or nonparametric test, it is easier to speak in terms of the confidence interval than the hypothesis test. On the other hand, when we describe a t-test, it is more convenient to talks in terms of the hypothesis test than a confidence interval. Actually, they are two heads of the same coins. Is that right?

Jim Frost says

Hi Lasheen,

Again, no, t-tests and bootstrapping are really more convenient to discuss in the way you’re saying. They’re actually fairly similar. The key difference is how they estimate the sampling distribution. After that point, the hypothesis test and CIs for both methods are fairly similar.

And, yes, hypothesis tests and CIs are two sides of the same coin. My post about how confidence intervals work explores this fact.

Lasheen says

Thanks for your reply. If we used the bootstrap distribution as the null distribution, and we set the mean value of that distribution as the null hypothesis. Then we will need to prove that the sample lies within, for example, 95% of the distribution. That is opposite to the t-test where we need to prove that the sample lies in the tail of the distribution. Is that make sense.

Jim Frost says

Hi Lasheen,

The bootstrap version of the t-test and the actual t-test follow the same basic approach. They are

notopposite. They just use different methodology to calculate the sampling distribution.I’d read my post about how hypothesis tests work. That is based on a one-sample t-test. When I should the sampling distribution of the means, which is based on the t-distribution, just mentally replace it with the bootstrap distribution centering on the null hypothesis value (the reference/target value). If the sample mean lies beyond the 2.5th or 97.5th percentile of the bootstrap distribution (but centered on the null value), then it is statistically significant.

Lasheen says

Thanks a lot for your post, very easy to follow. I am wondering what is the null hypothesis (and hence null distribution) in that context.

Jim Frost says

Hi Lasheen,

It would still be the same null hypothesis as in the parametric scenario. In this case, we’re talking about the sample mean for one-sample. We’d need to use a reference or target value for the null hypothesis value just like we’d do for a 1-sample t-test. The test (either parametric or bootstrapping) determines whether the difference between your population parameter estimate and the reference value is statistically significant.

Michael says

Hi Collin. you brought up an interesting point, “if you could first calculate all the possible combinations”. I took a stab at making this calculation with Jim’s data.

Assuming 92 people, order does not matter and we can have a person in the output more than once I think the number of total possible unique samples is 7.2016213874e+53.

I imagine there is no software readily available that could feasibly run this number of simulations!

With Jim running 500,000 simulations, what is the probability that that any of the simulations produced identical results!?

Jim Frost says

Hi Michael,

I describe how to calculate the number of unique samples in my reply to Collin. I’m not sure how you calculated yours, but the correct result is different.

famousdavispmp says

Collin, if you’re not too familiar with Monte Carlo simulation, you might find this spreadsheet helpful. I used it when doing a webinar a couple of months ago. This is an Excel spreadsheet, but it should work with Google Sheets users, too, since it uses built-in Excel functions (no plug-ins, nothing to install, it’s just a spreadsheet).

https://www.statisticalpert.com/download/1642/

In the spreadsheet, I simulate the rolling of two, six-sided dice. The question I’m trying to answer through simulation is how likely will I roll a 7? We already know the answer. There are six ways to get a “7” when rolling a pair of six-sided dice, and there are 36 possible combinations, so, 6/36 = 16.67% likely of rolling a 7 for any single roll.

In the spreadsheet, you’ll see how the greater number of simulated trials, the more accurate the simulated results are when compared to what we know is the true value getting a “7” (16.67%).

When we simulate with 100 trials, the results are not too accurate. Simulating with 1000 trials improves accuracy, and with 10,000 trials even more accuracy is achieved. Had I included a worksheet with 100,000 trials, the results would have been very, very accurate.

Of course, we don’t need to run a Monte Carlo simulation to solve a simple problem like this. We use Monte Carlo simulation when the problems are much more complex and the answers are anything but obvious and couldn’t be solved simply by using a math formula.

Collin M says

thank you Jim for that insight

Collin M says

hi Jim, thank you for making statistics so intuitive. Am an undergraduate student of Bsc.Agriculture, your posts have made me feel like a PhD student who can interpret the whole research process.

But my question is, why did you choose 500,000 bootstrap samples other than +/- 100,000 bootstrap samples for example.

I was thinking if you could first calculate all the possible “combinations” or “orders”, let me call them so, which can come out of that sample. In other words, how many different samples you can re-arrange from that sample without making repeatitions of the bootstrap samples.

Jim Frost says

Hi Collin,

There are several reasons why I chose such a high number. For one thing, it’s more likely to produce a nice smooth graph that looks nice. Additionally, if anyone tries to replicate the results, their results will tend to be closer to mine with higher numbers. By the way, I’ve added a link to my script in this post so anyone can try it on their own. You can easily change the number of bootstrap samples in the script

However, I probably used far more bootstrap samples than needed. I just reran the analysis with 100,000 bootstrap samples and obtained virtually identical results. Modern computing power makes it easy to go overboard! For the program to create the 500,000 samples and perform all the follow up calculations, it took only a matter of seconds! And, there’s no harm with going with more samples than necessary. A good rule of thumb is to increase the number of bootstrap samples up to point where you’re getting consistent results from one run to the next. If the results change much between runs, you have too few bootstrap samples.

For calculating possible combinations, it’s just n^n. Where n is sample size. My dataset has 92 values, which means the number of possible combinations is 92^92 = 4.66e+108! That’s a huge number! While 500,000 seems like a lot, it’s a tiny proportion (1.07e-175 to be precise) of all possible combinations. Although, again, 500,000 is more than sufficient for this dataset.

famousdavispmp says

“I wasn’t quite clear on your Excel methodology. Did you create 500,000 bootstrapped samples using resampling with replacement where n=92 and then calculate the mean for each one?”

Yes, that’s exactly what I did. Once I setup the model, I just tell @Risk to run a simulation 500K times. Each simulation uses n=92 with replacement. @Risk creates the bell-curve thanks to the CLT, which is what I used to compare its CI with yours.

@Risk is sophisticated, so I’m sure there’s a reason why the CI is different, but it’s just a curiosity to me now because using Excel’s built-in functions I can almost identically match your results, so it confirmed I understand the process correctly.

Jim Frost says

Great! I’m not familiar with @Risk. I might need to look it up! There are different bootstrap CI methods. I’m sure they’re just using one of the other methods.

david says

Hi Jim ! I wondered if maybe you know Why is using bootstrap to compute a confidence interval for the maximum value of a variable is problematic?

Jim Frost says

Hi David, the maximum value possible in the bootstrap simulated samples is the maximum observed value in the actual sample. Bootstrap samples are unable to go beyond that maximum sample value. It’s a sharp cutoff. However, sampling distributions based on a probability distribution can go beyond the observed value in a sample. Confidence intervals based on those probability distributions can therefore use that information that lies beyond the maximum observed value. The same is true for minimum values.

famousdavispmp says

Hi Jim, here’s a follow-up comment. I tried using Excel’s built-in function, PERCENTILE.EXC, against 4000 simulated iterations (of 92 samples each) by creating a big data table in Excel (so I wouldn’t use my @Risk plug-in at all). Using this approach, I nearly exactly matched your results: 2.5% was 27.20 and the 97.5% was 29.99, which of course nearly matches your simulation.

So for some reason, when I use the Palisade @Risk program to run the simulation, I’m getting a slightly different result, but I have no idea why.

Jim Frost says

Hi,

I’m not familiar with Palisade @Risk or the methodology that they use. I do know that there are several methods for creating bootstrap CIs. I used the simplest because it was easiest to illustrate.

I wasn’t quite clear on your Excel methodology. Did you create 500,000 bootstrapped samples using resampling with replacement where n=92 and then calculate the mean for each one? I wasn’t quite clear about the portion where you write, “Then I take the AVERAGE of those 92 bootstrapped values, and run the simulation 500K times.” Maybe you’re saying the same thing a different way? The results in your second comment are almost identical!

famousdavispmp says

Hi Jim. Thank you for your article. I downloaded your dataset of body fat samples and tried bootstrapping in Excel to see if I could match your results. I’m using Excel’s RANDBETWEEN function 92 times for each bootstrapped iteration and finding the mean for each iteration. Then I use Palisade’s @Risk Excel plug-in to run the simulation 500,000 times just like you.

My results are a little different, though, and I’m wondering why? The mean from my simulation is 28.42, and the 95% confidence interval is 27.54 (2.5%) and 29.59 (97.5%). I knew my results wouldn’t necessarily match your results *exactly* bu they are different enough to make me wonder why?

The model is pretty simple. I opened your body fat dataset, and in a column over I use =INDEX($A$2:$A$93),RANDBETWEEN(1,92)) and copy that 92 times, one for each of the 92 rows of data. Then I take the AVERAGE of those 92 bootstrapped values, and run the simulation 500K times. Using @Risk, I can see the statistics for the cell with the AVERAGE function which is where I obtained my confidence interval.

Any thoughts why I can’t match exactly your results? I noted that you’re using a different program to run your simulation.

[email protected] says

noted with thanks

Stan Aleeman says

I have some very good articles on resampling; permutation, jacknife, bootstrapping. If you send me an email – [email protected] – I will send them to you.

Stan Alekman

Khan says

Can I have an articles on types of bootstrapping? Which type of bootstrapping is used in Sem-amos?

Jim Frost says

Hi Khan,

As of now, I have just this one article about bootstrapping. I might write more down the road. I’m not familiar with SEM AMOS other than it’s an SPSS module for structural equation modeling. So, I’m not sure what methodology they use.

Aarav says

Hi Jim,

Thanks for the wonderful post. Can you create a post of using bootstrapping for hypothesis testing?

Thanks

LJ Legaspi says

Sir, is bootstrapping a nonparmetric test?

Jim Frost says

Hi LJ,

There are actually nonparametric and parametric forms of bootstrapping. The most common form is the method I show in this post, which is a nonparameteric method. This method creates new samples of the same size using sampling with replacement as shown.

However, there is a parametric form of bootstrapping. That form assumes that the population your are studying follows a particular distribution, such as the normal distribution, Poisson, or whatever. The procedure then estimates the parameters for that distribution from your sample. The procedure then uses the estimated distributions to produce the new samples.

So, yes, the type I show, which is the most common, is a nonparametric. However, just be aware that there is also a parametric type of bootstrapping.

Stan Alekman says

Resampling simply runs a Monte Carlo simulation on existing data to give some idea about the influence of extreme values. It does ask why extreme values or outliers are present. It does not test for outliers and cull them. It simply tries to average out their effects. But in non-experimental settings, outliers are critical. They are the signals that tell us of the presence of assignable causes. Resampling sidesteps the assumption of independent and identically distributed random variables without having to deal with outliers. The emphasis is completely upon estimation of parameters, not process characterization or improvement. Given this difference in emphasis, it works.

If I see appreciably different results between the usual tests and resampling, I would suspect the data of having come from an unpredictable process. In that case the resampling results would provide estimates with less variation, but the question of whether or not those estimates were estimates of one parameter or many different parameters would remain unanswered. Resampling works with data that are mostly homogeneous with only a few outliers.

Stan Alekman

Dwasch says

Hello.

Thanks for this helpful summary. Am I correct in understanding bootstrapping doesn’t rely on either a normality assumption or (for group comparison) a homogeneity of variances assumption? If so, could you point to a references without too much hassle? Would be helpful for an revise and resubmit.

Thanks!

Stan Alekman says

The question then remains: is the bootstrap confidence interval more reliable (closer to the truth) than is the confidence interval by traditional means? Without an answer or consensus, decisions based on analysis will not necessarily be the best we can make. We strive to make the best evidence based decisions.

Jim Frost says

Hi Stan,

This being statistics, the answer is a definite, “it depends.” I know that’s not helpful but a blanket answer isn’t possible. There are some cases where your data just don’t fit an existing analysis. It might deviate from the assumptions too much. Or, perhaps the appropriate test does not exist. In those cases, bootstrapping is clearly superior.

However, in other cases where your data completely satisfy the assumptions of a proven test, it’s harder to make the case that either method is superior. I’d say that bootstrapping is more flexible in terms of the conditions and tests that it can handle. I also haven’t thoroughly researched bootstrapping and might be unaware of how it compares to traditional methods (such as t-tests and CIs) when your data do satisfy the assumptions. I wouldn’t be surprised if someone performed a simulation study to look into this question. If this is a question you face for a study, it would probably be wise to research it.

I also don’t know the properties of your data. My sense is that the more closely your data follow the normal distribution the more equivalent the two approaches become. However, as your data diverge from the normal distribution, I’d expect bootstrapping to become the better analysis. However, I’m not familiar enough with that literature to give you practical advice for making that decision.

In statistics, knowing which test is better typically depends on understanding the characteristics of your data and the stringency of the relevant requirements. This holds true for deciding between traditional vs. bootstrapping methods.

Stan Alekman says

I wonder. I collect a sample and estimate a mean and confidence interval by the traditional t-distribution.

Then I re-estimate the mean and confidence interval by boot strapping and find a somewhat different mean a narrower interval.

Is it appropriate to report the boot strap estimates? Will bootstrap estimates be acceptable for journal publications? Are boot strap estimates superior?

Jim Frost says

Hi Stan,

Unfortunately, I don’t have concrete answers for your questions. In terms of what journal publications will accept, that will vary by field and journal. Most journal articles I’ve read use the traditional t-distribution, tests, and CIs. I think that’s mainly due to familiarity and tradition rather than it being better. Most people are more familiar with the traditional hypothesis tests. However, that’s not to say that journals won’t accept bootstrap results. I’d look into what the journal has published as well others in your field. There is a good case to be made for bootstrap methods.

Where I think the bootstrap method really shines is for cases where you don’t satisfy the assumptions for a traditional test. Or, perhaps there isn’t even a traditional method for what you want to accomplish. That’s where I’d say that bootstrapping is superior. If you have data that satisfy the assumptions, my sense is that both methods are similarly good.

Sorry for the vague answer. But, I don’t think a concrete one exists!

Fizza says

Hey. How can I use bootstrapping for multiple regression?

Shashank Garg says

Thank you Jim for such a simple explanation of Bootstrapping. I was trying to get the initials of the design from long but was not able to figure it out. Now it will easier for me to understand further details of it. I was also not a supporter of the the theory that all phenomena are normally distributed. Although, bootstrapping also makes assumptions, still we have something new to ponder.

Jim Frost says

Hi Shashank,

You’re very welcome! As some one who “grew up” on traditional hypothesis testing procedures, learning about bootstrapping was very interesting.

In traditional hypothesis, it’s true that not all distributions are normal. However, the central limit theorem is our friend in that regard because, with a large enough sample, the sampling distributions approximate the normal distribution, which satisfies the assumption for those tests.

saroja says

i love the way you make the concepyt clear

Debanjan says

So, should my original mean fall within the bootstrap confidence interval or not?

Jim Frost says

For a 95% confidence interval, you can be 95% confident that the interval contains the population mean. The population mean is the unknowable parameter that we’re estimating with a sample. So, yes, the process typically produces intervals that contain the population parameter. However, occasionally it won’t because of an unusual sample. Of course, this assumes that you’re drawing a random, independent sample.

Karan Desai says

Hi Jim,

This is the first time I read about bootstrapping and loved the concept. No wonder the name of the method is bootstrapping. You have explained it really well. Your blog is a gem.

Stanley Alekman says

Thanks for the explanation. I failed to understand that earlier.

Stan Alekman

Jim Frost says

You bet. And, it probably means I didn’t explain it clearly enough!

Stan Alekman says

Not to be argumentative, inference from a single non-representative sample as opposed to a hundred thousand resamples from a single non-representative sample seems like the wrong direction to take.

Jim Frost says

It actually works out to be fairly equivalent. The traditional approach uses the sample to calculate a sampling distribution, such as the t-distribution. That distribution is calculated from your one sample and it is equivalent to the distribution you’d obtain after performing the analysis (e.g., t-test) an infinite number of times. If your sample is not representative, that distribution will not be correct.

I’m not trying to convince you to use bootstrapping by any means. But a non-representative sample will affect the sampling distribution for both approaches because both use a single sample to estimate a sampling distribution. The methodology to produce that sampling distribution is different (resampling vs. formulas), but the end results are similar.

I haven’t used bootstrapping methods extensively myself. My training and experience has been with the traditional methods. However, the research that supports the validity of bootstrap methodology is very strong.

Stan Alekman says

Thanks for the info re bootstrapping regression coefficients, etc. Frankly, I am distrustful of bootstrap estimates. The underlying assumption is that the original sample mimics the population. It is very difficult to collect truly random samples in industrial settings.

Jim Frost says

I think in general it’s harder than commonly recognized to get a truly random, representative sample. The only thing that I’d add is that the traditional statistical methods also assume representative samples. So, if it’s that’s a problem, it’ll affect both bootstrap and traditional methods.

Aijaz Ahmad Dar says

I am interested in bootstrapping and I am using it. But I am having a question that i asked to many but I don’t get the answer. My question is how to find the Confidence interval (C.I) for the support parameter (I mean the situation where MLE is the first order and nth order statistics). example in Pareto distribution, power distribution.

Stan Alekman says

Thank you. Look forward to it.

Stan Alekman says

Can you reply to my specific question regarding tolerance intervals by bootstrapped mean and bootstrapped standard deviation? This would be an excellent procedure, if valid, to generate precise tolerance intervals.

Thank you.

Stan Alekman

Jim Frost says

Hi Stan,

You can create bootstrapped tolerance intervals. I don’t know enough about it right know to give you an intelligent response about it. That’s forthcoming after I learn more!

Stan Alekman says

Can you prepare an article describing how to bootstrap regression coefficients, and regression coefficient confidence intervals?

Can bootstrap estimates of means and standard deviations (as in your example) be used to estimate tolerance intervals using the bootstrapped mean +-k*bootstrapped sigma where k is the smallest value in the table since hundreds of thousands of bootstrap sampling steps are used to estimate the bootstrapped sigma?

Regards,

Stan Alekman

Jim Frost says

Hi Stan,

I was wondering what the reaction would be to bootstrapping. I had hopes there would be interest in it. I think it’s safe to say that there will be more articles about it!

Matt says

Jim, great article, generating lots of discussion among my peers. Thanks.

“An Introduction to Statistical Learning with Applications in R” by Gareth James et al has a short section (5.2, pages 187-190) on bootstrapping, with an example on regression coefficients. Essentially the bootstrapped samples draw the X and Y data from the original, then you figure the regression coefficient for each bootstrapped sample. Across all bootstrapped samples, figure your statistic of the coefficient.

Sampath says

It’s really interesting post. Thank you Jim.

Jim Frost says

Thank you, Sampath! I’m glad you enjoyed it.

Mcpheson says

Nicely intuitive.

Jim Frost says

Thank you!

محمد عبدالله محمد احمد says

Thanks a lot Mr. Frost

Jim Frost says

You’re very welcome!

ihsanullah says

please examples

Jim Frost says

Hi, I include a great example right in this post! 🙂