Sample Statistics Are Always Wrong (to Some Extent)!

By Jim Frost 11 Comments

Here’s some shocking information for you—sample statistics are always wrong! When you use samples to estimate the properties of populations, you never obtain the correct values exactly. Don’t worry. I’ll help you navigate this issue using a simple statistical tool!

Statistics probably bombard you. You can find them all over the place, such as in the news media, surveys, commercials, and so on. Often these statistics aren’t meant to describe only the specific group of subjects that were measured. Instead, the goal is to infer properties about a larger population. This practice is called inferential statistics.

For example, when we read survey results, we are not learning about just the opinions of those who responded to the survey, but about an entire population. Or, when we see averages, such as health measures and salaries, we’re learning about them on the scale of a population, not just the few subjects in the study. Consequently, inferential statistics can provide extremely meaningful information.

Inferential statistics is a powerful tool because it allows you to use a relatively small sample to learn about an entire population. However, to have any chance of obtaining good results, you must follow essential procedures that help your sample to represent the population faithfully.

To learn more, read my posts about Descriptive vs. Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics.

Random Sampling Error is Always Present in Samples

Unfortunately, even when you diligently follow the proper methodology for performing a valid study, your estimates will almost always be at least a little wrong. I’m not referring to unscrupulous manipulation, mistakes, or methodology errors. I’m talking about cases where researchers use the correct sampling methodology and legitimate statistical calculations to produce the best possible estimates.

Why does this happen? Random sampling error is present in all samples. By sheer chance alone, your sample contains sampling error that causes the statistics to be at least a little bit off. Your data are not 100% representative of the population because they are not the entire population! Samples can never provide a perfect depiction of the population from which it is drawn.

All estimates are at least a little wrong, but they can be very wrong. Unfortunately, the media and other sources forget this point when they present statistics. Upon seeing an estimate, you should wonder—how large is the difference between the estimate and the real population value?

The answer depends on the sample size and the variability in the data. We never know the correct population value exactly—which is known as the population parameter. But, we can estimate the range the parameter is likely to fall within using a confidence interval. This handy statistical tool incorporates the sample size and variability to produce a range of likely values for parameters such as the population mean, standard deviation, and proportion. Confidence intervals provide the margin of error around a point estimate so we have an idea of how wrong the estimate might be. Similarly, the margin of error in a survey tells you how near you can expect the survey results to be to the correct population value.

Additionally, the law of large numbers states that as the sample size grows, sample statistics will converge on the population value. Furthermore, the standard error of the mean quantifies the relationship between sample size and precision.

Simulating How Samples and Confidence Intervals Estimate the Properties of a Population

I’ll illustrate how confidence intervals work using a process that lets us be all powerful and all knowing! Sounds like fun!

Typically, we don’t know the exact properties of a population. However, using statistical software, we can define the features of a population and then draw random samples from it. In this manner, we can assess the sample statistics and confidence intervals while knowing the correct population values.

For these examples, I’ll define a population that follows a normal distribution with a mean of 100 and a standard deviation of 15. These properties happen to match the accepted values for the distribution of IQ scores. I’ll have the software draw simulated random samples of size 10 and 100 from this population, and we can see how the estimates and CIs compare.

Here’s how the population appears in a probability distribution plot. We’ll see how the sample data appear in comparison.

You can download the CSV data file here: SimulatedIQData. Or, use statistical software and try it yourself!

Example: Sample Statistics and CIs for 10 Observations

For the sample size of 10, here are the summary statistics for my random sample.

Statisticians usually consider a sample size of 10 to be a bit on the small side. From the histogram, the data do not look much like the original population. The estimates for the mean and standard deviation are 103.25 and 12.89, respectively. These are the point estimates for the population parameters. They’re both in the right ballpark for the correct values of 100 and 15.

We have our point estimates, but we know that those aren’t exactly correct. Let’s check the confidence intervals to see the ranges for where the actual parameter values are likely to fall.

The confidence interval for the mean is [94.03 112.46], and for the standard deviation it is [8.86 23.52]. The population parameters usually fall within their confidence intervals. Typically, we don’t know the actual parameter values, but for this illustration we can see that both estimates fall within their intervals. The sample does not provide an exact representation of the population, but the estimates are not too far off. If we didn’t know the actual values, the CIs would give us useful guidance.

What does “95% Confidence“ in the table indicate? Imagine that we collect 100 random samples from this population. We’d end up calculating 100 different confidence intervals for the mean. If we set the confidence level at 95%, we’d expect that 95 out of 100 confidence intervals will contain the actual population parameter.

For more a more in-depth explanation, please read my post: How Confidence Intervals Work.

Example: Sample Statistics and CIs for 100 Observations

Now, let’s look at the sample size of 100. Those results are below.

For this larger sample, this histogram is beginning to look more like the underlying population distribution. The estimates for the mean and standard deviation are 99.553 and 15.597, respectively. Both of these point estimates are closer to the actual population values than their counterparts in the smaller sample.

For the confidence intervals, both of the CIs again contain the parameters. However, notice that these intervals are tighter than for the sample size of 10. For example, the CI for the mean is [96.458 102.648] compared to [94.03 112.46] for the sample size of 10. That’s a range of about 6 IQ points rather than 18. The tighter intervals indicate that these estimates are more precise than those from the smaller sample. In other words, the difference between the estimate and the actual value is likely to be smaller for the larger sample.

Sample Size and the Precision of Sample Estimates

As I mentioned earlier, both sample size and variability affect the precision of sample estimates. However, we often can’t control the variability in the data, only the amount of data we collect. Increasing the sample size tends to improve the precision of the estimates. In other words, you can expect the difference between the estimate and the actual population value to be smaller when you use a larger sample.

Unfortunately, it is possible to use a larger sample size and yet still obtain a difference between the estimate and actual value that is relatively large. Sometimes you draw a fluky sample thanks to bad luck! However, it becomes less and less likely as you gather more data.

All sample statistics are wrong to some extent due to random error. Depending on your sample size, variability, and the luck of the draw, the difference can be large or small. Use confidence intervals to estimate the range of likely values for the population parameters. If this range is too broad to be meaningful, increase your sample size!

Comments

Aviral Pandey says

May 3, 2020 at 8:11 am

Thank you sir for wonderful notes. Can you tell us the name of software which you have used for measuring sample size in this note.

Reply
- Jim Frost says
  
  May 3, 2020 at 3:37 pm
  
  Hi Aviral, I’m using Minitab statistical software. I’m glad my website has been helpful!
  
  Reply
Craig H. Appel says

November 22, 2019 at 8:12 pm

Indeed there is a lot to consider, and I don’t want to be diverting you into an intricate question of only limited interest. I guess the bottom line for me is your conclusion:

“….the test results and CIs are based on the sampling distributions, which often approximate the normal distribution even when the original data do not.”

and this I take to mean: since your sampling is, or is presumed to be, random, a random sample of a population with an unknown distribution, once you get more than “a few” of these random samples, they will tend increasingly towards a normal distribution. My problem, or concern, or point, is this: there is NO WAY to infer the population distribution, from your sampling distribution.

It’s like Yossarian trying to get out of flying more missions: he says, “I see everything twice.” Doc Daneeka thinks he’s malingering, and tests him. “How many fingers am I holding up?” he asks, holding up two fingers. “Two!” says Yossarian. The Doc holds up one finger. “Two!” says Yossarian. He holds up four fingers. “Two!” says Yossarian. “By Golly,” says Daneeka to the Colonel. “He DOES see everything twice!”

That is, WHATEVER the population distribution may be, the sampling distribution (for a reasonable number of samples) will always be normal. So the statistics for that (normal) sample distribution, have NO NECESSARY RELATION to the real distribution of the data in the population – they’re artifactual.

Or, am I confused? And Jim, if I AM confused, just tell me so and I’ll follow your links and work it out myself – you can’t be running a tutoring service here. And, thank you.

Reply
- Jim Frost says
  
  November 22, 2019 at 10:09 pm
  
  Hi Craig,
  
  I think you’re confusing “distribution of the sample data” with the “sampling distribution.” Suppose the population has a skewed distribution. As you collect a larger and larger random samples from that population, you’d expect the distribution of the sample data to more closely reflect the true population distribution. Therefore, if the population is skewed, you’d expect a large random sample to reflect that skew.
  
  However, the sampling distribution is something else entirely. It does not and isn’t supposed to reflect the population or sample distribution. It’s used for entirely different purposes, such as creating CIs.
  
  The fact that the sampling distribution tends to more closely approximate the normal distribution as you gather a larger sample is a good thing. It means you can trust the results of the hypothesis test and CIs even when the underlying distribution isn’t normally distributed. I think reading through what I shared will be helpful.
  
  Reply
Craig H. Appel says

November 22, 2019 at 7:36 am

(Thanks for quick response!)

Then if I read, in a medical-science paper, that for a certain value of parameter x, the measured value is y, the SD s = B, CI = (D < y < E), _I_ can assume that the AUTHOR assumes that the spread of values is the result of random error, that it's symmetrical about the reported (mean) value, that about 63% of the measurements fall between y-s and y+s, and so on.

In the real world, however, the distribution of repeated measurements might be ANYTHING; the distribution might be, say, bi-modal, representing differences in technique between two investigators, or it might be monotonically decreasing with time, representing a learning effect, or saturation. And unless the distribution of repeated measurements of y at x is examined for EACH x, the assumption of random error is unwarranted. Is that correct?

Reply
- Jim Frost says
  
  November 22, 2019 at 5:43 pm
  
  Hi Craig,
  
  There’s a lot to consider about this issue. Yes, the distribution of measurements could be anything. Hopefully the author checks the distribution. Additionally, many statistical tests (e.g., t-tests, ANOVA, etc.) assume the data follow the normal distribution, but they are robust to departures from that assumption when the sample size per group is large enough. For a table of those sample sizes by type of test, see my post about parametric vs. nonparametric tests.
  
  It’s thanks to the central limit theorem (CLT) that these tests can handle nonnormal distributions when they have a large enough sample size. This theorem states that the sampling distribution converges on a normal distribution as the sample size increases regardless of the underlying data distribution. Technically there are a few exceptions but it works for almost all distributions. The CLT is important because it’s the sampling distributions that determine the statistical significance and confidence intervals. For more information about the CLT, read my post about the Central Limit Theorem.
  
  So, back to your question about the SD and the CI. The CIs are based on the standard error of the mean (SEM), which is the standard deviation of the sampling distribution. Therefore, if the sample size is large enough, the sampling distribution approximates a normal distribution where the SD equals the SEM. And, voila, it all works out OK. The trick of course it to have enough observations so the CLT “kicks in.” The table in the other post provides good rules of thumb based on simulation studies. But, the number depends on how skewed or nonnormal the data are.
  
  I’ve tackled confidence intervals for skewed data in a post about the bootstrapping methodology. In that post, you can see how you start out with a skewed data distribution that produces a sampling distribution which approximates the normal distribution. Then, using that sampling distribution, we construct the confidence interval. I think it’s a more intuitive way of understanding how it all works even when the original distribution is skewed. The traditional CI method produces a result that is very close.
  
  At any rate, the key point to remember is that the test results and CIs are based on the sampling distributions, which often approximate the normal distribution even when the original data do not.
  
  Reply
Craig Appel says

November 20, 2019 at 8:21 am

For which “distributions” is the standard deviation defined? Or, perhaps more to the point, for what distributions is the SD UNDEFINED?

Reply
- Jim Frost says
  
  November 22, 2019 at 1:11 am
  
  Are you talking about probability distributions that use standard deviation as a parameter? If so, the normal distribution is the only one.
  
  Reply
Abhijeet says

August 30, 2018 at 11:28 am

“Unfortunately, it is possible to use a larger sample size and yet still obtain a difference between the estimate and actual value that is relatively large. Sometimes you draw a fluky sample thanks to bad luck!”

Usually in observational studies, the samples collected may not be representative of the whole population. It also might not be possible to get a larger sample. How can then one have more confidence in their estimates?

Reply
- Jim Frost says
  
  August 30, 2018 at 1:58 pm
  
  Hi Abhijeet,
  
  Observational studies are studies where the researchers cannot control the experimental conditions and independent variables, and there is no randomization. Usually it’s because it is impossible to randomize or it’s unethical. Instead, the researchers observe, measure, and then statistically derive the relationships between the data.
  
  Now if you are performing an observational study and you suspect that the sample of subjects do not represent the larger population, you might be able to use the statistical model that describes the relationships between the independent variables and the dependent (outcome) variable to help derive the mean for the actual population. Because you’re performing an observational study, you should already have an idea of which variables affect the outcome. That’s crucial to have any shot at producing meaningful results. So, basically you’d weight your sample mean by using what you know about the relevant variables and how they are different for your sample versus the population.
  
  For example, if income is related to the outcome variable and you know the properties of that relationship, and you know that your sample group is either higher or lower income than the overall population, you can statistically adjust the results. Of course, you’d need to weight it using all of the relevant variables. In essence, you’re predicting what the results of a representative sample would have been.
  
  I hope this helps!
  
  Reply
Ifedayo Adu says

August 29, 2018 at 10:54 pm

😍

Reply