Here’s some shocking information for you—sample statistics are always wrong! When you use samples to estimate the properties of populations, you never obtain the correct values exactly. Don’t worry. I’ll help you navigate this issue using a simple statistical tool!
Statistics probably bombard you. You can find them all over the place, such as in the news media, surveys, commercials, and so on. Often these statistics aren’t meant to describe only the specific group of subjects that were measured. Instead, the goal is to infer properties about a larger population. This practice is called inferential statistics.
For example, when we read survey results, we are not learning about just the opinions of those who responded to the survey, but about an entire population. Or, when we see averages, such as health measures and salaries, we’re learning about them on the scale of a population, not just the few subjects in the study. Consequently, inferential statistics can provide extremely meaningful information.
Inferential statistics is a powerful tool because it allows you to use a relatively small sample to learn about an entire population. However, to have any chance of obtaining good results, you must follow essential procedures that help your sample to represent the population faithfully.
To learn more, read my posts about Descriptive vs. Inferential Statistics and Populations, Parameters, and Samples in Inferential Statistics.
Random Sampling Error is Always Present in Samples
Unfortunately, even when you diligently follow the proper methodology for performing a valid study, your estimates will almost always be at least a little wrong. I’m not referring to unscrupulous manipulation, mistakes, or methodology errors. I’m talking about cases where researchers use the correct sampling methodology and legitimate statistical calculations to produce the best possible estimates.
Why does this happen? Random sampling error is present in all samples. By sheer chance alone, your sample contains sampling error that causes the statistics to be at least a little bit off. Your data are not 100% representative of the population because they are not the entire population! Samples can never provide a perfect depiction of the population from which it is drawn.
All estimates are at least a little wrong, but they can be very wrong. Unfortunately, the media and other sources forget this point when they present statistics. Upon seeing an estimate, you should wonder—how large is the difference between the estimate and the real population value?
The answer depends on the sample size and the variability in the data. We never know the correct population value exactly—which is known as the population parameter. But, we can estimate the range the parameter is likely to fall within using a confidence interval. This handy statistical tool incorporates the sample size and variability to produce a range of likely values for parameters such as the population mean, standard deviation, and proportion. Confidence intervals provide the margin of error around a point estimate so we have an idea of how wrong the estimate might be. Similarly, the margin of error in a survey tells you how near you can expect the survey results to be to the correct population value.
Additionally, the law of large numbers states that as the sample size grows, sample statistics will converge on the population value. Furthermore, the standard error of the mean quantifies the relationship between sample size and precision.
Simulating How Samples and Confidence Intervals Estimate the Properties of a Population
I’ll illustrate how confidence intervals work using a process that lets us be all powerful and all knowing! Sounds like fun!
Typically, we don’t know the exact properties of a population. However, using statistical software, we can define the features of a population and then draw random samples from it. In this manner, we can assess the sample statistics and confidence intervals while knowing the correct population values.
For these examples, I’ll define a population that follows a normal distribution with a mean of 100 and a standard deviation of 15. These properties happen to match the accepted values for the distribution of IQ scores. I’ll have the software draw simulated random samples of size 10 and 100 from this population, and we can see how the estimates and CIs compare.
Here’s how the population appears in a probability distribution plot. We’ll see how the sample data appear in comparison.
You can download the CSV data file here: SimulatedIQData. Or, use statistical software and try it yourself!
Example: Sample Statistics and CIs for 10 Observations
For the sample size of 10, here are the summary statistics for my random sample.
Statisticians usually consider a sample size of 10 to be a bit on the small side. From the histogram, the data do not look much like the original population. The estimates for the mean and standard deviation are 103.25 and 12.89, respectively. These are the point estimates for the population parameters. They’re both in the right ballpark for the correct values of 100 and 15.
We have our point estimates, but we know that those aren’t exactly correct. Let’s check the confidence intervals to see the ranges for where the actual parameter values are likely to fall.
The confidence interval for the mean is [94.03 112.46], and for the standard deviation it is [8.86 23.52]. The population parameters usually fall within their confidence intervals. Typically, we don’t know the actual parameter values, but for this illustration we can see that both estimates fall within their intervals. The sample does not provide an exact representation of the population, but the estimates are not too far off. If we didn’t know the actual values, the CIs would give us useful guidance.
What does “95% Confidence“ in the table indicate? Imagine that we collect 100 random samples from this population. We’d end up calculating 100 different confidence intervals for the mean. If we set the confidence level at 95%, we’d expect that 95 out of 100 confidence intervals will contain the actual population parameter.
For more a more in-depth explanation, please read my post: How Confidence Intervals Work.
Example: Sample Statistics and CIs for 100 Observations
Now, let’s look at the sample size of 100. Those results are below.
For this larger sample, this histogram is beginning to look more like the underlying population distribution. The estimates for the mean and standard deviation are 99.553 and 15.597, respectively. Both of these point estimates are closer to the actual population values than their counterparts in the smaller sample.
For the confidence intervals, both of the CIs again contain the parameters. However, notice that these intervals are tighter than for the sample size of 10. For example, the CI for the mean is [96.458 102.648] compared to [94.03 112.46] for the sample size of 10. That’s a range of about 6 IQ points rather than 18. The tighter intervals indicate that these estimates are more precise than those from the smaller sample. In other words, the difference between the estimate and the actual value is likely to be smaller for the larger sample.
Sample Size and the Precision of Sample Estimates
As I mentioned earlier, both sample size and variability affect the precision of sample estimates. However, we often can’t control the variability in the data, only the amount of data we collect. Increasing the sample size tends to improve the precision of the estimates. In other words, you can expect the difference between the estimate and the actual population value to be smaller when you use a larger sample.
Unfortunately, it is possible to use a larger sample size and yet still obtain a difference between the estimate and actual value that is relatively large. Sometimes you draw a fluky sample thanks to bad luck! However, it becomes less and less likely as you gather more data.
All sample statistics are wrong to some extent due to random error. Depending on your sample size, variability, and the luck of the draw, the difference can be large or small. Use confidence intervals to estimate the range of likely values for the population parameters. If this range is too broad to be meaningful, increase your sample size!