Inferential statistics lets you draw conclusions about populations by using small samples. Consequently, inferential statistics provide enormous benefits because typically you can’t measure an entire population.
In this blog post, I discuss these concepts, and how to obtain representative samples using random sampling.
Populations can include people, but other examples include objects, events, businesses, and so on. In statistics, there are two general types of populations.
Populations can be the complete set of all similar items that exist. For example, the population of a country includes all people currently within that country. It’s a finite but potentially large list of members.
However, a population can be a theoretical construct that is potentially infinite in size. For example, quality improvement analysts often consider all current and future output from a manufacturing line to be part of a population.
Populations share a set of attributes that you define. For example, the following are populations:
- Stars in the Milky Way galaxy.
- Parts from a production line.
- Citizens of the United States.
Before you begin a study, you must carefully define the population that you are studying. These populations can be narrowly defined to meet the needs of your analysis. For example, adult Swedish women who are otherwise healthy but have osteoporosis.
Subpopulations can Improve Your Analysis
Subpopulations share additional attributes. For instance, the population of the United States contains the subpopulations of men and women. You can also subdivide it in other ways such as region, age, socioeconomic status, and so on. Different studies that involve the same population can divide it into different subpopulations depending on what makes sense for the data and the analyses.
Understanding the subpopulations in your study helps you grasp the subject matter more thoroughly. They can also help you produce statistical models that fit the data better. Subpopulations are particularly important when they have characteristics that are systematically different than the overall population. When you analyze your data, you need to be aware of these deeper divisions. In fact, you can treat the relevant subpopulations as additional factors in later analyses.
For example, if you’re analyzing the average height of adults in the United States, you’ll improve your results by including male and female subpopulations because their heights are systematically different. I’ll cover that example in depth later in this post!
Population Parameters versus Sample Statistics
A parameter is a value that describes a characteristic of an entire population, such as the population mean. Because you can almost never measure an entire population, you usually don’t know the real value of a parameter. In fact, parameter values are nearly always unknowable. While we don’t know the value, it definitely exists.
For example, the average height of adult women in the United States is a parameter that has an exact value—we just don’t know what it is!
The population mean and standard deviation are two common parameters. In statistics, Greek symbols usually represent population parameters, such as μ (mu) for the mean and σ (sigma) for the standard deviation.
A statistic is a characteristic of a sample. If you collect a sample and calculate the mean and standard deviation, these are sample statistics. Inferential statistics allow you to use sample statistics to make conclusions about a population. However, to draw valid conclusions, you must use particular sampling techniques. These techniques help ensure that samples produce unbiased estimates. Biased estimates are systematically too high or too low. You want unbiased estimates because they are correct on average.
In inferential statistics, we use sample statistics to estimate population parameters. For example, if we collect a random sample of adult women in the United States and measure their heights, we can calculate the sample mean and use it as an unbiased estimate of the population mean. We can also perform hypothesis testing on the sample estimate and create confidence intervals to construct a range that the actual population value likely falls within.
|Population Parameter||Sample Statistic|
|Mu (μ)||Sample mean|
|Sigma (σ)||Sample standard deviation|
Representative Sampling and Simple Random Samples
In statistics, sampling refers to selecting a subset of a population. After drawing the sample, you measure one or more characteristics of all items in the sample, such as height, income, temperature, opinion, etc. If you want to draw conclusions about these characteristics in the whole population, it imposes restrictions on how you collect the sample. If you use an incorrect methodology, the sample might not represent the population, which can lead you to erroneous conclusions.
The most well-known method to obtain an unbiased, representative sample is simple random sampling. With this method, all items in the population have an equal probability of being selected. This process helps ensure that the sample includes the full range of the population. Additionally, all relevant subpopulations should be incorporated into the sample and represented accurately on average. Simple random sampling minimizes the bias and simplifies data analysis.
I’ll discuss sampling methodology in more detail in a future blog post, but there are several crucial caveats about simple random sampling. While this approach minimizes bias, it does not indicate that your sample statistics exactly equal the population parameters. Instead, estimates from a specific sample are likely to be a bit high or low, but the process produces accurate estimates on average. Furthermore, it is possible to obtain unusual samples with random sampling—it’s just not the expected result.
Related post: Sample Statistics Are Always Wrong (to Some Extent)!
Additionally, random sampling might sound a bit haphazard and easy to do—both of which are not true. Simple random sampling assumes that you systematically compile a complete list of all people or items that exist in the population. You then randomly select subjects from that list and include them in the sample. It can be a very cumbersome process.
Let’s bring these concepts to life!
Example of a Population with Important Subpopulations
Suppose we’re studying the height of American citizens and let’s further assume that we don’t know much about the subject. Consequently, we collect a random sample, measure the heights in centimeters, and calculate the sample mean and standard deviation. Here is the CSV data file: Heights.
We obtain the following results:
Because we gathered a random sample, we can assume that these sample statistics are unbiased estimates of the population parameters.
Now, suppose we learn more about the study area and include male and female as subpopulations. We obtain the following results.
Notice how the single broad distribution has been replaced by two narrower distributions? The distribution for each gender has a smaller standard deviation than the single distribution for all adults, which is consistent with the tighter spread around the means for both men and women in the graph. These results show how the mean provides more precise estimates when we assess heights by gender. In fact, the mean for the entire population does not equal the mean for either subpopulation. It’s misleading!
During this process, we learn that gender is a crucial subpopulation that relates to height and increases our understanding of the subject matter. In future studies about height, we can include gender as a predictor variable.
This example is intentionally easy to understand but imagine a study about a less obvious subject. This process helps you gain new insights and produce better statistical models.
Using your knowledge of populations, subpopulations, parameters, sampling, and sample statistics, you can draw valuable conclusions about large populations by using small samples. For more information about how you can test hypotheses about populations, read my Overview of Hypothesis Tests.