Inferential statistics lets you draw conclusions about populations by using small samples. Consequently, inferential statistics provide enormous benefits because typically you can’t measure an entire population.

However, to gain these benefits, you must understand the relationship between populations, subpopulations, population parameters, samples, and sample statistics.

In this blog post, I discuss these concepts, and how to obtain representative samples using random sampling.

**Related post**: Difference between Descriptive and Inferential Statistics

## Populations

Populations can include people, but other examples include objects, events, businesses, and so on. In statistics, there are two general types of populations.

Populations can be the complete set of all similar items that exist. For example, the population of a country includes all people currently within that country. It’s a finite but potentially large list of members.

However, a population can be a theoretical construct that is potentially infinite in size. For example, quality improvement analysts often consider all current and future output from a manufacturing line to be part of a population.

Populations share a set of attributes that you define. For example, the following are populations:

- Stars in the Milky Way galaxy.
- Parts from a production line.
- Citizens of the United States.

Before you begin a study, you must carefully define the population that you are studying. These populations can be narrowly defined to meet the needs of your analysis. For example, adult Swedish women who are otherwise healthy but have osteoporosis.

## Subpopulations can Improve Your Analysis

Subpopulations share additional attributes. For instance, the population of the United States contains the subpopulations of men and women. You can also subdivide it in other ways such as region, age, socioeconomic status, and so on. Different studies that involve the same population can divide it into different subpopulations depending on what makes sense for the data and the analyses.

Understanding the subpopulations in your study helps you grasp the subject matter more thoroughly. They can also help you produce statistical models that fit the data better. Subpopulations are particularly important when they have characteristics that are systematically different than the overall population. When you analyze your data, you need to be aware of these deeper divisions. In fact, you can treat the relevant subpopulations as additional factors in later analyses.

For example, if you’re analyzing the average height of adults in the United States, you’ll improve your results by including male and female subpopulations because their heights are systematically different. I’ll cover that example in depth later in this post!

## Population Parameters versus Sample Statistics

A parameter is a value that describes a characteristic of an entire population, such as the population mean. Because you can almost never measure an entire population, you usually don’t know the real value of a parameter. In fact, parameter values are nearly always unknowable. While we don’t know the value, it definitely exists.

For example, the average height of adult women in the United States is a parameter that has an exact value—we just don’t know what it is!

The population mean and standard deviation are two common parameters. In statistics, Greek symbols usually represent population parameters, such as μ (mu) for the mean and σ (sigma) for the standard deviation.

A statistic is a characteristic of a sample. If you collect a sample and calculate the mean and standard deviation, these are sample statistics. Inferential statistics allow you to use sample statistics to make conclusions about a population. However, to draw valid conclusions, you must use particular sampling techniques. These techniques help ensure that samples produce unbiased estimates. Biased estimates are systematically too high or too low. You want unbiased estimates because they are correct on average.

In inferential statistics, we use sample statistics to estimate population parameters. For example, if we collect a random sample of adult women in the United States and measure their heights, we can calculate the sample mean and use it as an unbiased estimate of the population mean. We can also perform hypothesis testing on the sample estimate and create confidence intervals to construct a range that the actual population value likely falls within.

Population Parameter |
Sample Statistic |

Mu (μ) | Sample mean |

Sigma (σ) | Sample standard deviation |

**Related posts**: Measures of Central Tendency and Measures of Variability

## Representative Sampling and Simple Random Samples

In statistics, sampling refers to selecting a subset of a population. After drawing the sample, you measure one or more characteristics of all items in the sample, such as height, income, temperature, opinion, etc. If you want to draw conclusions about these characteristics in the whole population, it imposes restrictions on how you collect the sample. If you use an incorrect methodology, the sample might not represent the population, which can lead you to erroneous conclusions.

The most well-known method to obtain an unbiased, representative sample is simple random sampling. With this method, all items in the population have an equal probability of being selected. This process helps ensure that the sample includes the full range of the population. Additionally, all relevant subpopulations should be incorporated into the sample and represented accurately on average. Simple random sampling minimizes the bias and simplifies data analysis.

I’ll discuss sampling methodology in more detail in a future blog post, but there are several crucial caveats about simple random sampling. While this approach minimizes bias, it does not indicate that your sample statistics exactly equal the population parameters. Instead, estimates from a specific sample are likely to be a bit high or low, but the process produces accurate estimates on average. Furthermore, it is possible to obtain unusual samples with random sampling—it’s just not the expected result.

**Related post**: Sample Statistics Are Always Wrong (to Some Extent)!

Additionally, random sampling might sound a bit haphazard and easy to do—both of which are not true. Simple random sampling assumes that you systematically compile a complete list of all people or items that exist in the population. You then randomly select subjects from that list and include them in the sample. It can be a very cumbersome process.

Let’s bring these concepts to life!

## Example of a Population with Important Subpopulations

Suppose we’re studying the height of American citizens and let’s further assume that we don’t know much about the subject. Consequently, we collect a random sample, measure the heights in centimeters, and calculate the sample mean and standard deviation. Here is the CSV data file: Heights.

We obtain the following results:

Because we gathered a random sample, we can assume that these sample statistics are unbiased estimates of the population parameters.

Now, suppose we learn more about the study area and include male and female as subpopulations. We obtain the following results.

Notice how the single broad distribution has been replaced by two narrower distributions? The distribution for each gender has a smaller standard deviation than the single distribution for all adults, which is consistent with the tighter spread around the means for both men and women in the graph. These results show how the mean provides more precise estimates when we assess heights by gender. In fact, the mean for the entire population does not equal the mean for either subpopulation. It’s misleading!

During this process, we learn that gender is a crucial subpopulation that relates to height and increases our understanding of the subject matter. In future studies about height, we can include gender as a predictor variable.

This example uses a categorical grouping variable (Gender) and a continuous outcome variable (Heights). When you want to compare distributions of continuous values between groups like this example, consider using boxplots and individual value plots. These plots become more useful as the number of groups increases.

This example is intentionally easy to understand but imagine a study about a less obvious subject. This process helps you gain new insights and produce better statistical models.

Using your knowledge of populations, subpopulations, parameters, sampling, and sample statistics, you can draw valuable conclusions about large populations by using small samples. For more information about how you can test hypotheses about populations, read my Overview of Hypothesis Tests.

Hamthal says

population vs. sample, and the terms

parameter vs. statistic which a researcher almost always use, and

why?

Jim Frost says

Hi Hamthal,

Read this article more carefully! It’s clear in this article that researchers will almost never know the population parameters. In fact, they are usually unknowable. Instead, researchers use sample statistics to estimate the parameter values.

Melissa says

Hi Jim, if the only sampling method that we can use is convenience sampling ,or samples that are obtained by voluntary response (which are biased), should we still proceed with our research?

Jim Frost says

Hi Melissa

That’s a tricky situation. Often researchers will have samples that aren’t truly random. The question then becomes understanding the implications of the nonrandomness for your sample.

Are you talking about data that aren’t random but used a systematic technique such as a stratified or clustered sample? These methods approximate random sampling but use some intentional differences. I talk about these methods in my Introduction to Statistics ebook. There are techniques that can handle these types of samples.

Or, do you mean a convenience sample? In this case, you need to understand the ways in which your sample is different than your study population. Your results might be biased on way or another. There’s no firm answer I can give here because it depends on the specifics of how your sample is different from the population. It weakens your evidence undoubtedly. You can’t really trust the p-values and confidence intervals. Effects can biased. How much these issues affect your results depends on your sample. How different is your sample from a random sample? You need to understand that. A place to start would be to look at the various properties of your sample and compare those properties to published values of the population you are studying. Are there any striking differences?

I hope that helps!

Yvonne Kennedy says

Hello Sir

Does the value of statistic necessarily equal parameter and why?

Jim Frost says

Hi Yvonne,

It can be surprising, but no, the sample statistic doesn’t necessarily equal the population parameter. In fact, the sample statistic is almost always at least a little different from the parameter. That difference between statistics and parameter is sampling error. A key goal of inferential statistics is estimating the size of sampling error so you can understand how good your estimate is. Sampling error occurs because your sample, even with appropriate random sampling methodology, won’t exactly represent the full population.

For more on this topic, read my post about how sample statistics are always wrong (to some extent).

Tinashe says

How bias influences the estimation of a population parameter

Trushasingh says

Such a helpful content ,understood the topic very clearly ,thanks uh so much sir for providing this kind of explanation

Extremely grateful !

–trusha

Allissa says

Hi therre. Is a population parameter a value or the characteristic? i.e. is the population parameter ‘the proportion of faulty items in a production batch’ or ‘5% of items in a production batch are faulty’

Thank you 🙂

Jim Frost says

Hi Allissa,

A parameter is a value that describes a characteristic of a population. For example, the mean height of all women in the United States is a parameter. It has a specific value, we just don’t know what it is. That value is for a specific characteristic (height). So, it’s a value that uses units relevant to the characteristic (such as CM).

For your example, the parameter is the proportion of faulty items. The actual parameter value is a proportion for the entire population. Of course, we’ll never know it exactly. You mention “5% of a batch.” Now that is a sample estimate of the parameter, not the parameter itself. Usually, the best we can do is estimate a parameter.

So, parameters are values but we never know those values exactly. However, we can estimate them.

Haley says

I am trying to understand the importance of parameters in drawing conclusions when an exact value can be calculated. Can you explain this for me?

Musa Aniya says

What are population parameters and how can they be use for estimation

Jim Frost says

Hi Musa,

This post answers your question. Read the section titled “Population Parameters versus Sample Statistics” more closely!

Lavan says

Dear Sir

Hope you are doing well, I want to ask a clarification when your time permit, please throw some light on it.

Which is the best way to estimate the (population) parameter?

1. Calculate the required sample size by defining Z-score (95%, 1-96), error (example 0, 03), and p (say .5 for maximum sample size) then estimate the sample statistic (example sample proportion). Then we say the calculated sample proportion is an unbiased estimator of the population proportion and 95% confidence the population proportion lies within plus or minus 0.03 (this value was used for calculating sample size) of the sample proportion. That is,

p- 0.03=< P <= p + 0.03

Or

We take a small sample (not calculate sample size statistically, say 40) due to limitation but using sampling techniques (srs, cluster or ..) while selecting a sample, then calculate the sample proportion after that and its variance (using statistical techniques). Finally, we say population proportion-P lies between p + – Z SE(p). That is,

p- Z[SE(p)] =< P <= p + Z [SE(p)]

Please clarify it, when your time permits.

Jim Frost says

Hi Lavan,

I’m not sure exactly what you’re asking. There are established power analysis methods for estimating samples sizes require to obtain statistical power that you specify. And other procedures for determining the precision of the estimate. For those types of procedures, you’ll need to enter information such as estimated effect size and estimated standard deviations. Read the post I link to for more information.

I hope this helps,

Jim

Sudarshan says

Content of this blog is awesome, quick absorb-able , Thank You

Jim Frost says

You’re very welcome, Sudarshan! I happy to hear that it was helpful!

Sajad Ahmad Mir says

Hello sir! I m always enjoying ur e-lectures. I hv a query. I m conducting research on gender based and type of school management based sample. Target population is secondary school students. I employed d probability sampling of randomization

Cud u tell me wheth i shud adopt straitified or simple random sampling technique m how. Remember d population is itself large but finite here