## What is a Population vs Sample?

Population vs sample is a crucial distinction in statistics. Typically, researchers use samples to learn about populations. Let’s explore the differences between these concepts!

**Population**: The whole group of people, items, or element of interest.**Sample**: A subset of the population that researchers select and include in their study.

Researchers might want to learn about the characteristics of a population, such as its mean and standard deviation. Unfortunately, they are usually too large and expensive to study in their entirety.

Instead, the researchers draw a sample from the population to learn about it. Collecting data from a subset can be more efficient and cost-effective.

Inferential statistics use sample statistics, like the mean and standard deviation, to draw inferences about the corresponding population characteristics.

If we had to measure entire populations, we’d never be able to answer our research questions because they tend to be too large and unwieldy. Fortunately, we can use a subset to move forward.

Read on to learn more about population and sample statistics, examples, and sampling methods.

**Related post**: Difference between Descriptive and Inferential Statistics

## Population and Sample Examples

For an example of population vs sample, researchers might be studying U.S. college students. This population contains about 19 million students and is too large and geographically dispersed to study fully. However, researchers can draw a subset of a manageable size to learn about its characteristics.

Or, medical researchers might want to understand the effect of a new medication on the general population—which contains a vast number of people. Obviously, they can’t administer the new drug to everyone and measure the results. Instead, they can collect 2000 participants, perform the experiment, and use the sample mean effect to estimate the population mean effect.

Surveys collect opinions from a sample of respondents to estimate the overall views of a population. For example, pollsters might want to understand political opinions in a state with millions of residents. They can survey 1000 people to estimate the entire state.

## Population vs Sample Statistics

Statisticians refer to population values as parameters and sample values as statistics. Learn more about Parameters vs Statistics: Examples & Differences.

Population parameters are precise but typically unknown values. For example, the population mean height for all U.S. women is a particular value. Unfortunately, parameter values tend to be unknowable. We can never measure the heights of all U.S. women, so we’ll never know the exact parameter.

Sample statistics estimate the value of the population value. For example, the mean height of a subset of women can estimate the parameter. The estimate never equals the parameter exactly. Consequently, there is always a margin of error around sample estimates.

Sampling error is the difference between the correct population value and the sample estimate. Unfortunately, analysts never know the amount of sampling error precisely because they don’t know the parameter’s value. But statistical methods can estimate it. It might be shocking to learn, but Sample Statistics are Always Wrong (to Some Extent)!

Confidence intervals and Margins of Error are two methods for estimating sampling error. Click the links to learn more about them!

Learn more about Sampling Error: Definition, Sources & Minimizing and Sample Mean vs. Population Mean.

## Drawing Samples from Populations

Statisticians refer to the various processes of drawing subsets from populations as sampling methods. Ideally, these techniques produce representative samples with characteristics that look like the entire set of subjects. Representative samples are best for researchers who want to generalize their results to the population.

The various methods each have a set of pros and cons. Generally, the more expensive, complex procedures are better for obtaining representative samples. The less costly approaches tend to produce bias, making them less generalizable. Learn more about Sampling Methods in Research and Representative Samples.

In short, a tradeoff usually exists between representativeness and cost.

Probability sampling methods are better for representativeness and include the following types:

Non-probability procedures are often more convenient and cheaper but tend to produce biased and non-representative results, limiting generalizability. These methods include the following types:

Learn more about Populations and Parameters in Inferential Statistics.

John says

Hi Jim,

I have a question about where you draw the line between calling a set of data a sample or calling it a population. If I have a continuous stream of data, we’ll say it’s the number of valve orders that were rescheduled, by order line, by month, over the past 5 years, but are only concerned about the previous twelve months, do those twelve months constitute a population themselves or are they a sample of the entire 5 years? This 12 month set comprises thousands of individual records and all analyses would be conducted within this one year period and no other.

Thank you,

Michael says

Hi Jim,

Thank you for this highly informative post! My question: I’d like to determine the minimal sample size needed in genome wide association studies (GWAS) to discover DNA variants/mutations that are present in a population at frequencies of 1%, 0.1%, or 0.01% with confidence intervals of 95%, 99%, or 99.9%. In addition to the actual values, I’d be grateful for the equation to run these calculations in other scenarios.

Thank you very much in advance for your help!

Jim Frost says

Hi Michael,

I’m glad you found the previous information helpful and you have a very interesting question!

For full transparency, I have absolutely no experience analyzing data in the genetic realm. However, I did little checking around. Please take the following with a grain of salt, but the following is what I was able to gather.

Unfortunately, it seems like there’s no straightforward, universally accepted formula to calculate sample size for GWAS.

Sample size calculations for GWAS depend on several factors, including the underlying genetic architecture of the trait in question, the minor allele frequency (MAF) of the SNP, the desired statistical power, and the multiple testing burden. Due to the complexity and diversity of these factors, the calculations are typically carried out using specialized software like Quanto or G*Power, or via simulation-based approaches.

Please note that in practice, because of multiple testing in GWAS, the effective significance level is often much more stringent (commonly set at 5×10^-8) to control for false positives.

Generally, the principles for calculating sample size for GWAS can be found in the book “Design and Analysis of Genetic Studies” (Springer 2013) by S. Zeggini and A. Morris.

Additionally, here are a few references that might help further your understanding:

1. Purcell S, Cherny SS, Sham PC. Genetic Power Calculator: design of linkage and association genetic mapping studies of complex traits. Bioinformatics. 2003;19(1):149-150. doi:10.1093/bioinformatics/19.1.149

2. Gauderman WJ, Morrison JM. QUANTO 1.1: A computer program for power and sample size calculations for genetic-epidemiology studies, 2006. http://biostats.usc.edu/Quanto.html

I recommend consulting these resources, and possibly discussing with a biostatistician or genetic epidemiologist if you’re planning a GWAS. This will help ensure your study is adequately powered.