Having independent and identically distributed (IID) data is a common assumption for statistical procedures and hypothesis tests. But what does that mouthful of words actually mean? That’s the topic of this post! And, I’ll provide helpful tips for determining whether your data are IID.
Let’s break the components down one-by-one.
We talk about independent and identically distributed variables in the context of samples. Samples are drawn from a population sequentially. And, IID relates to the values of a characteristic for the objects that you are sequentially sampling.
Values for a characteristic is easy. That’s just the variable you are measuring. For example, imagine you measure the IQ for each person in your study. Let’s examine the properties of independence and identically distribution in depth!
In the context of sampling, events are independent when observing the current item doesn’t influence or provide insight about the value of the next item you measure, or any other items in the sample. There is no connection between the observations.
The classic example of independent events is flipping a coin. As you flip the coin, one result does not influence or predict the next outcome at all. Even if you get five heads in a row, the next coin flip still has a 50 percent chance of being heads.
You can apply the same thinking to other characteristics. For example, in our IQ study, if we measure an individual’s IQ, it shouldn’t provide any information about the next subject we assess. If we’re selecting subjects randomly, that should be true. However, if we’re not using random selection, it might not be accurate.
Imagine we assess one person’s IQ, and then measure their sibling because it is convenient. A correlation between their scores is likely. By measuring the first person, we gain some insight into the second person. Hence, they are not independent observations.
Independence relates to how you define your population and the process by which you obtain your sample. It pretty much boils down to random sampling and not using a convenience sample. The best practice is to define your population and then draw a random sample from that population.
Most hypothesis tests assume that observations are independent. Violating that assumption can cause the results to be untrustworthy. However, a few tests work with dependent samples, such as paired t-tests.
Related post: Independent and Dependent Samples
Identically distributed relates to the probability distribution that describes the characteristic you are measuring. Specifically, one probability distribution should adequately model all values you observe in a sample. Consequently, a dataset should not contain trends because they indicate that one probability distribution does not describe all the data.
Probability distributions define both discrete and continuous variables. Let’s look at what this entails for both types.
Discrete events have a few specific outcomes. For each event, the sum probability of all possible outcomes equals one. Flipping a coin is the traditional example, but I’m going to use rolling a six on a die. I’ve found that the equal probability aspect of heads and tails gets mixed into the definition too often. Identically distributed does not require equal probabilities.
Analysts model rolling a six versus not rolling a six using the binomial distribution because they are binary data (6 or not 6). The probability of rolling a six is 16.6%. This probability should not change over your data collection run. If it stays consistent while you’re collecting data, the values are identically distributed.
Continuous events have an infinite number of possible outcomes. A probability distribution for continuous data defines the probabilities of these outcomes. The most common is the normal distribution. For continuous distributions, you often need to ensure there are no trends in multiple parameters. For example, the normal distribution has two parameters, the mean and standard deviation. There should not be a trend in either measure as you collect your sample.
For example, with our fictitious IQ study, IQ scores follow a normal distribution. As we collect our scores, we should ensure that our scores don’t tend to increase or decrease because that affects the mean. Additionally, we need to be sure that the spread of the IQ scores remains constant.
Related posts: Understanding Probability Distributions, Normal Distribution, and Distributions for Binary Data
Why Are Identically Distributed Data Important?
For both discrete and continuous data, there should be no trends. One probability distribution can describe your sample.
Identically distributed data are vital for most hypothesis tests because they indicate you are assessing a stable phenomenon. For example, if you measure the strength of a product and the mean strength increases as you collect more samples, it’s hard to draw conclusions. What is the mean strength of the product? It depends on when you measure it! Is it stronger than a particular value? That also depends on when you measure it!
Of course, if your analysis compares groups, the groups can have different means, proportions, or other properties, but each group must be identically distributed.
Finally, you can assess measures that have trends over time, but you’ll need to use an analysis designed for it, such as time series analysis.
Assessing IID in Your Dataset
How do you know whether your data are independent and identically distributed? Here are some tips!
For independence, consider how you collected your data. Did you use random sampling, or did you obtain a sample of convenience? If you used readily available subjects, do you believe sequential observations are related or influence each other, such as siblings’ IQ scores?
Understanding your data collection process and the subject area can help you determine whether your observations are independent. Random sampling is great way to help ensure independent observations!
For the identically distributed portion, determine whether there are any trends in the data. Graphs can help you with this aspect. Graph your data in the order that you measured each item and look for patterns.
Fortunately, there’s a special type of graph that quality analysts use to assess this aspect—control charts! Unfortunately, analysts outside that field don’t use these charts often enough.
Control charts are specifically designed to track characteristics overtime, including proportions, means, and variability. These charts indicate whether your sample has problematic trends or patterns suggesting your data don’t follow a single probability distribution. For more information, read my post about using control charts with hypothesis tests.
Hopefully, you understand both the independent and identically distributed portions of the IID assumption. Also, keep in mind that there are statistical tests and procedures specifically designed for data that do not satisfy these assumptions!
Hi Jim, many thanks for your kind reply!
Reflecting on what you wrote I thought:
– when we have n data without a time label (no time series), we could apply a Ljung-Box test on them, to verify the presence of correlations (that is, of a certain form of dependence) between them; but we should do it by considering every possible sequence of data from the starting set. By doing so we could also find a form of dependence on one of these sequences, but this may not occur on all sequences, and therefore we could say that we have at most a suspicion that there is some form of dependence between the n data (and if I’m not mistaken that’s what you show at the post “Use Control Charts with Hypothesis Tests”);
– when we work with n data with time label (time series), by construction there is only one sequence of the n data, so in thea case we can apply with more confidence a test like the Ljung-Box, and have a unique result (and I think this is precisely in line with your last sentence in the comment above).
Did I understand correctly?:-)
It’s a very nice post Jim!
Sorry, I have one question about the following sentence:
you wrote “There’s no statistical test that can tell you whether your observations are independent. “.
But, what about a test like the Ljung-Box for the autocorrelations (a form of dependence) in time series?
Thank you so much,
Jim Frost says
You raise a good point and I think I wrote that sentence too hastily. I’ll modify it.
Here is what my train of thought was about that, and autocorrelations makes a good basis for comparison.
Autocorrelation (as you no doubt know) is when sequential residuals are correlated. They should be independent and have no correlation. Autocorrelation can be a bit more insidious and have both known and unknown causes. In other words, there are some cases where you suspect you might have it in your model and you use the Ljung-Box test to determine whether your model adequately captures all time related effects. In other cases, you might not be aware of time order effects because it’s not intentionally a part of your model. For example, the measurement device might drift over time. Or, events outside an experiment might impact subjects within the experiment. I’ve seen both of those in practice. The test will detect that as well.
That is sort of similar to the difference between independent and dependent samples. You’d expect paired observations in dependent samples to have higher correlations. Conversely, independent samples should have no correlation between pairs of observations. However, when you’re talking about dependent samples rather than autocorrelation among residuals, you don’t typically use a test to determine whether your samples are dependent. Instead, you use your knowledge of how you obtained the samples. You’ll know because the same people are in both groups, or because you used siblings, etc.
For residuals it makes sense to use a test for the reasons indicated whereas for samples you’ll use your knowledge of sampling process instead of a test.
Daniela Cajiao says
Thanks Jim for this great information!