A measure of variability is a summary statistic that represents the amount of dispersion in a dataset. How spread out are the values? While a measure of central tendency describes the typical value, measures of variability define how far away the data points tend to fall from the center. We talk about variability in the context of a distribution of values. A low dispersion indicates that the data points tend to be clustered tightly around the center. High dispersion signifies that they tend to fall further away.
In statistics, variability, dispersion, and spread are synonyms that denote the width of the distribution. Just as there are multiple measures of central tendency, there are several measures of variability. In this blog post, you’ll learn why understanding the variability of your data is critical. Then, I explore the most common measures of variability—the range, interquartile range, variance, and standard deviation. I’ll help you determine which one is best for your data.
The two plots below show the difference graphically for distributions with the same mean but more and less dispersion. The panel on the left shows a distribution that is tightly clustered around the average, while the distribution in the right panel is more spread out.
Related post: Measures of Central Tendency: Mean, Median, and Mode
Why Understanding Variability is Important
Let’s take a step back and first get a handle on why understanding variability is so essential. Analysts frequently use the mean to summarize the center of a population or a process. While the mean is relevant, people often react to variability even more. When a distribution has lower variability, the values in a dataset are more consistent. However, when the variability is higher, the data points are more dissimilar and extreme values become more likely. Consequently, understanding variability helps you grasp the likelihood of unusual events.
In some situations, extreme values can cause problems! Have you seen a weather report where the meteorologist shows extreme heat and drought in one area and flooding in another? It would be nice to average those together! Frequently, we feel discomfort at the extremes more than the mean. Understanding that variability around the mean provides critical information.
Variability is everywhere. Your commute time to work varies a bit every day. When you order a favorite dish at a restaurant repeatedly, it isn’t exactly the same each time. The parts that come off an assembly line might appear to be identical, but they have subtly different lengths and widths.
These are all examples of real-life variability. Some degree of variation is unavoidable. However, too much inconsistency can cause problems. If your morning commute takes much longer than the mean travel time, you will be late for work. If the restaurant dish is much different than how it is usually, you might not like it at all. And, if a manufactured part is too much out of spec, it won’t function as intended.
Some variation is inevitable, but problems occur at the extremes. Distributions with greater variability produce observations with unusually large and small values more frequently than distributions with less variability.
Variability can also help you assess the sample’s heterogeneity.
Example of Different Amounts of Variability
Let’s take a look at two hypothetical pizza restaurants. They both advertise a mean delivery time of 20 minutes. When we’re ravenous, they both sound equally good! However, this equivalence can be deceptive! To determine the restaurant that you should order from when you’re hungry, we need to analyze their variability.
Suppose we study their delivery times, calculate the variability for each place, and determine that their variabilities are different. We’ve computed the standard deviations for both restaurants—which is a measure that we’ll come back to later in this post. How significant is this difference in getting pizza to their customers promptly?
The graphs below display the distribution of delivery times and provide the answer. The restaurant with more variable delivery times has the broader distribution curve. I’ve used the same scales in both graphs so you can visually compare the two distributions.
In these graphs, we consider a 30-minute wait or longer to be unacceptable. We’re hungry after all! The shaded area in each chart represents the proportion of delivery times that surpass 30 minutes. Nearly 16% of the deliveries for the high variability restaurant exceed 30 minutes. On the other hand, only 2% of the deliveries take too long with the low variability restaurant. They both have an average delivery time of 20 minutes, but I know where I’d place my order when I’m hungry!
As this example shows, the central tendency doesn’t provide complete information. We also need to understand the variability around the middle of the distribution to get the full picture. Now, let’s move on to the different ways of measuring variability!
Range
Let’s start with the range because it is the most straightforward measure of variability to calculate and the simplest to understand. The range of a dataset is the difference between the largest and smallest values in that dataset. For example, in the two datasets below, dataset 1 has a range of 20 – 38 = 18 while dataset 2 has a range of 11 – 52 = 41. Dataset 2 has a broader range and, hence, more variability than dataset 1.
While the range is easy to understand, it is based on only the two most extreme values in the dataset, which makes it very susceptible to outliers. If one of those numbers is unusually high or low, it affects the entire range even if it is atypical.
Additionally, the size of the dataset affects the range. In general, you are less likely to observe extreme values. However, as you increase the sample size, you have more opportunities to obtain these extreme values. Consequently, when you draw random samples from the same population, the range tends to increase as the sample size increases. Consequently, use the range to compare variability only when the sample sizes are similar.
For more details, read my post, The Range in Statistics.
Learn how you can use the range to estimate the standard deviation using the range rule of thumb.
The Interquartile Range (IQR) . . . and other Percentiles
The interquartile range is the middle half of the data. To visualize it, think about the median value that splits the dataset in half. Similarly, you can divide the data into quarters. Statisticians refer to these quarters as quartiles and denote them from low to high as Q1, Q2, and Q3. The lowest quartile (Q1) contains the quarter of the dataset with the smallest values. The upper quartile (Q4) contains the quarter of the dataset with the highest values. The interquartile range is the middle half of the data that is in between the upper and lower quartiles. In other words, the interquartile range includes the 50% of data points that fall between Q1 and Q3. The IQR is the red area in the graph below.
The interquartile range is a robust measure of variability in a similar manner that the median is a robust measure of central tendency. Neither measure is influenced dramatically by outliers because they don’t depend on every value. Additionally, the interquartile range is excellent for skewed distributions, just like the median. As you’ll learn, when you have a normal distribution, the standard deviation tells you the percentage of observations that fall specific distances from the mean. However, this doesn’t work for skewed distributions, and the IQR is a great alternative.
I’ve divided the dataset below into quartiles. The interquartile range (IQR) extends from the low end of Q2 to the upper limit of Q3. For this dataset, the range is 39 – 20 = 19.
Related posts: Quartile: Definition, Finding, and Using, Interquartile Range: Definition and Uses, and What are Robust Statistics?
Using other percentiles
When you have a skewed distribution, I find that reporting the median with the interquartile range is a particularly good combination. The interquartile range is equivalent to the region between the 75th and 25th percentile (75 – 25 = 50% of the data). You can also use other percentiles to determine the spread of different proportions. For example, the range between the 97.5th percentile and the 2.5th percentile covers 95% of the data. The broader these ranges, the higher the variability in your dataset.
Related post: Percentiles: Interpretations and Calculations
Variance
Variance is the average squared difference of the values from the mean. Unlike the previous measures of variability, the variance includes all values in the calculation by comparing each value to the mean. To calculate this statistic, you calculate a set of squared differences between the data points and the mean, sum them, and then divide by the number of observations. Hence, it’s the average squared difference.
There are two formulas for the variance depending on whether you are calculating the variance for an entire population or using a sample to estimate the population variance. The equations are below, and then I work through an example in a table to help bring it to life.
Population variance
The formula for the variance of an entire population is the following:
In the equation, σ2 is the population parameter for the variance, μ is the parameter for the population mean, and N is the number of data points, which should include the entire population.
Statisticians refer to the numerator portion of the variance formula as the sum of squares.
Sample variance
To use a sample to estimate the variance for a population, use the following formula. Using the previous equation with sample data tends to underestimate the variability. Because it’s usually impossible to measure an entire population, statisticians use the equation for sample variances much more frequently.
In the equation, s2 is the sample variance, and M is the sample mean. N-1 in the denominator corrects for the tendency of a sample to underestimate the population variance.
Example of calculating the sample variance
I’ll work through an example using the formula for a sample on a dataset with 17 observations in the table below. The numbers in parentheses represent the corresponding table column number. The procedure involves taking each observation (1), subtracting the sample mean (2) to calculate the difference (3), and squaring that difference (4). Then, I sum the squared differences at the bottom of the table. Finally, I take the sum and divide by 16 because I’m using the sample variance equation with 17 observations (17 – 1 = 16). The variance for this dataset is 201.
Because the calculations use the squared differences, the variance is in squared units rather the original units of the data. While higher values of the variance indicate greater variability, there is no intuitive interpretation for specific values. Despite this limitation, various statistical tests use the variance in their calculations. For an example, read my post about the F-test and ANOVA.
While it is difficult to interpret the variance itself, the standard deviation resolves this problem!
For more details, read my post about the Variance.
Standard Deviation
The standard deviation is the standard or typical difference between each data point and the mean. When the values in a dataset are grouped closer together, you have a smaller standard deviation. On the other hand, when the values are spread out more, the standard deviation is larger because the standard distance is greater.
Conveniently, the standard deviation uses the original units of the data, which makes interpretation easier. Consequently, the standard deviation is the most widely used measure of variability. For example, in the pizza delivery example, a standard deviation of 5 indicates that the typical delivery time is plus or minus 5 minutes from the mean. It’s often reported along with the mean: 20 minutes (s.d. 5).
The standard deviation is just the square root of the variance. Recall that the variance is in squared units. Hence, the square root returns the value to the natural units. The symbol for the standard deviation as a population parameter is σ while s represents it as a sample estimate. To calculate the standard deviation, calculate the variance as shown above, and then take the square root of it. Voila! You have the standard deviation!
In the variance section, we calculated a variance of 201 in the table.
Therefore, the standard deviation for that dataset is 14.177.
The standard deviation is similar to the mean absolute deviation. Both use the original data units and they compare the data values to mean to assess variability. However, there are differences. To learn more, read my post about the mean absolute deviation (MAD).
People often confuse the standard deviation with the standard error of the mean. Both measures assess variability, but they have extremely different purposes. To learn more, read my post The Standard Error of the Mean.
Related post: Using the Standard Deviation
The Empirical Rule for the Standard Deviation of a Normal Distribution
When you have normally distributed data, or approximately so, the standard deviation becomes particularly valuable. You can use it to determine the proportion of the values that fall within a specified number of standard deviations from the mean. For example, in a normal distribution, 68% of the values will fall within +/- 1 standard deviation from the mean. This property is part of the Empirical Rule. This rule describes the percentage of the data that fall within specific numbers of standard deviations from the mean for bell-shaped curves.
Mean +/- standard deviations | Percentage of data contained |
1 | 68% |
2 | 95% |
3 | 99.7% |
Let’s take another look at the pizza delivery example where we have a mean delivery time of 20 minutes and a standard deviation of 5 minutes. Using the Empirical Rule, we can use the mean and standard deviation to determine that 68% of the delivery times will fall between 15-25 minutes (20 +/- 5) and 95% will fall between 10-30 minutes (20 +/- 2*5).
Related posts: The Normal Distribution and Empirical Rule
Which is Best—the Range, Interquartile Range, or Standard Deviation?
First off, you probably notice that I didn’t include the variance as one of the options in the heading above. That’s because the variance is in squared units and doesn’t provide an intuitive interpretation. So, I’ve crossed that off the list. Let’s go over the other three measures of variability.
When you are comparing samples that are the same size, consider using the range as the measure of variability. It’s a reasonably intuitive statistic. Just be aware that a single outlier can throw the range off. The range is particularly suitable for small samples when you don’t have enough data to calculate the other measures reliably, and the likelihood of obtaining an outlier is also lower.
When you have a skewed distribution, the median is a better measure of central tendency, and it makes sense to pair it with either the interquartile range or other percentile-based ranges because all of these statistics divide the dataset into groups with specific proportions.
For normally distributed data, or even data that aren’t terribly skewed, using the tried and true combination reporting the mean and the standard deviation is the way to go. This combination is by far the most common. You can still supplement this approach with percentile-base ranges as you need.
Except for variances, the statistics in this post are absolute measures of variability because they use the original variable’s measurement units. Read my post about the coefficient of variation to learn about a relative measure of variability that can be advantageous in some circumstances.
Analysts frequently use measures of variability to describe their datasets. Learn how to Analyze Descriptive Statistics in Excel.
If you’re learning about statistics and like the approach I use in my blog, check out my Introduction to Statistics book! It’s available at Amazon and other retailers.
Well narrated calculations and explanations.
Big up
Thanks so much!
Jim, how can I buy your e-book, which cost USD 9?
Hi Lewis, just go to My Webstore, which is where I sell my ebooks. Scroll down past the Amazon links to find my ebooks. You’ll see my Introduction to Statistics ebook is available for USD 9$.
Thank you for taking the time to write these amazing posts.
I’ve got time-series data on GDP as a measure of output for one country, and I’d like to see how volatile it is. My question is: how can I measure output volatility?
Hello, sir, i am happy to come hear to find answer of my question. Sir, I used NMDS, PERMANOVA, and PERMDISP to examine data, and the results show that there is a location effect rather than dispersion. Now I don’t know what the impact of location effect on my data, please help me understand location effect in a dataset and what is its importance. thank you very much.
in measure of central tendency page, you told us that mode is only parameter to calculate the central of data. I want to ask you, how about the categorical data in measures of dispersion? can we calculate measures of dispersion for categorical data?
Hi Daffa,
You’re correct, the mode is the only measure of central tendency for categorical data.
Variability for categorical variables is rarely used, but a form of it does exist. It’s fairly different than dispersion for continuous data. There is a coefficient of unalikeability, which measures how similar or dissimilar the outcome values are for categorical data. Unalikeability assesses how often observations differ from one another. For example, if the all outcomes are in one category, they are very similar (identical in the extreme case). However, if they’re spread out among the other categories more evenly, they become dissimilar. The coefficient of unalikeability measures that aspect.
I have never used it myself. But, it does exist! Perhaps I’ll write a post about it at some point. However, it’s not a commonly used measure.
You should note that you can’t use the standard measures of variability for categorical data. Typically, you won’t report variability for categorical data. You can report things like the mode and percentages for the various categories.
Jim,
What do you mean by weighted average? because all 3 tests are with the same sample data (10 measurements per test point). What I would like to do is combine these 3 standard deviations from each test (with different magnitudes) and determine an average value of the standard deviation that can represent the 3 tests. Its possible and valid from statistics point of view?
Thanks!
Hi Ricardo, you didn’t explain those extra details in your original question. I still don’t fully understand what you want to do so I’m unable to answer your question.
Hi Jim,
Continuing with the subject of estimated standard deviation, could be possible to calculate the average of 3 standard deviations calculated from 3 different sample data with different measurement values for each one (e.g. at 10, 50, and 90% of instrument range)? If so, which is the method to do it?
Hi Ricardo,
You’d calculate a standard deviation for each dataset. If for some reason you wanted to calculate a pooled standard deviation for all three, there are several approaches.
You could combine all the data into one larger dataset and calculate the standard deviation for it.
Or, you can calculate the separate standard deviations and then calculate a weighted average. The weights are based on sample sizes such that larger samples have more weight.
The first approach calculates the variability of all data points around the grand mean of the combined dataset. Conversely, the weight average approach calculates the average variability of the data points in a group around their group’s mean rather than the grand mean. So, the method depends on which measure you need, variability around the grand mean or around the group means.
Hi Mr. Frost
Congratulations on your website, it is an excellent resource!
I have read in several guides related to the field of metrology that when a Type A measurement uncertainty evaluation (repeatability test) is carried out, to statistically calculate the standard uncertainty (68%) based on a series of repeated measurements with a normally distributed population is by using the estimated standard deviation (s) but in other guides, it is specified that for a normally distributed population, the best estimate of a standard uncertainty is by using the standard error of the mean.
My question is, what is the correct way to determine the estimated standard uncertainty (68%) in a Type A uncertainty evaluation assuming a normal distribution? Or when should one or the other be used?
Thanks in advance!
RV
Hi Ricardo,
I have not focused on metrology, so I’m not claiming to be an expert in that area. However, from my understanding you would use the standard deviation. For type A uncertainty, you are estimating the standard deviation from an observed frequency distribution. Whereas, for type B, it’s from an assumed distribution.
You almost certainly would not use the standard error of the mean. That measure is the standard deviation of the distribution for the sampling distribution of the means. It’s used to calculate p-values and confidence intervals. The standard error of the mean is not used to estimate the dispersion of observations but rather the dispersion of sample means.
I hope that helps!
When calculating standard deviation what is x
Hi Nimusiima,
X in the equation represent the individual observations/data points of your variable of interest. In the table in the section where I show an example of how to calculate the sample variance, you’ll see a column called data point. Each one of those values in that column is an X in the equation. As I explain in that section, you take each individual observation (data point, X) and subtract it from the mean of the variable, square that difference, and then sum all the squared differences.
For the standard deviation, which you’re asking about, all the above applies (including the part about X), but you just add one step at the end for taking the square root of the sum.
I hope that helps!
Hi Jim
What are the various methods to measure variability between two data sets with the same variables?
HI Jim,
I have bought 2 or 3 books now mostly for my junior team members and for myself as well as a refresher. The most part I like with your approach and the blogs is the simplicity in explaining the concept with some real examples. That is more valuable than reading some other dry textbooks. My team is self-learning a lot as well. A book with use of Excel and functions (step by step) with example templates to download with a key would be an excellent addition.
Hi JT,
Thanks so much for your kind words! And, I’m glad my books have been helpful.
Yes! An Excel book is very high on my priority list! I’m not sure when I’ll be able to release it but it’s a book I want to write!
Hello Sir
Please explain me in simple terms why are we using (N-1) for sample variance in the denominator ?
Thanking you in advance.
Hi Anubhab,
That’s known as degrees of freedom (DF). I’ve written a post about degrees of freedom and why you need to use N-1. That’ll explain it for you.
Thanks for writing!
what are the importance of measures of variability in educational assessment?
Hi. My name is Nana Zahid. Love the way it is explain in a very simple English. Am from Malaysia, your write up helped me a lots in doing my assignment for Ed.D. Thank you so much. Love from Malaysia.
Hi Nana,
I’m so happy to hear that this was helpful! Best of luck on your assignment! 🙂
Thank you so much, Mr. Frost! I appreciate your help on both fronts.
Hello,
This is super helpful! I was hoping to cite your work in a paper I’m writing for my statistics course at Colorado State University. Do you happen to have a date of publication available? I can’t seem to find one.
Thank you!
Hi Megan, I’m so glad it was helpful! With electronic resources, the data you accessed the URL is more important because they can change. Please refer to Perdue’s Citing Electronic Sources guidelines for details.
thank you! this is very helpful
Thank you so much, Sir! This has been helpful for me and my friends in understanding this topic of statistics! Thank you again and have a great day 😊
Hello Sir!
I have a question.
How do we compare the two standard deviations when the units of two sets of data are different?
Thank you so much, Sir! I love your book and your explanations. They have helped me indefinitely. Thank you from a high school student ☺️
Hi Abby!
It’s great that you’re taking statistics in High School! I think statistics often doesn’t get enough attention early enough, often not until college!
Unfortunately, when your data use different units, you can’t compare the standard deviations because those too will use different units. Sometimes, you can convert the original units to common units. For example, if you have weights in pounds and kilograms, you can convert the pounds into kilograms and you’ll have the same units. Then calculate the standard deviations while they’re in the same units and compare.
In some cases, you can’t convert the units because they’re inherently different. For example, you might want to compare the variability of weight measurements to strength measurements. In that case, you can use the coefficient of variation. I write about that measure in its own blog post: using the coefficient of variation.
Thanks for writing with the great question!
Jim
Jim,
I have purchased your three books and I am trying to get an understanding of indices of dispersion. The standard deviation or Interquartile Range in isolation do not mean very much which is why your histograms and box plots are so useful.
Could you give me a description of some of the indices of dispersion such as Karl Pearson’s coefficient of dispersion (ratio of the standard deviation to the mean) or the Quartile Coefficient of Dispersion (Q3-Q1)/(Q3+Q1).
Are these useful and how are they interpreted? Are there other useful indices of dispersion?
Richard
Hi Richard,
Thanks so much for supporting my books. I really appreciate that!
I guess there are several things I’d point out. First, yes, I think graphs and numeric statistics often work best when they’re used together. The graphs can present the data in a way that’s very easy to understand in just a glance. Meanwhile, the numeric values provide objective measurements that don’t depend on scaling issues. So, I’m not surprised that the histograms added some much needed context! Ideally, that should be standard practice when presenting results.
I think the measures of variability provide more information than you might be giving them credit. I agree that the range is often not helpful. For one thing, sample size affects the range. However, if you regularly collect a standard sample size, the range can be meaningful. Suppose you collect a daily sample of a certain size. The daily sample ranges will be relatively consistent between samples. However, if you see an unusually large range on a given day, it would be a signal to investigate that day’s unusual variability. I’ve seen that approach often used in the quality control context and provides a statistic that is easy to understand in that context.
The standard deviation is meaningful because it’s in the units of the variable and represents the standard difference between the observed values and the mean. That would seem to make it a relatively intuitive measure by itself. But then consider the Empirical Rule. If your data follow a normal distribution, you can easily determine where most of the values fall. Take the IQ distribution for example, mean of 100, standard deviation of 15. Based on that alone, we know that 95% of IQs fall between 100 +/- 2 *15 or [70, 130].
I also like the IQR because it doesn’t depend on the distribution being normal. You know that half the values fall within the range and half fall outside. Additionally, you can use other percentiles if that’s more meaningful.
I have not used the coefficient of dispersion or quartile coefficient of dispersion much. They are less useful for understanding your own dataset. In my mind, they have a very specific use–for comparing the variability across datasets when measurements use different units. When data use different units, you need a method for standardizing the different units, and that’s where these other unitless measures of coefficients of dispersion come into play. These measures answer the question, “relative to some measure of central tendency, how large is the variability.” So, these coefficients of dispersion are relative measures rather than an absolute measure, like the ones I discuss in my book (and earlier in this comment).
For example if your dataset has a coefficient of dispersion of 0.1 and another study has 0.2, you know that your study has less variable data. That’s handy if you’re measuring different characteristics. However, if you’re measuring the same characteristic (say IQ again), then just compare the standard deviations because they’re in the same units.
I’m sure there are certain instances where the coefficient of dispersion can be particularly helpful, but I think that’s less common overall than the absolute measures for the reasons I discuss.
Was there some type of information or interpretation you were hoping to gain relating to variability?
This is great explanation, great thanks
Thanks Jim for taking out the time to write such a helpful article. Its well well explained.
I am hoping to get one of my query resolved.
If I have a test observations value along with its statistical standard deviation, but the observation value is near zero and including StdDev the value is ranging between negative and positive zone.. in such case how to report the value which has statistically correct sign?
E.g. I have a value of -5 +/- 7 MPa stress and repeatations of same location gives values of +2 +/- 3 MPa, -5 +/- 2 MPa etc… Should I report sign? And what should be the thumb rule for reporting? Standard deviation? Equipment repeatability?
Thanks in advance…
Suhail Mulla
Hi Suhail,
I’m not sure exactly what you’re concern is? I don’t see a problem with using standard deviations in your scenario.
Why would the standard deviation likely not be a reliable measure of variability for a distribution of data that includes at least one extreme outlier.
Hi Eugene,
You have to think about the calculations for the variance, which is the foundation for the standard deviation. To calculate the variance, you sum the squared differences between the data points and the mean. The key is the squaring of the distance. So, think of squared values: 1, 2, 4, 8, 16, etc. If you suddenly have a value of 10 and square it, that’s 100! The squared value is so large compared to the other values that it skews the results all by itself.
I’ve written a post about outliers and include an example for how a single outlier can affect both the mean and standard deviation greatly.
I hope that helps!
Well done! It was easy to understand and very concise. enjoyed your article.
Excellent session. thanks lot
Cv or inter qurtile range or range who is most affected by outlier
I’d imagine it’s the range because a single value can change the entire range. The effect of that one value is not “diluted” by the other values.
Hey Jim, I just found out about your blog. Very sound explanations.
I have a question. I am faced with a client I dont know much about, as I am immersing into their initial reports I notice there is a tendency to compare median to Q3 of leadtimes. This is reported as a “variance”, which semantically speaking is incorrect.
My question is: is this comparison relevant? does it translate to variability ? Isn’t it worth analyzing the capability of the process, its sigma scale?
Any insights would be welcome!
Thanks a lot
Carlos
In the section where you are discussing Interquartile ranges, you say that Q1 has the lowest values and Q3 has the highest. However, should this not be Q4 has the highest ranges, as there are 4 quartiles.
Hi, that’s a typo. Thank you so much for catching it! You’re absolutely correct. I’ve updated the text to indicate that Q4 has the highest values.
Dear Mr Jim, thank you for your answers! Maybe it is a choice. I have read some other references in terms of CV.But it seems there are few references related to the CV when the data are negative and positive.It seems the problems are to be solved.Maybe I have not found related references. Whatever, thank you for your advices. Could you please recommend me several references or books related to the CV when the data are negative. Thank you very much in advance!
Mr Jim,thank you for description of the the statistics in terms of variability. They are very useful to understand the indices for the persons whose majors are not statistics. I have a question. As you have mentioned, coeffiicent of variation (CV) can be used when all the data are positive. However, there are practically some negative data besides positive data, then how to use the CV? If it is not suitable to use the CV ,what are the other alternatives for the in terms of variability? I am confused on the question. Could you please give me some advices about the problems ? I am looking forward to your advices! Thank you very much in advance!
Hello Mingming,
Yes, you can use the coefficient of variation with negative data. If the mean is negative, you’ll have a negative percentage for your CV, which you can interpret as if it was positive.
I hope this helps!
Hi Jim, Great content. In what sequence should your articles be read for starters and also which are the more helpful articles for understanding VAR, Value at Risk. Thank you
Hi Kai,
As it is now, the articles can be read in the order that makes sense to you and click the links to additional articles as needed. However, I do plan to write an ebook that serves as an introduction to statistics. This ebook will present content in a logical order and greatly expand the content. I did that with regression by writing an ebook for that analysis, and that has worked really well. So, stay tuned!
Hi Jim, I too have only praise for your blog.
I have another question regarding variance. Is there a way to compute variance from interquartile range? I am trying to perform a meta-analysis in Comprehensive Meta-analysis (CMA) with medians and interquartile ranges, but as per the CMA developer this is only possible if I can enter median plus variance. The data I am trying to met-analyze are by nature very skewed, which is why mean +/- SD will be biased.
Thanks a lot for your insight.
Christoph
Hi Christoph,
Thanks for the kind words! I appreciate it!
To calculate the precise variance, you’d need the raw data. If your data had followed the normal distribution, you could use that to estimate the variance. You could find the normal distribution that produces 50% of the values falling within the interquartile range. Then square the standard deviation of that distribution to obtain the variance. I’ve also read that you can estimate the SD for a normal distribution by taking the IQR/1.35.
However, those approaches don’t work with nonnormal data. In fact, that’s probably why the authors of the original study presented the results in the manner they did. With very skewed data, you’d need to know the distribution to estimate the variance. I don’t know if you have that information. Offhand, I don’t know of another approach for estimating the variance.
I did find the reference to this study which might provide some helpful techniques. I haven’t read the article myself to know. The title doesn’t mention IQR, but perhaps you have the other information?
Hozo SP. et al. Estimating the mean and variance from the median, range, and the size of a sample. BMC Medical Research Methodology 2005, 5:13.
I hope this helps!
Hi sir, Thanks a lot for your blogs, they are really awesome, literally i have no words to explain you how helpful your blogs are.
I have a one question:
We are squaring the differences between mean and observation values because we get a resultant value(sum of all differences) zero if we don’t square them!
so what we can interpret from that? why this resultant value gives zero? and what we can interpret from that?
Hi Raja, thank you so much! I’m glad to hear that my blog has been helpful!
When you have a symmetric distribution, you’ll have an equal number of values above the mean as below the mean, and at the same distances. So, imagine you have one distribution where you have many observations that are say equally at +10 and -10. And, another distribution where many scores are near +1 and -1 equally. Clearly the first distribution is much more spread out. However, both distributions will sum to approximately zero, and have an average of approximately zero. So, summing the difference doesn’t allow you to differentiate between these distributions. You want the variability score for the first distribution to be larger to accurately reflect the fact that it is more spread out.
That’s why we use the squared differences, because you can add them up without the plusses and minuses cancelling each other out.
Hi Jim, First of all, thanks a lot for taking time out to share your Statistical knowledge with the world.
I have a question about Variance vs. Standard Deviation. Why do we even have Variance as a measure of dispersion when we know that it gives squared values which are big and we have to use standard deviation as the easy and more interpretative measure of dispersion anyways?
Hi Nitin,
That’s a very good question! While variance really doesn’t mean much to us humans, it turns out that it is importance in various statistical tests. ANOVA is, after all, the analysis of variance. The F-test assesses the ratio of variances to determine whether they are equal. Additionally, in linear models, we have the key notion of sums of squares, which is a similar concept as variances (being squared differences from the mean). So, it’s a useful measure behind the scenes for statistical tests. However, I can’t think of a real world situation where people would think that the actual value of the variance conveys anything meaningful.
thats too wonderful and lucid ! hope to clarify on many statistical confusions
So beautifully explained! Students all around the world would really benefit from your teachings!
Thank you, Kai. That means so much to me!
Hello sir,
I am data science student. I have started following your articles . It gives me proper idea about statistics. It’s very beneficial to all non-statistical background people who really wants to learn proper statistics.
Thanks and regards,
Hiral Godhania
Hi Hiral, I’m so happy that you’ve found my articles to be helpful. Thank you so much for taking the time to write such a nice message!
First of all. Thank you for your time and help to spread stats in simple way
My question is I have a dataset with mean 2000 seconds and sd as 1950 seconds. What should I do, when I see such a big sd.
Hi Sundar, I’m glad you’ve found my blog to be helpful!
On to your question. Because all of your values are going to be greater than zero, it makes sense to compare the mean and standard deviation. If you could have negative values, then it doesn’t make sense.
You can say that your standard deviation is large compared to the mean. This indicates that while you have an estimate of the central tendency, you really can’t say for any given observation that it is likely to be near the mean. Your data have a lot of variability.
Additionally, I can virtually guarantee you that your data are skewed because you can’t have values less than zero seconds. You tend to get skewed data when you are near a limit. The limit here is zero seconds. And, how near you are to it is defined by the distance between the limit and the central tendency as measured by standard deviations, which is ~1 s.d. in your example. That’s close–so your data are skewed.
You can also think about it in terms of The Empirical Rule. For a bell-shaped curve, you’d expect 95% of the values to fall within +/- 2 standard deviations of the mean. However, that range include negative values. The values that you’d expect to fall below zero must actually fall greater than zero. Hence, your distribution is skewed.
You should graph your data! As for what else you should do, it depends on your goal.
I hope this helps!
Hi Jim,
I saw your post shared on social media by Carmen and noticed your at PSU! I’m a stats PhD student there. I found your blog interesting and intuitive and wanted to reach out to see if there are any resources you could share to help me improve my written and oral communication skills. I’m TAing for the first time this summer and want my class to be as interesting as your blog.
Hi Isaac! First, thanks so much for your kind words about my blog. That means a lot to me!
As for resources, I don’t have any about written and oral communications skills. I wish I had something helpful to point you towards. I’ve been explaining statistics for several decades and that’s helped me refine my approach.
For starters, the fact that you’re asking about it indicates that you’re already placing a value on clear communications, which is great! I always imagine someone trying to learn this material for the first time. Some of it is very complex. But, you can often find a simple way to explain it. When you find ways that work better at communicating a concept than other, make note and use that. Always refining along the way.
I’ll often go out of my way to read material that teaches statistics and look for things that are missing or not clear, and it gives me an angle on my own writing. A deep understanding of statistics really helps this process. When I read something where I know a certain aspect is particularly important but perhaps the import isn’t clearly conveyed in the material, it stands out to me. Then, when I write about it, I’ll focus on that aspect more. I really try to hone it into something that a novice can grasp. It’s a process and involves refinement, trying new approaches, and seeing how others approach it (for better and worse).
I’m not at PSU any more. It’s been quite awhile. Your class is lucky to have a TA who really values clear communications!
Hi,
Yes it helps a bit. Thank you for a detailed explanation.
You’re very welcome! Best of luck with your studies!
Hi,
I did not understand why we subtract ‘-1’ from the sample size in the formula for sample variance. Why ‘-1’ ?
Hi Thejas,
Statisticians have found that samples tend to underestimate the variance when you simply divide by n. It turns out that the data points in a sample are closer together than they are in the population. Dividing by n-1, rather than n, solves this problem.
Reducing the denominator counteracts the tendency for underestimation. By dividing by a smaller number, the end result is a bit larger.
For example, let’s say that you have sum of squared differences of 100 and a sample size of 10. Dividing by n (10), you obtain a variance of 100/10 = 10. However, if you divide by n-1 (9), you obtain 100/9 = 11.1. Statisticians have found that using n tends to underestimate the variance (a biased estimator in statistical speak). However, n-1 is unbiased.
I hope this helps!
Hello Sir,
Thank you so much for this post. Last year i had faced a huge problem in my paper due to my lack of attention towards variation measurement. This post has cleared some of my confusion. Thank you so much. Keep up the good work. Your explanation is easier to understand for a non statistics/math student like me. I will explore your other blog posts.
Hi Mahmudul, you’re very welcome! I’m very glad that my blog posts have helped you. And, thank you for taking the time to write such a nice comment. It means a lot to me!
Great, thank you Jim
I’ll look forward for that article.
As usual Jim, very clear explanation. Thank you!
Thanks, Roy!
Jim,
Thank you so much for taking time to post these awesome articles. I have never seen statistics article as intuitive as you post. Thank you for taking time to do this.
I have a request, could you please post an article on samples sizes and details around when do we need to take type-I error (alpha) and type-II error (beta) into consideration for both mean and proportions case. Please make sure to include formulas as well.
Thank you,
Ashwini
Hi Ashwini, thanks so much! I strive to make the articles as intuitive as possible.
That sounds like a great idea for an article. For the time being, I’ve written about some of the issues you mention but not all together in one place. Take a look at the following:
Comparing Hypothesis Tests–where I cover various tests including those for the mean and proportions and a little about the differences in required samples sizes.
As for your question about the significance level (alpha), I write about significance levels and p-values. Whenever you perform a hypothesis test, you need to worry about alpha. Beta is important too, but harder to quantify. I need to write about beta specifically at some point!
In regards to your question about sample sizes, sometime in March I’ll write about power and sample size analyses.
I hope this helps for now!
Hi. You did explain why the (n-1) denominator in sample variance is used. My question is why do many social science and education textbooks not use the (n-1) in descriptive statistic calculations?
Hi Jerry, the sample variance formula is used when using a sample to estimate a population. With descriptive statistics, the goal is not to estimate a population but only to describe the data in hand. In a sense, you’re treating that sample as a population and not using it to estimate one. At least that’s what I’m guessing the textbook’s rationale is based on your description. However, once analysts decide they want to estimate the population parameter, they should use the sample variance equation.
Thnk sir ….
Love u 2 much
I wll b waiting fr ur next article. ..nd u hve e helps me a lot
Thanks you, Khursheed! I appreciate your kind words! And, I’m glad that they have been helpful for you!