Descriptive statistics summarize your dataset, painting a picture of its properties. These properties include various central tendency and variability measures, distribution properties, outlier detection, and other information. Unlike inferential statistics, descriptive statistics only describe your dataset’s characteristics and do not attempt to generalize from a sample to a population.
Using a single function, Excel can calculate a set of descriptive statistics for your dataset. This post is an excellent introduction to interpreting descriptive statistics even if Excel isn’t your primary statistical software package.
In this post, I provide step-by-step instructions for using Excel to calculate descriptive statistics for your data. Importantly, I also show you how to interpret the results, determine which statistics are most applicable to your data, and help you navigate some of the lesser-known values.
Additionally, I include links to resources I’ve written that present clear explanations of relevant statistical concepts that you won’t find in Excel’s documentation. And, I use an example dataset for us to work through and interpret together!
Before proceeding, ensure that Excel’s Data Analysis ToolPak is installed. On the Data tab, look for Data Analysis, as shown below.
If you don’t see Data Analysis, install that ToolPak. Learn how to install it in my post about using Excel to perform t-tests. It’s free!
Descriptive Statistics in Excel
Let’s start with a caveat. Use descriptive statistics together with graphs. The statistical output contains numbers that describe the properties of your data. While they provide useful information, charts are often more intuitive. The best practice is to use graphs and statistical output together to maximize your understanding. At the end of this post, I display the histograms for the variables in this dataset.
For this example, we’ll assess two variables, the height and weight of preteen girls. I collected these data during a real experiment. To use this feature in Excel, arrange your data in columns or rows. I have my data in columns, as shown in the snippet below.
Download the Excel file that contains the data for this example: HeightWeight.
In Excel, click Data Analysis on the Data tab, as shown above. In the Data Analysis popup, choose Descriptive Statistics, and then follow the steps below.
Step-by-Step Instructions for Filling in Excel’s Descriptive Statistics Box
- Under Input Range, select the range for the variables that you want to analyze. You can include multiple variables as long as they form a contiguous block. While you can explore more than one variable, the analysis assesses each variable in a univariate manner (i.e., no correlation).
- In Grouped By, choose how your variables are organized. I always include one variable per column as this format is standard across software. Alternatively, you can include one variable per row.
- Check the Labels in first row checkbox if you have meaningful variable names in row 1. This option makes the output easier to interpret.
- In Output options, choose where you want Excel to display the results.
- Check the Summary statistics box to display most of the descriptive statistics (central tendency, dispersion, distribution properties, sum, and count).
- Check the Confidence Level for Mean box to display a confidence interval for the mean. Enter the confidence level. 95% is usually a good value. For more information about confidence levels, read my post about confidence intervals.
- Check Kth Largest and Kth Smallest to display a high and low value. If you enter 1, Excel displays the highest and lowest values. If you enter 2, it shows the 2nd highest and lowest values. Etc.
- Click OK.
For our example dataset, fill in the dialog box as shown below.
Interpreting Excel’s Descriptive Statistics Results
After Excel creates the statistical output, I autofit the columns for clarity.
As you can see, we’re assessing two variables, height in meters and weight in kilograms.
Generally, we’ll work our way down from the top of Excel’s descriptive statistics output. However, I’ll group the results into categories that make sense. Consequently, the following discussion doesn’t strictly follow the order of the output. If you want to learn more about the statistics, be sure to click the links for more detailed information!
Central Tendencies (Mean, Median, Mode)
A measure of central tendency describes where most of the values in the dataset occur. It’s the center of the distribution of values. Excel presents three measures of central tendency. Which one is best for your data?
- Mean: This measure is the one with which you’re most familiar. It’s the sum of all observations divided by the number of observations. It’s best for data that follow symmetric distributions.
- Median: This value splits your data in half. Half the values fall above the median while half are below it. It’s best for skewed distributions.
- Mode: This measure represents the value that occurs most frequently in your data. It’s best for categorical and ordinal data.
The example data are continuous variables. Excel frequently displays “N/A” for the mode when you have continuous data. That happens because continuous data are unlikely to have exactly duplicated values, a requirement for the mode. Thanks to a data collection artifact, my data are continuous, but Excel displays the mode anyway. The study’s nurse collected the underlying data in inches and pounds, rounded them to the nearest unit, and converted them to their metric equivalents. That process produced clumps of rounded values. However, the mode really is not a good measure for these data.
Related post: Data Types and How to Graph Them
Central Tendency for our Descriptive Statistics Example
What can we learn by comparing the mean and median for both variables? For the height data, they are virtually equal, 1.51m and 1.50m, respectively. For symmetric distributions, the mean and median will be very close together. That’s a good sign that the heights follow a symmetric distribution, making the mean a good choice. The mean tells us that the height distribution centers on 1.51m.
However, there is a difference between the weight mean (46.3kg) and median (44.9kg). When the mean is greater than the median, it indicates that the distribution is right-skewed. We should use the median for these data. Half the data points fall above 44.9kg, and half fall below.
For more information about the different measures of central tendency, their calculations, how data types and distribution properties affect them, graphical representations, and when to use each type, read my post about Measures of Central Tendency.
Measures of Dispersion (Standard Deviation, Variance, Range)
Previously, you saw how a measure of central tendency indicates where most observations fall. Measures of dispersion indicate how closely clustered or loosely spread the data points fall around the center. Excel presents three measures of dispersion. In general, as their values increase, data points spread out further from the center (i.e., the distribution becomes broader).
- Standard Deviation: The standard or typical difference between each data point and the mean. This measure uses the original units of the data, simplifying interpretation. Hence, analysts use this measure of variability the most frequently. The standard deviation is the square root of the variance.
- Variance: The average squared difference of the values from the mean. Because the calculations use squared differences, the variance is in squared units rather than the original data units. While higher values of the variance indicate greater variability, there is no intuitive interpretation for specific values.
- Range: The difference between the largest and smallest values in a dataset. The range is easy to understand but it is based on only the two most extreme values in the dataset, making it very susceptible to outliers. Additionally, the size of the dataset affects the range. As the sample size increases, the range tends to expand. Consequently, use the range to compare variability only when the sample sizes are similar.
Typically, use the standard deviation. When you have fairly skewed data, consider using the interquartile range (IQR), which Excel doesn’t provide, unfortunately.
Variability for our Descriptive Statistics Example
For the height data, the standard deviation is 0.07m (7cm). The typical height falls 7cm from the mean of 1.51m. The range tells us that the spread from the tallest to the shortest is 0.33m (33cm). You can draw similar conclusions from the weight data.
It might be tempting to compare the variability between heights and weights using the standard deviations. However, their standard deviations use different units, M and kg, making a direct comparison impossible. However, for some data, you can compare their coefficients of variation, which is easy to calculate using the standard deviation and means. For more information, read my post about the coefficient of variation.
For more information about the different measures of variability, their calculations, and when to use each type, read my post about Measures of Variability.
Distribution Shape Properties: Kurtosis and Skewness
Kurtosis and skewness are two measures that help you understand the general properties of your data’s distribution. These measures compare your distribution’s shape to a symmetric distribution and the normal distribution.
When either kurtosis or skewness significantly deviate from zero, it might indicate that your data do not follow a normal distribution. However, use a normality test or a normal distribution plot to make that determination.
I find that histograms present the same information more intuitively. However, graph axes and bin sizes can be manipulated to exaggerate or deemphasize characteristics while these statistics are completely objective.
Related post: Manually Adjusting Your Graph Axes
Kurtosis indicates how the peaks and tails of your distribution compare to the normal distribution. Is the peak taller or shorter than the normal distribution? Are the tails thicker or thinner? In the table, the red distributions have positive and negative kurtosis values while the blue distributions have a zero kurtosis value for comparison.
|Zero||Consistent with a normal distribution|
|Positive||Higher peak and thinner tails than the normal distribution|
|Negative||Shorter peak and thicker tails than the normal distribution|
For our example data, height has a kurtosis of -0.35. This value is close to zero, indicating that the tails are consistent with the normal distribution. However, weight has a kurtosis of 1.15, suggesting the tails are thinner than the normal distribution.
Skewness indicates the symmetry of your data’s distribution. Skewed data are asymmetric. The terms right-skewed and left-skewed indicate the direction in which the long tail points on a distribution curve.
|Zero||A perfectly symmetric distribution|
Note that a U-shaped distribution can be symmetric even though it is inverted compared to the normal distribution.
For our example data, height has a skewness of 0.11. This value is close to zero, signifying that these data have a symmetric distribution. However, weight has a skewness of 1.05, which indicates it is right-skewed.
The relative locations of the mean and median and these distribution properties paint a consistent picture of these two variables. For the height data, the mean and median are nearly equal, and kurtosis and skewness are both virtually zero. These measures collectively imply that the heights follow a symmetric distribution consistent with the normal distribution.
Conversely, the weight data have a mean that is higher than the median, a positive skew value, and a positive kurtosis value. These values suggest that the weights follow an asymmetric, right-skewed distribution that is not consistent with the normal distribution.
Minimum and Maximum
The minimum and maximum values in your dataset can help you understand where your data fall. For our example data, the heights fall between 1.33 – 1.66 M, while the weights fall between 29.26 – 80.74 kg. Additionally, these values can help you identify outliers. Frequently, data entry errors create values that fall outside the range of valid data. Look at the minimum and maximum values and see if they make sense for your data!
Related post: Five Ways to Find Outliers in Your Data
Sum and Count
The sum is simply the sum of all values for each variable. I’ve never found this to be helpful, but perhaps it will be for you. The count is the number of observations for each variable. Use this value to determine whether the sample size is what you expected. Both the height and weight variables have 88 observations.
Precision of the Mean: Standard Error and the Confidence Interval
The standard error and the confidence interval assess how precisely your sample mean estimates the population mean. A relatively precise estimate indicates that your sample estimate is likely to be close to the actual population value. Conversely, an imprecise estimate tends to be further away from the correct population value.
Technically, neither of the values belong in the descriptive statistics output because they use your sample data to infer the properties of a larger population (inferential statistics). Descriptive statistics only describes your data without considering a population. However, Excel includes them in the output, so I’ll interpret them here.
Be aware that inferential statistics impose additional requirements on data collection methodologies that do not apply to descriptive statistics. For example, you must use a representative sampling methodology, such as random sampling; otherwise, these measures are invalid.
For more information, read my post about the differences between descriptive and inferential statistics.
Standard Error of the Mean
The standard error of the mean is the standard deviation of the sampling distribution of the mean. What?!
If you took many samples from the same population and calculated each sample’s mean, you’d produce a distribution of sample means. That distribution has a standard deviation, which is the standard error of the mean.
Smaller standard errors indicate that your sample provides a more precise estimate of the population value. Unfortunately, there is no intuitive interpretation of these values. However, the calculations for confidence intervals (CIs) incorporate the standard error, and CIs are much easier to interpret. So, focus on the CIs and don’t worry about the standard errors!
Related post: Standard Error of the Mean
Confidence Interval (CI) of the Mean
A confidence interval of the mean is a range of values that a population mean is likely to fall within. Because of random sampling error, you know that your sample mean is unlikely to equal the population mean, but how large is that difference? CIs help you answer that question by providing a range of probable values for the population mean.
Narrow CIs indicate more precise estimates of the population mean. In other words, you can expect your sample mean to be relatively close to the population mean.
Excel doesn’t provide the range, but it does display the number to add and subtract from your mean to calculate the confidence interval.
For the height data, Excel displays 0.015530282, which I’m rounding to 0.02. To calculate the CI, take the average height and +/- this value. In other words, 1.51 +/- 0.02 creates a CI of 1.49 – 1.53. We can be confident that the mean height for this population falls between these two values.
Using the same process, the confidence interval for weight is [43.98 48.68]. We can be confident that the mean weight for the population falls between these values.
If you want to know more about standard errors, confidence intervals, and confidence levels, read my post about How Confidence Intervals Work.
Histograms of our Descriptive Statistics Data
Let’s see the histograms for our example data. These graphs are not a part of Excel’s descriptive statistics. However, my suggestion is that you graph your data first and then study the numbers. All the statistics in this post describe the data that created the graphs below.
Are there any surprises?
For myself, I expected the height data to be more perfectly symmetrical. However, they are very slightly skewed to the right. The weight data are more right skewed, consistent with the descriptive statistics.
While the Descriptive Statistics analysis can’t assess correlation, read my post about Using Excel to Calculate Correlation to evaluate the relationship between these two variables!