The Gauss-Markov theorem states that if your linear regression model satisfies the first six classical assumptions, then ordinary least squares (OLS) regression produces unbiased estimates that have the smallest variance of all possible linear estimators. [Read more…] about The Gauss-Markov Theorem and BLUE OLS Coefficient Estimates
Ordinary Least Squares (OLS) is the most common estimation method for linear models—and that’s true for a good reason. As long as your model satisfies the OLS assumptions for linear regression, you can rest easy knowing that you’re getting the best possible estimates. [Read more…] about 7 Classical Assumptions of Ordinary Least Squares (OLS) Linear Regression
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution. It is also known as the Gaussian distribution and the bell curve.
The normal distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.
In this blog post, you’ll learn how to use the normal distribution, its parameters, and how to calculate Z-scores to standardize your data and find probabilities. [Read more…] about Normal Distribution in Statistics
A probability distribution is a function that describes the likelihood of obtaining the possible values that a random variable can assume. In other words, the values of the variable vary based on the underlying probability distribution.
Suppose you draw a random sample and measure the heights of the subjects. As you measure heights, you can create a distribution of heights. This type of distribution is useful when you need to know which outcomes are most likely, the spread of potential values, and the likelihood of different results.
In this blog post, you’ll learn about probability distributions for both discrete and continuous variables. I’ll show you how they work and examples of how to use them. [Read more…] about Understanding Probability Distributions
A correlation between variables indicates that as one variable changes in value, the other variable tends to change in a specific direction. Understanding that relationship is useful because we can use the value of one variable to predict the value of the other variable. For example, height and weight are correlated—as height increases, weight also tends to increase. Consequently, if we observe an individual who is unusually tall, we can predict that his weight is also above the average.
In statistics, correlation is a quantitative assessment that measures both the direction and the strength of this tendency to vary together. There are different types of correlation that you can use for different kinds of data. In this post, I cover the most common type of correlation—Pearson’s correlation coefficient.
Before we get into the numbers, let’s graph some data first so we can understand the concept behind what we are measuring.
Graph Your Data to Find Correlations
Scatterplots are a great way to check quickly for relationships between pairs of continuous data. The scatterplot below displays the height and weight of pre-teenage girls. Each dot on the graph represents an individual girl and her combination of height and weight. These data are actual data that I collected during an experiment.
At a glance, you can see that there is a relationship between height and weight. As height increases, weight also tends to increase. However, it’s not a perfect relationship. If you look at a specific height, say 1.5 meters, you can see that there is a range of weights associated with it. You can also find short people who weigh more than taller people. However, the general tendency that height and weight increase together is unquestionably present.
Pearson’s correlation takes all of the data points on this graph and represents them as a single number. In this case, the statistical output below indicates that the correlation is 0.694.
What do the correlation and p-value mean? We’ll interpret the output soon. First, let’s look at a range of possible correlation values so we can understand how our height and weight example fits in.
How to Interpret the Pearson’s Correlation Coefficient
Pearson’s correlation coefficient is represented by the Greek letter rho (ρ) for the population parameter and r for a sample statistic. This coefficient is a single number that measures both the strength and direction of the linear relationship between two continuous variables. Values can range from -1 to +1.
- Strength: The greater the absolute value of the coefficient, the stronger the relationship.
- The extreme values of -1 and 1 indicate a perfectly linear relationship where a change in one variable is accompanied by a perfectly consistent change in the other. For these relationships, all of the data points fall on a line. In practice, you won’t see either type of perfect relationship.
- A coefficient of zero represents no linear relationship. As one variable increases, there is no tendency in the other variable to either increase or decrease.
- When the value is in-between 0 and +1/-1, there is a relationship, but the points don’t all fall on a line. As r approaches -1 or 1, the strength of the relationship increases and the data points tend to fall closer to a line.
- Direction: The coefficient sign represents the direction of the relationship.
- Positive coefficients indicate that when the value of one variable increases, the value of the other variable also tends to increase. Positive relationships produce an upward slope on a scatterplot.
- Negative coefficients represent cases when the value of one variable increases, the value of the other variable tends to decrease. Negative relationships produce a downward slope.
Examples of Positive and Negative Correlations
An example of a positive correlation is the relationship between the speed of a wind turbine and the amount of energy it produces. As the turbine speed increases, electricity production also increases.
An example of a negative correlation is the relationship between outdoor temperature and heating costs. As the temperature increases, heating costs decrease.
Graphs for Different Correlations
Graphs always help bring concepts to life. The scatterplots below represent a spectrum of different relationships. I’ve held the horizontal and vertical scales of the scatterplots constant to allow for valid comparisons between them.
Discussion about the Correlation Scatterplots
For the scatterplots above, I created one positive relationship between the variables and one negative relationship between the variables. Then, I varied only the amount of dispersion between the data points and the line that defines the relationship. That process illustrates how correlation measures the strength of the relationship. The stronger the relationship, the closer the data points fall to the line. I didn’t include plots for weaker correlations that are closer to zero than 0.6 and -0.6 because they start to look like blobs of dots and it’s hard to see the relationship.
A common misinterpretation is assuming that a negative correlation coefficient indicates that there is no relationship. After all, a negative correlation sounds suspiciously like no relationship. However, the scatterplots for the negative correlations display real relationships. For negative relationships, high values of one variable are associated with low values of another variable. For example, there is a negative correlation between school absences and grades. As the number of absences increases, the grades decrease.
Earlier I mentioned how crucial it is to graph your data to understand them better. However, a quantitative measurement of the relationship does have an advantage. Graphs are a great way to visualize the data, but the scaling can exaggerate or weaken the appearance of a relationship. Additionally, the automatic scaling in most statistical software tends to make all data look similar.
Fortunately, Pearson’s correlation coefficient is unaffected by scaling issues. Consequently, a statistical assessment is better for determining the precise strength of the relationship.
Graphs and the relevant statistical measures often work better in tandem.
Pearson’s Correlation Coefficient Measures Linear Relationship
Pearson’s correlation measures only linear relationships. Consequently, if your data contain a curvilinear relationship, the correlation coefficient will not detect it. For example, the correlation for the data in the scatterplot below is zero. However, there is a relationship between the two variables—it’s just not linear.
This example illustrates another reason to graph your data! Just because the coefficient is near zero, it doesn’t necessarily indicate that there is no relationship.
Hypothesis Test for Correlations
Correlations have a hypothesis test. As with any hypothesis test, this test takes sample data and evaluates two mutually exclusive statements about the population from which the sample was drawn. For Pearson correlations, the two hypotheses are the following:
- Null hypothesis: There is no linear relationship between the two variables. ρ = 0.
- Alternative hypothesis: There is a linear relationship between the two variables. ρ ≠ 0.
A correlation of zero indicates that no linear relationship exists. If your p-value is less than your significance level, the sample contains sufficient evidence to reject the null hypothesis and conclude that the correlation does not equal zero. In other words, the sample data support the notion that the relationship exists in the population.
Related post: Overview of Hypothesis Tests
Interpreting our Height and Weight Correlation Example
Now that we have seen a range of positive and negative relationships, let’s see how our correlation of 0.694 fits in. We know that it’s a positive relationship. As height increases, weight tends to increase. Regarding the strength of the relationship, the graph shows that it’s not a very strong relationship where the data points tightly hug a line. However, it’s not an entirely amorphous blob with a very low correlation. It’s somewhere in between. That description matches our moderate correlation of 0.694.
For the hypothesis test, our p-value equals 0.000. This p-value is less than any reasonable significance level. Consequently, we can reject the null hypothesis and conclude that the relationship is statistically significant. The sample data support the notion that the relationship between height and weight exists in the population of preteen girls.
Correlation Does Not Imply Causation
I’m sure you’ve heard this expression before, and it is a crucial warning. Correlation between two variables indicates that changes in one variable are associated with changes in the other variable. However, correlation does not mean that the changes in one variable actually cause the changes in the other variable.
Sometimes it is clear that there is a causal relationship. For the height and weight data, it makes sense that adding more vertical structure to a body causes the total mass to increase. Or, increasing the wattage of lightbulbs causes the light output to increase.
However, in other cases, a causal relationship is not possible. For example, ice cream sales and shark attacks are positively correlated. Clearly, selling more ice cream does not cause shark attacks (or vice versa). Instead, a third variable, outdoor temperatures, causes changes in the other two variables. Higher temperatures increase both sales of ice cream and the number of swimmers in the ocean, which creates the apparent relationship between ice cream sales and shark attacks.
In statistics, you typically need to perform a randomized, controlled experiment to determine that a relationship is causal rather than merely correlation.
How Strong of a Correlation is Considered Good?
What is a good correlation? How high should it be? These are commonly asked questions. I have seen several schemes that attempt to classify correlations as strong, medium, and weak.
However, there is only one correct answer. The correlation coefficient should accurately reflect the strength of the relationship. Take a look at the correlation between the height and weight data, 0.694. It’s not a very strong relationship, but it accurately represents our data. An accurate representation is the best-case scenario for using a statistic to describe an entire dataset.
The strength of any relationship naturally depends on the specific pair of variables. Some research questions involve weaker relationships than other subject areas. Case in point, humans are hard to predict. Studies that assess relationships involving human behavior tend to have correlations weaker than +/- 0.6.
However, if you analyze two variables in a physical process, and have very precise measurements, you might expect correlations near +1 or -1. There is no one-size fits all best answer for how strong a relationship should be. The correct correlation value depends on your study area.
Taking Correlation to the Next Level with Regression Analysis
Wouldn’t it be nice if instead of just describing the strength of the relationship between height and weight, we could define the relationship itself using an equation? Regression analysis does just that. That analysis finds the line and corresponding equation that provides the best fit to our dataset. We can use that equation to understand how much weight increases with each additional unit of height and to make predictions for specific heights. Read my post where I talk about the regression model for the height and weight data.
Regression analysis allows us to expand on correlation in other ways. If we have more variables that explain changes in weight, we can include them in the model and potentially improve our predictions. And, if the relationship is curved, we can still fit a regression model to the data.
Additionally, a form of the Pearson correlation coefficient shows up in regression analysis. R-squared is a primary measure of how well a regression model fits the data. This statistic represents the percentage of variation in one variable that other variables explain. For a pair of variables, R-squared is simply the square of the Pearson’s correlation coefficient. For example, squaring the height-weight correlation coefficient of 0.694 produces an R-squared of 0.482, or 48.2%. In other words, height explains about half the variability of weight in preteen girls.
To learn more about regression analysis, read my regression tutorial.
Determining a good sample size for a study is always an important issue. After all, using the wrong sample size can doom your study from the start. Fortunately, power analysis can find the answer for you. Power analysis combines statistical analysis, subject-area knowledge, and your requirements to help you derive the optimal sample size for your study.
Statistical power in a hypothesis test is the probability that the test will detect an effect that actually exists. As you’ll see in this post, both under-powered and over-powered studies are problematic. Let’s learn how to find a good sample size for your study! [Read more…] about Estimating a Good Sample Size for Your Study Using Power Analysis
A measure of variability is a summary statistic that represents the amount of dispersion in a dataset. How spread out are the values? While a measure of central tendency describes the typical value, measures of variability define how far away the data points tend to fall from the center. We talk about variability in the context of a distribution of values. A low dispersion indicates that the data points tend to be clustered tightly around the center. High dispersion signifies that they tend to fall further away.
In statistics, variability, dispersion, and spread are synonyms that denote the width of the distribution. Just as there are multiple measures of central tendency, there are several measures of variability. In this blog post, you’ll learn why understanding the variability of your data is critical. Then, I explore the most common measures of variability—the range, interquartile range, variance, and standard deviation. I’ll help you determine which one is best for your data. [Read more…] about Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation
A measure of central tendency is a summary statistic that represents the center point or typical value of a dataset. These measures indicate where most values in a distribution fall and are also referred to as the central location of a distribution. You can think of it as the tendency of data to cluster around a middle value. In statistics, the three most common measures of central tendency are the mean, median, and mode. Each of these measures calculates the location of the central point using a different method.
Choosing the best measure of central tendency depends on the type of data you have. In this post, I explore these measures of central tendency, show you how to calculate them, and how to determine which one is best for your data. [Read more…] about Measures of Central Tendency: Mean, Median, and Mode
Descriptive and inferential statistics are two broad categories in the field of statistics. In this blog post, I show you how both types of statistics are important for different purposes. Interestingly, some of the statistical measures are similar, but the goals and methodologies are very different. [Read more…] about Difference between Descriptive and Inferential Statistics
In the field of statistics, data are vital. Data are the information that you collect to learn, draw conclusions, and test hypotheses. After all, statistics is the science of learning from data. However, there are different types of variables, and they record various kinds of information. Crucially, the type of information determines what you can learn from it, and, importantly, what you cannot learn from it. Consequently, it’s essential that you understand the different types of data. [Read more…] about Guide to Data Types and How to Graph Them in Statistics
Binary data occur when you can place an observation into only two categories. It tells you that an event occurred or that an item has a particular characteristic. For instance, an inspection process produces binary pass/fail results. Or, when a customer enters a store, there are two possible outcomes—sale or no sale. In this post, I show you how to use the binomial, geometric, negative binomial, and the hypergeometric distributions to glean more information from your binary data. [Read more…] about Maximize the Value of Your Binary Data with the Binomial and Other Probability Distributions
Anecdotal evidence is a story told by individuals. It comes in many forms that can range from product testimonials to word of mouth. It’s often testimony, or a short account, about the truth or effectiveness of a claim. Typically, anecdotal evidence focuses on individual results, is driven by emotion, and presented by individuals who are not subject area experts. [Read more…] about Learn How Anecdotal Evidence Can Trick You!
The field of statistics is the science of learning from data. Statistical knowledge helps you use the proper methods to collect the data, employ the correct analyses, and effectively present the results. Statistics is a crucial process behind how we make discoveries in science, make decisions based on data, and make predictions. Statistics allows you to understand a subject much more deeply. [Read more…] about The Importance of Statistics
Regression analysis mathematically describes the relationship between independent variables and the dependent variable. It also allows you to predict the mean value of the dependent variable when you specify values for the independent variables. In this regression tutorial, I gather together a wide range of posts that I’ve written about regression analysis. My tutorial helps you go through the regression content in a systematic and logical order. [Read more…] about Regression Tutorial with Analysis Examples
In a previous blog post, I introduced the basic concepts of hypothesis testing and explained the need for performing these tests. In this post, I’ll build on that and compare various types of hypothesis tests that you can use with different types of data, explore some of the options, and explain how to interpret the results. Along the way, I’ll point out important planning considerations, related analyses, and pitfalls to avoid. [Read more…] about Comparing Hypothesis Tests for Continuous, Binary, and Count Data
In this blog post, I explain why you need to use statistical hypothesis testing and help you navigate the essential terminology. Hypothesis testing is a crucial procedure to perform when you want to make inferences about a population using a random sample. These inferences include estimating population properties such as the mean, differences between means, proportions, and the relationships between variables. [Read more…] about Statistical Hypothesis Testing Overview
Regression analysis mathematically describes the relationship between a set of independent variables and a dependent variable. There are numerous types of regression models that you can use. This choice often depends on the kind of data you have for the dependent variable and the type of model that provides the best fit. In this post, I cover the more common types of regression analyses and how to decide which one is right for your data. [Read more…] about Choosing the Correct Type of Regression Analysis
Interaction effects occur when the effect of one variable depends on the value of another variable. Interaction effects are common in regression analysis, ANOVA, and designed experiments. In this blog post, I explain interaction effects, how to interpret them in statistical designs, and the problems you will face if you don’t include them in your model. [Read more…] about Understanding Interaction Effects in Statistics
Use regression analysis to describe the relationships between a set of independent variables and the dependent variable. Regression analysis produces a regression equation where the coefficients represent the relationship between each independent variable and the dependent variable. You can also use the equation to make predictions.
As a statistician, I should probably tell you that I love all statistical analyses equally—like parents with their kids. But, shhh, I have secret! Regression analysis is my favorite because it provides tremendous flexibility, which makes it useful in so many different circumstances. In fact, I’ve described regression analysis as taking correlation to the next level!
In this blog post, I explain the capabilities of regression analysis, the types of relationships it can assess, how it controls the variables, and generally why I love it! You’ll learn when you should consider using regression analysis. [Read more…] about When Should I Use Regression Analysis?
Log-log plots display data in two dimensions where both axes use logarithmic scales. When one variable changes as a constant power of another, a log-log graph shows the relationship as a straight line. In this post, I’ll show you why these graphs are valuable and how to interpret them. [Read more…] about Using Log-Log Plots to Determine Whether Size Matters