Use scatterplots to show relationships between pairs of continuous variables. These graphs display symbols at the X, Y coordinates of the data points for the paired variables. Scatterplots are also known as scattergrams and scatter charts.
The pattern of dots on a scatterplot allows you to determine whether a relationship or correlation exists between two continuous variables. If a relationship exists, the scatterplot indicates its direction and whether it is a linear or curved relationship.
Fitted line plots are a special type of scatterplot that displays the data points along with a fitted line for a simple regression model. This graph allows you to evaluate how well the model fits the data.
Use scatterplots to assess the following features of your dataset:
- Examine the relationship between two variables.
- Check for outliers and unusual observations.
- Create a time series plot with irregular time-dependent data.
- Evaluate the fit of a regression model.
At a minimum, scatterplots require two continuous variables. To learn about other graphs, read my Guide to Data Types and How to Graph Them.
During an experiment, I measured the Body Mass Index (BMI) and body fat percentage of adolescent girls. I graphed these two variables in a scatterplot to assess the relationship between them.
Scatterplots typically contain the following elements:
- X-axis representing values of a continuous variable. By custom, this is the independent variable when you can classify one of the variables as such.
- Y-axis representing values of a continuous variable. Traditionally, this is the dependent variable.
- Symbols plotted at the (X, Y) coordinates of your data. Optionally, the graph can use different colored/shaped symbols to represent separate groups on the same chart.
- Optionally, you can overlay fit lines to determine how well a model fits the data.
For the BMI and the body fat data, the scatterplot displays a moderately strong, positive relationship. As BMI increases, the body fat percentage also tends to increase. The relationship appears to curve slightly because it flattens out for higher BMI values. To model the curvature, the analysts include a squared term in the model. The fitted line follows the curvature of the data, indicating a good fit.
Interpreting Scatterplots and Assessing Relationships between Variables
Scatterplots display the direction, strength, and linearity of the relationship between two variables.
Positive and Negative Correlation and Relationships
Values tending to rise together indicate a positive correlation. For instance, the relationship between height and weight have a positive correlation.
However, if one variable increases as the other decreases, it’s a negative correlation, as shown below.
Strength of Relationships
Stronger relationships produce a tighter clustering of data points. Be aware that changes in scaling can change the apparent strength of the relationship. Correlation coefficients provide an objective assessment of strength independent of graph scaling.
In the two graphs below, the data points in the top graph cluster more tightly than the data points in the bottom graph. Consequently, the first dataset displays a stronger relationship.
Stronger relationships produce correlation coefficients closer to -1 and +1 and regression models that have higher R-squared values.
Linear and Curved Relationships
Determine whether your data have a linear or curved relationship. When a relationship between two variables is curved, it affects the type of correlation you can use to assess its strength and how you can model it using regression analysis.
Adding a fit line highlights how well the model fits your data. When a relationship exists, you might want to model it using regression analysis.
Determine Whether the Relationship Changes between Groups
When your data have groups, you can determine whether the relationship between two variables differs between the groups. To make these comparisons, you’ll need a categorical variable that defines the groups. All groups must use the same X and Y measurements.
In this scatterplot, the slope of the relationship is the same for the two groups, but the output values of group B are consistently higher for any given input value.
In this scatterplot, the slope for group B is steeper than for group A. As the input value increases, the output for group B increase more quickly than group A.
Use indicator variables and interaction terms in a regression model to test the statistical significance of these differences. Click the link below for details.
Find Outliers and Unusual Observations with Scatterplots
Scatterplots can help you find multiple types of outliers.
Some outliers have extreme values. These outliers are distanced from other data points, as shown below.
Unusual observations have values that are not necessarily extreme, but they do not fit the observed relationship. In the scatterplot below, the circled point has X and Y values that are not unusual. However, the combination of the two values clearly does not fit the overall relationship.
Related post: Five Ways to Find Outliers in Your Data
Trends Over Time
Typically, analysts use time series plots to display data over time. However, you can also use scatterplots for this purpose. Scatterplots are a perfect choice for time-related data when your observations occur at irregular intervals. When creating a scatterplot for time data, be sure to add a connect line between the data points!
Use Scatterplots with the Appropriate Hypothesis Tests
You can use scatterplots to display the relationships between continuous variables. However, if you plan to use your sample to infer the characteristics of an entire population, be sure to perform the necessary hypothesis tests and assess statistical significance.
Graphs can be subjective because your software lets you edit their properties, such as the graph’s scaling. Altering these settings can change the appearance of scatterplots and the conclusions you draw from them. On the other hand, hypothesis tests present an objective evaluation of statistical significance. They also account for the possibility of random error explaining the observed patterns and differences.