Use scatterplots to show relationships between pairs of continuous variables. These graphs display symbols at the X, Y coordinates of the data points for the paired variables. Scatterplots are also known as scattergrams and scatter charts.
The pattern of dots on a scatterplot allows you to determine whether a relationship or correlation exists between two continuous variables. If a relationship exists, the scatterplot indicates its direction and whether it is a linear or curved relationship.
Fitted line plots are a special type of scatterplot that displays the data points along with a fitted line for a simple regression model. This graph allows you to evaluate how well the model fits the data.
Use scatterplots to assess the following features of your dataset:
- Examine the relationship between two variables.
- Check for outliers and unusual observations.
- Create a time series plot with irregular time-dependent data.
- Evaluate the fit of a regression model.
At a minimum, scatterplots require two continuous variables. To learn about other graphs, read my Guide to Data Types and How to Graph Them.
Example Scatterplot
During an experiment, I measured the Body Mass Index (BMI) and body fat percentage of adolescent girls. I graphed these two variables in a scatterplot to assess the relationship between them.
Scatterplots typically contain the following elements:
- X-axis representing values of a continuous variable. By custom, this is the independent variable when you can classify one of the variables as such.
- Y-axis representing values of a continuous variable. Traditionally, this is the dependent variable.
- Symbols plotted at the (X, Y) coordinates of your data. Optionally, the graph can use different colored/shaped symbols to represent separate groups on the same chart.
- Optionally, you can overlay fit lines to determine how well a model fits the data.
For the BMI and the body fat data, the scatterplot displays a moderately strong, positive relationship. As BMI increases, the body fat percentage also tends to increase. The relationship appears to curve slightly because it flattens out for higher BMI values. To model the curvature, the analysts include a squared term in the model. The fitted line follows the curvature of the data, indicating a good fit.
Learn more about the X and Y Axis.
Interpreting Scatterplots and Assessing Relationships between Variables
Scatterplots display the direction, strength, and linearity of the relationship between two variables.
Positive and Negative Correlation and Relationships
Values tending to rise together indicate a positive correlation. For instance, the relationship between height and weight have a positive correlation.
However, if one variable increases as the other decreases, it’s a negative correlation, as shown below.
Strength of Relationships
Stronger relationships produce a tighter clustering of data points. Be aware that changes in scaling can change the apparent strength of the relationship. Correlation coefficients provide an objective assessment of strength independent of graph scaling.
In the two graphs below, the data points in the top graph cluster more tightly than the data points in the bottom graph. Consequently, the first dataset displays a stronger relationship.
Stronger relationships produce correlation coefficients closer to -1 and +1 and regression models that have higher R-squared values.
Related post: Interpreting Correlation Coefficients
Linear and Curved Relationships
Determine whether your data have a linear or curved relationship. When a relationship between two variables is curved, it affects the type of correlation you can use to assess its strength and how you can model it using regression analysis.
Adding a fit line highlights how well the model fits your data. When a relationship exists, you might want to model it using regression analysis.
Related post: Modeling Curvature Using Regression
Determine Whether the Relationship Changes between Groups
When your data have groups, you can determine whether the relationship between two variables differs between the groups. To make these comparisons, you’ll need a categorical variable that defines the groups. All groups must use the same X and Y measurements.
In this scatterplot, the slope of the relationship is the same for the two groups, but the output values of group B are consistently higher for any given input value.
In this scatterplot, the slope for group B is steeper than for group A. As the input value increases, the output for group B increase more quickly than group A.
Use indicator variables and interaction terms in a regression model to test the statistical significance of these differences. Click the link below for details.
Related post: Comparing Regression Lines with Hypothesis Tests
Find Outliers and Unusual Observations with Scatterplots
Scatterplots can help you find multiple types of outliers.
Some outliers have extreme values. These outliers are distanced from other data points, as shown below.
Unusual observations have values that are not necessarily extreme, but they do not fit the observed relationship. In the scatterplot below, the circled point has X and Y values that are not unusual. However, the combination of the two values clearly does not fit the overall relationship.
Related post: Five Ways to Find Outliers in Your Data
Trends Over Time
Typically, analysts use time series plots to display data over time. However, you can also use scatterplots for this purpose. Scatterplots are a perfect choice for time-related data when your observations occur at irregular intervals. When creating a scatterplot for time data, be sure to add a connect line between the data points!
Use Scatterplots with the Appropriate Hypothesis Tests
You can use scatterplots to display the relationships between continuous variables. However, if you plan to use your sample to infer the characteristics of an entire population, be sure to perform the necessary hypothesis tests and assess statistical significance.
Related post: Descriptive versus Inferential Statistics
Graphs can be subjective because your software lets you edit their properties, such as the graph’s scaling. Altering these settings can change the appearance of scatterplots and the conclusions you draw from them. On the other hand, hypothesis tests present an objective evaluation of statistical significance. They also account for the possibility of random error explaining the observed patterns and differences.
Correlation and regression analysis are the primary methods for statistically assessing relationships between continuous data.
K says
Awesome, that makes sense. Thanks a lot!
K says
Hi Jim,
Thanks for your simplified explanations of stats. Quick question- do we need error bars for scatter plots (Like we need them for bar graphs)?
My scatter plot has a lot of points (more than 36)
Thank you!
Jim Frost says
Hi,
Generally you don’t need error bars for scatterplots. The reason you use them for bar charts is because the bars often represent a summary statistic, such as a mean or standard deviation. There is a margin of error or variability around that summary statistic which the error bars capture.
Scatterplots focus on the individual values rather than a calculated statistic (e.g., mean). Additionally, scatterplots focus on relationships between variables and again not presenting a summary statistic. So, there’s usually no need for error bars on a scatterplot.
Sean says
Hello Mr. Frost,
I am currently writing a report for my university course and it is about the correlation between one mineral and 75 others. I used scatterplots to visualize potential correlations but with about 28 plots the points are mainly on the x and y axis. I haven’t found a good explanation as to why and what it means online and hoped you might be able to answer my questions.
Thank you very much and have a great day
Jim Frost says
Hi Sean,
I’m not entirely sure what you mean by “mainly on the x and y axis.” Please describe in more detail what you’re seeing. That will help me answer your question. Thanks!
Shari Rossino says
My scatterplot results look like a perfect tic tac toe board. This does not seem like an appropriate response. Any thoughts Jim? I appreciate your feedback.
Jim Frost says
Hi Shari,
Do you mean the data points make a grid pattern? The appropriateness of any pattern (or lack thereof) depends on the nature of the variables. That pattern might make complete sense for a pair of variables. I can’t tell without that context. But understand the variables and see if the pattern makes sense.
Michelle Weston says
Can the data in a scatterplot be considered right/left skewed?
Jim Frost says
Hi Michelle,
When you’re looking at pairs of values as you’re doing in a scatterplot, terms like skew of distribution don’t make sense. Scatterplots highlight relationships between pairs of variables. The skew of a distribution relates to the distribution of a single variable, and you should use a histogram for that.
However, you can assess the distribution of values for individual variables in the context of a scatterplot by using a marginal plot. This type of plot simply graphs the distribution of each of the variables in a scatterplot separately in the margins, as shown in the example below.
In this graph, you can see that the distribution of the variable on the X axis (horizontal) is right skewed while the distribution for the variable on the Y axis (vertical) is fairly symmetrical. However, you only get that type of information for the individual variables in the separate histograms and not the scatterplot itself. The scatterplot indicates that there is a negative correlation between the two.