In the field of statistics, data are vital. Data are the information that you collect to learn, draw conclusions, and test hypotheses. After all, statistics is the science of learning from data. However, there are different types of variables, and they record various kinds of information. Crucially, the type of information determines what you can learn from it, and, importantly, what you cannot learn from it. Consequently, it’s essential that you understand the different types of data.
The term “data” carries strong preconceived notions with it. It almost becomes something that is separate from reality. Throughout this post, I want you to think about data as information in a study area that you are gathering to answer a question. For example:
- Do flu shots prevent the flu?
- Does exercise improve your health?
- Does a gasoline additive improve gas mileage?
When you assess any of these questions, there’s a wide array of characteristics that you can record. For example, in a study that uses human subjects, you can log numerical measurements such as height and weight. However, you can also designate properties such as gender, marital status, and health concerns. For some characteristics, you can record them in multiple ways. For instance, you can measure a subject’s body fat percentage, or you can indicate whether they are medically obese or not.
In this blog post, you’ll learn about the different types of variables, what you can learn from them, and how to graph the values using intuitive examples. I also include links to more in-depth posts where I show you how to pick the correct statistical analyses based on the types of variables that you have.
Related post: What is a Variable?
Quantitative versus Qualitative Data
The distinction between quantitative and qualitative data is the most fundamental way to divide types of data. Is the characteristic something you can objectively measure with numbers or not?
Quantitative: The information is recorded as numbers and represents an objective measurement or a count. Temperature, weight, and a count of transactions are all quantitative data. Analysts also refer to this type as numerical data.
Qualitative: The information represents characteristics that you do not measure with numbers. Instead, the observations fall within a countable number of groups. In fact, this type of variable can capture information that isn’t easily measured and can be subjective. Taste, eye color, architectural style, and marital status are all types of qualitative variables.
Within these two broad divisions, there are various subtypes.
Related posts: Qualitative vs. Quantitative Data and Levels of Measurement: Nominal, Ordinal, Interval, and Ratio Scales
Types of Quantitative Data: Continuous and Discrete
When you can represent the information you’re gathering with numbers, you are collecting quantitative data. This class encompasses two categories. To learn more, read Discrete vs. Continuous.
Continuous data
Continuous variables can take on any numeric value, and it can be meaningfully divided into smaller increments, including fractional and decimal values. There are an infinite number of possible values between any two values. Typically, you measure continuous variables on a scale. For example, when you measure height, weight, and temperature, you have continuous data.
With continuous variables, you can assess measures of central tendency and variability, such as the mean, median, distribution, range, and standard deviation. For example, the mean height in the U.S. is 5 feet 9 inches for men and 5 feet 4 inches for women.
Related posts: Measure of Central Tendency and Measures of Variability
How to graph continuous data
Histograms are a standard way to graph continuous variables because they show the distribution of the values. The histogram below helps you determine whether the distribution of body fat percentage values for adolescent girls are symmetric or skewed; understand the range of values; and, identify where the most common values fall.
Dot plots provide the same types of information as histograms. For more information, read my Guide to Dot Plots.
Related post: Using Histograms to Understand Your Data
When you have two continuous variables, you can graph them using a scatterplot. The scatterplot shows how the body fat percentage tends to rise as BMI increases. Use correlation to assess the strength of this relationship or regression analysis to derive the equation for the line that provides the best fit for these data. For more information, read my Guide to Scatterplots.
When you have continuous variables that are divided into groups, you can use a boxplot to display the central tendency and spread of each group. Fertilizer Type C is associated with the highest plant growth while Type B produces the greatest variability.
Please notice how with continuous variables you can assess the wide variety of properties that I illustrate above. You’ll see a contrast when we get to qualitative variables.
Related posts: Box Plot Explained with Examples and Time Series Plots
Discrete data
Discrete quantitative data are a count of the presence of a characteristic, result, item, or activity. These measures cannot be meaningfully divided into smaller increments. For example, a single household can have 1 or 2 cars, but it cannot have 1.6. There are a finite number of possible values that you can record for an observation.
With discrete variables, you can calculate and assess a rate of occurrence or a summary of the count, such as the mean, sum, and standard deviation. For example, U.S. households had an average of 2.11 vehicles in 2014.
Bar charts are a standard way to graph discrete variables. Each bar represents a distinct value, and the height represents its proportion in the entire sample.
See how I used a line plot to graph the count of coronavirus cases by country.
Related posts: Guide to Bar Charts and Guide to Line Charts
Qualitative Data: Categorical, Binary, and Ordinal
When you record information that categorizes your observations, you are collecting qualitative data. There are three types of qualitative variables—categorical, binary, and ordinal. With these data types, you’re often interested in the proportions of each category. Consequently, bar charts and pie charts are conventional methods for graphing qualitative variables because they are useful for displaying the relative percentage of each group out of the entire sample.
As I mentioned in the section about continuous variables, notice how we learn much less from qualitative data. I highlight this aspect in the section about binary variables. In cases where you have a choice about recording a characteristic as a continuous or qualitative variable, the best practice is to record the continuous data because you can learn so much more.
Categorical data
Categorical data have values that you can put into a countable number of distinct groups based on a characteristic. For a categorical variable, you can assign categories, but the categories have no natural order. Analysts also refer to categorical data as both attribute and nominal variables. For example, college major is a categorical variable that can have values such as psychology, political science, engineering, biology, etc. Categorical data is also known as nominal data. Learn more about Nominal Data: Definition & Examples.
The categorical data in the pie chart are the results of a PPG Industries study of new car colors in 2012.
Related post: Guide to Pie Charts
Binary data
Binary data can have only two values. If you can place an observation into only two categories, you have a binary variable. Statisticians also refer to binary data as both dichotomous and indicator variables. For example, pass/fail, male/female, and the presence/absence of a characteristic are all binary data.
Binary variables are helpful for calculating proportions or percentages, such as the proportion of defective products in a sample. You just take the number of faulty products and divide by the sample size.
The binary yes/no data for the pie chart are based on the continuous body fat percentage data in the histogram above. Compare how much we learn from the continuous data that the histogram displays as a distribution compared to the simple proportion that the binary version of the data provides in the pie chart below.
Related post: Maximizing the Value of Your Binary Data
Ordinal data
Ordinal data have at least three categories, and the categories have a natural order. Examples of ordinal variables include overall status (poor to excellent), agreement (strongly disagree to strongly agree), and rank (such as sporting teams).
Analysts often consider ordinal variables to have a combination of qualitative and quantitative properties. Analysts often represent ordinal variables using numbers, such as a 5-point Likert scale that measures satisfaction. In number form, you can calculate average scores as with quantitative variables. However, the numbers have limited usefulness because the differences between ranks might not be constant. Learn more in-depth about Ordinal Data: Definition, Examples & Analysis.
For example, first, second, and third in a race are ordinal data. The difference in time between first and second place might not be the same the difference between second and third place.
The bar chart below displays the proportion of each service rating category in their natural order.
How to Choose Statistical Analyses Based on Data Types
So, you understand the different types of data, what you can learn from them, and how to graph them—how else can you use this knowledge? In statistics, the type of variable greatly determines which kinds of analyses you can perform. Read the following posts to learn how to choose a statistical analysis based on the types of variables that you have.
Choosing Hypothesis Tests for Continuous, Binary, and Count Data: Hypothesis tests use sample data to evaluate claims about an entire population. The correct test depends on your variables.
Chi-squared test of independence when you have two or more categorical variables: This hypothesis test determines whether there is a statistically significant relationship between categorical variables.
Choosing the Correct Type of Regression Analysis Based on Data Type: Regression analysis describes the relationship between a set of independent variables and a dependent variable. The choice depends on the type of data you have for the dependent variable.
First I say my heartfelt thanks to you for doing this wonderful job of teaching statistics in easy way. Yet I never come across an another blog or web page like this for statistics concepts.
You have explained the data type in nice and understandable manner but I need some clarification in it.
I have the data of number of male flowers and number of infected male flowers (both are countable data). I hope it come under in discrete data types. My objective is to see the difference between these two variables. What statistical tools should i use?
Hi,
Thanks so much for your kind words. I’m thrilled that my website has been helpful!
There’s a couple ways you can treat your data. They are countable and you can treat them as count data. With count data, often follow the Poisson distribution and you might use procedures with Poisson in the name, such as a 1 or 2-sample Poisson rate test or Poisson regression.
However, my sense is that you’re probably better off treating it as binary data. The male flowers are either infected or uninfected. Using this approach, you’d have one variable, Infection Status, and that variable could only have the two possible values of infected/uninfected.
You could then calculate the proportion of infected male flowers and a CI using a test like a 1-sample proportion test. Or, you could use something like binary logistic regression which models how various factors affect the probability of a flower becoming infected. That would assume you measured possible factors.
I think treating it a single binary variable is the better option, and finding proportions or using binary logistic regression. Read the section in this post about binary data for more information!
I hope that helps!
I am confused about which mathematical/arithmetic operations are possible with discrete data. Is only Addition and Subtraction possible? Or are multiplication and division also possible? Surely it can’t be equated with Continuous Data. But, if calculation of mean is possible with discrete data (Ur book Intro to Stat, pg 29-30), can you compute the arithmetic mean only or the geometric mean as well?
Hi Sanjay,
Basically, the same rules that apply to continuous data also apply to discrete data when you’re talking about integers (i.e., not categorical). When you have a natural zero value, you have ratio scale data and can perform multiplication and division with discrete integers. However, if there’s no natural zero, then you have interval scale data and only addition and subtraction are possible.
For more information, read my post about nominal, ordinal, interval, and ratio scales. Pay particular attention to the interval and ratio scales and apply those ideas to discrete integers even though I talk about them in the context of continuous data.
To see an example of how, say, division works with integers. Imagine you have 30 kids and three rooms. You can’t have 29.5 kids or 2.5 rooms. These are truly discrete data values. Now suppose you want to evenly divide the 30 kids by 3 rooms, that give you 10 kids per room. The math works out. Division is ok. So is multiplication. Number of kids and rooms are both essentially ratio scale variables, but they happen to be integers.
I hope that helps!
Thanks for this post. My question is about plotting discrete data that models a normal distribution. For example, if we measured foot length of 16 year old boys, we would probably get a normal distribution and could plot this continuous data as a histogram. If we took the shoe size of the same boys and plotted it (shoe size being discrete data) we would get pretty much the same distribution as foot length. My question is (if my previous assumptions are correct) can we plot this as a histogram as shoe size is essentially representing a data bin? Or number of leaves on a sapling at two months – why do we have to separate the discrete points (number of leaves) when they could be any integer within a defined (I want to say continuous) range?
Is shoe size categorical or ordinal or something else?
Hai
can you please explain to me any types of discreet and continuous pattern of data distribution and their influence on data analysis
Thank you for your reply. So that discrete data (and the discrete variable with it) can take on negative and real values. Defining discrete data as the number of presences … is not enough right. I’m still a bit confused.
Hi, it’s discrete because it can only take on specific values in which there are no intermediate values. And it’s binary because there are only two discrete values.
Think of integers. They are whole numbers than can be both positive and negative. However, there are no values in between the integers. Hence, integers are discrete values even though they can have positive and negative values. Focus on the meaning of the word “discrete.” In this context, the relevant portion of the definition is, “individually separate and distinct.”
Your data can only have two values. There are no possible values in between in this context, Hence, discrete and binary.
Hi Jim,
First, thanks for your wonderful blog posts. They help me a lot.
However, I got a bit confused about your definition of discrete data. Discrete data is count data –> integer and non-negative values. For example, A = “The profit on a 2.5$ bet on black in roulette. Possible values: -2.5 and 2.5” —> which is type of this data
Hi,
Going strictly by the wording, only -2.5 and 2.5, it’s a discrete variable. More specifically, it’s a binary variable because it can only take two values in that setting.
Hi Jim,
I hope you’re doing well.
I want to make ordinal data for post hoc analysis about percentage increase in blood pressure change before and after treatment, but the categories are overlapping as below:
1. max% decrease to 0% (0% means before and after have same value)
2. 0% to max% increase
3. 5% increase to max% increase
4. 10% increase to max% increase
5. 15% increase to max% increase
Is it possible to group them into one ordinal variable?
If not, and I made them into nominal variable on each category, what is the suitable post hoc analysis for them?
Any advice would be greatly appreciated!
Thank you.
Best regards,
Hilman
Hi Hilman,
The ordinal categories must be mutually exclusive. If an observation falls within one ordinal category, it can fall within another. It’s a matter of correctly setting up those categories. Make sure they don’t overlap!
The ordinal categories belong to one variable. For example, you could have blood pressure change be your ordinal variable with the following four categories
No increase
Small increase
Medium increase
Large increase
Each category would be associated with a range of percentage increases that make sense for the subject area. I’m not familiar with that area, so I don’t know what would constitute reasonable ranges.
But, again, you’d have the one ordinal variable with multiple, mutually exclusive categories. I showed an example with four categories, but it can be a different number.
I don’t know what analysis would be appropriate because I don’t know what you want to learn and whether you have any other data. Also, you should question whether converting your raw data into ordinal categories is the appropriate approach. Your raw data probably provides more detailed information than ordinal groups that merge multiple observations together, effectively throwing away some of the details.
Here’s a link to an article I wrote that talks about the different types of hypothesis tests for different data types.
Dear Jim,
I am a psychiatrist and like to read the relevant portions of statitics from your list What all topics are important for my profession
Thank you
Dr P K Sukumaran
Thanks for your reply, Jim. The example I was referring to was on page 56 of that document – I apologize for not making that clearer initially.
The only rationale I can think of is it’s easier to see differences in averages using the slope of a line than trying to decipher differences in bar height. But from your response, it sounds like using a line is personal preference rather than any recognized rule. I’ll stick with my science training and use bar graphs for categorical data!
Hi Arlene,
Oh, ok, I see the example now. When you mentioned nominal and ordinal data I was thinking of a single nominal or ordinal variable. In that case, a bar chart with with no lines is appropriate. However, the example displays means for continuous data that are split into groups by a nominal (categorical) variable. In that scenario, it’s common practice to connect the means or medians by lines in those cases to highlight the differences between means.
Your article takes a large amount of information and distills it into something understandable – thanks for sharing.
Coming from a science background and then learning stats, it bothers me that nominal or ordinal data graphs often have a line connecting data points. For example, breeds of dogs and time to run 30m (https://static.nsta.org/pdfs/samples/PB343Xweb.pdf) . Is there a reason why statisticians often connect data points with a line when a simple bar graph would, in my opinion, be more accurate? I’d appreciate any input you have.
Hi Arlene,
I agree with you that a bar chart is great nominal and ordinal data, and that there should not be connecting lines. My guess is that most trained statisticians won’t do that but I can imagine many others doing that. There’s no good reason.
I was interested in the document that you link to so I could see the example, but oddly it skips from the foreword to page 51. So, I can’t explain it!
Please I need help in basic statistics
Hi Hannah,
I highly recommend that you read my Introduction to Statistics book. It’ll help you understand statistics without all the jargon and confusion. I think it’s exactly what you need! It’s available as an ebook or in print. Click the link to read about what it covers.
i become your fan sir, this is awesome its really cleared all my doubt.
Hi, Jim
As a newbie, I found it easier to learn statistic from your excellent writing.
I’m doing reasearch using data of all banks in one country (population, not sample) from year 2000-2015.
Loan ratio is the dependent variable.
I might have issue with this variable due to different measurement. For year 2000-2005, loan ratio includes the lending for productive and consumptive activities, while since 2006 it only covers productive activities. Therefore, the figure of loan ratio drops significantly since 2006.
To handle this issue, can I add dummy variable in regression model that takes value 0 for year<= 2005, and value 1 for years after 2005?
Or should I just use data from 2006, which means fewer observation?
Fyi, there is no different measurement for all independent variables.
I look forward for your help.
Many thanks
WEN
Thanks a lot Jim. Your notes are so simple to understand statistics. In this lockdown period and after long hours of searching online I finally found your articles which are just wow.
Sincerely wish to thank you .
I really appreciate your efforts.
Topics were addressed in a brilliant manner. Thanks a lot.
Wow! Statistics made simple. I have never understood this much until now. Thank you for this write-up. Much appreciated!
Hi Funmi, you’re very welcome. I strive to present statistics in a simple manner. Consequently, comments like yours absolutely make my day! Thanks for writing!
after searching lot of blogs post i found this blogs to get best conceptual start. on every post everyone was teaching math only here i can understand concept behind that method
Jim, I really enjoy your blog, as, especially the parts about regression have been extremely valuable for me. That being said, I have to dissent here mainly about pie charts.
I would generally advise against pie charts for various reasons
– they are hogging screen real estate. Face it: a circle is the most uneconomic way of display something on a rectangular screen
– they give you a hard time distinguishing actual proportions, especially when the pie section do not differ that much
– they tend to get messy with legends, descriptions and whatever
– they are plain hell for men (mostly men) with colorsight impairment
I agree that they have some limited use but I wouldn’t use them with more than three data points.
Bar charts/column charts will generally give you a much better view on the data, proportions and all that. Look at the “New cars color” pie chart: a bar chart would allow for a much more intuitive view on the data.
You’ll find this all over the place in the internet, e.g. here: http://www.businessinsider.com/pie-charts-are-the-worst-2013-6?IR=T
I also would advise for distinguishing between bar charts (horizontal) and column charts (vertical). For categorical and ordinal data I always would use the former, as they give you more freedom (and more real estate) for descriptions on the category axis by retaining the general advantages of column charts.
And last but not least: time series data are IMO best depicted on a line chart.
Hi Werner,
Thanks for you thoughtful comment. Choosing the best graph to present information clearly can sometimes be as much art as science. The analyst’s preferences will also play a role in that choice.
Personally, I think pie charts are fine in certain cases. In particular, they are the best chart for conveying at a glance the fact that you’re looking at proportions of a whole. On a bar chart, you have to look at the axis carefully to understand this facet. Also, pie charts don’t necessarily have to take up more room than a bar chart. Although, I agree that when you have too many categories, the legend and labels can be too cluttered. Bar charts are better in those cases. I have to admit, I didn’t think of the color blindness issue. You have to weigh all of these factors!
There are defenders of pie charts as well.
I definitely agree about time series charts. At some point I’ll add it to this post!
Thanks again for the insightful comment!
Jim
Dear jim,
Your written information about types of data are very beneficial and valuable. I proud of you that you are helping us by posting such types of lectures. Welldone sir.
Sir as i commented on one of your earler post about binary data analysis. I, once again request you to please share more detailed information on catagorical regression in you own words or
Sir if you have no time then please suggest me a relevant book name and its author name please. I want to learn more detail about categorical data analysis. Thanks
Regards
Sami ullah,
Ph.D student of economics,
Pakistan.
Hi Sami, this is definitely on my list of topics to write about, but you’ll need a little patience! It’ll probably be a couple of months before I can get to it. If you need information earlier, most regression textbooks should talk about logistic regression analysis.
Hi Jim,
I really appreciate this guide! But I have a question. Elsewhere I’ve seen data types differentiated by the acronym NOIR: Nominal, Ordinal, Interval, and Ratio. In these situations “Qualitative” is replaced by “Categorical” (making two major groups Quantitative and Categorical instead of Quantitative and Qualitative), followed by two subgroups in each: Nominal and Ordinal as subgroups of Categorical; and Interval and Ratio as subgroups of Quantitative.
These differences can drive confusion on how to properly identify data types. It would be helpful to know how to appropriately combine all these terms into one cohesive Data type model. Could you offer any clarity on this?
Kind regards,
Chuck Wynn
Hi Chuck, I’m glad you found this guide to be useful. As you’ve noticed, there are different ways to classify data types. I’ve tried to include several alternative names for some of the data types. I did think of possibly including categorical as an AKA under qualitative. However, I already have a categorical group and I thought that would be confusing to have that twice! I did list “nominal” along with “attribute” as AKAs for categorical data.
The Nominal, Ordinal, Interval, and Ratio classification system was created by a psychologist and I wonder if this system is used more frequently is the field of psychology?
The difference between interval and ratio is that ratio has an absolute zero point while interval does not. While that is crucial for calculating ratios, it’s often not crucial when you’re graphing and statistically analyzing data. But, it can be an important point in terms of other types of interpretation. For instance 20 degrees Celsius is not twice 10 degrees Celsius.
I’ve tried to combine the two systems below. Parentheses indicate the NOIR classification terminology.
Quantitative:
-Continuous data (Ratio and Interval)
-Discrete data (Ratio but not Interval. Counts do have an absolute zero.)
Qualitative (Categorical):
-Categorical (Nominal)
-Binary (Nominal)
-Ordinal
I hope this helps and thanks for the interesting question!
Jim
It is really informative in statistics.
Thank you Yeshambel!
Thnks a lot
I am honestly saying that I get a lot of concepts. ……by u
Keep on
God bless u. …
Thank you Khursheed! I’m very happy that you found it to be helpful!