The R programming language is a powerful and free statistical software tool that analysts use frequently.
The R programming language is open source software where the R community develops and maintains it, while users can download it for free.
Being open source provides many advantages, including the following:
- New statistical methods are quickly available because the R community is vast and active.
- The source code for each function is freely available and everybody can review it.
- Using the R programming language is free! That’s a significant advantage to relatively expensive statistical tools, such as SAS, STATA, and SPSS.
In this article, I give you a brief introduction to the strengths of the R programming language by applying basic statistical concepts to a real dataset using R functions.
If you want to follow the examples, you can copy and paste the codes shown in this article into R or RStudio. All codes are 100% reproducible.
Let’s dive into it!
Example Data for the R Programming Language
In the first section of this article, we’ll load the iris dataset into R. Ronald Fisher, biologist and statistician, introduced the iris flower dataset in 1936. It contains flower measurements.
After downloading the dataset, load it into R by executing the following code:
data(iris) # Loading iris flower data as example data set
Next, we can inspect the structure of the iris flower data using the head function. The head function returns only the first six rows of a data set:
head(iris) # Printing first six rows of iris data set
This table displays the first six rows of our example data and indicates that our data contain five variables: “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species.”
The first four variables contain numeric values, and the fifth variable groups our data into different flower species.
In the following sections, we’ll analyze these data – So keep on reading!
Using R Functions to Calculate Basic Descriptive Statistics for a Dataset
The following syntax illustrates how to calculate a set of descriptive statistics for all variables in a dataset.
For this task, we can apply the summary function as shown below:
summary(iris) # Return summary statistics
This table contains the minimum, 1st quantile, median, mean, 3rd quantile, and the maximum for the numeric columns in our data, and the count of each category for the non-numeric columns.
This information gives us an overview of the data distributions for our variables. However, we can analyze our data in much more detail!
Related posts: Measures of Central Tendency and Interpreting Percentiles and Quartiles
Calculating Descriptive Statistics by Group using R
As you have seen in the previous sections, our dataset groups the observations by three flower species: setosa, versicolor, and virginica. Therefore, it might be interesting to compare the descriptive statistics of the different flower species.
The following R code uses the aggregate and mean functions to calculate the mean by group (i.e. flower species) for the variable Sepal.Length:
aggregate(Sepal.Length ~ Species, iris, mean) # Return mean by group
The table indicates that the average sepal length of the flower species setosa is the shortest and the sepal length of the species virginica is the longest.
Note that we have calculated only the mean by group for the first variable. However, we can replace the variable name to calculate the mean by group for other variables. Additionally, we can calculate other summary statistics instead of the mean.
To determine whether these mean differences are statistically significant, you need to perform one-way ANOVA.
Creating a Correlation Matrix with R Programming
Understanding the relationships between variables provides additional useful information. To gain this information, we can create a correlation matrix (i.e., a table showing the correlation coefficients between multiple variables at the same time) by applying the cor function to the numeric variables of our data:
cor(iris[ , 1:4]) # Return correlation matrix
In the correlation matrix, you can see, for instance, that the correlation between Petal.Width and Petal.Length is very high, but the correlation between Sepal.Width and Sepal.Length is relatively low.
Related post: Understanding Correlation Coefficients
Using the R Programming Language to Estimate a Linear Regression Model
The R programming language also provides functions to estimate statistical models. One of the most commonly used model types is linear regression. Using the lm and summary functions in R, we can estimate and evaluate these models.
The following R syntax uses the variable Sepal.Length as the dependent variable and the remaining variables in the dataset as independent variables:
summary(lm(Sepal.Length ~ ., iris)) # Results of linear regression
Besides many other metrics, the output displays regression coefficients, standard errors, t-values, and p-values. As the stars on the right side of the output indicate, all independent variables significantly impact the dependent variable.
Related post: Interpreting Regression Coefficients and their P-values
Generating Random Numbers with R Programming
So far, we have used R to analyze the iris flower dataset. However, the R programming language also provides powerful functions to generate random data.
Whenever random processes are involved, it is useful to set a random seed. A random seed is a number that initializes a pseudorandom number generator and allows other analysts to reproduce our “random” output. We can set a random seed in R using the set.seed function:
set.seed(101101) # Set a random seed
Next, we can draw random numbers from a random distribution. The rnorm function draws numbers from a normal distribution:
x_small <- rnorm(20) # Generate small sample
The previous R code has generated 20 random numbers following a normal distribution. We can visualize our randomly generated data in a histogram using the hist function:
hist(x_small) # Draw histogram of small sample
The previous figure visualizes our random data in a histogram. As you can see, our data does not look normally distributed yet.
Related posts: Using Histograms to Understand Your Data and Assessing Normality: Histograms vs. Q-Q Plots
The reason for this is that we have drawn only 20 random numbers from the normal distribution and, due to the law of large numbers, we need to draw a larger sample to approximate a normal distribution.
We can do this by simply increasing the number within the rnorm function. The following R code draws 10000 random numbers from the normal distribution:
x_large <- rnorm(10000) # Generate large sample
Let’s draw these data in a histogram:
hist(x_large, breaks = 100) # Draw histogram of large sample
After executing the code, R creates the histogram below.
As you can see, our data almost perfectly follows the normal distribution.
Summary & About the Author
In this tutorial, you have learned how to calculate basic statistics using the R programming language. In case you want to learn more about topics like this, you may check out my website, Statistics Globe, as well as the Statistics Globe YouTube channel.
My name is Joachim Schork, I’m a survey statistician and programmer, and I provide many R programming and statistics tutorials on these platforms. Thanks a lot to Jim Frost for providing me this opportunity to introduce the R programming language on his wonderful website!
Hey Jean, thanks a lot for the kind words! Glad you enjoyed reading the tutorial. Regards, Joachim
Jean Pascal Koh says
Thanks for sharing.
I use R, and like to read your tutorial
F M says
I’m trying to identify which statistical measurement can help me differentiate between two series in the following manner. The range (between maximum and minimum value) is the same and the average is the same . But Serie A values have oscillated between the Maximum and the minimum 50 times during sampling. On Serie B values went once to maximum, then went to minimum and then ended in the middle. I would assume the standard deviation is the same for both, but on Serie A the number of times values went up and down was much higher. What statistical measurement will allow me to compare this?
That’s kinda my point, Hemant! If one wants to learn (for whatever reason) basic to advanced statistical analysis in a relatively user-friendly way on a computer and, we hope, be an intelligent consumer of statistical analysis one would go with omnipresent Excel or Libre Calc. If one wants to be a “serious” Data Scientist, then one would have to get to R and Python to have cultural ‘cred.
Debashis Sengupta says
Thank you, Joachim!
No Doubt Excel is Very powerful tool and developing day by day and very user friendly at the same time. Anyone with lack of programming knowledge can use it.
R on the other hand is much much more powerful than Excel due to packages available for every type of analysis but lacks user friendliness. To use R you must have programming knowledge and knowledge of tool/package available for a particular problem
Thanks a lot for sharing your experience with Excel Jefferson! I think this is a very good example why R (or Python) should be used when doing more complex data analyzes. Regards, Joachim
Uk just lost 600 million track test results, trying to use excel. Never do anything important with excel.
I concur with what Jim and Joachim have said about Excel.
Indeed, you are somewhat right. Not everyone needs R and/or Python. This is to the extent of the analysis you are doing and whether you are looking for a job like me in the data space. I recently used Excel for a small project where I applied a Machine Learning technique.
For my thesis project, Excel closed several times when I was analyzing my data. It was then that I learned that it has limitations. I think Excel breaks down if you are analyzing a dataset that with “250 megabytes (MB) total file size”.
In my thesis research, I had a dataset with variables at 15 min interval from 2012 to 2016 (5 years). Unfortunately, I was not introduced to R or Python and I did not know about them until the early days of pandemic when I was job hunting then I realized how critical R and Python are these days.
Hey Da, thanks a lot for your feedback! Indeed, Excel is very powerful, and I definitely didn’t want to make the impression that Excel is a bad tool. However, as Jim said this tutorial is meant to give an introduction into R, which can be used as basement for more sophisticated methods. If you are mainly interested in basic statistics, Excel might definitely be a good fit for you.
This might be somewhat heretical but if you right click the link (above) “the iris dataset” and open in a new tab (and assuming you have Excel on your machine) you can download or open the dataset and Excel will automatically convert to the csv to .xls file and apply Excel tools to the data. At least this works works with current Excel. I think Excel 2016 and earlier requires some added labor. I make this observation because some people find Excel meets their basic needs and hunger for data. Obviously if one needs to learn R and/or Python, csv files will suffice.
Jim Frost says
I don’t think that’s heretical at all. It really depends on each person’s analytical needs. Excel definitely supplies a range of methods to explore your data. If you search for Excel in my site’s search box (near the top of the right margin), you’ll see that I have written a number of posts about using Excel to analyze your data. Excel definitely has a role to play!
On the other hand, R provides a massive amount of statistical analyses going well beyond Excel and it can support most every niche. When Joachim writes that one of R’s strengths is the R community’s ability to add new analyses, he’s not kidding. I used to work at a statistical software company and there was no way they could add new features as quickly as the community adds them to R! Joachim shows several of the introductory R functions, but that’s really the tip of the iceberg. A starting point for more complex analyses.
You’re absolutely correct though. Not everyone needs to use R. Excel can help many people. But R is a powerful tool for a different group of analysts. I know you weren’t suggesting otherwise, but it’s great that there are a variety of analytical tools to meet various needs!
Thanks Hemant! I agree, the R programming language is getting more and more popular 🙂
Thanks a lot Frank, glad to hear that you have liked the tutorial!
Thank you for the great feedback, glad you like the tutorial! I’m not sure whether Jim has planned to publish a similar tutorial on Python, but you may also have a look at the Python section on my website (https://statisticsglobe.com/python-programming-language). Even though R is my main language, I have published several Python tutorials recently.
Good Initiative by Starting R Programming Tutorials. It is need of the hour.
Frank CHao says
Your tutorials are so helpful! Thank you!
Debashis Sengupta says
Very effective and concise tutorial–it brought back memories from my earlier attempts to use R regularly. That said, I have started using Python on Jupyter notebook. Missing values in a real-life sales dataset are a big problem for me even before I start basic analysis. Is it possible for you to host a similar tutorial using Python/Jupyter next time?
Again, thanks so much J n’ J (Joachim and Jim).
Thanks a lot for this amazing collaboration Jim, it was a great pleasure to work with you!
Jim Frost says
Hi Joachim, it was indeed a great pleasure working with you. Thanks for creating such an informative tutorial. I’m sure many readers will find it helpful!