Introduction to Statistics Using the R Programming Language

The R programming language is a powerful and free statistical software tool that analysts use frequently.

The R programming language is open source software where the R community develops and maintains it, while users can download it for free.

Being open source provides many advantages, including the following:

New statistical methods are quickly available because the R community is vast and active.
The source code for each function is freely available and everybody can review it.
Using the R programming language is free! That’s a significant advantage to relatively expensive statistical tools, such as SAS, STATA, and SPSS.

In this article, I give you a brief introduction to the strengths of the R programming language by applying basic statistical concepts to a real dataset using R functions.

If you want to follow the examples, you can copy and paste the codes shown in this article into R or RStudio. All codes are 100% reproducible.

Let’s dive into it!

Example Data for the R Programming Language

In the first section of this article, we’ll load the iris dataset into R. Ronald Fisher, biologist and statistician, introduced the iris flower dataset in 1936. It contains flower measurements.

After downloading the dataset, load it into R by executing the following code:

data(iris)     # Loading iris flower data as example data set

Next, we can inspect the structure of the iris flower data using the head function. The head function returns only the first six rows of a data set:

head(iris)     # Printing first six rows of iris data set

This table displays the first six rows of our example data and indicates that our data contain five variables: “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species.”

The first four variables contain numeric values, and the fifth variable groups our data into different flower species.

In the following sections, we’ll analyze these data – So keep on reading!

Using R Functions to Calculate Basic Descriptive Statistics for a Dataset

The following syntax illustrates how to calculate a set of descriptive statistics for all variables in a dataset.

For this task, we can apply the summary function as shown below:

summary(iris)     # Return summary statistics

This table contains the minimum, 1st quantile, median, mean, 3rd quantile, and the maximum for the numeric columns in our data, and the count of each category for the non-numeric columns.

This information gives us an overview of the data distributions for our variables. However, we can analyze our data in much more detail!

Calculating Descriptive Statistics by Group using R

As you have seen in the previous sections, our dataset groups the observations by three flower species: setosa, versicolor, and virginica. Therefore, it might be interesting to compare the descriptive statistics of the different flower species.

The following R code uses the aggregate and mean functions to calculate the mean by group (i.e. flower species) for the variable Sepal.Length:

aggregate(Sepal.Length ~ Species, iris, mean)     # Return mean by group

The table indicates that the average sepal length of the flower species setosa is the shortest and the sepal length of the species virginica is the longest.

Note that we have calculated only the mean by group for the first variable. However, we can replace the variable name to calculate the mean by group for other variables. Additionally, we can calculate other summary statistics instead of the mean.

To determine whether these mean differences are statistically significant, you need to perform one-way ANOVA.

Creating a Correlation Matrix with R Programming

Understanding the relationships between variables provides additional useful information. To gain this information, we can create a correlation matrix (i.e., a table showing the correlation coefficients between multiple variables at the same time) by applying the cor function to the numeric variables of our data:

cor(iris[ , 1:4])     # Return correlation matrix

In the correlation matrix, you can see, for instance, that the correlation between Petal.Width and Petal.Length is very high, but the correlation between Sepal.Width and Sepal.Length is relatively low.

Related post: Understanding Correlation Coefficients

Using the R Programming Language to Estimate a Linear Regression Model

The R programming language also provides functions to estimate statistical models. One of the most commonly used model types is linear regression. Using the lm and summary functions in R, we can estimate and evaluate these models.

The following R syntax uses the variable Sepal.Length as the dependent variable and the remaining variables in the dataset as independent variables:

summary(lm(Sepal.Length ~ ., iris))     # Results of linear regression

Besides many other metrics, the output displays regression coefficients, standard errors, t-values, and p-values. As the stars on the right side of the output indicate, all independent variables significantly impact the dependent variable.

Generating Random Numbers with R Programming

So far, we have used R to analyze the iris flower dataset. However, the R programming language also provides powerful functions to generate random data.

Whenever random processes are involved, it is useful to set a random seed. A random seed is a number that initializes a pseudorandom number generator and allows other analysts to reproduce our “random” output. We can set a random seed in R using the set.seed function:

set.seed(101101)     # Set a random seed

Next, we can draw random numbers from a random distribution. The rnorm function draws numbers from a normal distribution:

x_small <- rnorm(20)     # Generate small sample

The previous R code has generated 20 random numbers following a normal distribution. We can visualize our randomly generated data in a histogram using the hist function:

hist(x_small)     # Draw histogram of small sample

The previous figure visualizes our random data in a histogram. As you can see, our data does not look normally distributed yet.

The reason for this is that we have drawn only 20 random numbers from the normal distribution and, due to the law of large numbers, we need to draw a larger sample to approximate a normal distribution.

We can do this by simply increasing the number within the rnorm function. The following R code draws 10000 random numbers from the normal distribution:

x_large <- rnorm(10000)     # Generate large sample

Let’s draw these data in a histogram:

hist(x_large, breaks = 100)     # Draw histogram of large sample

After executing the code, R creates the histogram below.

As you can see, our data almost perfectly follows the normal distribution.

Summary & About the Author

In this tutorial, you have learned how to calculate basic statistics using the R programming language. In case you want to learn more about topics like this, you may check out my website, Statistics Globe, as well as the Statistics Globe YouTube channel.

My name is Joachim Schork, I’m a survey statistician and programmer, and I provide many R programming and statistics tutorials on these platforms. Thanks a lot to Jim Frost for providing me this opportunity to introduce the R programming language on his wonderful website!

Comments

Ria says

January 4, 2024 at 7:42 am

Hi Jim, thank you so much for this tutorial. I am new to R and am trying to learn for my PhD data analysis. I keep getting an error message as I try to load the dataset;
sepal_length,sepal_width,petal_length,petal_width,species
Error: unexpected ‘,’ in “sepal_length,”

Nobody else seems to have mentioned or had this- can you kindly advise?
thank you

Loading...

Reply
Joachim says

December 2, 2021 at 3:51 am

Hey Jean, thanks a lot for the kind words! Glad you enjoyed reading the tutorial. Regards, Joachim

Loading...

Reply
Jean Pascal Koh says

December 1, 2021 at 11:33 pm

Thanks for sharing.
I use R, and like to read your tutorial

Loading...

Reply
F M says

August 14, 2021 at 1:54 am

Hi Jim,
I’m trying to identify which statistical measurement can help me differentiate between two series in the following manner. The range (between maximum and minimum value) is the same and the average is the same . But Serie A values have oscillated between the Maximum and the minimum 50 times during sampling. On Serie B values went once to maximum, then went to minimum and then ended in the middle. I would assume the standard deviation is the same for both, but on Serie A the number of times values went up and down was much higher. What statistical measurement will allow me to compare this?
Thanks,

Loading...

Reply
da says

June 25, 2021 at 1:36 pm

That’s kinda my point, Hemant! If one wants to learn (for whatever reason) basic to advanced statistical analysis in a relatively user-friendly way on a computer and, we hope, be an intelligent consumer of statistical analysis one would go with omnipresent Excel or Libre Calc. If one wants to be a “serious” Data Scientist, then one would have to get to R and Python to have cultural ‘cred.

Loading...

Reply
Debashis Sengupta says

June 25, 2021 at 9:39 am

Thank you, Joachim!

Loading...

Reply
Hemant says

June 25, 2021 at 4:22 am

No Doubt Excel is Very powerful tool and developing day by day and very user friendly at the same time. Anyone with lack of programming knowledge can use it.
R on the other hand is much much more powerful than Excel due to packages available for every type of analysis but lacks user friendliness. To use R you must have programming knowledge and knowledge of tool/package available for a particular problem

Loading...

Reply
Joachim says

June 25, 2021 at 4:18 am

Thanks a lot for sharing your experience with Excel Jefferson! I think this is a very good example why R (or Python) should be used when doing more complex data analyzes. Regards, Joachim

Loading...

Reply
Anurag says

June 25, 2021 at 4:09 am

Uk just lost 600 million track test results, trying to use excel. Never do anything important with excel.

Loading...

Reply
Jefferson says

June 25, 2021 at 3:34 am

Hello DA,

I concur with what Jim and Joachim have said about Excel.

Indeed, you are somewhat right. Not everyone needs R and/or Python. This is to the extent of the analysis you are doing and whether you are looking for a job like me in the data space. I recently used Excel for a small project where I applied a Machine Learning technique.

For my thesis project, Excel closed several times when I was analyzing my data. It was then that I learned that it has limitations. I think Excel breaks down if you are analyzing a dataset that with “250 megabytes (MB) total file size”.

In my thesis research, I had a dataset with variables at 15 min interval from 2012 to 2016 (5 years). Unfortunately, I was not introduced to R or Python and I did not know about them until the early days of pandemic when I was job hunting then I realized how critical R and Python are these days.

Loading...

Reply
Joachim says

June 24, 2021 at 1:37 pm

Hey Da, thanks a lot for your feedback! Indeed, Excel is very powerful, and I definitely didn’t want to make the impression that Excel is a bad tool. However, as Jim said this tutorial is meant to give an introduction into R, which can be used as basement for more sophisticated methods. If you are mainly interested in basic statistics, Excel might definitely be a good fit for you.

Loading...

Reply
da says

June 24, 2021 at 12:46 pm

This might be somewhat heretical but if you right click the link (above) “the iris dataset” and open in a new tab (and assuming you have Excel on your machine) you can download or open the dataset and Excel will automatically convert to the csv to .xls file and apply Excel tools to the data. At least this works works with current Excel. I think Excel 2016 and earlier requires some added labor. I make this observation because some people find Excel meets their basic needs and hunger for data. Obviously if one needs to learn R and/or Python, csv files will suffice.

Loading...

Reply
- Jim Frost says
  
  June 24, 2021 at 1:24 pm
  
  Hi Da,
  
  I don’t think that’s heretical at all. It really depends on each person’s analytical needs. Excel definitely supplies a range of methods to explore your data. If you search for Excel in my site’s search box (near the top of the right margin), you’ll see that I have written a number of posts about using Excel to analyze your data. Excel definitely has a role to play!
  
  On the other hand, R provides a massive amount of statistical analyses going well beyond Excel and it can support most every niche. When Joachim writes that one of R’s strengths is the R community’s ability to add new analyses, he’s not kidding. I used to work at a statistical software company and there was no way they could add new features as quickly as the community adds them to R! Joachim shows several of the introductory R functions, but that’s really the tip of the iceberg. A starting point for more complex analyses.
  
  You’re absolutely correct though. Not everyone needs to use R. Excel can help many people. But R is a powerful tool for a different group of analysts. I know you weren’t suggesting otherwise, but it’s great that there are a variety of analytical tools to meet various needs!
  
  Loading...
  
  Reply
Joachim says

June 24, 2021 at 11:30 am

Thanks Hemant! I agree, the R programming language is getting more and more popular 🙂

Loading...

Reply
Joachim says

June 24, 2021 at 11:28 am

Thanks a lot Frank, glad to hear that you have liked the tutorial!

Loading...

Reply
Joachim says

June 24, 2021 at 11:19 am

Hey Debashis,

Thank you for the great feedback, glad you like the tutorial! I’m not sure whether Jim has planned to publish a similar tutorial on Python, but you may also have a look at the Python section on my website (https://statisticsglobe.com/python-programming-language). Even though R is my main language, I have published several Python tutorials recently.

Regards

Joachim

Loading...

Reply
Hemant says

June 24, 2021 at 9:40 am

Good Initiative by Starting R Programming Tutorials. It is need of the hour.

Loading...

Reply
Frank CHao says

June 24, 2021 at 8:53 am

Your tutorials are so helpful! Thank you!

Loading...

Reply
Debashis Sengupta says

June 24, 2021 at 8:08 am

Very effective and concise tutorial–it brought back memories from my earlier attempts to use R regularly. That said, I have started using Python on Jupyter notebook. Missing values in a real-life sales dataset are a big problem for me even before I start basic analysis. Is it possible for you to host a similar tutorial using Python/Jupyter next time?

Again, thanks so much J n’ J (Joachim and Jim).

Loading...

Reply
Joachim says

June 24, 2021 at 3:06 am

Thanks a lot for this amazing collaboration Jim, it was a great pleasure to work with you!

Loading...

Reply
- Jim Frost says
  
  June 24, 2021 at 12:22 pm
  
  Hi Joachim, it was indeed a great pleasure working with you. Thanks for creating such an informative tutorial. I’m sure many readers will find it helpful!
  
  Loading...
  
  Reply