The R programming language is a powerful and free statistical software tool that analysts use frequently.
The R programming language is open source software where the R community develops and maintains it, while users can download it for free.
Being open source provides many advantages, including the following:
- New statistical methods are quickly available because the R community is vast and active.
- The source code for each function is freely available and everybody can review it.
- Using the R programming language is free! That’s a significant advantage to relatively expensive statistical tools, such as SAS, STATA, and SPSS.
In this article, I give you a brief introduction to the strengths of the R programming language by applying basic statistical concepts to a real dataset using R functions.
If you want to follow the examples, you can copy and paste the codes shown in this article into R or RStudio. All codes are 100% reproducible.
Let’s dive into it!
Example Data for the R Programming Language
After downloading the dataset, load it into R by executing the following code:
data(iris) # Loading iris flower data as example data set
Next, we can inspect the structure of the iris flower data using the head function. The head function returns only the first six rows of a data set:
head(iris) # Printing first six rows of iris data set
This table displays the first six rows of our example data and indicates that our data contain five variables: “Sepal.Length”, “Sepal.Width”, “Petal.Length”, “Petal.Width”, and “Species.”
The first four variables contain numeric values, and the fifth variable groups our data into different flower species.
In the following sections, we’ll analyze these data – So keep on reading!
Using R Functions to Calculate Basic Descriptive Statistics for a Dataset
The following syntax illustrates how to calculate a set of descriptive statistics for all variables in a dataset.
For this task, we can apply the summary function as shown below:
summary(iris) # Return summary statistics
This table contains the minimum, 1st quantile, median, mean, 3rd quantile, and the maximum for the numeric columns in our data, and the count of each category for the non-numeric columns.
This information gives us an overview of the data distributions for our variables. However, we can analyze our data in much more detail!
Calculating Descriptive Statistics by Group using R
As you have seen in the previous sections, our dataset groups the observations by three flower species: setosa, versicolor, and virginica. Therefore, it might be interesting to compare the descriptive statistics of the different flower species.
The following R code uses the aggregate and mean functions to calculate the mean by group (i.e. flower species) for the variable Sepal.Length:
aggregate(Sepal.Length ~ Species, iris, mean) # Return mean by group
The table indicates that the average sepal length of the flower species setosa is the shortest and the sepal length of the species virginica is the longest.
Note that we have calculated only the mean by group for the first variable. However, we can replace the variable name to calculate the mean by group for other variables. Additionally, we can calculate other summary statistics instead of the mean.
To determine whether these mean differences are statistically significant, you need to perform one-way ANOVA.
Creating a Correlation Matrix with R Programming
Understanding the relationships between variables provides additional useful information. To gain this information, we can create a correlation matrix (i.e., a table showing the correlation coefficients between multiple variables at the same time) by applying the cor function to the numeric variables of our data:
cor(iris[ , 1:4]) # Return correlation matrix
In the correlation matrix, you can see, for instance, that the correlation between Petal.Width and Petal.Length is very high, but the correlation between Sepal.Width and Sepal.Length is relatively low.
Related post: Understanding Correlation Coefficients
Using the R Programming Language to Estimate a Linear Regression Model
The R programming language also provides functions to estimate statistical models. One of the most commonly used model types is linear regression. Using the lm and summary functions in R, we can estimate and evaluate these models.
The following R syntax uses the variable Sepal.Length as the dependent variable and the remaining variables in the dataset as independent variables:
summary(lm(Sepal.Length ~ ., iris)) # Results of linear regression
Besides many other metrics, the output displays regression coefficients, standard errors, t-values, and p-values. As the stars on the right side of the output indicate, all independent variables significantly impact the dependent variable.
Related post: Interpreting Regression Coefficients and their P-values
Generating Random Numbers with R Programming
So far, we have used R to analyze the iris flower dataset. However, the R programming language also provides powerful functions to generate random data.
Whenever random processes are involved, it is useful to set a random seed. A random seed is a number that initializes a pseudorandom number generator and allows other analysts to reproduce our “random” output. We can set a random seed in R using the set.seed function:
set.seed(101101) # Set a random seed
Next, we can draw random numbers from a random distribution. The rnorm function draws numbers from a normal distribution:
x_small <- rnorm(20) # Generate small sample
The previous R code has generated 20 random numbers following a normal distribution. We can visualize our randomly generated data in a histogram using the hist function:
hist(x_small) # Draw histogram of small sample
The previous figure visualizes our random data in a histogram. As you can see, our data does not look normally distributed yet.
The reason for this is that we have drawn only 20 random numbers from the normal distribution and, due to the law of large numbers, we need to draw a larger sample to approximate a normal distribution.
We can do this by simply increasing the number within the rnorm function. The following R code draws 10000 random numbers from the normal distribution:
x_large <- rnorm(10000) # Generate large sample
Let’s draw these data in a histogram:
hist(x_large, breaks = 100) # Draw histogram of large sample
After executing the code, R creates the histogram below.
As you can see, our data almost perfectly follows the normal distribution.
Summary & About the Author
In this tutorial, you have learned how to calculate basic statistics using the R programming language. In case you want to learn more about topics like this, you may check out my website, Statistics Globe, as well as the Statistics Globe YouTube channel.
My name is Joachim Schork, I’m a survey statistician and programmer, and I provide many R programming and statistics tutorials on these platforms. Thanks a lot to Jim Frost for providing me this opportunity to introduce the R programming language on his wonderful website!