## What is Factor Analysis?

Factor analysis uses the correlation structure amongst observed variables to model a smaller number of unobserved, latent variables known as factors. Researchers use this statistical method when subject-area knowledge suggests that latent factors cause observable variables to covary. Use factor analysis to identify the hidden variables.

Analysts often refer to the observed variables as indicators because they literally indicate information about the factor. Factor analysis treats these indicators as linear combinations of the factors in the analysis plus an error. The procedure assesses how much of the variance each factor explains within the indicators. The idea is that the latent factors create commonalities in some of the observed variables.

For example, socioeconomic status (SES) is a factor you can’t measure directly. However, you can assess occupation, income, and education levels. These variables all relate to socioeconomic status. People with a particular socioeconomic status tend to have similar values for the observable variables. If the factor (SES) has a strong relationship with these indicators, then it accounts for a large portion of the variance in the indicators.

The illustration below illustrates how the four hidden factors in blue drive the measurable values in the yellow indicator tags.

Researchers frequently use factor analysis in psychology, sociology, marketing, and machine learning.

Let’s dig deeper into the goals of factor analysis, critical methodology choices, and an example. This guide provides practical advice for performing factor analysis.

## Analysis Goals

Factor analysis simplifies a complex dataset by taking a larger number of observed variables and reducing them to a smaller set of unobserved factors. Anytime you simplify something, you’re trading off exactness with ease of understanding. Ideally, you obtain a result where the simplification helps you better understand the underlying reality of the subject area. However, this process involves several methodological and interpretative judgment calls. Indeed, while the analysis identifies factors, it’s up to the researchers to name them! Consequently, analysts debate factor analysis results more often than other statistical analyses.

While all factor analysis aims to find latent factors, researchers use it for two primary goals. They either want to explore and discover the structure within a dataset or confirm the validity of existing hypotheses and measurement instruments.

### Exploratory Factor Analysis (EFA)

Researchers use exploratory factor analysis (EFA) when they do not already have a good understanding of the factors present in a dataset. In this scenario, they use factor analysis to find the factors within a dataset containing many variables. Use this approach before forming hypotheses about the patterns in your dataset. In exploratory factor analysis, researchers are likely to use statistical output and graphs to help determine the number of factors to extract.

Exploratory factor analysis is most effective when multiple variables are related to each factor. During EFA, the researchers must decide how to conduct the analysis (e.g., number of factors, extraction method, and rotation) because there are no hypotheses or assessment instruments to guide them. Use the methodology that makes sense for your research.

For example, researchers can use EFA to create a scale, a set of questions measuring one factor. Exploratory factor analysis can find the survey items that load on certain constructs.

### Confirmatory Factor Analysis (CFA)

Confirmatory factor analysis (CFA) is a more rigid process than EFA. Using this method, the researchers seek to confirm existing hypotheses developed by themselves or others. This process aims to confirm previous ideas, research, and measurement and assessment instruments. Consequently, the nature of what they want to verify will impose constraints on the analysis.

Before the factor analysis, the researchers must state their methodology including extraction method, number of factors, and type of rotation. They base these decisions on the nature of what they’re confirming. Afterwards, the researchers will determine whether the model’s goodness-of-fit and pattern of factor loadings match those predicted by the theory or assessment instruments.

In this vein, confirmatory factor analysis can help assess construct validity. The underlying constructs are the latent factors, while the items in the assessment instrument are the indicators. Similarly, it can also evaluate the validity of measurement systems. Does the tool measure the construct it claims to measure?

For example, researchers might want to confirm factors underlying the items in a personality inventory. Matching the inventory and its theories will impose methodological choices on the researchers, such as the number of factors.

We’ll get to an example factor analysis in short order, but first, let’s cover some key concepts and methodology choices you’ll need to know for the example.

Learn more about Validity and Construct Validity.

## Factors

In this context, factors are broader concepts or constructs that researchers can’t measure directly. These deeper factors drive other observable variables. Consequently, researchers infer the properties of unobserved factors by measuring variables that correlate with the factor. In this manner, factor analysis lets researchers identify factors they can’t evaluate directly.

Psychologists frequently use factor analysis because many of their factors are inherently unobservable because they exist inside the human brain.

For example, depression is a condition inside the mind that researchers can’t directly observe. However, they can ask questions and make observations about different behaviors and attitudes. Depression is an invisible driver that affects many outcomes we can measure. Consequently, people with depression will tend to have more similar responses to those outcomes than those who are not depressed.

For similar reasons, factor analysis in psychology often identifies and evaluates other mental characteristics, such as intelligence, perseverance, and self-esteem. The researchers can see how a set of measurements load on these factors and others.

## Method of Factor Extraction

The first methodology choice for factor analysis is the mathematical approach for extracting the factors from your dataset. The most common choices are maximum likelihood (ML), principal axis factoring (PAF), and principal components analysis (PCA).

You should use either ML or PAF most of the time.

Use ML when your data follow a normal distribution. In addition to extracting factor loadings, it also can perform hypothesis tests, construct confidence intervals, and calculate goodness-of-fit statistics.

Use PAF when your data violates multivariate normality. PAF doesn’t assume that your data follow any distribution, so you could use it when they are normally distributed. However, this method can’t provide all the statistical measures as ML.

PCA is the default method for factor analysis in some statistical software packages, but it isn’t a factor extraction method. It is a data reduction technique to find components. There are technical differences, but in a nutshell, factor analysis aims to reveal latent factors while PCA is only for data reduction. While calculating the components, PCA doesn’t assess the underlying commonalities that unobserved factors cause.

PCA gained popularity because it was a faster algorithm during a time of slower, more expensive computers. If you’re using PCA for factor analysis, do some research to be sure it’s the correct method for your study. Learn more about PCA in, Principal Component Analysis Guide and Example.

There are other methods of factor extraction, but the factor analysis literature has not strongly shown that any of them are better than maximum likelihood or principal axis factoring.

## Number of Factors to Extract

You need to specify the number of factors to extract from your data except when using principal component components. The method for determining that number depends on whether you’re performing exploratory or confirmatory factor analysis.

### Exploratory Factor Analysis

In EFA, researchers must specify the number of factors to retain. The maximum number of factors you can extract equals the number of variables in your dataset. However, you typically want to reduce the number of factors as much as possible while maximizing the total amount of variance the factors explain.

That’s the notion of a parsimonious model in statistics. When adding factors, there are diminishing returns. At some point, you’ll find that an additional factor doesn’t substantially increase the explained variance. That’s when adding factors needlessly complicates the model. Go with the simplest model that explains most of the variance.

Fortunately, a simple statistical tool known as a scree plot helps you manage this tradeoff.

Use your statistical software to produce a scree plot. Then look for the bend in the data where the curve flattens. The number of points *before* the bend is often the correct number of factors to extract.

The scree plot below relates to the factor analysis example later in this post. The graph displays the Eigenvalues by the number of factors. Eigenvalues relate to the amount of explained variance.

The scree plot shows the bend in the curve occurring at factor 6. Consequently, we need to extract five factors. Those five explain most of the variance. Additional factors do not explain much more.

Some analysts and software use Eigenvalues > 1 to retain a factor. However, simulation studies have found that this tends to extract too many factors and that the scree plot method is better. (Costello & Osborne, 2005).

Of course, as you explore your data and evaluate the results, you can use theory and subject-area knowledge to adjust the number of factors. The factors and their interpretations must fit the context of your study.

### Confirmatory Factor Analysis

In CFA, researchers specify the number of factors to retain using existing theory or measurement instruments before performing the analysis. For example, if a measurement instrument purports to assess three constructs, then the factor analysis should extract three factors and see if the results match theory.

## Factor Loadings

In factor analysis, the loadings describe the relationships between the factors and the observed variables. By evaluating the factor loadings, you can understand the strength of the relationship between each variable and the factor. Additionally, you can identify the observed variables corresponding to a specific factor.

Interpret loadings like correlation coefficients. Values range from -1 to +1. The sign indicates the direction of the relations (positive or negative), while the absolute value indicates the strength. Stronger relationships have factor loadings closer to -1 and +1. Weaker relationships are close to zero.

Stronger relationships in the factor analysis context indicate that the factors explain much of the variance in the observed variables.

**Related post**: Correlation Coefficients

## Factor Rotations

In factor analysis, the initial set of loadings is only one of an infinite number of possible solutions that describe the data equally. Unfortunately, the initial answer is frequently difficult to interpret because each factor can contain middling loadings for many indicators. That makes it hard to label them. You want to say that particular variables correlate strongly with a factor while most others do not correlate at all. A sharp contrast between high and low loadings makes that easier.

Rotating the factors addresses this problem by maximizing and minimizing the entire set of factor loadings. The goal is to produce a limited number of high loadings and many low loadings for each factor.

This combination lets you identify the relatively few indicators that strongly correlate with a factor and the larger number of variables that do not correlate with it. You can more easily determine what relates to a factor and what does not. This condition is what statisticians mean by simplifying factor analysis results and making them easier to interpret.

### Graphical illustration

Let me show you how factor rotations work graphically using scatterplots.

Factor analysis starts by calculating the pattern of factor loadings. However, it picks an arbitrary set of axes by which to report them. Rotating the axes while leaving the data points unaltered keeps the original model and data pattern in place while producing more interpretable results.

To make this graphable in two dimensions, we’ll use two factors represented by the X and Y axes. On the scatterplot below, the six data points represent the observed variables, and the X and Y coordinates indicate their loadings for the two factors. Ideally, the dots fall right on an axis because that shows a high loading for that factor and a zero loading for the other.

For the initial factor analysis solution on the scatterplot, the points contain a mixture of both X and Y coordinates and aren’t close to a factor’s axis. That makes the results difficult to interpret because the variables have middling loads on all the factors. Visually, they’re not clumped near axes, making it difficult to assign the variables to one.

Rotating the axes around the scatterplot increases or decreases the X and Y values while retaining the original pattern of data points. At the blue rotation on the graph below, you maximize one factor loading while minimizing the other for all data points. The result is that the loads are high on one indicator but low on the other.

On the graph, all data points cluster close to one of the two factors on the blue rotated axes, making it easy to associate the observed variables with one factor.

### Types of Rotations

Throughout these rotations, you work with the same data points and factor analysis model. The model fits the data for the rotated loadings equally as well as the initial loadings, but they’re easier to interpret. You’re using a different coordinate system to gain a different perspective of the same pattern of points.

There are two fundamental types of rotation in factor analysis, oblique and orthogonal.

Oblique rotations allow correlation amongst the factors, while orthogonal rotations assume they are entirely uncorrelated.

Graphically, orthogonal rotations enforce a 90° separation between axes, as shown in the example above, where the rotated axes form right angles.

Oblique rotations are *not* required to have axes forming right angles, as shown below for a different dataset.

Notice how the freedom for each axis to take any orientation allows them to fit the data more closely than when enforcing the 90° constraint. Consequently, oblique rotations can produce simpler structures than orthogonal rotations in some cases. However, these results can contain correlated factors.

Oblique Rotations |
Orthogonal Rotations |

Promax | Varimax |

Oblimin | Equimax |

Direct Quartimin | Quartimax |

In practice, oblique rotations produce similar results as orthogonal rotations when the factors are uncorrelated in the real world. However, if you impose an orthogonal rotation on genuinely correlated factors, it can adversely affect the results. Despite the benefits of oblique rotations, analysts tend to use orthogonal rotations more frequently, which might be a mistake in some cases.

When choosing a rotation method in factor analysis, be sure it matches your underlying assumptions and subject-area knowledge about whether the factors are correlated.

## Factor Analysis Example

Imagine that we are human resources researchers who want to understand the underlying factors for job candidates. We measured 12 variables and perform factor analysis to identify the latent factors. Download the CSV dataset: FactorAnalysis

The first step is to determine the number of factors to extract. Earlier in this post, I displayed the scree plot, which indicated we should extract five factors. If necessary, we can perform the analysis with a different number of factors later.

For the factor analysis, we’ll assume normality and use Maximum Likelihood to extract the factors. I’d prefer to use an oblique rotation, but my software only has orthogonal rotations. So, we’ll use Varimax. Let’s perform the analysis!

### Interpreting the Results

In the bottom right of the output, we see that the five factors account for 81.8% of the variance. The %Var row along the bottom shows how much of the variance each explains. The five factors are roughly equal, explaining between 13.5% to 19% of the variance. Learn about Variance.

The Communality column displays the proportion of the variance the five factors explain for each variable. Values closer to 1 are better. The five factors explain the most variance for Resume (0.989) and the least for Appearance (0.643).

In the factor analysis output, the circled loadings show which variables have high loadings for each factor. As shown in the table below, we can assign labels encompassing the properties of the highly loading variables for each factor.

Factor |
Label |
High Loading Variables |

1 | Relevant Background | Academic record, Potential, Experience |

2 | Personal Characteristics | Confidence, Likeability, Appearance |

3 | General Work Skills | Organization, Communication |

4 | Writing Skills | Letter, Resume |

5 | Overall Fit | Company Fit, Job Fit |

In summary, these five factors explain a large proportion of the variance, and we can devise reasonable labels for each. These five latent factors drive the values of the 12 variables we measured.

## References

Hervé Abdi (2003), “Factor Rotations in Factor Analyses,” In: Lewis-Beck M., Bryman, A., Futing T. (Eds.) (2003). *Encyclopedia of Social Sciences Research Methods*. Thousand Oaks (CA): Sage.

Brown, Michael W., (2001) “An Overview of Analytic Rotation in Exploratory Factor Analysis,” *Multivariate Behavioral Research*, 36 (1), 111-150.

Costello, Anna B. and Osborne, Jason (2005) “Best practices in exploratory factor analysis: four recommendations for getting the most from your analysis,” *Practical Assessment, Research, and **Evaluation*: Vol. 10 , Article 7.

Michelle Scher says

Dear Jim

I have been saving and referring to your posts for the past few years and as a beginner researcher, I find them invaluable.

I have just started my PHD journey of developing and validating a mental health literacy measure for adolescents. Our approach is a mixed methods research design including a survey and Delphi panel with Item generation for the survey well on its way.

I am totally out of my depth here and would be grateful for your input in terms of the way forward with regards to the analysis and integration of the quantitative and qualitative data.

Your guidance and insights would be deeply appreciated.

Kind regards

Michelle

Roan says

Good day Jim, I am running in troubles in terms of the item analysis on the 5 point Likert scale that I am trying to create. The thing is, is that my CFI is around 0.9 and TLI is around 0.8

which is good but my RMSEA and SRMR has a awful result as the RMSEA is around 0.1 and SRMR is 0.2. And it is a roadblock for me, I want to ask on how I can improve my RMSEA and SRMR? so that it would reach the cut off.

I hope that his message would reach you and thank you for taking the time and reading and responding to my troubled question.

Hilary says

Good day, Sir Jim. I am currently trying to create a 5-Likert scale that tries to measure National Identity Conformity in three ways: (1) Origin – (e.g., Americans are born in/from America), (2) Culture (e.g., Americans are patriotic) and (3) Belief (e.g., Americans embrace being Americans).

In the process of establishing the scale’s validity, I was told to use Exploratory Factor Analysis, and I would like to ask what methods of extraction and rotation can be best used to ensure that the inter-item validity of my scale is good. I would also like to understand how I can avoid crossloading or limit crossloading factors.

Jim Frost says

Hi Hilary,

I discuss those issues in this post. I’d recommend PAF as the method of extraction because your data being Likert scale won’t be normally distribution. Read the Method of Factor Extraction section for more information.

As for cross-loading, the method of rotation can help with that. The choice depends largely on subject-area knowledge and what works best for your data, so I can’t provide a suggested method. Read the Factor Rotations section for more information about that. For instance, if you get cross-loadings with orthogonal rotations, using an oblique rotation might help.

If factor rotation doesn’t sufficiently reduce cross-loading, you might need to rework your questions so they’re more distinct, remove problematic items, or increase your sample size (can provide more stable factor solutions and clearer patterns of loadings). In this scenario where changing rotations doesn’t help, you’ll need to determine whether the underlying issue is with your questions or having to small of a sample size.

I hope that helps!

RAMESH SRINIVASAN says

What does negative loadings mean? How to proceed further with these loadings?

Jim Frost says

Hi Ramesh,

Loadings are like correlation coefficients and range from -1 to +1. More extreme positive and negative values indicate stronger relationships. Negative loadings indicate a negative relationship between the latent factors and observed variables. Highly negative values are as good as highly positive values. I discuss this in detail in the the Factor Loadings section of this post.

Ted says

Good day Jim,

The methodology seems loaded with opportunities for errors. So often we are being asked to translate a nebulous English word into some sort of mathematical descriptor. As an example, in the section labelled ‘Interpreting the Results’, what are we to make of the words ‘likeability’ or ‘self-confidence’ ? How can we possibly evaluate those things…and to three significant decimal places ?

You Jim, understand and use statistical methods correctly. Yet, too often people who apply statistics fail to examine the language of their initial questions and end up doing poor analysis. Worse, many don’t understand the software they use.

On a more cheery note, keep up the great work. The world needs a thousand more of you.

Jim Frost says

Hi Ted,

Thanks for the thoughtful comment. I agree with your concerns.

Ideally, all of those attributes are measured using validated measurement scales. The field of psychology is pretty good about that for terms that seem kind of squishy. For instance, they usually have thorough validation processes for personality traits, etc. However, your point is well taken, you need to be able to trust your data.

All statistical analyses depend on thorough subject-area knowledge, and that’s very true for factor analysis. You must have a solid theoretical understanding of these latent factors from extensive research before considering FA. Then FA can see if there’s evidence that they actually exist. But, I do agree with you that between the rotations and having to derive names to associate with the loadings, it can be a fairly subjective process.

Thanks so much for your kind words! I appreciate them because I do strive for accuracy.

Anju says

sir, i want to know that after successfully identifying my 3 factors with above give method now i want to regress on the data how to get single value for each factor rather than these number of values

Ismail Foda says

Hello,

Thanks for your effort on this post, it really helped me a lot. I want your recommendation for my case if you don’t mind.

I’m working on my research and I’ve 5 independent variables and 1 dependent variable, I want to use a factor analysis method in order to know which variable contributes the most in the dependent variable.

Also, what kind of data checks and preparations shall I make before starting the analysis.

Thanks in advance for your consideration.

Jim Frost says

Hi Ismail,

Based on the information you provided, I don’t believe factor analysis is the correct analysis for you.

Factor analysis is primarily used for understanding the structure of a set of variables and for reducing data dimensions by identifying underlying latent factors. It’s particularly useful when you have a large number of observed variables and believe that they are influenced by a smaller number of unobserved factors.

Instead, it sounds like you have the IVs and DV and want to understand the relationships between them. For that, I recommend multiple regression. Learn more in my post about When to Use Regression. After you settle on a model, there are several ways to Identify the Most Important Variables in the Model.

In terms of checking assumptions, familiarize yourself with the Ordinary Least Squares Regression Assumptions. Least squares regression is the most common and is a good place to start.

Best of luck with your analysis!

aftab hussain says

What would be the eign value in efa

Hendra says

Hi Jim, this is an excellent yet succinct article on the topic.

A very basic question, though: the dataset contains ordinal data. Is this ok? I’m a student in a Multivariate Statistics course, and as far as I’m aware, both PCA and common factor analysis dictate metric data. Or is it assumed that since the ordinal data has been coded into a range of 0-10, then the data is considered numeric and can be applied with PCA or CFA?

Sorry for the dumb question, and thank you.

Jim Frost says

Hi Hendra,

That’s a great question.

For the example in this post, we’re dealing with data on a 10 point scale where the differences between all points are equal. Consequently, we can treat discrete data as continuous data.

Now, to your question about ordinal data. You can use ordinal data with factor analysis however you might need to use specific methods.

For ordinal data, it’s often recommended to use polychoric correlations instead of Pearson correlations. Polychoric correlations estimate the correlation between two latent continuous variables that underlie the observed ordinal variables. This provides a more accurate correlation matrix for factor analysis of ordinal data.

I’ve also heard about categorical PCA and nonlinear Factor Analysis that use a monotonical transformation of ordinal data.

I hope that helps clarify it for you!

Brittany says

Once identifying how much each variability the factors contribute, what steps could we take from here to make predictions about variables ?

Jim Frost says

Hi Brittany,

Thanks for the great question! And thanks for you kind words in your other comment! ๐

What you can do is calculate all the factor scores for each observation. Some software will do this for you as an option. Or, you can input values into the regression equations for the factor scores that are included in the output.

Then use these scores as the independent variables in regression analysis. From there, you can use the regression model to make predictions.

Ideally, you’d evaluate the regression model before making predictions and use cross validation to be sure that the model works for observations outside the dataset you used to fit the model.

I hope that helps!

Brittany says

Wow! This was really helpful and structured very well for interpretation. Thank you!

Funsho Olukade says

I can imagine that Prof will have further explanations on this down the line at some point in future. I’m waiting… Thanks Prof Jim for your usual intuitive manner of explaining concepts. Funsho

howie says

Thanks for a very comprehensive guide. I learnt a lot.

In PCA, we usually extract the components and use it for predictive modeling.

Is this the case with Factor Analysis as well? Can we use factors as predictors?

Jim Frost says

Hi Howie,

I have not used factors as predictors, but I think it would be possible. However, PCA’s goal is to maximize data reduction. This process is particularly valuable when you have many variables, low sample size and/or collinearity between the predictors. Factor Analysis also reduces the data but that’s not its primary goal. Consequently, my sense is that PCA is better for that predictive modeling while Factor Analysis is better for when you’re trying to understand the underlying factors (which you aren’t with PCA). But, again, I haven’t tried using factors in that way nor I have compared the results to PCA. So, take that with a grain of salt!