Tidyverse solutions for Factor Analysis / Principal Component Analysis

timpe · January 25, 2018, 3:42pm

What's the best way to do FA or PCA in the Tidyverse? Right now, I switch to base R using princomp() or the pysch package for my series reduction work.

Are there any solutions that fit into the tidyverse workflow?

tbradley · January 25, 2018, 5:08pm

You can do it similar to other types of modelling methods using list-columns and purrr, broom, and dplyr functions.

library(tidyverse)
library(broom)
library(ggfortify) # for plotting pca

iris_pca <- iris %>% 
  nest() %>% 
  mutate(pca = map(data, ~prcomp(.x %>% select(-Species), center = TRUE, scale = TRUE)), 
         pca_tidy = map2(pca, data, ~broom::augment(.x, data = .y)))

That will give you this:

> iris_pca
# A tibble: 1 x 3
  data                   pca          pca_tidy               
  <list>                 <list>       <list>                 
1 <data.frame [150 x 5]> <S3: prcomp> <data.frame [150 x 10]>

From there you can calculate the variance explained by each dimension like this:

iris_pca %>%
  unnest(pca_tidy) %>% 
  summarize_at(.vars = vars(contains("PC")), .funs = funs(var)) %>% 
  gather(key = pc, value = variance) %>% 
  mutate(var_exp = variance/sum(variance))

which gives you this:

# A tibble: 4 x 3
  pc         variance var_exp
  <chr>         <dbl>   <dbl>
1 .fittedPC1   2.92   0.730  
2 .fittedPC2   0.914  0.229  
3 .fittedPC3   0.147  0.0367 
4 .fittedPC4   0.0207 0.00518

You can also plot using this same setup:

iris_pca %>% 
  mutate(
    pca_graph = map2(
      .x = pca, 
      .y = data,
      ~ autoplot(.x, loadings = TRUE, loadings.label = TRUE, 
                 data = .y, colour = "Species")
    )
  ) %>% 
  pull(pca_graph)

which gives you this:

You can extend this as far as you would like, with multiple models by grouping prior to nesting. You can also extend the graph easily just like you would for any other ggplot. To see some of the options for the autoplot pca specific graph, you can look at ?ggfortify::autoplot.prcomp. It is worth noting that the autoplot.prcomp method only takes the UK spelling of colour = .

I am not that familiar with factor analysis but I would imagine you could use this same workflow with it. The big dependency is whether broom has a tidy/augment/glance method for that model type. But even if it doesn't, you can still access the model once it is in a list column, it just may look a little more hacky.

timpe · January 25, 2018, 5:51pm

This is incredibly helpful. Thanks so much.

davidr · January 26, 2018, 3:02am

Out of curiosity, why not stick with base-R or the psych package? This setup seems better than the tidyverse because base-R and psych aren't going to mess with their stable foundation any time soon. Sticking with this route, you know the same code will work in 1 year, 2 years, or 10 years from now.

Disclaimer - the tidyverse is awesome. I update my code to use as much of the tidyverse packages as possible once I learn new tidy package functions. Yet I still use base-R and psych for PCA and EFA because both are super simple and give all the info I want/need.

tbradley · January 26, 2018, 4:30pm

I would say that one of the major benefits of using this workflow is the ability to extend it to run multiple models or sub models all at once. It also provides the models in a format that is easy to work with (especially if there are broom methods for the model type).

In addition, the method I provided did not abandon the base-R function that is used to perform the PCA, rather it just implemented it in the framework of the tidyverse. This allows for easy manipulation of model results and it stores all of the results in a convenient place, i.e., in a single data frame, rather than in several different variables that you then have to keep track of.

mara · January 29, 2018, 12:59pm

Super helpful question and answer (as always, @tbradley is a !)

@timpe, if your question has been answered, would you mind marking the solution? That way, someone in the future can come along and easily see what worked for you.

If you’re the OP (as you were in this case), there should be a little box at the bottom of replies that you can click to select that response as your “solution.”

Example from one of my questions, below:

phil · February 7, 2018, 10:14pm

How would I go about getting the loadings for a specific number of components, using this method? My "un-tidyverse" method has historically been to use psych::principal().

tbradley · June 20, 2018, 6:16pm

@phil Sorry for the super late response. I meant to respond to this when you asked but ultimately forgot.

I am not familiar with the psych::principal function and it does not look like there are broom methods for this model type. However, you can still compute the model and operate on it in the same manner shown above.

Here is a basic example with the USArrests dataset:

library(psych)
library(tidyverse)

us_arrests <- USArrests %>% 
  rownames_to_column(var = "state") %>% 
  # I prefer column names to be all lowercase so I am going to change them here
  rename_all(tolower) %>% 
  as_tibble()


us_arrests_prin <- us_arrests %>% 
  nest() %>% 
  mutate(prin = map(data, ~psych::principal(.x %>% select(-state))))


us_arrests_prin$prin[[1]]
#> Principal Components Analysis
#> Call: psych::principal(r = .x %>% select(-state))
#> Standardized loadings (pattern matrix) based upon correlation matrix
#>           PC1   h2   u2 com
#> murder   0.84 0.71 0.29   1
#> assault  0.92 0.84 0.16   1
#> urbanpop 0.44 0.19 0.81   1
#> rape     0.86 0.73 0.27   1
#> 
#>                 PC1
#> SS loadings    2.48
#> Proportion Var 0.62
#> 
#> Mean item complexity =  1
#> Test of the hypothesis that 1 component is sufficient.
#> 
#> The root mean square of the residuals (RMSR) is  0.16 
#>  with the empirical chi square  15.25  with prob <  0.00049 
#> 
#> Fit based upon off diagonal values = 0.91

Created on 2018-06-20 by the reprex package (v0.2.0).

As a note, the graphics aren't as compatible when you use base graphics (as it seems the psych package uses) since the graphs are just simply plotted when run in a mutate call rather than saved as an object like ggplot objects. But this method would still help you keep your model objects tidy with the data.

eoppe1022 · February 24, 2019, 6:31pm

Hey @tbradley, just came across the answer. For some reason, the solution won't work unless I do it within a reprex.

I keep getting this error:

Error in mutate_impl(.data, dots) : Evaluation error: 'data' is missing.

Any idea why?

tbradley · February 25, 2019, 1:20am

Which portion of the code is failing? If it is working in the reprex, my first guess would be some issue with your local session, but without knowing which piece of code that is failing, I can't give you too many specific pointers.

eoppe1022 · February 26, 2019, 4:57am

It seems to be this part. Weird!

mara · February 26, 2019, 11:45am

Can you give traceback()?

If it's working in the reprex and not in your global environment, there's a possibility of a name collision, or something else in your environment interfering.

eoppe1022 · February 27, 2019, 3:48am

Alright, I just re-tried and everything seems to work perfectly. So yeah I guess there was some conflict somewhere. So... nevermind! Thanks for the help.