How to use `map` with `cor`

I'm doing something wrong here (NOOB).

I'm trying to loop the correlations for each Tree as dummy example of something else I'm working on.

Any suggestions?

library(purrr)
library(tidyr)
library(dplyr)
Orange %>% 
    group_by(Tree) %>% 
    nest() %>% 
    map(.x = data, .f = function(z){with(data = .x, expr = cor(x = age, y = circumference))})
#> Error: `.x` is not a vector (closure)

You need to add

library(dplyr)

I added dplyr, and it doesn't work.

I'd suggest trying base R in steps (using "." as an intermediate variable) instead of jumping straight to a "all at once" pipeline. The secret sauce is split() is a great function that actually accomplishes a lot.

. <- Orange
. <- split(., .$Tree)
vapply(., 
       function(z) cor(z$age, z$circumference),
       numeric(1))
#>         3         1         5         2         4
#> 0.9881766 0.9854675 0.9877376 0.9873624 0.9844610

For more on the dot intermediate notation please see my note here. I know the specific problem was notional- but the method works fairly generally.

There are some things in your processing that don't really work as you probably think they do.

First of all when you pass a data.frame (or tibble) in a map function, map enumerates the columns of that data.frame, not the rows. See the example below.

group_by and nest were a good start, but the tibbles nest produced are in the data column of the tibble output by nest, so you have to somehow turn that data column into a list of tibbles that map can process.

You can use select to pull out the data column into another tibble, then use flatten to "unwarp" that tibble and make a list out of the data column.

Once you have done that you can use map to enumerate those tibbles and pass the age and circumference columns into cor to get your correlations.

See the example below for details.

BTW the tilda, '~' in map is just a but more compact way of passing the processing function into map.


suppressPackageStartupMessages(library(tidyverse))
# map enumerates the columns of a data.frame, not the rows
df <- tribble(
  ~ x, ~ c,
  1, "a",
  2, "b"
)
# All this map function does is return a lit
# of the things is has enumerted.
# BTW this technique can sometimes be handy
# for debugging what a map is doing
map(df, ~ .)
#> $x
#> [1] 1 2
#> 
#> $c
#> [1] "a" "b"


Orange %>%
  # group by tree
  group_by(Tree) %>%
  # now nest age and circumference
  nest() %>%
  # nest produces a tibble, and the data column in that
  # tibble contains tibbles of age and circumference
  # so select just the data column
  select(data) %>%
  # select in this case produces a tibble with a single column also.
  # Now flatten that list to remove the "wrapper" around it
  flatten() %>%
  # Now use map to enumerate each of the data.frames that
  # are in the flattened list.
  # Use map_dbl so you end up with a double vector intsead of a list
  map_dbl(~ cor(.$age, .$circumference))
#> [1] 0.9854675 0.9873624 0.9881766 0.9844610 0.9877376

Created on 2018-03-25 by the reprex package (v0.2.0).

And a more compact way to do the equivalent

suppressPackageStartupMessages(library(tidyverse))
unique(Orange$Tree) %>%
    map(~ filter(Orange, Tree == .)) %>%
    map_dbl(~ cor(.$age, .$circumference))
#> [1] 0.9854675 0.9873624 0.9881766 0.9844610 0.9877376

Created on 2018-03-25 by the reprex package (v0.2.0).

2 Likes

You are very close to what you want with your original code. As @danr pointed out, if you pass the tibble directly into map then it maps over the columns rather than the rows. However, if you put your map inside of a mutate call then it does what you want:

library(tidyverse)

Orange %>% 
  group_by(Tree) %>% 
  nest() %>% 
  mutate(cor = map(data, ~cor(.x$age, .x$circumference))) %>% 
  unnest(cor, .drop = TRUE)
#> # A tibble: 5 x 2
#>   Tree    cor
#>   <ord> <dbl>
#> 1 1     0.985
#> 2 2     0.987
#> 3 3     0.988
#> 4 4     0.984
#> 5 5     0.988

#' Created on 2018-03-26 by the reprex package (v0.2.0).

4 Likes