Would others find a single row extraction function useful when working with tibbles containing list columns?

mungojam · December 9, 2017, 6:09pm

I've been refactoring some old non-tidyverse code and have been introducing list columns as they make it a lot easier to work with. Thanks to the posts on here for inspiring me and these slides for teaching me how to work with them!

One annoyance I have hit is that when I want to extract a single row out of a tibble, I have to unlist any list columns in order to fetch their value. This makes sense when I am using filter and there is the potential that I may get additional 0-many rows, so I was thinking that there is space for a single function that would give me a simple named list of the column items with any list columns de-listed.

Let's say I have a simple tibble, that's really like a hashmap or dictionary:

library(tidyverse)

listOfTibbles <- list(tibble(a = c(1,2,3), b = c(2, 3, 4)), tibble(), tibble())
tibbleWithListColumns <- tibble(key = c("a", "b", "c"), value = listOfTibbles)
tibbleWithListColumns
# A tibble: 3 x 2
   key           value
  <chr>           <list>
1     a <tibble [3 x 2]>
2     b <tibble [0 x 0]>
3     c <tibble [0 x 0]>

I'd want to be able to use 'single' rather than filter and have it return:

b <- tibbleWithListColumns %>% single(key == "a")
b
$key
[1] "a"

$value
# A tibble: 3 x 2
      a     b
  <dbl> <dbl>
1     1     2
2     2     3
3     3     4

whereas filter understandably returns $value as a one element list

tibbleWithListColumns %>% filter(key == "a") %>% .$value
[[1]]  # <------ A one element list
# A tibble: 3 x 2
      a     b
  <dbl> <dbl>
1     1     2
2     2     3
3     3     4

I know a lot of the time this shouldn't be needed with the use of pmap to map things by row, but in my case it would have been useful and would have prevented [[1]] everywhere. The single function would throw an error if zero or > 1 items came back, much like Single in C#.

tiernan · December 10, 2017, 4:48pm

I feel your pain, @mungojam, but I'm not sure this use warrants its own special function - just use filter() %>% pull():

tibbleWithListColumns <- tibble(key = c("a", "b", "c"), 
                                value = list(tibble(a = c(1,2,3), 
                                                    b = c(2, 3, 4)), 
                                             tibble(), 
                                             tibble()
                                )
)

tibbleWithListColumns %>% 
  filter(key %in% 'a') %>% 
  pull('value')
## [[1]]
## # A tibble: 3 x 2
##       a     b
##   <dbl> <dbl>
## 1     1     2
## 2     2     3
## 3     3     4

Session info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/Los_Angeles         
##  date     2017-12-10
## Packages -----------------------------------------------------------------
##  package    * version    date       source                            
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.2)                    
##  backports    1.1.0      2017-05-22 CRAN (R 3.4.0)                    
##  base       * 3.4.0      2017-04-21 local                             
##  bindr        0.1        2016-11-13 CRAN (R 3.4.2)                    
##  bindrcpp     0.2        2017-06-17 CRAN (R 3.4.2)                    
##  broom        0.4.2      2017-02-13 CRAN (R 3.4.0)                    
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.2)                    
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                    
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.2)                    
##  compiler     3.4.0      2017-04-21 local                             
##  crayon       1.3.4      2017-10-30 Github (r-lib/crayon@b5221ab)     
##  datasets   * 3.4.0      2017-04-21 local                             
##  devtools     1.13.2     2017-06-02 CRAN (R 3.4.0)                    
##  digest       0.6.12     2017-01-27 CRAN (R 3.4.0)                    
##  dplyr      * 0.7.4      2017-09-28 CRAN (R 3.4.2)                    
##  evaluate     0.10       2016-10-11 CRAN (R 3.4.0)                    
##  forcats    * 0.2.0      2017-01-23 CRAN (R 3.4.2)                    
##  foreign      0.8-67     2016-09-13 CRAN (R 3.4.0)                    
##  ggplot2    * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)
##  glue         1.2.0.9000 2017-12-05 Github (tidyverse/glue@69bc72c)   
##  graphics   * 3.4.0      2017-04-21 local                             
##  grDevices  * 3.4.0      2017-04-21 local                             
##  grid         3.4.0      2017-04-21 local                             
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.2)                    
##  haven        1.1.0      2017-07-09 CRAN (R 3.4.2)                    
##  hms          0.3        2016-11-22 CRAN (R 3.4.2)                    
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                    
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.2)                    
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                    
##  knitr        1.16       2017-05-18 CRAN (R 3.4.0)                    
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.0)                    
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                    
##  lubridate    1.7.1      2017-11-03 CRAN (R 3.4.2)                    
##  magrittr     1.5        2014-11-22 CRAN (R 3.4.0)                    
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                    
##  methods    * 3.4.0      2017-04-21 local                             
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.1)                    
##  modelr       0.1.1      2017-07-24 CRAN (R 3.4.2)                    
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.2)                    
##  nlme         3.1-131    2017-02-06 CRAN (R 3.4.0)                    
##  parallel     3.4.0      2017-04-21 local                             
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.2)                    
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.2)                    
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.2)                    
##  purrr      * 0.2.4.9000 2017-12-05 Github (tidyverse/purrr@62b135a)  
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                    
##  Rcpp         0.12.14    2017-11-23 CRAN (R 3.4.2)                    
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.2)                    
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.2)                    
##  reshape2     1.4.2      2016-10-22 CRAN (R 3.4.2)                    
##  rlang        0.1.4      2017-11-05 CRAN (R 3.4.2)                    
##  rmarkdown    1.8        2017-11-17 CRAN (R 3.4.2)                    
##  rprojroot    1.2        2017-01-16 CRAN (R 3.4.0)                    
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.2)                    
##  scales       0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)    
##  stats      * 3.4.0      2017-04-21 local                             
##  stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                    
##  stringr    * 1.2.0      2017-02-18 CRAN (R 3.4.0)                    
##  tibble     * 1.3.4      2017-08-22 CRAN (R 3.4.2)                    
##  tidyr      * 0.7.2.9000 2017-12-05 Github (tidyverse/tidyr@efd9ea5)  
##  tidyverse  * 1.2.1      2017-11-14 CRAN (R 3.4.2)                    
##  tools        3.4.0      2017-04-21 local                             
##  utils      * 3.4.0      2017-04-21 local                             
##  withr        2.1.0.9000 2017-12-02 Github (jimhester/withr@fe81c00)  
##  xml2         1.1.1      2017-01-24 CRAN (R 3.4.2)                    
##  yaml         2.1.14     2016-11-12 CRAN (R 3.4.0)

mungojam · December 10, 2017, 4:57pm

Unfortunately it would still need to be pull('value')[[1]] rather than just pull('value') to get an unlisted value so I would still have the annoying extra [[1]].

In the case I was using it there were 5-6 columns so $ was a bit less verbose than pull.

tiernan · December 10, 2017, 5:25pm

In the case I was using it there were 5-6 columns so $ was a bit less verbose than pull.

I'm not sure I follow that comment. Here's a suggestion for extracting tibbles from multiple list-columns using filter %>% map(flatten_df):

library(tidyverse)  

tibbleWithListColumns <- 
  tibble(key = c("a", "b", "c"), 
         value_a = list(tibble(a = c(1,2,3), 
                             b = c(2, 3, 4)), 
                      tibble(), 
                      tibble()
         )
  ) %>% 
  mutate(value_b = value_a,
         value_c = value_a)


tibbleWithListColumns
## # A tibble: 3 x 4
##   key   value_a          value_b          value_c         
##   <chr> <list>           <list>           <list>          
## 1 a     <tibble [3 x 2]> <tibble [3 x 2]> <tibble [3 x 2]>
## 2 b     <tibble [0 x 0]> <tibble [0 x 0]> <tibble [0 x 0]>
## 3 c     <tibble [0 x 0]> <tibble [0 x 0]> <tibble [0 x 0]>

tibbleWithListColumns %>% 
  filter(key %in% 'a') %>% 
  select(-key) %>% 
  map(flatten_df)
## $value_a
## # A tibble: 3 x 2
##       a     b
##   <dbl> <dbl>
## 1  1.00  2.00
## 2  2.00  3.00
## 3  3.00  4.00
## 
## $value_b
## # A tibble: 3 x 2
##       a     b
##   <dbl> <dbl>
## 1  1.00  2.00
## 2  2.00  3.00
## 3  3.00  4.00
## 
## $value_c
## # A tibble: 3 x 2
##       a     b
##   <dbl> <dbl>
## 1  1.00  2.00
## 2  2.00  3.00
## 3  3.00  4.00

Session info

devtools::session_info()
## Session info -------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  system   x86_64, mingw32             
##  ui       RTerm                       
##  language (EN)                        
##  collate  English_United States.1252  
##  tz       America/Los_Angeles         
##  date     2017-12-10
## Packages -----------------------------------------------------------------
##  package    * version    date       source                              
##  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.2)                      
##  backports    1.1.0      2017-05-22 CRAN (R 3.4.0)                      
##  base       * 3.4.0      2017-04-21 local                               
##  bindr        0.1        2016-11-13 CRAN (R 3.4.2)                      
##  bindrcpp     0.2        2017-06-17 CRAN (R 3.4.2)                      
##  broom        0.4.3      2017-11-20 CRAN (R 3.4.3)                      
##  cellranger   1.1.0      2016-07-27 CRAN (R 3.4.2)                      
##  cli          1.0.0      2017-11-05 CRAN (R 3.4.2)                      
##  colorspace   1.3-2      2016-12-14 CRAN (R 3.4.2)                      
##  compiler     3.4.0      2017-04-21 local                               
##  crayon       1.3.4      2017-10-30 Github (r-lib/crayon@b5221ab)       
##  datasets   * 3.4.0      2017-04-21 local                               
##  devtools     1.13.2     2017-06-02 CRAN (R 3.4.0)                      
##  digest       0.6.12     2017-01-27 CRAN (R 3.4.0)                      
##  dplyr      * 0.7.4      2017-09-28 CRAN (R 3.4.2)                      
##  evaluate     0.10       2016-10-11 CRAN (R 3.4.0)                      
##  forcats    * 0.2.0      2017-01-23 CRAN (R 3.4.2)                      
##  foreign      0.8-67     2016-09-13 CRAN (R 3.4.0)                      
##  ggplot2    * 2.2.1.9000 2017-12-02 Github (tidyverse/ggplot2@7b5c185)  
##  glue         1.2.0.9000 2017-12-05 Github (tidyverse/glue@69bc72c)     
##  graphics   * 3.4.0      2017-04-21 local                               
##  grDevices  * 3.4.0      2017-04-21 local                               
##  grid         3.4.0      2017-04-21 local                               
##  gtable       0.2.0      2016-02-26 CRAN (R 3.4.2)                      
##  haven        1.1.0      2017-07-09 CRAN (R 3.4.2)                      
##  hms          0.4.0      2017-11-23 CRAN (R 3.4.3)                      
##  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.0)                      
##  httr         1.3.1      2017-08-20 CRAN (R 3.4.2)                      
##  jsonlite     1.5        2017-06-01 CRAN (R 3.4.0)                      
##  knitr        1.16       2017-05-18 CRAN (R 3.4.0)                      
##  lattice      0.20-35    2017-03-25 CRAN (R 3.4.0)                      
##  lazyeval     0.2.1      2017-10-29 CRAN (R 3.4.2)                      
##  lubridate    1.7.1      2017-11-03 CRAN (R 3.4.2)                      
##  magrittr   * 1.5        2014-11-22 CRAN (R 3.4.0)                      
##  memoise      1.1.0      2017-04-21 CRAN (R 3.4.0)                      
##  methods    * 3.4.0      2017-04-21 local                               
##  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.1)                      
##  modelr       0.1.1      2017-07-24 CRAN (R 3.4.2)                      
##  munsell      0.4.3      2016-02-13 CRAN (R 3.4.2)                      
##  nlme         3.1-131    2017-02-06 CRAN (R 3.4.0)                      
##  parallel     3.4.0      2017-04-21 local                               
##  pillar       0.0.0.9000 2017-12-10 Github (r-lib/pillar@5a082e1)       
##  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.2)                      
##  plyr         1.8.4      2016-06-08 CRAN (R 3.4.2)                      
##  psych        1.7.8      2017-09-09 CRAN (R 3.4.2)                      
##  purrr      * 0.2.4.9000 2017-12-05 Github (tidyverse/purrr@62b135a)    
##  R6           2.2.2      2017-06-17 CRAN (R 3.4.0)                      
##  Rcpp         0.12.14    2017-11-23 CRAN (R 3.4.2)                      
##  readr      * 1.1.1      2017-05-16 CRAN (R 3.4.2)                      
##  readxl       1.0.0      2017-04-18 CRAN (R 3.4.2)                      
##  reshape2     1.4.2      2016-10-22 CRAN (R 3.4.2)                      
##  rlang        0.1.4      2017-11-05 CRAN (R 3.4.2)                      
##  rmarkdown    1.8        2017-11-17 CRAN (R 3.4.2)                      
##  rprojroot    1.2        2017-01-16 CRAN (R 3.4.0)                      
##  rvest        0.3.2      2016-06-17 CRAN (R 3.4.2)                      
##  scales       0.5.0.9000 2017-12-02 Github (hadley/scales@d767915)      
##  stats      * 3.4.0      2017-04-21 local                               
##  stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                      
##  stringr    * 1.2.0      2017-02-18 CRAN (R 3.4.0)                      
##  tibble     * 1.3.4.9003 2017-12-10 Github (tidyverse/tibble@60281b3)   
##  tidyr      * 0.7.2.9000 2017-12-05 Github (tidyverse/tidyr@efd9ea5)    
##  tidyverse  * 1.2.1      2017-12-10 Github (tidyverse/tidyverse@3769ff2)
##  tools        3.4.0      2017-04-21 local                               
##  utf8         1.1.1      2017-11-29 CRAN (R 3.4.3)                      
##  utils      * 3.4.0      2017-04-21 local                               
##  withr        2.1.0.9000 2017-12-02 Github (jimhester/withr@fe81c00)    
##  xml2         1.1.1      2017-01-24 CRAN (R 3.4.2)                      
##  yaml         2.1.14     2016-11-12 CRAN (R 3.4.0)

mungojam · December 10, 2017, 5:54pm

Yeah, I think we are crossing wires a bit. So I have a tibble with say 6 columns, some of which are list columns of some kind and one is a key column. I then want to pull out a single row and be able to refer to each column in it succinctly and clearly without having to have [[1]] all the time.

You have put me on to a decent solution though, using flatten:

listOfTibbles <- list(tibble(a = c(1,2,3), b = c(2, 3, 4)), tibble(), tibble())

tibbleWithListColumns <- 
  tibble(
    key = c("a", "b", "c"), 
    value1 = listOfTibbles, 
    value2 = listOfTibbles, 
    value3 = listOfTibbles
)

result <- tibbleWithListColumns %>% 
  filter(key == "a") %>% 
  flatten

# $key
# [1] "a"
# 
# $value1
# # A tibble: 3 x 2
# a     b
# <dbl> <dbl>
#   1     1     2
# 2     2     3
# 3     3     4
# 
# $value2
# # A tibble: 3 x 2
# a     b
# <dbl> <dbl>
#   1     1     2
# 2     2     3
# 3     3     4
# 
# $value3
# # A tibble: 3 x 2
# a     b
# <dbl> <dbl>
#   1     1     2
# 2     2     3
# 3     3     4

I can then happily do result$value1 and get a dataframe or whatever object was in my list column without needing [[1]].

Thanks For me, I think there's a space for a method that does this in one, but two methods is fine. Maybe a hashmap type object would make more sense too, I imagine there are packages that provide them but they probably aren't tidyverse friendly.

mungojam · December 10, 2017, 6:03pm

The only additional feature that a single method would provide is that it would error if more than one row were returned whereas flatten returns a list with all the values mingled together.

tiernan · December 10, 2017, 6:21pm

Yep - flatten is a rather blunt instrument, so I'm not surprised it doesn't work perfectly in this use case.

Glad to hear my suggestion gets you closer to a workable solution. You might also experiment with the transpose %>% map pattern – I've found it to be useful when I want to access the rows of tibble. Good luck!

alistaire · December 10, 2017, 7:31pm

purrr::pluck allows indices to be chained, so

library(tidyverse)

tibbleWithListColumns <- tibble(key = c("a", "b", "c"), 
                                value = list(tibble(a = c(1,2,3), 
                                                    b = c(2, 3, 4)), 
                                             tibble(), 
                                             tibble()))

tibbleWithListColumns %>% pluck('value', 1)
#> # A tibble: 3 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2     3
#> 3     3     4

or if you prefer,

tibbleWithListColumns %>% pluck('value', which(.$key == 'a'))

or

tibbleWithListColumns %>% filter(key == 'a') %>% pluck('value', 1)

if you like. The only downside is that pluck requires quoting for variable names. I suppose a version of pull that accepts further indices or a version of pluck that accepts raw variable names could be useful, though the semantics may get confusing.

I've almost never used either in this idiom, though; I extract nested data frames with tidyr::unnest, subsetting before or after.

mungojam · December 10, 2017, 9:17pm

Thanks for the ideas, a clever use of pluck with which.

I've found pluck useful elsewhere, but in my case I want to be able to pass the whole row on to another function which can then extract whichever values it needs so having it in a regular list as flatten gives is better than having ways to pull out the individual cells.

I've found nest and unnest very useful too, and I know for my example it could be helpful as I used data frames, but in reality some of my list columns contained other s3 classes like a forecast model.

I should probably convert the whole thing to use pmap in the end.

mungojam · December 11, 2017, 10:28am

Thanks. I guess transpose would only work if all the column data-types are the same?

Just discovered an irritating thing with flatten which is that it clears types like date and converts them to numeric. Maybe I just need to bite the bullet and create my single function