Using `purrr` map functions on unnamed vectors/lists?

R-by-Ryo · December 6, 2017, 3:33pm

Looking at the harrypotter package which consists of a list for each of the 7 HP books, I want to create a data frame with a column for Book, Chapter, Text.

Using a for loop you can do this by:

library(dplyr)
library(harrypotter)

titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
            "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
            "Deathly Hallows")

books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
              goblet_of_fire, order_of_the_phoenix, half_blood_prince,
              deathly_hallows)

series <- tibble()

for(i in seq_along(titles)) {
  
  clean <- tibble(chapter = seq_along(books[[i]]),
                  text = books[[i]]) %>%
    mutate(book = titles[i]) %>%
    select(book, everything())
  
  series <- rbind(series, clean)
}

Is there a way to get the above with tibble or data.frame + map_chr()?

The problem I've been having in attempting to do this is that the character vectors and elements are unnamed so I don't have anything to pass as an argument into the purrr functions.

mishabalyasin · December 6, 2017, 3:55pm

I think, you are most of the way there.
This is what I did:

books <- setNames(books, titles)
res <- purrr::map(books, function(book){
    return(tibble::tibble(text = book, chapter = seq(1:length(book))))
}) %>%
    dplyr::bind_rows(.id = "Book")
str(res)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	200 obs. of  3 variables:
 $ book   : chr  "Philosopher's Stone" "Philosopher's Stone" "Philosopher's Stone" "Philosopher's Stone" ...
 $ text   : chr  "THE BOY WHO LIVED  Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ "THE VANISHING GLASS  Nearly ten years had passed since the Dursleys had woken up to find their nephew on the "| __truncated__ "THE LETTERS FROM NO ONE  The escape of the Brazilian boa constrictor earned Harry his longest-ever punishment"| __truncated__ "THE KEEPER OF THE KEYS  BOOM. They knocked again. Dudley jerked awake. \"Where's the cannon?\" he said stupid"| __truncated__ ...
 $ chapter: int  1 2 3 4 5 6 7 8 9 10 ...

Is that what you wanted to achieve?

markdly · December 6, 2017, 7:37pm

Another alternative could be to use something like map2_df:

library(tidyverse)
map2_df(titles, books, ~ tibble(book = .x, chapter = seq_along(.y), text = .y))
#> # A tibble: 200 x 3
#>                   book chapter
#>                  <chr>   <int>
#>  1 Philosopher's Stone       1
#>  2 Philosopher's Stone       2
#>  3 Philosopher's Stone       3
#>  4 Philosopher's Stone       4
#>  5 Philosopher's Stone       5
#>  6 Philosopher's Stone       6
#>  7 Philosopher's Stone       7
#>  8 Philosopher's Stone       8
#>  9 Philosopher's Stone       9
#> 10 Philosopher's Stone      10
#> # ... with 190 more rows, and 1 more variables: text <chr>

alistaire · December 6, 2017, 8:12pm

Probably the simplest option is to add the books as a list column. Once you've done that, you can easily iterate over it with map and seq_along to make another list column of chapter numbers. Since they will be the same length, you can call tidyr::unnest afterwards to expand everything out.

library(tidyverse)
library(harrypotter)

books <- tibble(title = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
                          "Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
                          "Deathly Hallows"), 
                text = list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
                            goblet_of_fire, order_of_the_phoenix, half_blood_prince,
                            deathly_hallows), 
                chapter = map(text, seq_along))

books
#> # A tibble: 7 x 3
#>                  title       text    chapter
#>                  <chr>     <list>     <list>
#> 1  Philosopher's Stone <chr [17]> <int [17]>
#> 2   Chamber of Secrets <chr [19]> <int [19]>
#> 3  Prisoner of Azkaban <chr [22]> <int [22]>
#> 4       Goblet of Fire <chr [37]> <int [37]>
#> 5 Order of the Phoenix <chr [38]> <int [38]>
#> 6    Half-Blood Prince <chr [30]> <int [30]>
#> 7      Deathly Hallows <chr [37]> <int [37]>

Also, I've got to wonder about the legality of the package. gutenbergr is a good source of books in the public domain, if you need.

R-by-Ryo · December 6, 2017, 8:36pm

thanks, i've been trying to practice using purrr functions and this is a good example!

R-by-Ryo · December 6, 2017, 8:42pm

another good solution, thanks!

and yes you're quite right... i saw a certain someone use it in a text mining tutorial so i thought i'd play around with it too. for stuff you want to share online it's probably for the best to use public domain stuff from gutenbergr as you suggested!

R-by-Ryo · December 6, 2017, 8:43pm

this works too, i didn't know you could use the .id argument in bind_rows() so thanks!

austensen · December 6, 2017, 8:59pm

purrr also has a function, map_dfr, for this common pattern of map() %>% bind_rows() and it takes the same .id argument.

markdly · December 6, 2017, 11:39pm

Glad it helped! I think the example is good to show how one of the variants of the purrr::map family of functions can work for this particular question.

I feel the approach illustrated by @alistaire's answer is more useful in a general sense though as getting used to working with nested tibbles / list columns that you tidyr::unnest at the end is something you can apply to a wide variety of situations...