Looking at the harrypotter package which consists of a list for each of the 7 HP books, I want to create a data frame with a column for Book, Chapter, Text.
Using a for loop you can do this by:
library(dplyr)
library(harrypotter)
titles <- c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows")
books <- list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince,
deathly_hallows)
series <- tibble()
for(i in seq_along(titles)) {
clean <- tibble(chapter = seq_along(books[[i]]),
text = books[[i]]) %>%
mutate(book = titles[i]) %>%
select(book, everything())
series <- rbind(series, clean)
}
Is there a way to get the above with tibble or data.frame + map_chr()?
The problem I've been having in attempting to do this is that the character vectors and elements are unnamed so I don't have anything to pass as an argument into the purrr functions.
I think, you are most of the way there.
This is what I did:
books <- setNames(books, titles)
res <- purrr::map(books, function(book){
return(tibble::tibble(text = book, chapter = seq(1:length(book))))
}) %>%
dplyr::bind_rows(.id = "Book")
str(res)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 200 obs. of 3 variables:
$ book : chr "Philosopher's Stone" "Philosopher's Stone" "Philosopher's Stone" "Philosopher's Stone" ...
$ text : chr "THE BOY WHO LIVED Mr. and Mrs. Dursley, of number four, Privet Drive, were proud to say that they were perfe"| __truncated__ "THE VANISHING GLASS Nearly ten years had passed since the Dursleys had woken up to find their nephew on the "| __truncated__ "THE LETTERS FROM NO ONE The escape of the Brazilian boa constrictor earned Harry his longest-ever punishment"| __truncated__ "THE KEEPER OF THE KEYS BOOM. They knocked again. Dudley jerked awake. \"Where's the cannon?\" he said stupid"| __truncated__ ...
$ chapter: int 1 2 3 4 5 6 7 8 9 10 ...
Another alternative could be to use something like map2_df:
library(tidyverse)
map2_df(titles, books, ~ tibble(book = .x, chapter = seq_along(.y), text = .y))
#> # A tibble: 200 x 3
#> book chapter
#> <chr> <int>
#> 1 Philosopher's Stone 1
#> 2 Philosopher's Stone 2
#> 3 Philosopher's Stone 3
#> 4 Philosopher's Stone 4
#> 5 Philosopher's Stone 5
#> 6 Philosopher's Stone 6
#> 7 Philosopher's Stone 7
#> 8 Philosopher's Stone 8
#> 9 Philosopher's Stone 9
#> 10 Philosopher's Stone 10
#> # ... with 190 more rows, and 1 more variables: text <chr>
Probably the simplest option is to add the books as a list column. Once you've done that, you can easily iterate over it with map and seq_along to make another list column of chapter numbers. Since they will be the same length, you can call tidyr::unnest afterwards to expand everything out.
library(tidyverse)
library(harrypotter)
books <- tibble(title = c("Philosopher's Stone", "Chamber of Secrets", "Prisoner of Azkaban",
"Goblet of Fire", "Order of the Phoenix", "Half-Blood Prince",
"Deathly Hallows"),
text = list(philosophers_stone, chamber_of_secrets, prisoner_of_azkaban,
goblet_of_fire, order_of_the_phoenix, half_blood_prince,
deathly_hallows),
chapter = map(text, seq_along))
books
#> # A tibble: 7 x 3
#> title text chapter
#> <chr> <list> <list>
#> 1 Philosopher's Stone <chr [17]> <int [17]>
#> 2 Chamber of Secrets <chr [19]> <int [19]>
#> 3 Prisoner of Azkaban <chr [22]> <int [22]>
#> 4 Goblet of Fire <chr [37]> <int [37]>
#> 5 Order of the Phoenix <chr [38]> <int [38]>
#> 6 Half-Blood Prince <chr [30]> <int [30]>
#> 7 Deathly Hallows <chr [37]> <int [37]>
Also, I've got to wonder about the legality of the package. gutenbergr is a good source of books in the public domain, if you need.
and yes you're quite right... i saw a certain someone use it in a text mining tutorial so i thought i'd play around with it too. for stuff you want to share online it's probably for the best to use public domain stuff from gutenbergr as you suggested!
Glad it helped! I think the example is good to show how one of the variants of the purrr::map family of functions can work for this particular question.
I feel the approach illustrated by @alistaire's answer is more useful in a general sense though as getting used to working with nested tibbles / list columns that you tidyr::unnest at the end is something you can apply to a wide variety of situations...