Convert complicated string into data frame

Hi all!

I have a pretty convoluted character string that I'm trying to convert into a data frame. Here's an example of what the string looks like:

> string <- "A. MYTHOLOGICAL MOTIFS a. Creator A0. - A87. + . (2) 1, 2; (4) 4. b. Gods A101. - A493. (1) 19, 46; (2) 1, 2. B. ANIMALS a. Mythical animals B0. - B98. + . (1) 62, 63; (23) 6, 7. b. Magic animals B101. + . - B192.2. + . (1) 24; (7) 5, 33."

I want to make a row for each number that follows a number in parentheses (so, starting from the beginning, those would 1, 2, 3, 4, 5, 19, 46, etc.) (in the original dataset, these are story IDs). You can think about those hierarchically organized under three levels:

  • a number in parentheses (the collection in which a story is found)
  • a motif identified with a lowercase letter (e.g., "a. Creator A0. - A87. + ." or "b. Gods A101. - A493.")
  • a category of motifs (e.g., "A. MYTHOLOGICAL MOTIFS")

I want the fourth column to contain each story ID; the third column to contain the number in the parentheses immediately preceding that number (the collection); the second column to include the name or an identifier for the motif (the names labeled with the lowercase letters); and the fourth column to contain the class of motifs (the names labeled with the uppercase letters).

So, for the above string, the data frame would come out as follows:

> dat <- data.frame("Motif_category" = c(rep("A. MYTHOLOGICAL MOTIFS", 7), rep("B. ANIMALS", 7)),
+                  "Motif" = c(rep("a. Creator A0. - A87. + .", 3), rep("b. Gods A101. - A493.", 4),
+                              rep("a. Mythical animals B0. - B908. + .", 4), rep("b. Magic animals B101. + . - B192.2. + .", 3)),
+                  "Collection" = c(2, 2, 4, 1, 1, 2, 2, 1, 1, 23, 23, 1, 7, 7),
+                  "Story_no" = c(1, 2, 4, 19, 46, 1, 2, 62, 63, 6, 7, 24, 5, 33))
> dat 
           Motif_category                                    Motif Collection Story_no
1  A. MYTHOLOGICAL MOTIFS                a. Creator A0. - A87. + .          2        1
2  A. MYTHOLOGICAL MOTIFS                a. Creator A0. - A87. + .          2        2
3  A. MYTHOLOGICAL MOTIFS                a. Creator A0. - A87. + .          4        4
4  A. MYTHOLOGICAL MOTIFS                    b. Gods A101. - A493.          1       19
5  A. MYTHOLOGICAL MOTIFS                    b. Gods A101. - A493.          1       46
6  A. MYTHOLOGICAL MOTIFS                    b. Gods A101. - A493.          2        1
7  A. MYTHOLOGICAL MOTIFS                    b. Gods A101. - A493.          2        2
8              B. ANIMALS      a. Mythical animals B0. - B908. + .          1       62
9              B. ANIMALS      a. Mythical animals B0. - B908. + .          1       63
10             B. ANIMALS      a. Mythical animals B0. - B908. + .         23        6
11             B. ANIMALS      a. Mythical animals B0. - B908. + .         23        7
12             B. ANIMALS b. Magic animals B101. + . - B192.2. + .          1       24
13             B. ANIMALS b. Magic animals B101. + . - B192.2. + .          7        5
14             B. ANIMALS b. Magic animals B101. + . - B192.2. + .          7       33

If anyone has any thoughts, that would be greatly appreciated!

Wow, this was a fun one!

I think I lost sight of this one a bit because it seems way over convoluted. Anyway, I managed to recreate the desired output (based on the string example, which seems has a slight difference from the desired output shown)

library(tidyr)
library(dplyr)
library(purrr)
library(ggplot2)

  
string %>% 
  strsplit('[A-Z]\\.') %>% 
  unlist() %>%
  .[nchar(.) > 1] %>% 
  keep(~nchar(.) > 1) %>%
  strsplit('[a-z]\\.') %>% 
  imap(~crossing(motif_category = paste0(LETTERS[.y], '.', .x[1]),
                 motif = paste0(letters[1:length(.) - 1], '.', .x[-1]))) %>% 
  bind_rows() %>% 
  extract(motif, c('motif', 'collection'), '(^.+\\.)( \\(.+$)', ) %>% 
  separate_rows(collection, sep = '\\(') %>% 
  separate(collection, c('collection', 'story_id'), sep = '\\)', convert = TRUE) %>% 
  separate_rows(story_id, convert = TRUE) %>% 
  drop_na()
  
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 4 rows [1, 4, 7,
#> 10].
#> # A tibble: 14 x 4
#>    motif_category          motif                             collection story_id
#>    <chr>                   <chr>                                  <int>    <dbl>
#>  1 "A. MYTHOLOGICAL MOTIF~ a. Creator A0. - A87. + .                  2        1
#>  2 "A. MYTHOLOGICAL MOTIF~ a. Creator A0. - A87. + .                  2        2
#>  3 "A. MYTHOLOGICAL MOTIF~ a. Creator A0. - A87. + .                  4        4
#>  4 "A. MYTHOLOGICAL MOTIF~ b. Gods A101. - A493.                      1       19
#>  5 "A. MYTHOLOGICAL MOTIF~ b. Gods A101. - A493.                      1       46
#>  6 "A. MYTHOLOGICAL MOTIF~ b. Gods A101. - A493.                      2        1
#>  7 "A. MYTHOLOGICAL MOTIF~ b. Gods A101. - A493.                      2        2
#>  8 "B. ANIMALS "           a. Mythical animals B0. - B98. +~          1       62
#>  9 "B. ANIMALS "           a. Mythical animals B0. - B98. +~          1       63
#> 10 "B. ANIMALS "           a. Mythical animals B0. - B98. +~         23        6
#> 11 "B. ANIMALS "           a. Mythical animals B0. - B98. +~         23        7
#> 12 "B. ANIMALS "           b. Magic animals B101. + . - B19~          1       24
#> 13 "B. ANIMALS "           b. Magic animals B101. + . - B19~          7        5
#> 14 "B. ANIMALS "           b. Magic animals B101. + . - B19~          7       33
2 Likes

@jmcvw got further than I did

suppressPackageStartupMessages({
  library(dplyr)
  library(stringr)
})

# FUNCTIONS

get_colsto <- function(x) {
  str_extract_all(string,pattern_colsto)[[1]][x] %>% str_remove(.,pattern_semico)
}

make_rows <- function(x) {
  str_replace_all(x,pattern_init,"ZZ\\1") %>% str_split(.,pattern_split) -> a
  data.frame(string = a[[1]]) 
}

# SEARCH PATTERN

pattern_cat    <- "[:upper:]+\\.\\s+[A-Z]+"
pattern_init   <- "( [A-Z]\\.\\s)"
pattern_split  <- "ZZ "
(2)
#> [1] 2
pattern_id   <- '\\(\\d+\\)'
pattern_parens <- '[()]'
pattern_motif  <- '[a-z]+\\.\\s[:graph:]+\\s'#+[:graph:]+[:graph:]'
pattern_colsto <- '(\\d+)(;)'
pattern_semico <- ';'

# PREPROCESSING

long_string <- "A. MYTHOLOGICAL MOTIFS a. Creator A0. - A87. + . (2) 1, 2; (4) 4. b. Gods A101. - A493. (1) 19, 46; (2) 1, 2. B. ANIMALS a. Mythical animals B0. - B98. + . (1) 62, 63; (23) 6, 7. b. Magic animals B101. + . - B192.2. + . (1) 24; (7) 5, 33."

# DATA

my_df <- make_rows(long_string)

# MAIN

my_df %>%
  mutate(Motif_Category = str_trim(str_extract(string,pattern_cat)),
         Motif          = str_trim(str_extract(string,pattern_motif)))
#>                                                                                                                             string
#> 1                    A. MYTHOLOGICAL MOTIFS a. Creator A0. - A87. + . (2) 1, 2; (4) 4. b. Gods A101. - A493. (1) 19, 46; (2) 1, 2.
#> 2 B. ANIMALS a. Mythical animals B0. - B98. + . (1) 62, 63; (23) 6, 7. b. Magic animals B101. + . - B192.2. + . (1) 24; (7) 5, 33.
#>    Motif_Category       Motif
#> 1 A. MYTHOLOGICAL  a. Creator
#> 2      B. ANIMALS a. Mythical

Created on 2020-10-19 by the reprex package (v0.3.0.9001)

My partial solution aimed at keeping the output tidy, such that each category would have a single row with one or multiple motifs. Consider pivot_wider for @jmcvw's solution.

1 Like