Efficient method for replacing many words

As on your other recent post, this seems like a pretty good job for left_join(). The additional step here is to "tokenize" your data into a tidy text format using unnest_tokens() from the wonderful tidytext package before joining with your lookup table.

The process is that we take your original text and create a new tibble where each row is a single word with an identifier for the original row (along the way, unnest_tokens() also makes each word lower case).

library(tidyverse)
library(tidytext)

dat_orig <- tibble(TEXT = c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "Fusce nec quam ut tortor interdum pulvinar id vitae magna.",
  "Curabitur commodo consequat arcu et lacinia.",
  "Proin at diam vitae lectus dignissim auctor nec dictum lectus.",
  "Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida."
))

recode_table <- tibble(
  ORIG = c("lorem", "ipsum", "magna", "fusce", "lectus"),
  NEW = c("APPLE", "BANANA", "CHERRY", "DAIKON", "EGGPLANT")
)
tidy_text <- dat_orig %>%
  mutate(id = row_number()) %>%
  unnest_tokens(ORIG, TEXT)
tidy_text
#> # A tibble: 44 x 2
#>       id ORIG       
#>    <int> <chr>      
#>  1     1 lorem      
#>  2     1 ipsum      
#>  3     1 dolor      
#>  4     1 sit        
#>  5     1 amet       
#>  6     1 consectetur
#>  7     1 adipiscing 
#>  8     1 elit       
#>  9     2 fusce      
#> 10     2 nec        
#> # ... with 34 more rows

Then, left_join() with the lookup table, and if a replacement is found, use that new value, and if not, keep the original value.

tidy_recode <- tidy_text %>%
  left_join(recode_table) %>%
  mutate(NEW = if_else(is.na(NEW), ORIG, NEW)) %>%
  select(-ORIG)
#> Joining, by = "ORIG"
tidy_recode
#> # A tibble: 44 x 2
#>       id NEW        
#>    <int> <chr>      
#>  1     1 APPLE      
#>  2     1 BANANA     
#>  3     1 dolor      
#>  4     1 sit        
#>  5     1 amet       
#>  6     1 consectetur
#>  7     1 adipiscing 
#>  8     1 elit       
#>  9     2 DAIKON     
#> 10     2 nec        
#> # ... with 34 more rows

Finally, cast the tidy text tibble back into its original form where each row is a sentence and add a period!

tidy_recode %>%
  nest(NEW) %>%
  mutate(NEW_TEXT = map(data, unlist), 
         NEW_TEXT = map_chr(NEW_TEXT, paste, collapse = " "),
         NEW_TEXT = paste0(NEW_TEXT, ".")) %>%
  select(NEW_TEXT)
#> # A tibble: 5 x 1
#>   NEW_TEXT                                                             
#>   <chr>                                                                
#> 1 APPLE BANANA dolor sit amet consectetur adipiscing elit.             
#> 2 DAIKON nec quam ut tortor interdum pulvinar id vitae CHERRY.         
#> 3 curabitur commodo consequat arcu et lacinia.                         
#> 4 proin at diam vitae EGGPLANT dignissim auctor nec dictum EGGPLANT.   
#> 5 DAIKON venenatis eros congue velit feugiat ac aliquam BANANA gravida.

Created on 2018-03-06 by the reprex package (v0.2.0).

2 Likes