Efficient method for replacing many words

tiernan · March 4, 2018, 7:44pm

I have a vector of sentences which contains many words that need to be replaced (bolded):

TEXT
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Fusce nec quam ut tortor interdum pulvinar id vitae magna.
Curabitur commodo consequat arcu et lacinia.
Proin at diam vitae lectus dignissim auctor nec dictum lectus.
Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida.

I also have a tibble which contains a column for the target words (ORIG) and a column for their replacements (NEW):

ORIG	NEW
lorem	APPLE
ipsum	BANANA
magna	CHERRY
fusce	DAIKON
lectus	EGGPLANT

In this example there are only five words to be replaced but my actual use case involves about 100 target words, so I'd like to find an efficient, programmatic way of returning the following result (bolding for clarity only):

TEXT	NEW TEXT
Lorem ipsum dolor sit amet, consectetur adipiscing elit.	APPLE BANANA dolor sit amet, consectetur adipiscing elit.
Fusce nec quam ut tortor interdum pulvinar id vitae magna.	DAIKON nec quam ut tortor interdum pulvinar id vitae CHERRY.
Curabitur commodo consequat arcu et lacinia.	Curabitur commodo consequat arcu et lacinia.
Proin at diam vitae lectus dignissim auctor nec dictum lectus.	Proin at diam vitae EGGPLANT dignissim auctor nec dictum EGGPLANT.
Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida.	DAIKON venenatis eros congue velit feugiat, ac aliquam BANANA gravida.

What is an efficient way of doing this string replacement?

So far I've played around with passing a named vector to str_replace_all(), but I've been unable to overcome case sensitivity (see the reprex below). My gut tells me there's probably a way to do this using fuzzyjoin::regex_inner_join(), but I haven't been able to crack it.

Any suggestions would be appreciated!

Reprex

library(tidyverse) 

dat_orig <- tibble(TEXT = c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "Fusce nec quam ut tortor interdum pulvinar id vitae magna.",
  "Curabitur commodo consequat arcu et lacinia.",
  "Proin at diam vitae lectus dignissim auctor nec dictum lectus.",
  "Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida."
))

recode_table <- tibble(
  ORIG = c("lorem", "ipsum", "magna", "fusce", "lectus"),
  NEW = c("APPLE", "BANANA", "CHERRY", "DAIKON", "EGGPLANT")
)

name_tbl_vector <- function(x, name, value) {
  x %>%
    transpose() %>%
    {
      set_names(map_chr(., value), map_chr(., name))
    }
}

key <- name_tbl_vector(recode_table, name = "ORIG", value = "NEW")

# This almost works, but it fails to replace "Lorem" and other targets that have capitalized letters
dat_orig %>%
  mutate(NEW_TEXT = str_replace_all(TEXT, key)) 
## # A tibble: 5 x 2
##   TEXT                                NEW_TEXT                            
##   <chr>                               <chr>                               
## 1 Lorem ipsum dolor sit amet, consec~ Lorem BANANA dolor sit amet, consec~
## 2 Fusce nec quam ut tortor interdum ~ Fusce nec quam ut tortor interdum p~
## 3 Curabitur commodo consequat arcu e~ Curabitur commodo consequat arcu et~
## 4 Proin at diam vitae lectus digniss~ Proin at diam vitae EGGPLANT dignis~
## 5 Fusce venenatis eros congue velit ~ Fusce venenatis eros congue velit f~

sergiouribe · March 4, 2018, 8:43pm

Maybe this can help

danr · March 4, 2018, 8:47pm

Efficiency depends on what you want to do efficiently, but I think this produces the results you are looking for. It uses the tidyverse to do a lot of the heavy lifting.

Thanks for including a reprex!

suppressMessages(library(tidyverse))

dat_orig <- tibble(TEXT = c(
    "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
    "Fusce nec quam ut tortor interdum pulvinar id vitae magna.",
    "Curabitur commodo consequat arcu et lacinia.",
    "Proin at diam vitae lectus dignissim auctor nec dictum lectus.",
    "Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida."
))

recode_table <- tibble(
    ORIG = c("lorem", "ipsum", "magna", "fusce", "lectus"),
    NEW = c("APPLE", "BANANA", "CHERRY", "DAIKON", "EGGPLANT")
)

# make a regex from recode_table$ORIG
use_regex <- regex(str_c(recode_table$ORIG, collapse="|"), ignore_case = TRUE)

dat_orig$TEXT %>% map_chr(~ str_replace_all(.,
        use_regex,
        function(m) {
            m1 <- str_to_lower(m[[1]])
            # find row that has match and use the NEW from that row
            filter(recode_table, ORIG == m1)$NEW
                }
))
#> [1] "APPLE BANANA dolor sit amet, consectetur adipiscing elit."             
#> [2] "DAIKON nec quam ut tortor interdum pulvinar id vitae CHERRY."          
#> [3] "Curabitur commodo consequat arcu et lacinia."                          
#> [4] "Proin at diam vitae EGGPLANT dignissim auctor nec dictum EGGPLANT."    
#> [5] "DAIKON venenatis eros congue velit feugiat, ac aliquam BANANA gravida."

Created on 2018-03-04 by the reprex package (v0.2.0).

tiernan · March 5, 2018, 8:01pm

Thanks for the suggestion, @danr.

The solution you presented returns my desired output. However, I was hoping to find an approach that can be wrapped up in a single, fairly-intuitive function. The giant regex string that use_regex will become in my real project is somewhat difficult to inspect. I'm also not confident that I could look back on this approach and remember how it all works!

Thanks again for the suggestion though!

tiernan · March 5, 2018, 8:03pm

While this is a useful trick (I use mutate_if() a lot for data cleaning), I'm only looking to replace the values of a single column, so I don't think it's a solution for the problem I described above.

mfherman · March 6, 2018, 2:39pm

As on your other recent post, this seems like a pretty good job for left_join(). The additional step here is to "tokenize" your data into a tidy text format using unnest_tokens() from the wonderful tidytext package before joining with your lookup table.

The process is that we take your original text and create a new tibble where each row is a single word with an identifier for the original row (along the way, unnest_tokens() also makes each word lower case).

library(tidyverse)
library(tidytext)

dat_orig <- tibble(TEXT = c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit.",
  "Fusce nec quam ut tortor interdum pulvinar id vitae magna.",
  "Curabitur commodo consequat arcu et lacinia.",
  "Proin at diam vitae lectus dignissim auctor nec dictum lectus.",
  "Fusce venenatis eros congue velit feugiat, ac aliquam ipsum gravida."
))

recode_table <- tibble(
  ORIG = c("lorem", "ipsum", "magna", "fusce", "lectus"),
  NEW = c("APPLE", "BANANA", "CHERRY", "DAIKON", "EGGPLANT")
)

tidy_text <- dat_orig %>%
  mutate(id = row_number()) %>%
  unnest_tokens(ORIG, TEXT)
tidy_text
#> # A tibble: 44 x 2
#>       id ORIG       
#>    <int> <chr>      
#>  1     1 lorem      
#>  2     1 ipsum      
#>  3     1 dolor      
#>  4     1 sit        
#>  5     1 amet       
#>  6     1 consectetur
#>  7     1 adipiscing 
#>  8     1 elit       
#>  9     2 fusce      
#> 10     2 nec        
#> # ... with 34 more rows

Then, left_join() with the lookup table, and if a replacement is found, use that new value, and if not, keep the original value.

tidy_recode <- tidy_text %>%
  left_join(recode_table) %>%
  mutate(NEW = if_else(is.na(NEW), ORIG, NEW)) %>%
  select(-ORIG)
#> Joining, by = "ORIG"
tidy_recode
#> # A tibble: 44 x 2
#>       id NEW        
#>    <int> <chr>      
#>  1     1 APPLE      
#>  2     1 BANANA     
#>  3     1 dolor      
#>  4     1 sit        
#>  5     1 amet       
#>  6     1 consectetur
#>  7     1 adipiscing 
#>  8     1 elit       
#>  9     2 DAIKON     
#> 10     2 nec        
#> # ... with 34 more rows

Finally, cast the tidy text tibble back into its original form where each row is a sentence and add a period!

tidy_recode %>%
  nest(NEW) %>%
  mutate(NEW_TEXT = map(data, unlist), 
         NEW_TEXT = map_chr(NEW_TEXT, paste, collapse = " "),
         NEW_TEXT = paste0(NEW_TEXT, ".")) %>%
  select(NEW_TEXT)
#> # A tibble: 5 x 1
#>   NEW_TEXT                                                             
#>   <chr>                                                                
#> 1 APPLE BANANA dolor sit amet consectetur adipiscing elit.             
#> 2 DAIKON nec quam ut tortor interdum pulvinar id vitae CHERRY.         
#> 3 curabitur commodo consequat arcu et lacinia.                         
#> 4 proin at diam vitae EGGPLANT dignissim auctor nec dictum EGGPLANT.   
#> 5 DAIKON venenatis eros congue velit feugiat ac aliquam BANANA gravida.

Created on 2018-03-06 by the reprex package (v0.2.0).

tiernan · March 6, 2018, 5:35pm

This is exactly what I was searching for: a way to use a left_join() rather than a long regex string.

I have played around with the tidytext package in other projects, but it did not occur to me to apply the "tokenization" method to my current project.

Thanks again!

dubrowg · May 18, 2018, 3:58pm

I have a similar problem that I think this helps answer. In a dataset of schools with a school code, school name district code, and district name - created by joining from two separate sources, there are some missing district names due to school name/code changes.

In SAS I'd reference a lookup format like this:

if district_name = '' then do;
district_name=(put(district_code,$districtformat));
(where districtformat is the name in the table , and I match to the district code)

But I'm migrating workflow to r, meaning learning new ways to do things that used to come easy.

Will the approach shown above accomplish the same thing? Replace the missing district names with values from a lookup table based on the district code?

tiernan · May 20, 2018, 5:08pm

Based on your description, I think dplyr::left_join() should be able to solve your problem - see the reproducible example below:

Reprex


library(tidyverse) 

districts <- tibble(
  DISTRICT_CODE = c("D001", "D002", "D003"),
  DISTRICT_NAME = c("District One", "District Two", "District Three")
)

schools <- tibble(
  SCHOOL_CODE = c("S0001", "S0002", "S0003", "S0004", "S0005"),
  SCHOOL_NAME = c("Washington", "Adams", "Jefferson", "Madison", "Obama"),
  DISTRICT_CODE = c("D001", "D002", "D003", "D001", "D002"),
  DISTRICT_NAME = c("District One", "District Two", "District Three", "District One", "District Two")
)

print(schools)
## # A tibble: 5 x 4
##   SCHOOL_CODE SCHOOL_NAME DISTRICT_CODE DISTRICT_NAME 
##   <chr>       <chr>       <chr>         <chr>         
## 1 S0001       Washington  D001          District One  
## 2 S0002       Adams       D002          District Two  
## 3 S0003       Jefferson   D003          District Three
## 4 S0004       Madison     D001          District One  
## 5 S0005       Obama       D002          District Two

set.seed(1986)

schools_incomplete <- schools %>%
  mutate(DISTRICT_NAME = map_chr(DISTRICT_NAME, ~ sample(c(.x, NA), 1, replace = TRUE)))  # replace a few DISTRICT_NAME values with NA

print(schools_incomplete)
## # A tibble: 5 x 4
##   SCHOOL_CODE SCHOOL_NAME DISTRICT_CODE DISTRICT_NAME
##   <chr>       <chr>       <chr>         <chr>        
## 1 S0001       Washington  D001          District One 
## 2 S0002       Adams       D002          District Two 
## 3 S0003       Jefferson   D003          <NA>         
## 4 S0004       Madison     D001          District One 
## 5 S0005       Obama       D002          <NA>

# Use dplyr::left_join()

schools_incomplete %>%
  left_join(districts, by = "DISTRICT_CODE", suffix = c("_MISSING", "_COMPLETE"))
## # A tibble: 5 x 5
##   SCHOOL_CODE SCHOOL_NAME DISTRICT_CODE DISTRICT_NAME_MI~ DISTRICT_NAME_C~
##   <chr>       <chr>       <chr>         <chr>             <chr>           
## 1 S0001       Washington  D001          District One      District One    
## 2 S0002       Adams       D002          District Two      District Two    
## 3 S0003       Jefferson   D003          <NA>              District Three  
## 4 S0004       Madison     D001          District One      District One    
## 5 S0005       Obama       D002          <NA>              District Two

By the way, it's a lot easier for people to help you if you give them some reproducible code to work with in addition to a written description. Check out the reprex package if you haven't already.