efficient way to do pattern match

veda · October 21, 2019, 4:36am

Hi R experts,

I want to extract sentences which contained a pre-specified key word and its context word(s) which occurred within the distance of five words from the keyword .

For instance, I have a list of keywords and its context words, like this:

list=data.frame(keyword=c('apple','orange','cat'),contextword=c('pick','peel,throw','stroke,keep,pat'))

I have a to-be-searched list, like this:

input=data.frame(x=c('I pick an apple','I did not eat an orange today yet. Did you ever throw it out.','I keep the cat. You pat the cat'))

Specifically, I want to extract sentences in "input" where keyword and context word appear less than five words apart.

So, the output should look like this:

output=data.frame(x=c('I pick an apple','','I keep the cat. you pat the cat'))

"I did not eat an orange today yet. Did you ever throw it out." did not meet the criteria because "orange" and "throw" appear in the distance larger than five words.

This could be done with loop. However, I have more than hundreds of keywords and more than hundreds thousands to-be searched sentences. loop would be too slow. Could you please share some efficient ways to do this? Thanks.

Best,
Veda

input=data.frame(x=c('I eat apple','I peel orange','I keep the cat. you pat the cat'))

Yarnabrina · October 21, 2019, 8:09am

I hope the following does what you want, though I don't really like my solution. I've also assumed that in case there are more than one keywords in an input, you'll keep the line unchanged in case there is at least one context word in less than five words distance for any one of them. For your particular case, probably you'll have to modify a little bit here and there. Another point is that you can use paste instead of str_c and lapply/mapply instead of map/map2.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(purrr)
library(stringr)
library(tidyr)

search_list <- data.frame(keyword = c("apple", "orange", "cat"),
                          contextword = c("pick", "peel,throw", "stroke,keep,pat"),
                          stringsAsFactors = FALSE)

to_be_searched <- data.frame(inputs = c("I eat apple", "I did not eat an orange today yet. Did you ever throw it out.", "I peel orange", "I pick an apple", "I keep the cat. You pat the cat"),
                             stringsAsFactors = FALSE)

window_selection <- function(keyword_positions,
                             cleaned_line_as_vector,
                             window_size)
{
    map(.x = keyword_positions,
        .f = function(keyword_position) str_c(cleaned_line_as_vector[seq(from = max(keyword_position - window_size,
                                                                                    1),
                                                                         to = min(keyword_position + window_size,
                                                                                  length(x = cleaned_line_as_vector)))],
                                              collapse = " "))
}

results_after_search <- to_be_searched %>%
    mutate(match_position_in_search_list = map_int(.x = inputs,
                                                   .f = ~ str_which(string = .x,
                                                                    pattern = search_list$keyword)),
           matched_row = map(.x = match_position_in_search_list,
                             .f = ~ slice(.data = search_list,
                                          .x))) %>%
    unnest_wider(col = matched_row) %>%
    mutate(contextword_as_regex = str_replace_all(string = contextword,
                                                  pattern = ",",
                                                  replacement = "|"),
           cleaned_inputs = str_remove_all(string = inputs,
                                           pattern = "[:punct:]"),
           cleaned_inputs_as_vector = str_split(string = cleaned_inputs,
                                                pattern = " "),
           match_position_of_keyword = map2(.x = cleaned_inputs_as_vector,
                                            .y = keyword,
                                            .f = ~ str_which(string = .x,
                                                             pattern = .y)),
           window_of_five = map2(.x = match_position_of_keyword,
                                 .y = cleaned_inputs_as_vector,
                                 .f = ~ window_selection(keyword_positions = .x,
                                                         cleaned_line_as_vector = .y,
                                                         window_size = 5L)),
           keep_or_not = map2_lgl(.x = window_of_five,
                                  .y = contextword_as_regex,
                                  .f = ~ any(str_detect(string = .x,
                                                        pattern = .y)))) %>%
    transmute(modified_inputs = replace(x = inputs,
                                        list = !keep_or_not,
                                        values = ""))

search_list
#>   keyword     contextword
#> 1   apple            pick
#> 2  orange      peel,throw
#> 3     cat stroke,keep,pat
to_be_searched
#>                                                          inputs
#> 1                                                   I eat apple
#> 2 I did not eat an orange today yet. Did you ever throw it out.
#> 3                                                 I peel orange
#> 4                                               I pick an apple
#> 5                               I keep the cat. You pat the cat
results_after_search
#> # A tibble: 5 x 1
#>   modified_inputs                
#>   <chr>                          
#> 1 ""                             
#> 2 ""                             
#> 3 I peel orange                  
#> 4 I pick an apple                
#> 5 I keep the cat. You pat the cat

^{Created on 2019-10-21 by the reprex package (v0.3.0)}

I don't have a large dataset, so can't check how "efficient" it is compared to other solutions. I'm sure others will also post elegant solutions, so can I request you to provide a benchmark at the end comparing different solutions?

valeri · October 21, 2019, 8:36am

This is what I came up with

Interested to know also which one is faster. This one assumes that spaces are always word separators and a comma always separates keywords and/or context words. That can be generalised if needed.

library(tidyverse)
library(purrr)

list=tibble(keyword=c('apple','orange','cat'),contextword=c('pick','peel,throw','stroke,keep,pat'))

input=tibble(x=c('I pick an apple','I did not eat an orange today yet. Did you ever throw it out.','I keep the cat. You pat the cat'))

words <- list$keyword
context <- list$contextword
sentences <- input$x

word_matches <- map2(.x=words, .y = sentences, .f = function(x, y) {
    sapply(str_split(x, ",", simplify = TRUE), function(word){match(word,str_split(y, " ", simplify = TRUE))})
})

context_matches <- map2(.x=context, .y = sentences, .f = function(x, y) {
    sapply(str_split(x, ",", simplify = TRUE), function(word){match(word,str_split(y, " ", simplify = TRUE))})
})

index_condition <- map2_lgl(.x = word_matches, .y = context_matches, ~ min(abs(.x-.y), na.rm = TRUE)<5  )

output <- tibble(x=if_else(index_condition, input$x, ""))

output
#> # A tibble: 3 x 1
#>   x                              
#>   <chr>                          
#> 1 I pick an apple                
#> 2 ""                             
#> 3 I keep the cat. You pat the cat

Created on 2019-10-21 by the reprex package (v0.3.0)

andresrcs · October 21, 2019, 3:11pm

This is another option using "regular expressions"

library(stringr)

list <- data.frame(
    stringsAsFactors = FALSE,
    keyword = c('apple', 'orange', 'cat'),
    contextword = c('pick','peel,throw', 'stroke,keep,pat')
)

input <- data.frame(
    stringsAsFactors = FALSE,
    x = c('I pick an apple', 'I did not eat an orange today yet. Did you ever throw it out.',
          'I keep the cat. You pat the cat')
)

expression <- as.character(glue::glue("<<list$keyword>>([:punct:]?\\s+\\w+){0,4}[:punct:]?\\s+(<<str_replace_all(list$contextword, ',', '|')>>)|
                         (<<str_replace_all(list$contextword, ',', '|')>>)([:punct:]?\\s+\\w+){0,4}[:punct:]?\\s+<<list$keyword>>",
                         .open = "<<", .close = ">>"))

str_subset(input$x, regex(expression, comments = TRUE, ignore_case = TRUE))
#> [1] "I pick an apple"                 "I keep the cat. You pat the cat"

Most likely the regex could be simplified to make it look better

pete · October 21, 2019, 8:07pm

You can also look here for R packages that might help: https://cran.r-project.org/view=NaturalLanguageProcessing

system · November 11, 2019, 8:07pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.