I hope the following does what you want, though I don't really like my solution. I've also assumed that in case there are more than one keywords in an input, you'll keep the line unchanged in case there is at least one context word in less than five words distance for any one of them. For your particular case, probably you'll have to modify a little bit here and there. Another point is that you can use paste
instead of str_c
and lapply
/mapply
instead of map
/map2
.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(purrr)
library(stringr)
library(tidyr)
search_list <- data.frame(keyword = c("apple", "orange", "cat"),
contextword = c("pick", "peel,throw", "stroke,keep,pat"),
stringsAsFactors = FALSE)
to_be_searched <- data.frame(inputs = c("I eat apple", "I did not eat an orange today yet. Did you ever throw it out.", "I peel orange", "I pick an apple", "I keep the cat. You pat the cat"),
stringsAsFactors = FALSE)
window_selection <- function(keyword_positions,
cleaned_line_as_vector,
window_size)
{
map(.x = keyword_positions,
.f = function(keyword_position) str_c(cleaned_line_as_vector[seq(from = max(keyword_position - window_size,
1),
to = min(keyword_position + window_size,
length(x = cleaned_line_as_vector)))],
collapse = " "))
}
results_after_search <- to_be_searched %>%
mutate(match_position_in_search_list = map_int(.x = inputs,
.f = ~ str_which(string = .x,
pattern = search_list$keyword)),
matched_row = map(.x = match_position_in_search_list,
.f = ~ slice(.data = search_list,
.x))) %>%
unnest_wider(col = matched_row) %>%
mutate(contextword_as_regex = str_replace_all(string = contextword,
pattern = ",",
replacement = "|"),
cleaned_inputs = str_remove_all(string = inputs,
pattern = "[:punct:]"),
cleaned_inputs_as_vector = str_split(string = cleaned_inputs,
pattern = " "),
match_position_of_keyword = map2(.x = cleaned_inputs_as_vector,
.y = keyword,
.f = ~ str_which(string = .x,
pattern = .y)),
window_of_five = map2(.x = match_position_of_keyword,
.y = cleaned_inputs_as_vector,
.f = ~ window_selection(keyword_positions = .x,
cleaned_line_as_vector = .y,
window_size = 5L)),
keep_or_not = map2_lgl(.x = window_of_five,
.y = contextword_as_regex,
.f = ~ any(str_detect(string = .x,
pattern = .y)))) %>%
transmute(modified_inputs = replace(x = inputs,
list = !keep_or_not,
values = ""))
search_list
#> keyword contextword
#> 1 apple pick
#> 2 orange peel,throw
#> 3 cat stroke,keep,pat
to_be_searched
#> inputs
#> 1 I eat apple
#> 2 I did not eat an orange today yet. Did you ever throw it out.
#> 3 I peel orange
#> 4 I pick an apple
#> 5 I keep the cat. You pat the cat
results_after_search
#> # A tibble: 5 x 1
#> modified_inputs
#> <chr>
#> 1 ""
#> 2 ""
#> 3 I peel orange
#> 4 I pick an apple
#> 5 I keep the cat. You pat the cat
Created on 2019-10-21 by the reprex package (v0.3.0)
I don't have a large dataset, so can't check how "efficient" it is compared to other solutions. I'm sure others will also post elegant solutions, so can I request you to provide a benchmark at the end comparing different solutions?