Compare text from a DF column to a list of keywords and subset rows that match the keywords

I have a data frame with multiple columns and about 500 rows. I want to compare text in one of the columns to a list of key words and then subset rows where the text matches the keywords

As an example my DF is

Text DF: 

SL_NO	Index_No	TC_1	TC_2	TC_3
1	    17002	    …	     …	    The trees in the plantation are bananas
2	    25003	    …	     …	    There are coconut trees 30 miles from here
3	    58016	    …	     …	    Sugarcane needs a lot of water to grow
Keywords_DF:
Sugarcane
Coconut
Bananas

I want to compare the text in TC_3 with the key words in the Keywords_DF and subset all the SL_No and Index_No that match the keywords.

Is there a simple way of going about this? I do not want to do a loop because the Text_DF will grow large over time.

Thanks

Whatever happens, you will have some form of nested looping, on TC_3 and on keywords. On way to make the keywords one efficient is to use %in%. Only thing is you might have to explicitly account for the case (upper/lowercase).

library(tidyverse)

df <- read.table(text = "SL_NO  Index_No    TC_1    TC_2    TC_3
1   17002   …   …   The trees in the plantation are bananas
2   25003   …   …   There are coconut trees 30 miles from here
3   58016   …   …   Sugarcane needs a lot of water to grow",
header = TRUE,
row.names=NULL,
sep = "\t")

df
#>   SL_NO Index_No TC_1 TC_2                                       TC_3
#> 1     1    17002    …    …    The trees in the plantation are bananas
#> 2     2    25003    …    … There are coconut trees 30 miles from here
#> 3     3    58016    …    …     Sugarcane needs a lot of water to grow

keywords <- scan(text = "Sugarcane
Coconut
Bananas",what = "character")

keywords
#> [1] "Sugarcane" "Coconut"   "Bananas"

df |>
    mutate(words_in_TC_3 = str_split(TC_3, " "),
                 has_match = map_lgl(words_in_TC_3,
                                                        ~any(.x %in% keywords)))
#>   SL_NO Index_No TC_1 TC_2                                       TC_3
#> 1     1    17002    …    …    The trees in the plantation are bananas
#> 2     2    25003    …    … There are coconut trees 30 miles from here
#> 3     3    58016    …    …     Sugarcane needs a lot of water to grow
#>                                       words_in_TC_3 has_match
#> 1     The, trees, in, the, plantation, are, bananas     FALSE
#> 2 There, are, coconut, trees, 30, miles, from, here     FALSE
#> 3     Sugarcane, needs, a, lot, of, water, to, grow      TRUE

Created on 2022-05-09 by the reprex package (v2.0.1)

Another approach would be to make the loop on TC_3 more efficient, with something like:

map_dfc(keywords,
		~str_detect(df$TC_3, .x)) |>
	as.matrix() |>
	matrixStats::rowAnys()

Thankyou for the response @AlexisW . I did assume that a loop would be unavoidable, but I was hoping there would be a solution otherwise.

Only thing is you might have to explicitly account for the case (upper/lowercase).

On accounting for uppercase and lowercase I will convert both TC_3 and keywords to lower (tolower()) before I run the comparison.

This is another option

library(dplyr)
library(stringr)

df <- data.frame(
  stringsAsFactors = FALSE,
             SL_NO = c(1L, 2L, 3L),
          Index_No = c(17002L, 25003L, 58016L),
              TC_1 = c("…", "…", "…"),
              TC_2 = c("…", "…", "…"),
              TC_3 = c("The trees in the plantation are bananas","There are coconut trees 30 miles from here",
                       "Sugarcane needs a lot of water to grow")
)

keywords <- c('Sugarcane', 'Coconut', 'Bananas')

df %>% 
    filter(str_detect(TC_3, regex(paste0(keywords, collapse = "|"), ignore_case = T)))
#>   SL_NO Index_No TC_1 TC_2                                       TC_3
#> 1     1    17002    …    …    The trees in the plantation are bananas
#> 2     2    25003    …    … There are coconut trees 30 miles from here
#> 3     3    58016    …    …     Sugarcane needs a lot of water to grow

Created on 2022-05-10 by the reprex package (v2.0.1)

Thank you for this @andresrcs!!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.