R Identifing text string within column of dataframe with (AND,OR )

One column of my data frame has words and phrases. I am trying to create a dummy variable for those fields within this column that have specific strings of text anywhere within the cell.

For example:

Y <- HIGH VOLUME CREATES NUISANCE TO EVERYONE.

X <-c("(hIGH,VOLUME)|(HIGH,VOLUME)|(LOW,VOICE)")

Here in X bracket comma(,)indicates AND condition.

I would want to identify all the fields with the string in AND , OR condition.

If there are both the words from these three word list then i want to put "VOLUME"in the additional column created.

I've tried a few things such as any() , which() and %in% but nothing has worked so far.

Any help greatly appreciated

If I understand your questions correctly, I think a good way to go would be using regular expressions, like in this example

library(tidyverse)

sample_df <- data.frame(stringsAsFactors = FALSE,
                        text = c("HIGH VOLUME CREATES NUISANCE TO EVERYONE", "other text"))

sample_df %>% 
    mutate(new_column = if_else(str_detect(text, regex("high.*volume|low.*voice", ignore_case = TRUE)),
                                "VOLUME", NA_character_))
#>                                       text new_column
#> 1 HIGH VOLUME CREATES NUISANCE TO EVERYONE     VOLUME
#> 2                               other text       <NA>

Created on 2019-11-19 by the reprex package (v0.3.0.9000)
If this doesn't solve your problem, then please provide a proper REPRoducible EXample (reprex) illustrating your issue.

1 Like

Thanks for your quick reply.

I have a dataframe consist of one of the column has free text. eg(multiple rows are like)

  1. inspected system and found high volume and language not matching
  2. inspected and found language not accessible and forgot to reconfigured
  3. rear seats are not folding and stitches are not good
  4. screen losses vision

I want to add one column where i want to put the related word.

If the text in row consist of language and volume(not case sensitive) i want to put Language or seat as per below case.Comma is acted as AND condition.
if consist of these keyword(LANGUAGE,VOLUME)|(fold,seat)|(LAGUAGE,Forgot) then Language or
if (FOLD,REAR)|(REAR ,SEAT) then Seat or else NA

Well, that is still not a reproducible example and it is not clear to me how it is different from your first example, does this don't work for you?

library(tidyverse)

sample_df <- data.frame(stringsAsFactors = FALSE,
                        text = c("inspected system and found high volume and language not matching",
                                 "inspected and found language not accessible and forgot to reconfigured",
                                 "rear seats are not folding and stitches are not good",
                                 "screen losses vision"))

sample_df %>% 
    mutate(new_column = case_when(
        str_detect(text, regex("((language|volume).*(language|volume))|
                               ((language|forgot).*(language|forgot))",
                               ignore_case = TRUE, comments = TRUE)) ~ "Language",
        str_detect(text, regex("((fold|rear).*(fold|rear))|
                               ((rear|seat).*(rear|seat))",
                               ignore_case = TRUE, comments = TRUE)) ~ "Seat",
        TRUE ~ NA_character_))
#>                                                                     text
#> 1       inspected system and found high volume and language not matching
#> 2 inspected and found language not accessible and forgot to reconfigured
#> 3                   rear seats are not folding and stitches are not good
#> 4                                                   screen losses vision
#>   new_column
#> 1   Language
#> 2   Language
#> 3       Seat
#> 4       <NA>

Hi,

Yes..Thanks,It works for me and i need to amend something more into this.
Eg.

sample_df <- data.frame(stringsAsFactors = FALSE,
text = c("inspected system and found high volume and language not matching",
"inspected and found language not accessible and forgot to reconfigured",
"rear seats are not folding and stitches are not good",
"screen losses vision"))

I am using this to extract multiple words on (or/and)condition but there are list of words.So can i directly check it with creating bag of words??? 1st Question

I am checking this words if it there in the text also i want to negate some words along with this.eg
sample_df <- data.frame(stringsAsFactors = FALSE,
text = c("inspected system and found high volume and language not matching"))
##For extracting matching words from above text i am using below code line.But if there is "high" word in text i dont want output as "Language" in the next column. It is a negative word i say.2nd Question

sample_df %>%
mutate(new_column = case_when(
str_detect(text, regex("((language|volume).(language|volume))|
((language|forgot).
(language|forgot))",
ignore_case = TRUE, comments = TRUE)) ~ "Language",
TRUE ~ NA_character_))

Sorry, I don't understand your questions, to clarify them, please try to make a proper reproducible example as explained in the link I gave you before.

Hi

sample_df <- data.frame(stringsAsFactors = FALSE,
text = c("inspected system and found high volume and language not matching",
"inspected and found language not accessible and forgot to reconfigured",
"rear seats are not folding and stitches are not good",
"screen losses vision"))

x <- data.frame("SN" = 1:2, "Positive" = c("(LANGUAGE,volume)","(LANGUAGE,CHANG)","(fold|rear)","(rear|seat)"), "Negative" = c("High","rear"))

Simply I want to search positive words(from x dataframe ) in the text if there are any matching and do not want any of negative words in the text. There are multiple positive and negative words in Or and And conditions.

eg.

For 1st Text line the output will be

Text = "inspected system and found high volume and language not matching"

Here Matching words are language and volume as per positive words in x dataframe

But in text there is also a negative word "High" which is in negative words in x data frame

So I need in new column in sample_df as "no match".

I am having difficulty to do so.

I can't manage to understand the logic in your ´x´ data frame, I suppose I'm lacking the context to give a meaningful solution, but I think you can accomplish what you want by using a "regular expression", as I did in my previous example.
Regular expressions are a little hard at first, but they are very powerful and the effort invested is going to pay off.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.