Regex: Capture a single match

mclp · October 22, 2020, 3:36pm

Hello everyone.

I need to extract some keywords from multiple texts.
The thing is, I only need to extract these words once.
So for example, if my keyword is 'Jedi', and in the text I have 'Jedi' 3 times, I just want it to return the word 'Jedi' once.

What I have done so far:

main_df <- read.table(header = TRUE, 
                      stringsAsFactors = FALSE, 
                      text="Title Text
'School Performance' 'Students A1, A6 and A7 are great. A1 could do better'
'Groceries Performance' 'Students A9, A3 are ok. A9 and A3 will for sure do better.'
'Fruit Performance' 'A5 and A7 will be great fruit pickers. Very surprised by A5 but not so much by A7'
'Jedi Performance' 'A3, A6, A5 will be great Jedis. The rest will be average Jedis' 
'Sith Performance' 'No one is very good. We should be happy. Good luck we have.'")

capture_words <- c('A1','A2','A3','A4','A5','A6','A7')

main_df %>% add_column(
  meta_data_title = str_extract_all(
    tolower(main_df$Text),
    paste(as.vector(tolower(capture_words)),
          collapse = "\\b|\\b"),
    simplify = FALSE
  ))

Output that I have now:
Title Text meta_data_title
1 School Performance Students A1, A6 and A7 are great. A1 could do better a1, a6, a7, a1
2 Groceries Performance Students A9, A3 are ok. A9 and A3 will for sure do better. a3, a3

Output that I want:
Title Text meta_data_title
1 School Performance Students A1, A6 and A7 are great. A1 could do better a1, a6, a7
2 Groceries Performance Students A9, A3 are ok. A9 and A3 will for sure do better. a3

Bonus questions:
1 - is that "tolower" really necessary?
2 - While writing on the tidyverse style, there is a line of code "str_extract_all(tolower(main_df$Text)" I find it weird having there the database. But if I only write the name of the variable, the code wont run. "str_extract_all( tolower(Text)"

Thank you and much appreciated.
Stay safe and keep on Rocking

jmcvw · October 22, 2020, 3:58pm

The main issue here is that str_extract_all is extracting every case of your capture text, incuding duplicates.
To fix that you will have to filter for only the unique values, but because str_extract_all returns a list, you have to unlist it first. (There may be a better way, but I don't know it)

To lower is not essential, but is often useful if there is any question about the consistency of the data
\\b not essential here either, but could be in other cases, maybe in your larger dataset.

As for the part requiring main_df$Text, I'm not sure if it is maybe a bug?

My take on this would be:

capture_words <- paste0('A', 1:7, collapse = '|')

main_df %>% 
  mutate(meta_data_title = map(Text, ~unique(unlist(str_extract_all(.x, capture_words)))))

Also you might find it cleaner to read if you pull the string exrtraction out into a separate function

pull_meta_data <- function(x) {
  unique(unlist(str_extract_all(x, capture_words)))
}

main_df %>% 
  mutate(md = map(Text, pull_meta_data))

system · October 29, 2020, 3:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.