Count word occurrences (per row)

Hello RStudio Community folks!

I am trying to count the number of word occurrences in a tibble (after tokenization from the tidytext::unnest_tokens() function), but can't seem to figure out how to do this with stringr::str_count():

WordOccurrenceTest <- tibble::tribble(
            ~word,                                                        ~text,
       "abnormal", "if this was not abnormal, consider changing from abnormal.",
       "abnormal", "if this was not abnormal, consider changing from abnormal."
       )

I want to count the number of times 'abnormal' occurs (or whatever word exists in word) in text, so I thought it would be:

WordOccurrenceTest %>% 
  mutate(
    word_occurrence = sum(str_count(text, as.character(word)))
  ) 

But this gives me 4 in word_occurrence.

# A tibble: 2 × 3
  word     text                                                       word_occurrence
  <chr>    <chr>                                                                <int>
1 abnormal if this was not abnormal, consider changing from abnormal.               4
2 abnormal if this was not abnormal, consider changing from abnormal.               4

I can do this with base R, but it gives a warning:

WordOccurrenceTest %>% 
  mutate(
    word_occurrence = lengths(regmatches(text, gregexpr(as.character(word), text)))
  )
# A tibble: 2 × 3
  word     text                                                       word_occurrence
  <chr>    <chr>                                                                <int>
1 abnormal if this was not abnormal, consider changing from abnormal.               2
2 abnormal if this was not abnormal, consider changing from abnormal.               2
Warning message:
Problem with `mutate()` column `word_occurrence`.
ℹ `word_occurrence = lengths(regmatches(text, gregexpr(as.character(word), text)))`.
ℹ argument 'pattern' has length > 1 and only the first element will be used

Any help on how to get the output from str_count() to produce the rowwise count of each word occurrence from the word column would be great!

Thank you so much for your time!

Try

WordOccurrenceTest %>% 
  mutate(
    word_occurrence = str_count(text, word))
1 Like

Ah yes! Thank you! The sum() was unnecessary :expressionless:

1 Like

Just for completeness, I could also use the stringi::stri_count_fixed() function:

# data 
WordOccurrenceTest <- tibble::tribble(
            ~word,                                                        ~text,
       "abnormal", "if this was not abnormal, consider changing from abnormal.",
       "normal",                            "this is normal--do not change it."
       )
# stringi
WordOccurrenceTest %>% 
  mutate(
    word_occurrence = stringi::stri_count_fixed(str = text, 
                                                pattern = word)
  )
# A tibble: 2 × 3
#  word     text                                                       word_occurrence
#  <chr>    <chr>                                                                <int>
#1 abnormal if this was not abnormal, consider changing from abnormal.               2
#2 normal   this is normal--do not change it.                                        1

# stringr::str_count()
WordOccurrenceTest %>% 
  mutate(
    word_occurrence = stringr::str_count(text, word)
)
# A tibble: 2 × 3
#  word     text                                                       word_occurrence
#  <chr>    <chr>                                                                <int>
#1 abnormal if this was not abnormal, consider changing from abnormal.               2
#2 normal   this is normal--do not change it.                                        1
1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.