Grepl one column of a data frame to a column of another summing the matches for each observation

Hi there,

I've been struggling with this one for a while.

I have a data frame which is a collection of tweets, I want to find the sum of the matches for one of the columns against another dataframe I'm using to lookup.

At the moment I've written a function (below) which I then lapply through but it's very slow as it's using a for loop.

word_count <- function(name) {
  word_sum <- 0
  for (i in 1:nrow(lookup)) {
    value <- grepl(lookup$word[i], word)
    word_sum <- word_sum + value
  }
  word_sum <- word_sum/nrow(lookup)
  return(word_sum)
}

Then I
lapply(tweets$name, word_count)

This is particularly slow (2hrs for ~20k) and I'm sure there's a better way. I've looked into purrr::map but my brain can't quite compute. Can anyone help?

Hi Chris, if you're actually analyzing tweets, you may want to consider using thetidytext package, here is the link to the book: https://www.tidytextmining.com/

I wrote a short blog post about it, using Twitter data, so it may be more specific to what you're trying to do: https://www.edgarsdatalab.com/2017/09/04/sentiment-analysis-using-tidytext/

2 Likes

Hi Edgar,

Thanks for getting back to me so quick.

Thanks for the link it has been a help. Whilst I can't use the method you have at the end (as I'm looking for a partial match rather than a full match) it does suggest some other avenues I'll look at.

Sounds good, btw, a map version of your for loop would look something like this:

library(dplyr)
library(purrr)

lookup$word %>%
  map_int(~grepl(.x, word)) %>%
  sum()
2 Likes

I used the bioinformatics package, Biostrings, for a similar problem a while ago. Here is short example using text lines from janeaustenr and a handful of lookup words:

require(Biostrings)
require(janeaustenr)

# create BStringSet object from book lines
tt=austen_books()$text
subj = BStringSet(tt)
# create another BStringSet from a curated set of words to locate 
lookup = BStringSet(c("read","rabbit","cousin", "polite", "civil"))
# count word instances on each line with vcountPDict, then summarize 
word.counts = colSums(vcountPDict(lookup, subj))

You'll have to install Biostrings through Bioconductor.

It takes a few seconds to run, but it only matches a small set of words. The example above also doesn't recognize word boundaries in the subj lines; you'd have to add spaces to the strings in lookup to ensure that a word like "read" didn't match "readily".

How large is your lookup object?