Grepl one column of a data frame to a column of another summing the matches for each observation


#1

Hi there,

I’ve been struggling with this one for a while.

I have a data frame which is a collection of tweets, I want to find the sum of the matches for one of the columns against another dataframe I’m using to lookup.

At the moment I’ve written a function (below) which I then lapply through but it’s very slow as it’s using a for loop.

word_count <- function(name) {
  word_sum <- 0
  for (i in 1:nrow(lookup)) {
    value <- grepl(lookup$word[i], word)
    word_sum <- word_sum + value
  }
  word_sum <- word_sum/nrow(lookup)
  return(word_sum)
}

Then I
lapply(tweets$name, word_count)

This is particularly slow (2hrs for ~20k) and I’m sure there’s a better way. I’ve looked into purrr::map but my brain can’t quite compute. Can anyone help?


#2

Hi Chris, if you’re actually analyzing tweets, you may want to consider using thetidytext package, here is the link to the book: https://www.tidytextmining.com/

I wrote a short blog post about it, using Twitter data, so it may be more specific to what you’re trying to do: https://www.edgarsdatalab.com/2017/09/04/sentiment-analysis-using-tidytext/


#3

Hi Edgar,

Thanks for getting back to me so quick.

Thanks for the link it has been a help. Whilst I can’t use the method you have at the end (as I’m looking for a partial match rather than a full match) it does suggest some other avenues I’ll look at.


#4

Sounds good, btw, a map version of your for loop would look something like this:

library(dplyr)
library(purrr)

lookup$word %>%
  map_int(~grepl(.x, word)) %>%
  sum()

#5

I used the bioinformatics package, Biostrings, for a similar problem a while ago. Here is short example using text lines from janeaustenr and a handful of lookup words:

require(Biostrings)
require(janeaustenr)

# create BStringSet object from book lines
tt=austen_books()$text
subj = BStringSet(tt)
# create another BStringSet from a curated set of words to locate 
lookup = BStringSet(c("read","rabbit","cousin", "polite", "civil"))
# count word instances on each line with vcountPDict, then summarize 
word.counts = colSums(vcountPDict(lookup, subj))

You’ll have to install Biostrings through Bioconductor.

It takes a few seconds to run, but it only matches a small set of words. The example above also doesn’t recognize word boundaries in the subj lines; you’d have to add spaces to the strings in lookup to ensure that a word like “read” didn’t match “readily”.

How large is your lookup object?