Text Mining Question -- Tokenizing Bigrams

I have a corpus named "Mow_corp_lite" with 203k elements and 812.5 MB size. I am trying to tokenize the corpus into bigrams and then summarize the bigrams in a wordcloud.

The script:

# Tokenizing Bigrams and Plotting Bigram Wordcloud
bi_token <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
Mow_bi_dtm <- DocumentTermMatrix(Mow_corp_lite, control = list(tokenize = bi_token))
str(Mow_bi_dtm)
Mow_bi_dtm_m <- as.matrix(Mow_bi_dtm)
Bi_freq <- colSums(Mow_bi_dtm_m)
Bi_words <- names(Bi_freq)
wordcloud(Bi_words, Bi_freq, max.words = 15)

Something is not right. The matrix object "Mow_bi_dtm_m" has 12.4 MB. The result of running colSums on that object is numeric(0).

str(Mow_bi_dtm_m) returns:

num[1:203245, 0 ] 
 - attr(*, "dimnames")=List of 2
  ..$ Docs : chr [1:203245] "1" "2" "3" "4" ...
  ..$ Terms: NULL

Is it possible the terms: null status is due to my having used memCompress on my text prior to coverting it into a VCorpus?

What am I doing wrong?

Thank you!

Take a look at the quanteda package. It will do bi-grams, tri-grams, proximity grams, and it's an easy conversion of corpus objects from tm or tidytext

Hi,

Thanks very much. I will start playing with that package later today.

I'm reasonably sure the script above is not working due to a problem I created by trying to compress a corpus prior to tokenizing and converting to a matrix.

The motivation for compressing the corpus was to reduce the object size to something my machine can handle in RAM when creating the bigram matrix.

An alternate approach -- I think -- will be to make the dataset more sparse.

I have the following collection of words:

str(Mow_cleanest)
 chr [1:54401] "touch" "healthy" "green" "eater" "lights" "starting" "small" ...

What method would you recommend for making this dataset a more manageable size, while preserving most of the prominent bigram relationships?

Thanks, again!

Just to confirm the obvious; you've taken out stopwords, of course.

Mow_cleanest is nowhere near as large as Moby Dick, for example, so unless you're struggling with 4GB RAM, memory shouldn't be a problem.

I'm not sure what package you're working in, but you might also take a look at tidytext(https:://github.com/dgrtwo/tidy-text-mining) and work the examples in Chapter 4

Hi,

Thank you. I'm using tm and qdap. I'm following the process laid out by Ted Kwartler in his very helpful datacamp text mining course.

I have a vector of words sized 1.1MB. When unlisted and split, it has 65k entries.

The as.matrix function was throwing errors about not being able to process 4.2GB file. I have max 7.87GB usable.

I sorted the words in descending order to create a custom stoplist, and then removed all but the top 2000 (in terms of frequency). I named that character vector "Mow_trimmed":

str(Mow_trimmed)
 chr [1:65966] "touch" "propelled" "effective" "healthy" "green" "eater" ...
object.size(Mow_trimmed)
1008544 bytes

Next, I convert the vector into a VectorSource object, and then a VCorpus, per:

Mow_source <- VectorSource(Mow_trimmed)
Mow_unlist_corp <- VCorpus(Mow_source)

The resulting corpus is a collection of 66k documents sized 264MB. That might be normal. It seems odd.

When I bigram tokenize the corpus into a dtm, I end up with:

List of 6
 $ i       : int(0) 
 $ j       : int(0) 
 $ v       : num(0) 
 $ nrow    : int 65966
 $ ncol    : int 0
 $ dimnames:List of 2
  ..$ Docs : chr [1:65966] "1" "2" "3" "4" ...
  ..$ Terms: NULL
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

There is a flaw in how I'm executing this process. Please help me identify what I've done wrong?

Thank you, again!!

I just went back to the tm manual; it doesn't have bi-grams, but qdap::ngrams does. It doesn't take a corpus as an argument; it wants a text object.

ngrams(text.var, grouping.var = NULL, n = 2, ...)

This is my first look at qdap and there are other functions that will take either a text.var object or a word frequency matrix, but they say so explicitly.

Try with Mow_trimmed()?

I'll experiment with that. Thanks.

The method I've been trying to follow uses the RWeka package. Sorry for being unclear about that earlier!

# Make tokenizer function 
tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(text_corp, control = list(tokenize = tokenizer))

I'm sure the method works. And maybe challenging myself to figure what I've overlooked with that method is part of learning to troubleshoot similar hangups in the future.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.