Text Mining Question -- Tokenizing Bigrams

DeliveryPod · April 10, 2019, 10:33pm

I have a corpus named "Mow_corp_lite" with 203k elements and 812.5 MB size. I am trying to tokenize the corpus into bigrams and then summarize the bigrams in a wordcloud.

The script:

# Tokenizing Bigrams and Plotting Bigram Wordcloud
bi_token <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
Mow_bi_dtm <- DocumentTermMatrix(Mow_corp_lite, control = list(tokenize = bi_token))
str(Mow_bi_dtm)
Mow_bi_dtm_m <- as.matrix(Mow_bi_dtm)
Bi_freq <- colSums(Mow_bi_dtm_m)
Bi_words <- names(Bi_freq)
wordcloud(Bi_words, Bi_freq, max.words = 15)

Something is not right. The matrix object "Mow_bi_dtm_m" has 12.4 MB. The result of running colSums on that object is numeric(0).

str(Mow_bi_dtm_m) returns:

num[1:203245, 0 ] 
 - attr(*, "dimnames")=List of 2
  ..$ Docs : chr [1:203245] "1" "2" "3" "4" ...
  ..$ Terms: NULL

Is it possible the terms: null status is due to my having used memCompress on my text prior to coverting it into a VCorpus?

What am I doing wrong?

Thank you!

technocrat · April 11, 2019, 5:38pm

Take a look at the quanteda package. It will do bi-grams, tri-grams, proximity grams, and it's an easy conversion of corpus objects from tm or tidytext

DeliveryPod · April 11, 2019, 6:07pm

Hi,

Thanks very much. I will start playing with that package later today.

I'm reasonably sure the script above is not working due to a problem I created by trying to compress a corpus prior to tokenizing and converting to a matrix.

The motivation for compressing the corpus was to reduce the object size to something my machine can handle in RAM when creating the bigram matrix.

An alternate approach -- I think -- will be to make the dataset more sparse.

I have the following collection of words:

str(Mow_cleanest)
 chr [1:54401] "touch" "healthy" "green" "eater" "lights" "starting" "small" ...

What method would you recommend for making this dataset a more manageable size, while preserving most of the prominent bigram relationships?

Thanks, again!

technocrat · April 11, 2019, 8:15pm

Just to confirm the obvious; you've taken out stopwords, of course.

Mow_cleanest is nowhere near as large as Moby Dick, for example, so unless you're struggling with 4GB RAM, memory shouldn't be a problem.

I'm not sure what package you're working in, but you might also take a look at tidytext(https:://github.com/dgrtwo/tidy-text-mining) and work the examples in Chapter 4

DeliveryPod · April 11, 2019, 9:01pm

Hi,

Thank you. I'm using tm and qdap. I'm following the process laid out by Ted Kwartler in his very helpful datacamp text mining course.

I have a vector of words sized 1.1MB. When unlisted and split, it has 65k entries.

The as.matrix function was throwing errors about not being able to process 4.2GB file. I have max 7.87GB usable.

I sorted the words in descending order to create a custom stoplist, and then removed all but the top 2000 (in terms of frequency). I named that character vector "Mow_trimmed":

str(Mow_trimmed)
 chr [1:65966] "touch" "propelled" "effective" "healthy" "green" "eater" ...
object.size(Mow_trimmed)
1008544 bytes

Next, I convert the vector into a VectorSource object, and then a VCorpus, per:

Mow_source <- VectorSource(Mow_trimmed)
Mow_unlist_corp <- VCorpus(Mow_source)

The resulting corpus is a collection of 66k documents sized 264MB. That might be normal. It seems odd.

When I bigram tokenize the corpus into a dtm, I end up with:

List of 6
 $ i       : int(0) 
 $ j       : int(0) 
 $ v       : num(0) 
 $ nrow    : int 65966
 $ ncol    : int 0
 $ dimnames:List of 2
  ..$ Docs : chr [1:65966] "1" "2" "3" "4" ...
  ..$ Terms: NULL
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"

There is a flaw in how I'm executing this process. Please help me identify what I've done wrong?

Thank you, again!!

technocrat · April 11, 2019, 9:19pm

I just went back to the tm manual; it doesn't have bi-grams, but qdap::ngrams does. It doesn't take a corpus as an argument; it wants a text object.

ngrams(text.var, grouping.var = NULL, n = 2, ...)

This is my first look at qdap and there are other functions that will take either a text.var object or a word frequency matrix, but they say so explicitly.

Try with Mow_trimmed()?

DeliveryPod · April 11, 2019, 9:50pm

I'll experiment with that. Thanks.

The method I've been trying to follow uses the RWeka package. Sorry for being unclear about that earlier!

# Make tokenizer function 
tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

# Create bigram_dtm
bigram_dtm <- DocumentTermMatrix(text_corp, control = list(tokenize = tokenizer))

I'm sure the method works. And maybe challenging myself to figure what I've overlooked with that method is part of learning to troubleshoot similar hangups in the future.

system · May 2, 2019, 9:50pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.