Hi,
Thank you. I'm using tm and qdap. I'm following the process laid out by Ted Kwartler in his very helpful datacamp text mining course.
I have a vector of words sized 1.1MB. When unlisted and split, it has 65k entries.
The as.matrix function was throwing errors about not being able to process 4.2GB file. I have max 7.87GB usable.
I sorted the words in descending order to create a custom stoplist, and then removed all but the top 2000 (in terms of frequency). I named that character vector "Mow_trimmed":
str(Mow_trimmed)
chr [1:65966] "touch" "propelled" "effective" "healthy" "green" "eater" ...
object.size(Mow_trimmed)
1008544 bytes
Next, I convert the vector into a VectorSource object, and then a VCorpus, per:
Mow_source <- VectorSource(Mow_trimmed)
Mow_unlist_corp <- VCorpus(Mow_source)
The resulting corpus is a collection of 66k documents sized 264MB. That might be normal. It seems odd.
When I bigram tokenize the corpus into a dtm, I end up with:
List of 6
$ i : int(0)
$ j : int(0)
$ v : num(0)
$ nrow : int 65966
$ ncol : int 0
$ dimnames:List of 2
..$ Docs : chr [1:65966] "1" "2" "3" "4" ...
..$ Terms: NULL
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency" "tf"
There is a flaw in how I'm executing this process. Please help me identify what I've done wrong?
Thank you, again!!