I am currently working my way through the book Text Mining with R and am at the tokenizing portion of the book. My question may appear a bit simplistic but bare with me.

In the example below, we take a column text and tokenize it into two ngrams. If i wished to model something like this for classification, i would need to take these tokens and convert them to a matrix of 1s and 0s where my original column has the bigram or not (1 where it does, 0 where it does not). Does anyone know how to accomplish this.

d <- tibble(txt = prideprejudice)

d %>%
  unnest_tokens(bigram, txt, token = "ngrams", n = 2)

Have you gotten to the "Converting to and from non-tidy formats" section yet?

Looks like you need to count() and then cast_dtm() or cast_dfm() depending on what you want.


Hi @mara

I got to that section but for some reason could not get it to work (my fault entirely :))
I found an excellent book here that covers it

Thanks for your help

