unnest_tokens with token = "ngrams" will use behind the scene the tokenizer
and tokenize_ngrams function. In the help page it is precised:
These functions will strip all punctuation and normalize all whitespace to a single space character.
You have no option to configure that currently. You have to use another option.
The ngram
is one of the option as it allows more customization
txt <- "A 40-year-old R&D guy"
library(ngram)
ng <- ngram(txt, n = 2)
ng
#> An ngram object with 3 2-grams
print(ng, output = "full")
#> R&D guy | 1
#> NULL {1} |
#>
#> 40-year-old R&D | 1
#> guy {1} |
#>
#> A 40-year-old | 1
#> R&D {1} |
# I use rev to have in the order you want
ng_string <- rev(get.ngrams(ng))
ng_string
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
Created on 2019-01-09 by the reprex package (v0.2.1)
How to use it with unnest_tokens ?
You can provide a function as token argument. This function must work on a vector and return a list. See ?unnest_tokens
Here is an example:
txt <- "A 40-year-old R&D guy"
d <- tibble::data_frame(txt = txt)
# a function that takes a string an return list with each ngrams
ngram_string <- function(txt, n) list(unname(rev(ngram::get.ngrams(ngram::ngram(txt, n = n)))))
ngram_string(txt, 2)
#> [[1]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
# you need to make it vectorize
ngram_string_vec <- Vectorize(ngram_string, vectorize.args = "txt", USE.NAMES = FALSE)
# it works on vector now
ngram_string_vec(c(d$txt, d$txt), n = 2)
#> [[1]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
#>
#> [[2]]
#> [1] "A 40-year-old" "40-year-old R&D" "R&D guy"
# it is ready to be applied as tokenizing function
tidytext::unnest_tokens(d, ngram, txt, token = ngram_string_vec, n = 2, to_lower = FALSE)
#> # A tibble: 3 x 1
#> ngram
#> <chr>
#> 1 A 40-year-old
#> 2 40-year-old R&D
#> 3 R&D guy
Created on 2019-01-09 by the reprex package (v0.2.1)