Making An N-Gram Tokenizer - R Views Submission

Category: Entry Point into a Topic
Repo: GitHub - Ckrenzer/Making-An-NGram-Tokenizer: Walks through the process of creating an n-gram tokenizer.</t


Background

I am a big fan of Julia Silge's work. Her writings and tutorials not only show you how to be an advanced user of algorithms, but she also provides references for those looking to further their understanding of them. In the second chapter of the book she co-authored with Emil Hvitfeldt, Supervised Machine Learning For Text Analysis in R, Julia writes in great detail about how to create a tokenizer for text data. She explains how tokenizers work, the difficulties of tokenizing different human languages, and importantly, she shows us how to create our own tokenizers for our own use cases. In doing so, she touches upon one of the most important heuristics of the R language and indeed programming more broadly: functions gain flexibility at the expense of execution speed. This is not always the case, but it holds much of the time. In Chapter 24 of Hadley's Advanced R, Improving Performance, you see echoes of this claim. The more error-checking and custom features your function has, the more time your function spends ensuring customizability instead of computing the return value.

The goal, of course, is to have both--we want speed and customizability. Julia does a great job explaining the trade-offs inherent in the different tokenizer options out there, and she even provided us with code to build our own tokenizers. The 'hand-written' functions from the book break letters and individual words into tokens, but it did not provide us with a tokenizer for n-grams. In fact, I could not find anything online writing out the code to create an n-gram tokenizer--all the tutorials just showed me how to use tidytext::unnest_tokens(). I searched through the depths of the tidytext source code to find the unnest_tokens() implementation to no avail--but it's in there somewhere! After reading her chapter, I wanted to build an n-gram tokenizer from the ground up, and I wanted it to be fast. This guide examines some clever ways to assemble n-grams that are viable competitors to mainstream alternatives on CRAN.

We will start by describing how to create a tokenizer, then we will write different implementations and close by benchmarking the functions.

The Process

Any value of n in n-grams can be constructed from unigrams. In practical terms, this means placing each word of a document into its own element of a vector and then pasting n adjacent words together to make the n-gram. For example, with n = 2 on the sentence, "My short and exquisite sentence", we should have {"My short", "short and", "and exquisite", "exquisite sentence"} after tokenizing. Think of the tokenizer as a slide that selects two (n) words at a time until each pair of words is selected.

One of the greatest bottlenecks in this approach is the step creating unigrams. If this can be bypassed (perhaps with a well-designed regular expression?), the implementations can speed up further.

Packages

Let's load in all the packages we'll be using:

if(!require(pacman)) install.packages("pacman")
# General Packages
pacman::p_load(stringr, dplyr, readr, data.table)

# Packages with n-gram tokenizers
pacman::p_load(tokenizers, tidytext)

# Benchmarking packages
pacman::p_load(microbenchmark)

Tokenizers

As mentioned above, the more restricted a function's output, the faster the function's execution speed--the more complex the data structure used, the longer it takes to run. Often, we will want to return a data frame instead of a character vector. This will cost us dearly, but the flexibility gained from doing so is often well-worth it.

Let's begin with a simple tokenizer and work our way up in complexity.

Heads & Tails

Often, we only need bigrams (n = 2). After splitting the text into unigrams, head() and tail() combine the words into bigrams quite nicely:

f1 <- function(words){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  
  return(str_c(head(words, -1), tail(words, -1), sep = " ", collapse = NULL))
}
  • Split the string into unigrams.

  • Paste the vectors together using head() and tail().

  • Bigrams (n = 2) only.

Nested Loops

For those of us who like to think about problems by tracing out each step, loops work wonders. It is not guaranteed to be a fast implementation in R, but this methodology translates nicely to other languages:

f2 <- function(words, n = 2){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  n <- n - 1
  
  vec <- character(length(words) - n)
  for(i in 1:length(vec)){
    for(j in i:(i + n)){
      vec[i] <- str_c(vec[i], words[j], sep = " ") 
    }
  }
  vec <- str_remove(vec, "^\\s{1}")
  return(vec)
}
  • Split the string into unigrams.

  • Use nested loops to assemble pairings of n words together.

  • Works for any value of n.

Shift Lists

Data.table's shift() function provides us with a bona fide slider as described above. We can then use loop functions to assemble the n-grams. Do you keep a shift list?

f3 <- function(words, n = 2){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  n <- n - 1
  
  word_list <- lapply(shift(words, n = 0:n, type = 'lead'), na.omit)
  mn <- min(lengths(word_list))
  grams <- do.call(paste, lapply(word_list, head, mn))
  
  return(grams)
}
  • Use data.table::shift() to put words into n lists.

  • Use the length of the shortest list when calling the head() function and assemble the word vector with paste() and do.call().

  • Does not work for n = 1.

Tibbles

Of course, we might want to tokenize in a magrittr pipeline. When this is the case, character vectors usually won't suffice. Returning more complex data types allows for greater flexibility, but we pay for it at run time. To drive this point home, the three above functions are adjusted to return a tibble, and f2_tibble_custom() is a modified version of f2() that allows tokenization of multiple documents at once, as well as control over the input column of text.

f1_tibble <- function(words){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  
  return(tibble(text = str_c(head(words, -1), tail(words, -1), sep = " ", collapse = NULL)))
}
f2_tibble <- function(words, n = 2){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  n <- n - 1
  
  vec <- character(length(words) - n)
  for(i in 1:length(vec)){
    for(j in i:(i + n)){
      vec[i] <- str_c(vec[i], words[j], sep = " ") 
    }
  }
  vec <- str_remove(vec, "^\\s{1}")
  return(tibble(text = vec))
}
f3_tibble <- function(words, n = 2){
  # the unigrams
  words <- str_split(words, "\\s+")[[1]]
  n <- n - 1
  
  word_list <- lapply(shift(words, n = 0:n, type = 'lead'), na.omit)
  mn <- min(lengths(word_list))
  grams <- do.call(paste, lapply(word_list, head, mn))
  
  return(tibble(text = grams))
}
f2_tibble_custom <- function(text_df, key_column, text_column, n = 2){
  # the unigrams
  words <- str_split(text_df[[text_column]], "\\s+", simplify = FALSE)
  n <- n - 1
  
  # results will be added to this data frame
  ngram_df <- tibble(doc_id = character(0), text = character(0))
  
  # performing this operation for each 'key' in the data frame
  for(element in 1:length(words)){
    # the company name is our key
    key_name <- text_df[[key_column]][element]
    
    # the n-grams are added to this vector
    vec <- character(length(words[[element]]) - n)
    for(i in 1:length(vec)){
      for(j in i:(i + n)){
        vec[i] <- str_c(vec[i], words[[element]][j], sep = " ") 
      }
    }
    ngram <- tibble(doc_id = key_name, text = str_remove(vec, "^\\s{1}"))
    ngram_df <- bind_rows(ngram_df, ngram) 
  }
  
  return(ngram_df)
}

Benchmarking

Data

To test our functions, we need data! Let's use my favorite novel, The Count of Monte Cristo. The version used is the same as one from the corpus package's Project Gutenberg API, except the notes at the end of the book have been removed. The data is stored in a text file to ensure reproducibility.

The Count is the closest thing the 19th century had to the Most Interesting Man In The World; if you have time to spare, give it a read!

# The Count of Monte Cristo by Alexandre Dumas (English Translation)
cristo_text <- readr::read_lines("https://raw.githubusercontent.com/Ckrenzer/Making-An-NGram-Tokenizer/main/The%20Count%20of%20Monte%20Cristo.txt")

# We want the data represented in a single string for tokenization
cristo_text <- str_c(cristo_text, collapse = " ")

cristo <- tibble(doc_id = "cristo", text = cristo_text)

Finally, let's put our functions to the test! They will be compared, at n = 2, to each other and official tokenizers found on CRAN (tokenizers for character vectors and tidytext for tibbles). This takes a while to run.

# These functions return character vectors
microbenchmark::microbenchmark(

  `Heads & Tails` = f1(words = cristo$text),
  `Nested Loops` = f2(words = cristo$text, n = 2),
  `Shift List` = f3(words = cristo$text, n = 2),
  
  `tokenizers` = tokenizers::tokenize_ngrams(x = cristo$text, n = 2, simplify = FALSE)[[1]],
  
  check = NULL,
  times = 25,
  unit = "s"
)

#> Unit: seconds
#>           expr       min        lq      mean    median         uq        max neval
#>  Heads & Tails 0.2889884 0.3867171 0.4575909 0.4182707  0.5038394  0.7673346    25
#>   Nested Loops 6.7113265 8.0643973 9.1981370 8.9747056 10.2201626 12.1695817    25
#>     Shift List 0.4393635 0.5288969 0.7289757 0.6653517  0.8351631  1.5823200    25
#>     tokenizers 0.3321225 0.4031210 0.4658819 0.4320680  0.5043797  0.8674276    25
# These functions return tibbles
microbenchmark::microbenchmark(
  
  `Heads & Tails` = f1_tibble(words = cristo$text),
  `Nested Loops` = f2_tibble(words = cristo$text, n = 2),
  `Shift List` = f3_tibble(words = cristo$text, n = 2),
  
  `Nested Loops Custom` = f2_tibble_custom(text_df = cristo,
                                           key_column = "doc_id",
                                           text_column = "text",
                                           n = 2),
  
  `tidytext` = tidytext::unnest_tokens(tbl = cristo,
                                       input = text,
                                       output = text,
                                       token = "ngrams",
                                       n = 2),
  
  check = NULL,
  times = 25,
  unit = "s"
)

#> Unit: seconds
#>                 expr       min        lq      mean    median        uq       max neval
#>        Heads & Tails 0.2997065 0.3412449 0.3957833 0.3633596 0.4006367 0.6814346    25
#>         Nested Loops 7.0280510 7.3828038 7.7136532 7.6348447 7.8778614 9.7611280    25
#>           Shift List 0.3780660 0.4803673 0.5579947 0.5167267 0.5703890 0.8564233    25
#>  Nested Loops Custom 7.1850311 7.5984540 7.9887214 7.9386131 8.1067919 9.5205180    25
#>             tidytext 0.4621718 0.5305908 0.5828535 0.5530582 0.6049297 0.9781302    25

Depending on the day, Heads & Tails runs faster than the equivalent code in tokenizers. We beat code written in C++ with a bit of clever writing! The tokenizers function only requires about half the memory of Heads & Tails and can be used with any value of n, but we did this entirely with stringr and base R!

Using a black box is much different from understanding an algorithm, but now you can at least see how tokenizers actually work!


This is a submission to the R Views Call for Documentation. For more information see rviews.rstudio.com.

1 Like