The standard keras tokenizer framework is the following:
tokenizer <- text_tokenizer(num_words = num_words) %>%
fit_text_tokenizer(df_train$text)
sequences <- texts_to_sequences(tokenizer, df_train$text)
However, the problem with that approach is that it doesn't allows for instance to:
a) prune the corpus (with quanteda for instance) from most frequent words or
b) exclude stopwords or
c) perform any other standard NLP actions on the set
Can anyone recommend another tokenization approach that has it's output compatible with keras models input requirements (coming from texts_to_sequences())?