What is the best tokenizer to be used for keras

keras
textmining

#1

The standard keras tokenizer framework is the following:

tokenizer <- text_tokenizer(num_words = num_words) %>% 
  fit_text_tokenizer(df_train$text)

sequences <- texts_to_sequences(tokenizer, df_train$text)

However, the problem with that approach is that it doesn't allows for instance to:

a) prune the corpus (with quanteda for instance) from most frequent words or

b) exclude stopwords or

c) perform any other standard NLP actions on the set

Can anyone recommend another tokenization approach that has it's output compatible with keras models input requirements (coming from texts_to_sequences())?


#2

I had some very good results from package udpipe - especially when using languages other than English (Czech, my native language, belongs to the Western Slavic family).

The function is udpipe::udpipe_annotate(). I have used it as input for vocabulary based LSTM keras models, with positive results. Classification accuracy improved when I started using lemmas instead of plain tokens.


#3

Thanks for pointing that! Could you provide a minimal example of formatting the output of the udpipe::udpipe_annotate() into the format expected by keras where each word in the sequence is referenced by an index of the dictionary?


#4

I can do that, as it is an interesting problem and I have the code.

It will not easily fit the format of a forum post, as there is a lot the "minimal" example has to do:

  • tokenize a piece of text
  • build a vocabulary
  • build a matrix input
  • build & verify the model (ok, this part is optional, but it is the most fun)

I will make it a blog post on www.jla-data.net instead; that should not be a problem :slight_smile:


#5

I wrote a blog post covering the subject, as it took a bit more space than is available in a forum post.
You owe me a beer :beer: :wink:

It is built on a toy scenario of classifying authorship of 1000 tweets by two popular accounts (tweets are a neat subject for text classification).

It has accuracy of ~93% for starters, which is not bad and can be still improved upon.


#6

Thanks, that's actually great! :slight_smile: Let me come back to you on this one again throughout the week when I find the time to review in greater detail! Haha, I'll give you a shout when in Prague :wink:


#7

Thanks! It should give you a start. Do shout out if you run into trouble.


#8

Hi @jlacko - over here on the quanteda github page we're discussing an option of adding a default option for converting it's tokens and dfm objects directly into a keras compatible object with quanteda's convert function: link. I believe your inputs could be valuable :slight_smile: