Advice on building a multi-class neural network on 50K records of short text data (up to 10 words)

Hi all,

it's a bit of a conceptual question but I wanted to reach out to you to get your opinion and perhaps point me to some good references. As pointed out in the title I'm building a NN on top of short textual data and would like to classify it into 20 different categories. The row count is roughly 50K observations with quite severe class imbalance where the smallest class counts roughly 100 rows (I'm using the case weight option in keras to combat that a bit).

I've tried multiple different solutions already:

  1. Started with simple dense architectures on top of a prepared DFM matrix
  2. Then build the same model but instead add an embedding layer on top
  3. Experimented also with RNN and LSTMs both using an embedding layer

All of those solutions have obviously different performance but even though during training the model is able to reach 99% accuracy both on the training and validation samples, the test accuracy always stagnates at around 86% for any type of bootstrap from that population (I've tried different resamples too to verify that). I experimented already with a lot of different setting, architectures, added regularization and nothing really help me break that barrier.

With all of that at hand I begin to wonder whether there's something I'm missing out here except for the obvious answer which is: "get more training data", for instance:

  • What would be the recommended number of tokens to consider for such a task?

  • What should be the size of the embedding (in the context of the number of tokens)?

  • Why isn't my RNN/ LSTM performing much better that a simple dense NN applied to a pruned dfm?

  • Is it worth pruning my dictionary before passing it into the tokenizer and applying RNN / LSTM?

  • What should be my batch size and number epochs? Currently the accuracy of the model stops increasing at epoch = 3 with a batch size of 32

Any recommendations and advice would be really welcome! If you have any references to papers that discuss a similar topic it would be great if you can pass them along.

In what language is your text? For many languages you get better classification results if you lemmatize your text instead of "just" tokenizing.

R package udpipe has helped me a lot in this area of work. Besides of lemmatizig it also annotates parts of speech (verbs, proper nouns etc.) This helps in text classification by using features like count of parts of speech per short text (think verbs per tweet) which can be very predictive.

It's in English, but it will also be Dutch and German in it's future applications.

Ok, that actually a good idea. Would you then have different input layers in your network, e.g.:

  1. Layer 1 - tokenized words into embeddings
  2. Layer 2 - output of the udpipe package
  3. Layer 3 - possibly additional numerical features + output of the textfeatures package

But apart from that would you have any specific recommendations regarding the architecture or some of the parameters that I pointed out?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.