Hi all,
it's a bit of a conceptual question but I wanted to reach out to you to get your opinion and perhaps point me to some good references. As pointed out in the title I'm building a NN on top of short textual data and would like to classify it into 20 different categories. The row count is roughly 50K observations with quite severe class imbalance where the smallest class counts roughly 100 rows (I'm using the case weight option in keras to combat that a bit).
I've tried multiple different solutions already:
- Started with simple dense architectures on top of a prepared DFM matrix
- Then build the same model but instead add an embedding layer on top
- Experimented also with RNN and LSTMs both using an embedding layer
All of those solutions have obviously different performance but even though during training the model is able to reach 99% accuracy both on the training and validation samples, the test accuracy always stagnates at around 86% for any type of bootstrap from that population (I've tried different resamples too to verify that). I experimented already with a lot of different setting, architectures, added regularization and nothing really help me break that barrier.
With all of that at hand I begin to wonder whether there's something I'm missing out here except for the obvious answer which is: "get more training data", for instance:
-
What would be the recommended number of tokens to consider for such a task?
-
What should be the size of the embedding (in the context of the number of tokens)?
-
Why isn't my RNN/ LSTM performing much better that a simple dense NN applied to a pruned dfm?
-
Is it worth pruning my dictionary before passing it into the tokenizer and applying RNN / LSTM?
-
What should be my batch size and number epochs? Currently the accuracy of the model stops increasing at epoch = 3 with a batch size of 32
Any recommendations and advice would be really welcome! If you have any references to papers that discuss a similar topic it would be great if you can pass them along.