choice of algorithm

I am doing text classification and have 4 classes (can increase later) to which each text can be classified to.
Which algorithm works best in this case? Is there any better algorithm than RandomForest (ranger)?
What can be expected accuracy from the algorithm?

training data - 2000+

Questions about "which algorithm" to use are extremely problem specific and are very difficult to answer from a distance. Things to think about:

  • How much data do you have? How many features?
  • What is the balance of classes?
  • What is the cost of type 1 vs type 2 errors?

And that's really just the beginning.

If by "2000+" you mean there are 2,000 documents in your corpus, one of the only things that jumps out at me is that that is probably too little data for a neural network. It's difficult to say for sure without knowing precisely how you're building the model, but, for example, many simple bag of words models have well over 2,000 features. With powerful non-linear estimators like deep neural nets (or even random forest), it's very easy to overfit when you have a lot of features to work with and few examples.


This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.