Takes too long to create a random forest model for text data

I am working on text classification using random forest algorithm and the data size is about 2000 rows and 13 column but the analysis uses only 1 column which contains the text.

The model making process takes over 15 hours.

How can the process be modified to provide faster and more efficient results.
Im using a system with 8gb ram, windows 10 and 64bit .

sample.pdf (45.1 KB)

(The pdf of code isn't that helpful. A small reproducible example would go a long way to help us help you)

How many features are generated from the one column of text?

What is the memory utilization when not run in parallel? Are you exhausting memory since the use of parallel processing has multiplied the total memory needs by 3?

I'd suggest moving away from the formula method for train (or anything else that uses random forests). If the predictors are categorical (and we can't tell from your example code in pdf; it's hard to read), it probably takes longer to run.

Use method = "ranger" instead of the randomForest package and turn off ranger's internal parallelism.

Finally, be aware that the call to train is fitting 91 random forest models. It's not really "a" random forest model.

3 Likes

the text is classified into 4 categories.
No of columns in the tfidf matrix is 5000+ and about 1200 rows
During the execution the performance - CPU (75-95%) ,memory (1700+mb or 77%)

how do i do more parallel execution to speed up the process?
Should i reduce the vlaue of tunelength attribute from 3 to something else?
why is it not a random forest model?

If you are at 77% of memory running sequentially, I don't think that you should be running in parallel on that system.

You could use the tuneGrid argument to make mtry to the square-root of the number of predictors . Perhaps take a look at the documentation.

My point was that your call to train is fitting 91 separate random forest models (as opposed to a single random forest model). Sometimes people don't know what train is doing and are frustrated by the time it takes to do what they think is a single model fit.

Thank you Max, the "ranger" method worked well for me.

Also was wondering "that your call to train is fitting 91 separate random forest models (as opposed to a single random forest model)" and which line in the code is doing so and also how to avoid that.

.

Take a look at the documentation link above to see what train is used for.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.