I'm tuning my xgboost model with Tune of Tidymodels, running in 6 parallel processes on doFuture. However, it is still going to take a week to finish the tuning. Therefore, I'm considering adding more CPU cores to speed the tuning. However, I noticed that each tuning process would consume like 6GB of RAM (I guess every process own a copy of the training data?). If I'm going to spawn 1 process each core on a CPU of 68 cores, that is like 408GB of RAM needed, which is impossible for me. So how should I utilize more computation power regarding the memory usage? Maybe I should increase the thread number of xgboost engine and reduce the tuning process number?
Your assessment of the memory usage is correct; with every worker, the data are replicated in memory. You can try passing the xgboost thread parameter and see if that helps (please let us know).
Even if you had the memory, it's been my experience on large hpc systems that using > 50ish workers is not efficient. There is a startup time for each worker and that can take a while. Your experience will vary depending on the model and data set.
Thanks Max, so I installed OpenMP and reinstalled XGBoost package and added
nthread = 6 to the engine arguments; now I always have one worker with 600% CPU usage instead. I think this is more efficient because in the past, when I have 6 workers in parallel, there are times only 4 workers are 100% and 2 are idling. I believe this is because I have a 10-fold of cross validation samples, and setting
parallel_over = "everything" didn't help this. Another fun fact is that now the worker only consume 600MB of memory.
Later I'll try to tune the model on a server with much more cores and give you more report. However, one thing concerns me is that I might need different thread number settings on different machines. Is that configurable in Tune? Maybe this is not a problem for XGBoost because the manual said "Parallelization is automatically enabled if OpenMP is present", but what about other models?
So, I don't use Tune to tune hyper params, I usually create my own data.frame of combinations and then use foreach to train in parallel. That said, I spend a lot of time tuning xgboost models and here are some helpful tricks I've picked up:
- Tuning hyper params is what is called "embarassingly parallel" which means it's really easy and efficient to train them in parallel. In general, it will be faster to train 6 single-threaded models at once than it is to train 1 model using 6 threads.
- There's a diminishing rate of returns when adding more threads to train a single model. And hyper-threading actually can slow things down a bit.
- You can balance the memory consumption issue with the less efficient thread usage of training a single model with multiple threads by doing both. In other words, if you have 6 physical cores, but not enough RAM to train 6 models at once, try training 3 models at a time with 2 threads each.
- I've found 5-fold CV to be sufficient, 10-fold could be slowing down your training significantly.
- Make sure you're running the most up-to-date version of xgboost available, they're constantly improving the performance and memory usage of the package.
- Make sure you set single_precision_histogram = T, this will halve the memory usage by using 32-bit numbers instead of 64-bit and also speed up calculations with very, very minimal impact to model results.