We use caret and recipes for a diverse mix of ML modelling at our company (specialist insurance). I'm a big fan of the recipes / caret workflow and for us it's key to putting models into production.
However, we're now starting to hit fairly large memory usage at the bake stage of a new prediction. Some rough numbers:
- We run a prediction on ~1.2 million rows
- Roughly 25 features - although half of these are categorical, so design matrix ends up as ~ 75 columns
- R session uses 22GB of RAM at peak
Completely get this is a large (relative) task and will need a decent amount of RAM. But this will become a limiting factor for us soon. Looking at the underlying code, there looks to be a few places where data frames are copied over themselves in a loop which I'm guessing leads to lots of memory use.
A few questions:
- Has anyone experience the same scale of memory usage?
- Does anyone have recommendations for controlling this? My approach will be to "chunk" the data and run multiple times
- Any alternatives for doing this at scale?
- Has this / will this change in newer versions of the package?
For reference I'm running R 3.4.2 and version 0.1.1 of recipes (internal policy means we don't update quickly )