Recipes package uses a large amount of memory in bake

We use caret and recipes for a diverse mix of ML modelling at our company (specialist insurance). I'm a big fan of the recipes / caret workflow and for us it's key to putting models into production.

However, we're now starting to hit fairly large memory usage at the bake stage of a new prediction. Some rough numbers:

  • We run a prediction on ~1.2 million rows
  • Roughly 25 features - although half of these are categorical, so design matrix ends up as ~ 75 columns
  • R session uses 22GB of RAM at peak

Completely get this is a large (relative) task and will need a decent amount of RAM. But this will become a limiting factor for us soon. Looking at the underlying code, there looks to be a few places where data frames are copied over themselves in a loop which I'm guessing leads to lots of memory use.

A few questions:

  1. Has anyone experience the same scale of memory usage?
  2. Does anyone have recommendations for controlling this? My approach will be to "chunk" the data and run multiple times
  3. Any alternatives for doing this at scale?
  4. Has this / will this change in newer versions of the package?

For reference I'm running R 3.4.2 and version 0.1.1 of recipes (internal policy means we don't update quickly :frowning_face: )

Thanks,
Steve

I don't have a certain answer for you, but R 3.5 introduced performance improvements that might help, if you can upgrade:

Can you give some information on the recipes? If you are just centering and scaling, we would expect different performance characteristics than a more complex recipe with Isomap and similar steps.

If you have concerns about method privacy, go ahead and send an email to max@rstudio.com (an example script would be best).

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.