Problem using recipes package with wide input data

Hi there,

I'm having trouble with the recipes::recipe() function when using a wide set of input predictor features: I get an error saying "cannot allocate vector of size XX Gb."

I've worked up a reproducible example below. Any suggestions or workarounds would be greatly appreciated!

library(AmesHousing)
library(tidyverse)
library(recipes)

## make a small tibl from the ames housing package 
ames <- 
  make_ames() %>% 
  select(Sale_Price, Longitude, Latitude) %>% 
  ## make outcome be binary indicator of sale price 
  ##  being above $150,000
  dplyr::mutate(Sale_Price = 
          factor(sign(Sale_Price>150000)) %>%
            fct_inseq()
          ) 

## make a recipe with small p/few predictors
rec <-  recipe(Sale_Price ~ ., 
          data = ames
        ) # works no problem!
rec 

Up to this point everything runs smoothly, but if I try to add many more columns to the Ames data I can't get the same script to run:

## add large p nxp matrix to ames
p <- 500000
set.seed(32798)
big.dat <- matrix(runif(n = nrow(ames) * p), 
            nrow = nrow(ames), ncol = p) %>%
          as_tibble()

big.ames <- ames %>% 
            bind_cols(big.dat)

## make recipe with large p ames dataset
rec <-  recipe(Sale_Price ~ ., 
          data = big.ames
        ) ## this never completes!

## > Error: cannot allocate vector of size 3017.5 Gb
## > Execution halted

I'm using a machine with quite a lot of RAM so feel like the command must be getting hung up somewhere unnecessarily, but I'm not sure.

1 Like

I'm not sure about how this works in Windows. In the Linux sphere, however, there is an environment tuning parameter that limits the amount of memory that can be accessed.

On my macOS

 $ 548:  /usr/bin/ulimit
unlimited

This should work on most other Unix descendants to report if there is a resource limit. Can be changed with root privileges. See man(ulimit)

Hi, Paul.

Sorry to be obscure. Even with such a mindblowing, awesome amount of RAM, ulimit settings ration the amount that is addressable by any single process.

It's hard to say without knowing more, but because everything in R runs in some sort of environment that acts like RAM it is not only the size of the object that a program is operating on that matters but also the sizes of the interim object that the program must use to work.

In this case it is most likely that the culprit is the need to process the large p nxp matrix, big.ames.

I hope this is enough to go on to grit your teeth and go to IT, if you don't have root privileges or to look into changing ulimit to unlimited if you do.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thank you @technocrat, but I don't think access to physical memory is the issue:

  • I'm working on linux cluster with >500gb of RAM available that I've used before on tasks in R that should me more memory intensive than this
    • ps. no sudo privileges here but shouldn't be necessary

I'm not even sure why it should be doing anything particularly memory intensive as recipe() should just be making a model matrix, which I can do with this object easily enough :thinking:

Thanks Richard, very much appreciated. Hope I didn't come off as pretentious talking about the size of my RAM :smile:

Yeah, I hear ya that interim object is really size is what really determines the RAM cost. Maybe a discussion with IT is in order...

1 Like

Hey, Paul, not at all! Good luck :grin: with sysadm!