using split() and lapply()

Hi,

I have over 6mil rows and around 30col in my dataset. I wrote a code for a small subsample of the data that:

  1. uses split() to put every unique combination defined in the argument into a list item
  2. iterate through that list using lapply() to do manipulation.

Problem is that the split() part explodes to 90gb of RAM and crashes my server.

What else can I use instead of the split()? Do I need to move from lapply as well? I actually use the parLapply() makes things much faster.

Thanks

You could also use the functions in the furrr package to take advantage of multiple cores.

Dramatically speed up your R purrr functions with the furrr package | Technical Tidbits From Spatial Analysis & Data Science (zevross.com)

Thanks,

I actually managed to get somewhere using dplyr nest() and purrr. Either way, from my RAM killing the server it only goes up to 35gb.

is there a way to reduce RAM usage after a processing step? I can see that the data I produced is around 10gb in size, but my RAM is taking around 35gb.

Using gc() doesnt relly help. The only way I found that helps to reduce RAM usage is save the file on your server, restart the session, load your previous datasets and run next step, repeat.

Thanks

I would think if you could chunk your data.
To free the memory via gc() you would need to drop the named object the memory had been attached to.

rm(myobj) or myobj <-NULL

Or simply reuse the myobj chunk for the next iteration. Of your chunked process.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.