I get: Error: cannot allocate vector of size 1205.6 GB. While debugging I discovered that the code gets stuck at the spread statement inside the reshaped function. I don't see how a dataset of 1.4 GB could ask for 1205.6 GB of memory inside the dplyr code that I wrote. Could anyone explain why is this happening and suggest a possible solution?
P.S. I send this code to a cluster to run with 400 GB RAM, so high but more reasonable memory usage would be fine.
Yeah, but I never had any issues with data manipulations on the uncompressed version with the amount of RAM available to me. Unpacked version usually occupies around 50 GB or so of memory for which I have enough space. Also, the function runs till spread without any problems, so compression is not an issue here.
These are kind of random suggestions but could you try adding an ungroup() before calling spread(); just to ensure that we're not trying to spread a grouped data frame.
Also, count() automatically calls group_by() so you could remove the latter and just add the grouping variable names into the count() function.
I just realized that your function returns the input data frame unchanged. That's because you haven't assigned the result of the spread() operation back to df.
Since R functions return the last value that is computed, you could fix this by removing the return(df) statement altogether. Could you try that?