dplyr memory usage blows up

So, I have a dataset df 1.4 GBs big, and I am trying to reshape it using the following function:

reshaped <- function(df){
  df %>%
    select(subject_num, concept_code) %>% 
    group_by(subject_num, concept_code) %>%
    count() %>% 
    spread(concept_code, n, fill=0)
  
  return(df)
}

df = read_rds('df.RDs') %>% 
         mutate(a=paste(a, b, sep="|"))
df <- reshaped(df)
write_rds(df, 'df_reshaped.RDs')

I get: Error: cannot allocate vector of size 1205.6 GB. While debugging I discovered that the code gets stuck at the spread statement inside the reshaped function. I don't see how a dataset of 1.4 GB could ask for 1205.6 GB of memory inside the dplyr code that I wrote. Could anyone explain why is this happening and suggest a possible solution?

P.S. I send this code to a cluster to run with 400 GB RAM, so high but more reasonable memory usage would be fine.

R RDS Files, are in some way gzip compressed files

if now your compressed file alredy has the size of 1,4G then
unpacking it in memory you run into quick problems

/THX1188

Yeah, but I never had any issues with data manipulations on the uncompressed version with the amount of RAM available to me. Unpacked version usually occupies around 50 GB or so of memory for which I have enough space. Also, the function runs till spread without any problems, so compression is not an issue here.

These are kind of random suggestions but could you try adding an ungroup() before calling spread(); just to ensure that we're not trying to spread a grouped data frame.

Also, count() automatically calls group_by() so you could remove the latter and just add the grouping variable names into the count() function.

tried, didn't work((

Is pivot_wider any better ?

I just realized that your function returns the input data frame unchanged. That's because you haven't assigned the result of the spread() operation back to df.

Since R functions return the last value that is computed, you could fix this by removing the return(df) statement altogether. Could you try that?

Good catch! But I don't think this would fix anything as the function doesn't even get to the return statement. It gets stuck at spread.

pivot_wider gave me this error:

Error: Can't assign to elements that don't exist.
[31mx[39m The locations 1237409, 2474817, 3712225, 4949634, 6187042, etc. don't exist.
[34mℹ[39m There are only 1 element.
Backtrace:
[90m     [39m█
[90m  1. [39m├─global::reshaped(dx)
[90m  2. [39m│ └─`%>%`(...)
[90m  3. [39m│   ├─base::withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
[90m  4. [39m│   └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
[90m  5. [39m│     └─base::eval(quote(`_fseq`(`_lhs`)), env, env)
[90m  6. [39m│       └─`_fseq`(`_lhs`)
[90m  7. [39m│         └─magrittr::freduce(value, `_function_list`)
[90m  8. [39m│           ├─base::withVisible(function_list[[k]](value))
[90m  9. [39m│           └─function_list[[k]](value)
[90m 10. [39m│             └─tidyr::pivot_wider(...)
[90m 11. [39m│               └─tidyr::pivot_wider_spec(...)
[90m 12. [39m│                 └─vctrs::`vec_slice<-`(...)
[90m 13. [39m└─
In addition: Warning messages:
1: In nrow * ncol : NAs produced by integer overflow
2: In nrow * (val_id$col - 1L) : NAs produced by integer overflow
3: In val_id$row + nrow * (val_id$col - 1L) :
  NAs produced by integer overflow

Thanks. I'm curious about two things.
At the point where you've created a df with count

  1. can you get the dimensions of that frame ? dim()?
  2. can you tell the highest count value in there ? Perhaps with max()?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.