Error: memory exhausted (limit reached?)


#1

Hello, I'm trying to make a merge of two data frames:
· Activities 1104807obs of 9 variables
· MetaCiencias 547 obs. of 8 variables
Rstudio remains executing the following order without ever ending:
DFActCurso <- merge.data.frame (activities, metaCiencias, by.x = 'title_Activity', by.y = 'Ciencias.activity', all = TRUE)%>% unique ()

I have tried to delete the duplicate rows and I have verified that my session of R is 64-bit. And I changed the memory of RStudio (memory.limit (size = 16384)
I have a 16GB computer and I run it on Windows.

But the execution of that order never ends, it ends up as an error message:

Error: memory exhausted (limit reached?)
and then the error is repeated:
Error: during wrapup

Would anyone know how to solve it?
Thank you


#2

I think that is the normal behavior in joining two text fields with using merge(), where one has 1M rows. I'm not sure what the best solution in R is, but I believe it's possible--and not too difficult. If you don't get any helpful responses here, I'm sure you can find examples of people doing what you're trying to do via a Google search.

If there isn't a way to do the join in an R session with both dataframes in memory--although I suspect there is a way--, then you can do it out of memory.


#3

Firstly, try to make your data as small as possible so as to conserve memory. If you're going to call unique, do it beforehand. If you don't need any columns, drop them.

All that done, joins can still occasionally blow up the size of your data. One thing to consider is what's happening with any NA values. If the columns by which you are joining have NAs, by default R will do a Cartesian (or "cross") join, which multiplies those rows:

merge(data.frame(x = c(1, NA, NA)), 
      data.frame(x = c(1, NA, NA, NA), 
                 y = c('a', 'b', 'c', 'd')))
#>    x y
#> 1  1 a
#> 2 NA b
#> 3 NA c
#> 4 NA d
#> 5 NA b
#> 6 NA c
#> 7 NA d

If you've got a few hundred or thousand NAs, that can lead to your data getting very big very fast. You can drop NAs in the join by setting the incomparables parameter:

merge(data.frame(x = c(1, NA, NA)), 
      data.frame(x = c(1, NA, NA, NA), 
                 y = c('a', 'b', 'c', 'd')), 
      incomparables = NA)
#>   x y
#> 1 1 a

The same size explosion can happen if you have lots of the same level in both with otherwise different rows, but NAs are a common culprit.

If you are still running out of memory, there is not one solution. Sometimes it requires rethinking what you're trying to do. Sometimes it requires scaling to bigger hardware or SQL. Sometimes data.table can be used to maximize speed and minimize transactional memory usage. Which is appropriate varies by context, but for your sanity, attempt the simple solutions first.