RAM issues when importing several big datasets

marcroca29 · September 15, 2022, 10:53am

Hi!

I am trying to load and merge 85 datasets, each one of them having +1.000.000 observations. I am having RAM memory issues, once I load more than 20 of these dataframes I obtain an error: "Error: cannot allocate vector of size 60.0 Mb". Is it only a matter of my computer capacity? How could I solve this ? (my R version is 64 bit)

I have already tries to use this "memory.limit(size=3000000)" but it does not seem to do the trick

martin.R · September 15, 2022, 12:42pm

If you are having the problems with ca. 20 of 85 datasets then some marginal tricks for memory management won't work, but some tactics would be:

If you only need a few columns then try importing just those and specify the column types
If you know you only need certain rows, then you can grep these via e.g. data.table::fread()

A more likely option would be to use disk space rather than RAM using e.g. arrow:
Integration to Apache Arrow • Arrow R Package

andresrcs · September 15, 2022, 12:52pm

Yes, you are running out of RAM memory

The memory limit already defaults to the maximum amount of RAM installed in your system so there is no benefit on doing that

There are several options, you can simply add more memory (physically) to the system, add swap memory at the expense of performance, you can use an on-disk approach, file based with tools like like Arrow, DuckDB, etc, or with an RDBMS lyke MariaDB, Postgresql, etc. You can try to optimize your memory allocation or sample your data if possible to make inferences.

It all depends on your specific goals

marcroca29 · September 16, 2022, 12:44pm

I have decided to import certain columns and rows. I would like to import all the files from a folder, but importing certain columns (by naming the columns) and certain rows (based on a condition, in this case year>=2015). How could I code it?

Thanks!

jlacko · September 16, 2022, 1:01pm

Yours is actually a common use case. It happens quite often with datasets with standardized structure like logs or what not that are partitioned into individual files by date or region or what not.

My preferred way to handle it is by offloading the data to local sqlite database (there are folks who swear by by Duck DB; the difference is very minor).

This is the way I do it:

first get a list of all files in a directory; {fs} is of great help
create a local sqlite database by connecting to it
(re) creating a target table inside the database
iterate over the vector of excels, reading each one in turn & inserting it into the database
closing the connection as polite people do

Once done you can easily access the data either by using your favorite database manager (I suggest dbeaver) and / or as remote tables via {dbplyr}.

As a piece of code for your inspiration consider this; you will have to tune the DDL part a bit according to your actual data.

library(dplyr)
library(DBI)
library(RSQLite)
library(readxl)
library(fs)


ddl_registrace <- "CREATE TABLE `registrace` (
                         `some_field` REAL,
                         `some_other_field` TEXT,
                         # some ddl here as required.
                  );
                  
                  CREATE INDEX some_field_IDX ON registrace (some_field);"

# get list of all XLSX files in a directory
files <- fs::dir_info("./data", glob = "*.xlsx") 

# connect the database; file will be created if not present
con <- DBI::dbConnect(RSQLite::SQLite(), "./data/auta.sqlite") 

# out with the old...
result <- dbSendQuery(con, "drop table if exists registrace;")
dbClearResult(result) 

# ... in with the new!
result <- dbSendQuery(con, ddl_registrace)
dbClearResult(result) 

for (file in files$path) {
  
  # read the file in
  wrk_excel <- read_excel(file)
  # append data to database
  DBI::dbAppendTable(con, "registrace", wrk_excel)
  
}

DBI::dbDisconnect(con) # don't leave the connection hanging

system · October 7, 2022, 1:02pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.