How to Read Huge Files With R

Hello,

Thank you all in advance for being so supportive as I learn this language from the ground up.

So I got 12 csvs , for 12 months, and together they add up to over a GB., but the size is too much
I tried using RStudio Cloud, and just a basic R Console, as well as different codes.

I could load them individually, or only work with half as much data while I have this 7 year old MAC, but if there is a way to work with so much data, that'd be preferred.

This is the code I use before the last command falters:

install.packages("dplyr")
install.packages("tidyverse")
library(tidyverse)
library(dplyr)
library(data.table)
setwd("file_path")
files <- list.files(pattern = ".csv")
temp <- lapply(files,fread,sep=",")
data <- rbindlist(temp)

The error I get at the end says:
"2023-02-09 02:47:11.953 R[7686:174592] IMKClient Stall detected, please Report your user scenario attaching a spindump (or sysdiagnose) that captures the problem - (imkxpc_selectedRangeWithReply:) block performed very slowly (4.37 secs).
2023-02-09 02:47:11.953 R[7686:174592] IMKClient Stall detected, please Report your user scenario attaching a spindump (or sysdiagnose) that captures the problem - (imkxpc_deadKeyStateWithReply:) block performed very slowly (4.30 secs)"

So what would you do? Just work with less data or there is a certain package for more efficient csv_reading or something like that?

Thanks again! -Jon

One of my macs is that vintage. I've got 8GB RAM and have no difficulty with importing objects that size. If it were only 4GB, however, I'd worry.

But step back from how and take a look at what.

f(x) = y

x is your csv files (do those represent different years or do they represent different months of the same year?), what you have.

y is some object you want to create, say a model of New England ice cream consumption based on season and some other variables. (Spoiler, highest in summer, because it's a traditional beach treat.)

f is a composed function object to get from x to y and along the way it may be doing several things—data cleaning, variable section, etc—to create $x_1 \dots x_n before f_n can be called to produce y.

With that in mind, f_1 is choking on bringing in all of the x csv files. Given the next step that you have in mind for x_1, do you need all of them into the same object? Does your script run through creating temp? So, if, for example, if there were some currency amounts that were brought in as typeof character, and they needed conversion to typeof numeric, you could write a function to do that, cashout() and

sapply(temp,cashout)

That will leave you with a dozen data.table objects that have been lightly touched and if you expect data reduction through these baby steps, you could test rbindlist() with the first pair, then the second, to see if it still crashes.

You can use an "on-disk" approach, duckdb is a really good and popular option these days and you can seamlessly integrate it with arrow, for faster data ingestion and lighter data storage, and with dbplyr for friendly dplyr like syntax if you are not comfortable using SQL directly.

If you have 4GB of RAM you should easily be able to work with 1GB of data if you use data.table properly.

One thing which may help is avoiding the temp stage:

my_data <- rbindlist(lapply(files, fread))

Also, try to avoid using "data" as a variable name.

2 Likes

Thanks a ton for this! DuckDB looks like it might be a bit above my novice level. Is that correct of me to think? But it may be what I need!

Hey this worked! Why is it better to remove the temp stage / variable creation. Because maintaining it (and the 5.5 million rows that come with it) is just too much storage to have while also having the final 'data' varaible also with 5.5 million rows?

What's also real interesting is that the datetime variables are correctly returned as data type POSIXct, because sometimes I've loaded in dates from CSV files written in the same format and they got recognized as strings, but perhaps that was because it was Studio Cloud, when I ran your suggestion just on my desktop IDE (R 4.2.2)

Thanks a lot!

Hey thanks! Your answer is kind of like martin.R's below in the sense that you're saying to not assign a new variable for each change, but just change the variable itself, at least when working with this much data, so you aren't recreating a new variable with 5 million rows each step?

It could be in some cases but using it with dbplyr makes it simpler, almost transparent to the user (as long as you are already used to work with dplyr)

yes; you could always inline the code to not retain temp, so its auto released

library(data.table)
setwd("file_path")
files <- list.files(pattern = ".csv")
data <- rbindlist(lapply(files,fread,sep=","))

no temp to get rid of

fread() guesses the column type based on a sample of the rows. This will not always work correctly. You can specify the type manually to save this step. Also, it can be useful to retain dates as strings if you don't need to make any date calculations.

Update, though it worked on my R 4.2.2 IDE last night, it did not work on the RStudio Cloud. no biggie though! In fact, I think it's better for me to get the practice in a simple platform with fewer crutches arguably.

1 Like

Posit Cloud (formerly RStudio Cloud) also has R 4.2.2 so there is no difference in that regard, the most likely cause is that free Posit Cloud accounts have a 1GB RAM limit which is most likely much less than what you have available in your own machine. Have in mind that if you have a paid account you can increase the memory limit up to 16GB of RAM

1 Like

Have a GB data, and facing the data import issue when using less memory. which is the problem I faced many year ago. Though expand memory is final and best solution. I want to share some approach as well.

  1. loading part: loading by chunk and store the loaded data into disk storage(some format fast than csv ,database or data.table. or Rdata and etc)
  2. analyze part: it is be hard to analyze the data if facing memory limitation when loading, you can you put it into database and using DBplry to anlyze data via database power as well as R strong power in chunk mode.

That's the way. R + some powerful database(such as Postgresql) can handle any size data if it not large than disk space.
The disadvantage is slow computation such data moved from disk to memory.
The advantage is persistent you analytic & analysis mid-stag result and reuse it, reproduciable.

That's all. Hope it useful for the case.

WangYong

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.