A general question about processing Big data (Size greater than available memory) in R

It's a general question.

I have to process Data size greater than memory. ~30-80 GBs. :exploding_head: Mostly, data fails to read or system crashes.
Processing involves (Data cleaning, manipulation and/or visualization).
Work involved: -
Data Cleaning = replacing, removing and editing strings.

Manipulation = creating variables from strings and then passed to regression model. Or some basic function like count of words.

Visualization = Plotting multiple variables in leaflet, or generating bar or pie charts.

I generally use aws for these problems,

but out of curiosity :thinking:, if someone faces data size problem, what method(s) one would implement to tackle this problem, in low end machine.(Mine is i5, 8gb ram, typical HDD).

Note: Speed is a constraint, should not take days to get solution. And, obviously system should not crash.:grin: :grinning: :smiley:

Thanks for your time and opinion.:star_struck:

What format is your data in? If you could find a way to put the data in a database (e.g. a local SQL lite database), then you can do at least some of the aggregation/filtering in SQL or using dplyrs database backend capabilities. In R you also have the RSQLite package which allows you to create and interact with an SQL lite DB, but I don't think you will be able to create the DB from R as that would require that you load the data into your session first (which is the problem in the first place). If you have some sort of flat file like a txt or csv there are utilities to read the data in chunks which will allow you to do some processing on a chunk, load the next chunk ... etc. If you have an UNIX shell and can run bash commands, you can also split your file into chunks (see here https://stackoverflow.com/questions/2016894/how-to-split-a-large-text-file-into-smaller-files-with-equal-number-of-lines) and then process them in R in a loop or using some functional programming framework like purrr.

2 Likes

I think, you are wishing for too much. If you need to analyze 80 gigs of data and you only have 8 gigs on your machine, it will never be fast. No matter what tools you use, it'll always involve some read from disk and it's slooooooow. If you can put it into some DB, like @valeri suggested, it'll help a bit, but not too much.

I would say that if you want it to be fast and you know what you want to do with it, go for AWS/GKE/Azure and rent beefy machine, run your analysis in couple of minutes and shut it down. It won't be hugely expensive (definitely less expensive than buying sticks of RAM). To help with that, take a subset of data, work out all the problems on it on your local machine and then it'll only be running the script on remote machine, not wasting time developing a solution that may or may not work.

2 Likes

I have been having some success using the disk.frame package when I work with large datasets. It creates a disk.frame object that is stored in a directory and accessed with dplyr commands. It allows you to work on 'chunks' of the data at a time.
This would probably allow you to deal with the data cleaning, and it can theoretically handle regression models for larger-than-RAM datasets, but I'm not sure about other features.
The package is under active development, so if you like it maybe you can ask the creator to include features in the future.

2 Likes

Also, maybe this SO can be helpful: https://stackoverflow.com/questions/43677277/reading-csv-files-in-chunks-with-readrread-csv-chunked

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.