Uploading a large file

Hello,

I have a file that is too large to read into R all at once, so I've been using the read_csv_chunked function to do it bits at a time. I had just been working on the desktop version of R Studio, but even then it would still be going after 3–4 days, so my advisor set me up with a google cloud compute to try to get the job done without tying up my laptop. The only problem is the file (a csv) is on my computer and it's too large to upload it into R Studio cloud the usual way and read in into the environment. Is there any way to be able to read files with the read_csv_chunked from my computer, or, alternatively are there any good work arounds to this problem? Any help would be much appreciated ! Thank you !

I would try the vroom package for fast reading of CSV files on the desktop.

1 Like

Hey–that's a good idea I think, thank you ! I've never really done that before–is there a way to read chunks at a time, subset, then move to the next chunk?

is there a way to read chunks at a time, subset, then move to the next chunk?

Do you mean from the database to R? I am no db expert---I do have a copy of SQL for Dummies---but I don't see why not. It should be more efficient to do the data selection (i.e. chunking) and subsetting in the the database and just import the exact data you want to work with. I think it should reduce memory load and speed up processing time.

There are a number of decent tutorials on the web on working with R and databases on the web; https://www.pluralsight.com/guides/importing-data-from-relational-databases-in-r seem useful and this one gives more detail. https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html#Introduction

I have never had to handle such large files but I have heard great things about the arrow package regarding speed. I would advise trying it, it comes with a handy read_csv_arrow function: https://ursalabs.org/arrow-r-nightly/reference/read_delim_arrow.html

Check out the package data.table as it’s much more robust than readr at reading and processing large data such as your own. dplyr and readr are good but for something that is that large your best bet is something backended with C which will be much faster. Your alternative is SQL or a database linking package as suggested earlier.

I just saw a notice for what seews like a new package that may be worth looking into. It is called disk.frame . https://github.com/xiaodaigh/disk.frame

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.

how big is your data ?

I actually did try that on the desktop version before read_csv_chunked! Really the problem right now is the file is on my computer and it's too large to upload into R Studio cloud, so I'm wondering what a good way to get it into R studio cloud without uploading all of it, or something along those lines. Thank you for your response though !

wow, so, to transfer your data across your network will take time.
I suppose you can estimate by finding your upload speed, from somewhere like https://www.speedtest.net/

Probably you could compress/zip your file if you were comitted to sending it. might be worth testing on
some number of chunks worth of your daa (zipped and unzipped) to see if the upload times differ significantly.

putting aside the network transfer challenge if you wanted to try another way to access the large csv data on your desktop, I would look at if package mmap, with its mmap.csv function would help.

1 Like

It is just under 43 GB

43 GB
Ouch.
Would it be possible to read the .csv file into a database and use one of the various R db connections to read in data as needed?

2 Likes

Okay awesome thank you so much for these suggestions !