Advice and best practice for dealing with the 1GB limit when scraping

best-practices
recommendations

#1

I'm using RStudio Cloud to scrape a bunch of PDFs from a website.

Each PDF is reasonably large - about 20MB - and there are hundreds of them, so I regularly need to download the scraped PDFs to my local computer and then delete them from the R Studio Cloud project in order to stay under the 1 GB limit.

I am following Josh's response around exporting here: Exporting datasets from RStudio Cloud. But due to the number of times that I'm needing to do this, I'm wondering if I can improve on doing this download/delete process manually? For instance, is there a way to save to my local computer as I go?

Additionally, if I was to get a paid shinyapps.io account would this allow me to go over the 1GB limit or is the 'you will not encounter these space limits' comment in the Guide just referring to the number of members not the memory restriction?

Finally, even when the number of files is well less than 1GB (say 500MB), the workspace seems to crash if I try to do the above manual download of the PDFs. But if I grab about 10 of the PDFs at a time, then it seems to work. So I'm guessing that it's the size that is causing it to crash. Is there an alternative way to download and then delete than 'More'->'Export', and once that's done 'Delete', that might avoid this problem?


#2

The 1 GB limit refers to memory, not disk storage.

My understanding of the way the RStudio IDE performs exporting/downloads, is that the entire contents of the file is read into memory before the download begins. So any data loaded into the current session will affect the amount of memory available when performing the export/download.

There are limits on the amount of disk storage as well, which is currently 3 GB. We are looking at what will be reasonable and useful as we continue to work on rstudio.cloud, so if you have thoughts on this we would love to hear them.


#3

Thank you very much for your response and the information Josh.

Looking forward to moving more of my work over to RStudio Cloud as it continues to improve.