Reading Multiple Files in R


#1

Hello there,

I have 5,3GB of data, 35.360 zip files with 1 csv file inside each of them, all organized inside 41 folders, those are log files. The file names are organized like this:

Folder 2018-10-25:

2018-10-25-00-00-0e41.csv.gz;

2018-10-25-00-00-7f7d.csv.gz;

2018-10-25-00-00-32fa.csv.gz and so on;

Folder 2018-10-26:

2018-10-26-00-00-0e41.csv.gz;

2018-10-26-00-00-7f7d.csv.gz;

2018-10-26-00-00-32fa.csv.gz and so on.

Last folder 2018-12-04.

How can I read all those files in R as just one file? Any tips to work with such a great amount of data?

Kind Regards,

Luiz.


#2

5.3GB is a lot of data; you might have problems fitting it into memory. Do you have access to a database? Databases play nicely with dplyr (via DBI / dbplyr) and can do a lot of heavy lifting for you.

My approach would be along these lines:

zipfiles <- list.files(pattern= '*.gz') # data frame of zip files in current directory

for (i in seq_along(zipfiles)) {

   unzip(zipfiles[i], files = 'name_of_yer_file.csv', exdir = tempdir(), junkpaths = T)
  # your csv file will be unzipped to tempdir

 # somehow insert the content of the csv file to your database 
 # this will depend on its structure and your database of choice

}

Also, if you were willing to risk the Purity of Essence of your code consider this script.

It is written in the language of the snake people, and easily integrated with R code. I have used it with great success when parsing S3 logs; I am certain it can be used for other log structures with only minor hacking.


closed #3

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.