Readr read all csv files in zip archive

adambickford · July 24, 2018, 5:59pm

I have downloaded multiple ZIP archives from the Census Bureau. Several of these archives contain multiple CSV files that need to be read and combined into a single data frame. Are there options to read_csv that will accomplish this?

The examples on SO and elsewhere address the situation where the ZIP archive is on a web site. This is not my problem. I am reading the ZIP archives from a local disk.
TIA

mara · July 24, 2018, 6:19pm

It depends on the scope of "this." From the read_csv() docs; the file argument:

Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.

However, if there are other files in the zip folder, readr won't know what to do with them (the example, again, in the same docs, shows mtcars.csv.zip).

To read in and join multiple csv files, you'll have to tell readr where those files are. The answer on SO, below, is one approach, though it doesn't use readr (which is fine — you could also probably adapt it to use readr, if you wanted to):

This thread on GitHub also has several good approaches using readr and purrr:

github.com/STAT545-UBC/Discussion

read in an entire folder worth of files

opened 04:29AM - 03 Nov 16 UTC

scienceisfiction

This may be getting a bit ahead of the schedule (perhaps this will be covered wi…th Make?), but I'm wondering how to read in an entire folder worth of .csv files and quickly combine into one giant df without having to read in each file individually and without writing a for loop. Is there a quick way to do this with some of the tools we've learned so far that I'm just not remembering right now? Is this something I can do with map() and if so, are there some examples I can look at to figure out how to do this? I'm also having trouble with my directories. My script is floating freely in the root of my directory as it should be, and if I use list.files("folder_name")[1] the first file title pops up no problem, but if I use read_csv(list.files("folder_name")[1]) I get an error message that it doesn't exist in current working directory. I can work around this by copy and pasting the first file into my root OR by writing out the "folder_name/file_name.csv" but then I won't be able to automate reading in the entire folder, right? Is there another way I should be thinking about this?

The above pre-dated the fs (short for file system) package, which can also be of great help.

Have you looked at the tidycensus package? Having worked with a lot of Census and American Community Survey (acs) data before tidycensus existed, I definitely wish that I'd been able to use it directly.

adambickford · July 24, 2018, 6:33pm

Thanks for these references. I'm reading ACS PUMS data, so tidyensus doesn't help solve this problem. Which is not to criticize tidycensus. Cheers AB

mara · July 24, 2018, 6:36pm

I believe the acs package does! I linked to that one above, as well.

kylewalker · July 25, 2018, 2:13pm

If you haven't already, you might also consider using microdata from IPUMS (https://www.ipums.org/) with the associated ipumsr package (https://github.com/mnpopcenter/ipumsr). IPUMS gets you formatted microdata that is quite user-friendly and ipumsr has a number of helper functions for processing the data in R.

adambickford · July 25, 2018, 2:41pm

Hi,
Thanks for the suggestions. I guess I did this the hard way:

proc_csv <- function(inZip,varList){  #inZip is the ZIP Archive, varList is the variable specifications from readr
  outFile <- data.frame()
  
  filList <- unzip(inZip, list = TRUE) # Create list of files
  for(j in 1:nrow(filList)) { # Loop through the list of files
    if(grepl("csv",filList[j,1])) {  #If a file is a csv file, unzip it and read the data
      oFa <- read_csv(unz(inZip, filList[j,1]),col_names=TRUE, col_types = varList)
      outFile <- rbind(outFile,oFa)    #Then add the data files together
    }
  }
  outFile <- outFile[,c("PUMA","ST","MIGPUMA","MIGSP","PWGTP")]  #Finally, select the required variables
        #In my case, the variables in the archive vary by year, so this step is necessary
  return(outFile)  
}

This is not especially elegant, but it is a general solution. It would be great if there was an option in read_csv that allowed reading all of the csv files found in a ZIP archive.
Cheers

taras · July 25, 2018, 2:49pm

@adambickford, you could do it with purrr::map_df(), for example

Basically, download your zip, unzip it, peek inside and list all file names with list.files(), and then iterate over file names with purrr:map_df()

library(purrr)

url <- "url-to-your-zip"
path_zip <- "your-downloaded-zip-local-path"
path_unzip <- "path-where-to-save-unzip-files"
destfile <- "archive.zip"

# download zip
curl::curl_download(url, destfile = paste(path_zip, destfile, sep = "/"))

#unzip
unzip(destfile, exdir = path)

# list all files
files <- list.files(path = path_unzip)
 
# apply map_df() to iterate read_csv over files
data <- map_df(paste(path_unzip, files, sep = "/"),
                     read_csv,
                     ## additional params to read_csv here
                          )

michstua · April 28, 2019, 1:04pm

My code is similar, I have already downloaded and unzipped the files but then...

# list all files
files <- as.list(dir(pattern = "ai", path = "data_raw", full.names = T))

# combine all of the data files into one table
ai_data <- files %>% 
  # read in all of the csv's in the files list
  map_dfr(read_csv)