Readr read all csv files in zip archive


#1

I have downloaded multiple ZIP archives from the Census Bureau. Several of these archives contain multiple CSV files that need to be read and combined into a single data frame. Are there options to read_csv that will accomplish this?

The examples on SO and elsewhere address the situation where the ZIP archive is on a web site. This is not my problem. I am reading the ZIP archives from a local disk.
TIA


#2

It depends on the scope of "this." From the read_csv() docs; the file argument:

Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.

However, if there are other files in the zip folder, readr won't know what to do with them (the example, again, in the same docs, shows mtcars.csv.zip).

To read in and join multiple csv files, you'll have to tell readr where those files are. The answer on SO, below, is one approach, though it doesn't use readr (which is fine — you could also probably adapt it to use readr, if you wanted to):

This thread on GitHub also has several good approaches using readr and purrr:

The above pre-dated the fs (short for file system) package, which can also be of great help.

Have you looked at the tidycensus package? Having worked with a lot of Census and American Community Survey (acs) data before tidycensus existed, I definitely wish that I'd been able to use it directly.


#3

Thanks for these references. I'm reading ACS PUMS data, so tidyensus doesn't help solve this problem. Which is not to criticize tidycensus. Cheers AB


#4

I believe the acs package does! I linked to that one above, as well.


#5

If you haven't already, you might also consider using microdata from IPUMS (https://www.ipums.org/) with the associated ipumsr package (https://github.com/mnpopcenter/ipumsr). IPUMS gets you formatted microdata that is quite user-friendly and ipumsr has a number of helper functions for processing the data in R.


#6

Hi,
Thanks for the suggestions. I guess I did this the hard way:

proc_csv <- function(inZip,varList){  #inZip is the ZIP Archive, varList is the variable specifications from readr
  outFile <- data.frame()
  
  filList <- unzip(inZip, list = TRUE) # Create list of files
  for(j in 1:nrow(filList)) { # Loop through the list of files
    if(grepl("csv",filList[j,1])) {  #If a file is a csv file, unzip it and read the data
      oFa <- read_csv(unz(inZip, filList[j,1]),col_names=TRUE, col_types = varList)
      outFile <- rbind(outFile,oFa)    #Then add the data files together
    }
  }
  outFile <- outFile[,c("PUMA","ST","MIGPUMA","MIGSP","PWGTP")]  #Finally, select the required variables
        #In my case, the variables in the archive vary by year, so this step is necessary
  return(outFile)  
}

This is not especially elegant, but it is a general solution. It would be great if there was an option in read_csv that allowed reading all of the csv files found in a ZIP archive.
Cheers


#7

@adambickford, you could do it with purrr::map_df(), for example

Basically, download your zip, unzip it, peek inside and list all file names with list.files(), and then iterate over file names with purrr:map_df()

library(purrr)

url <- "url-to-your-zip"
path_zip <- "your-downloaded-zip-local-path"
path_unzip <- "path-where-to-save-unzip-files"
destfile <- "archive.zip"

# download zip
curl::curl_download(url, destfile = paste(path_zip, destfile, sep = "/"))

#unzip
unzip(destfile, exdir = path)

# list all files
files <- list.files(path = path_unzip)
 
# apply map_df() to iterate read_csv over files
data <- map_df(paste(path_unzip, files, sep = "/"),
                     read_csv,
                     ## additional params to read_csv here
                          )