I have downloaded multiple ZIP archives from the Census Bureau. Several of these archives contain multiple CSV files that need to be read and combined into a single data frame. Are there options to read_csv that will accomplish this?
The examples on SO and elsewhere address the situation where the ZIP archive is on a web site. This is not my problem. I am reading the ZIP archives from a local disk.
TIA
It depends on the scope of "this." From the read_csv() docs; the file argument:
Either a path to a file, a connection, or literal data (either a single string or a raw vector).
Files ending in .gz, .bz2, .xz, or .zip will be automatically uncompressed. Files starting with http://, https://, ftp://, or ftps:// will be automatically downloaded.
However, if there are other files in the zip folder, readr won't know what to do with them (the example, again, in the same docs, shows mtcars.csv.zip).
To read in and join multiple csv files, you'll have to tell readr where those files are. The answer on SO, below, is one approach, though it doesn't use readr (which is fine — you could also probably adapt it to use readr, if you wanted to):
This thread on GitHub also has several good approaches using readr and purrr:
The above pre-dated the fs (short for file system) package, which can also be of great help.
Have you looked at the tidycensus package? Having worked with a lot of Census and American Community Survey (acs) data before tidycensus existed, I definitely wish that I'd been able to use it directly.
If you haven't already, you might also consider using microdata from IPUMS (https://www.ipums.org/) with the associated ipumsr package (https://github.com/mnpopcenter/ipumsr). IPUMS gets you formatted microdata that is quite user-friendly and ipumsr has a number of helper functions for processing the data in R.
Hi,
Thanks for the suggestions. I guess I did this the hard way:
proc_csv <- function(inZip,varList){ #inZip is the ZIP Archive, varList is the variable specifications from readr
outFile <- data.frame()
filList <- unzip(inZip, list = TRUE) # Create list of files
for(j in 1:nrow(filList)) { # Loop through the list of files
if(grepl("csv",filList[j,1])) { #If a file is a csv file, unzip it and read the data
oFa <- read_csv(unz(inZip, filList[j,1]),col_names=TRUE, col_types = varList)
outFile <- rbind(outFile,oFa) #Then add the data files together
}
}
outFile <- outFile[,c("PUMA","ST","MIGPUMA","MIGSP","PWGTP")] #Finally, select the required variables
#In my case, the variables in the archive vary by year, so this step is necessary
return(outFile)
}
This is not especially elegant, but it is a general solution. It would be great if there was an option in read_csv that allowed reading all of the csv files found in a ZIP archive.
Cheers
My code is similar, I have already downloaded and unzipped the files but then...
# list all files
files <- as.list(dir(pattern = "ai", path = "data_raw", full.names = T))
# combine all of the data files into one table
ai_data <- files %>%
# read in all of the csv's in the files list
map_dfr(read_csv)