Download multiple files using “download.file” function while skipping broken links (with walk2)

I am currently in the process of writing code to download multiple pdf files from http://www.understandingwar.org/report/afghanistan-order-battle that detail information on the U.S. war in Afghanistan. The code below works when all of the links generated with glue exist on the website but breaks when they are missing.

The following code successfully downloads a single pdf file as expected.

#---- Loads Packages

library("pdftools")
library("glue")
library("tidyverse")

#---- Creates a List of All of the ORBAT PDF URLs

month <- c("January", "February", "March", "April", "May", "June", "July",
           "August", "September", "October", "November", "December")

year <- c("2013", "2014", "2015", "2016", "2017")

# Creates a String of the URL Addresses
 urls <- 
  tidyr::expand_grid(month, year) %>%
  filter(month == "October" & year == "2013") %>% 
  glue_data("http://www.understandingwar.org/sites/default/files/AfghanistanOrbat_{month}{year}.pdf")

head(urls, 5)  

# Creates Names for the PDF Files 
pdf_names <- 
  tidyr::expand_grid(month, year) %>%
  filter(month == "October" & year == "2013") %>% 
  glue_data("orbat-report-{month}-{year}.pdf")

head(pdf_names, 5)

#---- Downloads the PDF Files Using purrr
walk2(urls, pdf_names, download.file, mode = "wb")

The problem is that several of the links are broken. When I try to download all of the files in the list of URL addresses generated using glue_data the code fails. Does anyone have ideas for how to skip the broken links and to download the links that do exist/work while using walk2?

2 Likes

purrr::safely() is one option here. If you wrap download.file() in safely(), it will skip over the errors and continue iterating over each of the urls. In this case, you don't care about what is returned, so safely() just skips the files that aren't found. If you are storing the results of a call wrapped in safely(), you get a list with result and error elements.

library("pdftools")
library("glue")
library("tidyverse")

month <- c("January", "February", "March", "April", "May", "June", "July",
           "August", "September", "October", "November", "December")

year <- c("2013", "2014", "2015", "2016", "2017")

# Creates a String of the URL Addresses
urls <- 
  tidyr::expand_grid(month, year) %>%
  glue_data("http://www.understandingwar.org/sites/default/files/AfghanistanOrbat_{month}{year}.pdf")

# Creates Names for the PDF Files 
pdf_names <- 
  tidyr::expand_grid(month, year) %>%
  glue_data("orbat-report-{month}-{year}.pdf")

safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
walk2(urls, pdf_names, safe_download)

Created on 2020-01-30 by the reprex package (v0.3.0)

4 Likes

Thanks, this suggestion is perfect!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.