I have a vector of domain names, and I want to check if they are archived by the Wayback Machine. However, the vector is a bit long and it takes a lot of time to check domains one by one. So, I wanted to use furrr and parallelize the process. However, when I use future_map variants, I receive Service Unavailable (HTTP 503) error.
Is there a way to solve this problem? Is there another way to speed up API access?
Here is the reprex:
library(tidyverse)
library(wayback)
library(furrr)
#> Loading required package: future
plan(multiprocess)
domain_vec <- read_csv("/Users/berkcandeniz/Desktop/bw_sample.csv") %>%
pull(domain)
#> Parsed with column specification:
#> cols(
#> domain = col_character()
#> )
archive_check <- future_map_dfr(
.x = domain_vec, # the vector of domains
.f = archive_available # the function from the wayback package
)
#> Error in ...future.f(...future.x_jj, ...): Service Unavailable (HTTP 503).
Created on 2019-04-23 by the reprex package (v0.2.1.9000)
Can you give couple of examples of services you are trying to connect to? We don't have access to your CSV file and for this problem I would imagine having just couple of actual links in your reprex should be sufficient.
But just to clarify, if you use archive_available with one of the domains directly, then everything works correctly?
It's possible that, since you got 503, then there is some rate-throttling happening and you can only get n requests per second. Bob Rudis has a package that allows you to check that for a given domain.
Also, you can try wrapping your function into purrr::safely or something to see whether there is a specific domain that causing you trouble.