API (Wayback) Access with Furrr

bcdeniz · April 24, 2019, 4:51am

I have a vector of domain names, and I want to check if they are archived by the Wayback Machine. However, the vector is a bit long and it takes a lot of time to check domains one by one. So, I wanted to use furrr and parallelize the process. However, when I use future_map variants, I receive Service Unavailable (HTTP 503) error.

Is there a way to solve this problem? Is there another way to speed up API access?

Here is the reprex:

library(tidyverse)
library(wayback)
library(furrr)
#> Loading required package: future

plan(multiprocess)

domain_vec <- read_csv("/Users/berkcandeniz/Desktop/bw_sample.csv") %>% 
  pull(domain)
#> Parsed with column specification:
#> cols(
#>   domain = col_character()
#> )

archive_check <- future_map_dfr(
  .x = domain_vec, # the vector of domains
  .f = archive_available # the function from the wayback package
  )
#> Error in ...future.f(...future.x_jj, ...): Service Unavailable (HTTP 503).

^{Created on 2019-04-23 by the reprex package (v0.2.1.9000)}

mishabalyasin · April 24, 2019, 8:59am

Can you give couple of examples of services you are trying to connect to? We don't have access to your CSV file and for this problem I would imagine having just couple of actual links in your reprex should be sufficient.

But just to clarify, if you use archive_available with one of the domains directly, then everything works correctly?

bcdeniz · April 24, 2019, 4:07pm

Yes, when I use archive_available directly or with a purrr function, everything works fine. It just takes too long.

Here are 20 domains from the CSV file (I can't upload the csv):

bayarearealtysearch.com
twinklingvisions.com
insdr.co
westridgeroverretreat.ca
vente-privee.ma
ironwillreachestheheaven.com
july292017.com
0895news.com
worldismyworkplace.com
kinderwagenbestellen.nl
michelleandjan.com
penndata.com
zarahome.be
katherinelebron.com
watch-21.ml
sugaraddictioncode.com
buildabearworkshop.ca
h91kw.tk
uk17i.tk

The function archive_available works like this:

library(tidyverse)
library(wayback)

domain_vec <- read_csv("/Users/berkcandeniz/Desktop/bw_sample.csv") %>% 
  pull(domain)
#> Parsed with column specification:
#> cols(
#>   domain = col_character()
#> )

archive_available(domain_vec[1])
#> # A tibble: 1 x 5
#>   url        available closet_url                timestamp           status
#>   <chr>      <lgl>     <chr>                     <dttm>              <chr> 
#> 1 bayareare… TRUE      http://web.archive.org/w… 2018-11-03 00:00:00 200

^{Created on 2019-04-24 by the reprex package (v0.2.1.9000)}

mishabalyasin · April 25, 2019, 10:13am

It seems to work for me with furrr:

domain_vec <- c("bayarearealtysearch.com", "twinklingvisions.com")

library(tidyverse)
library(wayback)
library(furrr)
#> Loading required package: future

plan(multiprocess)

archive_check <- future_map_dfr(
  .x = domain_vec, # the vector of domains
  .f = archive_available # the function from the wayback package
)

archive_check
#> # A tibble: 2 x 6
#>   url     available closet_url       timestamp           status closest_url
#>   <chr>   <lgl>     <chr>            <dttm>              <chr>  <lgl>      
#> 1 bayare… TRUE      http://web.arch… 2018-11-03 00:00:00 200    NA         
#> 2 twinkl… FALSE     <NA>             NA                  404    NA

^{Created on 2019-04-25 by the reprex package (v0.2.1)}

It's possible that, since you got 503, then there is some rate-throttling happening and you can only get n requests per second. Bob Rudis has a package that allows you to check that for a given domain.

Also, you can try wrapping your function into purrr::safely or something to see whether there is a specific domain that causing you trouble.

system · May 16, 2019, 10:13am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.