Efficiently filling in NULLs

When I do webscraping or many API calls I often get throttled at a certain point. This doesn't break my code because I wrap my functions with purrr::safely, but I'll still need to eventually go back and retry these missing entries. My question is: what's the most efficient way to go back and retry these NULLs?

For example suppose I tried

library(tidyverse)
library(geonames)

options(geonamesUsername="XXXX")
options(geonamesHost="api.geonames.org")

find_zip_code <- safely(GNfindNearbyPostalCodes)


zipcodes <- tibble(
  locationlatitude = c(
    43.142, 45.015,
    34.296,  40.714, 40.661
  ),
  locationlongitude = c(
    -85.049, -93.340,
    -80.113, -75.032, -74.012
  ),
  zipcode = list("29079", "55422", "48834", NULL, NULL)
)

### how can I selectevly retry the NULLS?
## my usual method would be to filter for NULLs and join later
## 1) my normal method of mapping works
zipcodes %>% 
  mutate(
    zip = map2(locationlatitude,
               locationlongitude,
               ~find_zip_code(
                 lat = .x, 
                 lng = .y,
                 maxRows = 1))
  )

## 2) I could try some map_if method to target NULLs but this fails
retry_df2 <- zipcodes %>% 
  nest(coord = c(locationlatitude, locationlongitude)) %>% 
  mutate(
    zip = map_if(coord, 
                 .p = is.null(zipcode), 
                 .f =  
                   ~find_zip_code(
                      lat = .x$locationlatitude, 
                      lng = .x$locationlongitude,
                 maxRows = 1), .else = list(F)
                   )
  )
#> Error: Problem with `mutate()` input `zip`.
#> x length(.p) == length(.x) is not TRUE
#> ℹ Input `zip` is `map_if(...)`.

Created on 2020-11-22 by the reprex package (v0.3.0)

It is easy enough to filter, and then rejoin but this is clumsy, and I am wondering if there is a more elegant solution.

Not sure what is the best way to handle the problem, but I can tell you why your second method fails:

library(tidyverse)
zipcodes <- tibble(
  locationlatitude = c(
    43.142, 45.015,
    34.296,  40.714, 40.661
  ),
  locationlongitude = c(
    -85.049, -93.340,
    -80.113, -75.032, -74.012
  ),
  zipcode = list("29079", "55422", "48834", NULL, NULL)
)

is.null(zipcodes$zipcode)
#> [1] FALSE

map_lgl(zipcodes$zipcode, is.null)
#> [1] FALSE FALSE FALSE  TRUE  TRUE

Created on 2020-11-22 by the reprex package (v0.3.0)

A single call to is.null() tests existence of the whole object, it's not a vectorized test of the elements (as is.na() or others).

1 Like

Yes, because you are using zipcodes$zipcode, so referring to the original data frame with 5 rows, whereas in the meantime you nested it so the input to map_if has 4 rows (the two NULLs got nested together). But I'm not sure why you're nesting in the first place, you might want to try something like:

zipcodes %>% 
  mutate(zip2 = pmap_chr(list(locationlatitude,locationlongitude, zipcode),
                         ~ if_else(is.null(..3),
                                   find_zip_code(..1,..2),
                                   ..3)))
1 Like

Thanks this is a really helpful start! Even with this insight, I am still getting an error

retry_df <- zipcodes %>% 
  nest(coord = c(locationlatitude, locationlongitude, zipcode)) %>% 
  mutate(
    zip = map_if(coord, 
                 .p = map_lgl(zipcodes$zipcode, is.null), 
                 .f =  
                   ~find_zip_code(
                      lat = .x$locationlatitude, 
                      lng = .x$locationlongitude,
                 maxRows = 1), .else = .x$zipcode
                   )
  )
#> Error: Problem with `mutate()` input `zip`.
#> x length(.p) == length(.x) is not TRUE
#> ℹ Input `zip` is `map_if(...)`.

Created on 2020-11-23 by the reprex package (v0.3.0)