given a country, get the regions that contain it - improving my parsing json code

Hi! I've solved this problem, but just can't believe there isn't an easier way. So I figured I'd throw it up to the community!

Problem statement: This json contains a list of regions, and the regions/countries they contain, in hierarchy. I need the second entry (geographic hierarchy). I would like to come up with a list of all the regions that apply to a given country. Here's what I tried.

First attempt, to use jq (json query)... but recursion flattens the list and loses the hirearchy. Here's the code:

llibrary(jqr)
library(tidyverse)
jq_try <- url("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree") %>% 
  jq(".[1] | .children")
jq_try %>% jq(".[].geoAreaName")
#> [
#>     "Africa",
#>     "Americas",
#>     "Asia",
#>     "Antarctica",
#>     "Oceania",
#>     "Europe"
#> ]
jq_try %>% jq("..|.geoAreaName?") %>% as.list %>% enframe %>% unnest(value)
#> # A tibble: 601 x 2
#>     name value                               
#>    <int> <chr>                               
#>  1     1 "\"Africa\""                        
#>  2     2 "\"Sub-Saharan Africa\""            
#>  3     3 "\"Eastern Africa\""                
#>  4     4 "\"British Indian Ocean Territory\""
#>  5     5 "null"                              
#>  6     6 "\"Burundi\""                       
#>  7     7 "null"                              
#>  8     8 "\"Comoros\""                       
#>  9     9 "null"                              
#> 10    10 "\"Djibouti\""                      
#> # ... with 591 more rows

There might be a way to optimize that jq query, but I'm more interested in hearing this community's take on the second approach, using tidyverse and network analysis! What I did was more or less manually dig into the json, renaming fields every step to prevent name collisions. I also manually reconstruct the edge network because I couldn't find any method that digests a full dataframe and infers structure from that. This actually works, and I'll convert it into a function and use this if I can't get anything better:

library(tidyverse)
library(igraph)
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:dplyr':
#> 
#>     as_data_frame, groups, union
#> The following objects are masked from 'package:purrr':
#> 
#>     compose, simplify
#> The following object is masked from 'package:tidyr':
#> 
#>     crossing
#> The following object is masked from 'package:tibble':
#> 
#>     as_data_frame
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

countries_raw <- jsonlite::fromJSON('https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree') %>% 
  ## keep only continental categories:
  slice(2) %>% 
  rename(geocode1 = geoAreaCode, geoName1 = geoAreaName, geoType1 = type) %>% unnest(children, keep_empty = TRUE) %>% 
  rename(geocode2 = geoAreaCode, geoName2 = geoAreaName, geoType2 = type) %>% unnest(children, keep_empty = TRUE) %>% 
  rename(geocode3 = geoAreaCode, geoName3 = geoAreaName, geoType3 = type) %>% unnest(children, keep_empty = TRUE) %>% 
  rename(geocode4 = geoAreaCode, geoName4 = geoAreaName, geoType4 = type) %>% unnest(children, keep_empty = TRUE) %>% 
  rename(geocode5 = geoAreaCode, geoName5 = geoAreaName, geoType5 = type) %>% unnest(children, keep_empty = TRUE) %>% 
  rename(geocode6 = geoAreaCode, geoName6 = geoAreaName, geoType6 = type) %>% unnest(children, keep_empty = TRUE)

## now create a structure that can track upwards.  First, get rid of first
## level, and codes/types, then group by edgepairs

df.g <- countries_raw %>% select(-contains(c("1", "code", "type")))

df.g <- bind_rows(df.g[,1:2] %>% set_names("from", "to"), 
          df.g[,2:3] %>% set_names("from", "to"),
          df.g[,3:4] %>% set_names("from", "to"),
          df.g[,4:5] %>% set_names("from", "to"),
          df.g[,5:6] %>% set_names("from", "to")) %>% 
  na.omit()
df.g <- df.g %>% igraph::graph_from_data_frame()

# library(visNetwork)
# visNetwork::visIgraph(df.g)

vertices <- bfs(graph = df.g, root = "Nigeria",neimode = "in", unreachable = FALSE)$order %>% as.numeric %>% discard(~is.na(.))
V(graph = df.g)[vertices]$name
#> [1] "Nigeria"            "Western Africa"     "Sub-Saharan Africa"
#> [4] "Africa"

I also thought I could benefit by using squash to flatten the whole dataframe out... something like this:

library(tidyverse)
library(httr)
countries <- GET("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree")
countries <- content(countries)

rlang::squash(countries[[2]]) %>% enframe %>% 
  mutate(value = purrr::map(value, as.character)) %>% unnest(value) %>% 
  mutate(region = ifelse(value == "Region", lag(value, 1), NA),
         country = ifelse(value == "Country", lag(value, 1), NA)) %>% 
  select(region, country) %>% fill(region) %>% unique %>% 
  ## grab higher levels
  mutate(region2 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)), lag(region, 2), NA),
         region3 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)), lag(region, 3), NA),
         region4 = ifelse(is.na(lag(country, 1)) & is.na(lag(country, 2)) & is.na(lag(country, 3)), lag(region, 4), NA)) %>% 
  fill(region2, region3, region4) %>%
  filter(!is.na(country)) 
#> Warning: Outer names are only allowed for unnamed scalar atomic inputs
#> # A tibble: 283 x 5
#>    region         country                        region2         region3 region4
#>    <chr>          <chr>                          <chr>           <chr>   <chr>  
#>  1 Eastern Africa British Indian Ocean Territory Sub-Saharan Af~ Africa  World ~
#>  2 Eastern Africa Burundi                        Sub-Saharan Af~ Africa  World ~
#>  3 Eastern Africa Comoros                        Sub-Saharan Af~ Africa  World ~
#>  4 Eastern Africa Djibouti                       Sub-Saharan Af~ Africa  World ~
#>  5 Eastern Africa Eritrea                        Sub-Saharan Af~ Africa  World ~
#>  6 Eastern Africa Ethiopia                       Sub-Saharan Af~ Africa  World ~
#>  7 Eastern Africa French Southern Territories    Sub-Saharan Af~ Africa  World ~
#>  8 Eastern Africa Kenya                          Sub-Saharan Af~ Africa  World ~
#>  9 Eastern Africa Madagascar                     Sub-Saharan Af~ Africa  World ~
#> 10 Eastern Africa Malawi                         Sub-Saharan Af~ Africa  World ~
#> # ... with 273 more rows

This kinda works, but the whole inferring hierarchies based on NA is super faffy. The above logic is wrong for example. I could work on it and find something that works, but it feels brittle and wrong.

What am I missing?

This is an edit, as I posted a version, which involved jqr to pick the children of the first selection as per your first example. This edit is for the whole json, and uses jsonlite only

library(tidyverse) 
u1  <- url("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree")

library(jsonlite)

t1 <- tibble(fromJSON(u1,flatten=TRUE))

pdf <- t1
while (!is.logical(pull(pdf,children))) {
  pdf <- unnest(pdf,
                cols="children"
                ,names_repair = "unique")
}

pdf %>% select(contains("Name"))

nice usage of while! And I love how you accumulated the unnests inside tthat loop... I think that's kind of what my intuition was telling me to do!

That being said... this approach doesn't work because the first time that you unnest something that doesn't have children, it drops that row. You can see that your approach returns 50ish rows, rather than 270ish. We can fix that problem simply enough by modifying your code to this:

while (!is.logical(pull(pdf,children))) {
  pdf <- unnest(pdf,
                cols="children"
                ,names_repair = "unique", keep_empty = TRUE)
}

(see the keep_empty?)

So that flattens everything, but the countries are dispersed in random columns and it's not clear to me how to account for that, and then reverse it so that given the country, you get its parent regions.

pdf %>% select(contains("Name")) %>% set_names(paste0("n",1:6)) %>% 
  mutate(collated_country=coalesce(n6,n5,n4,n3)) %>% print(n=1000)

Yup! Edited your query a bit to come out w/ a tidyformat but that was it. Nice usage of coalesce and then feeding the features in backwards! Nice! Thanks for the help! In case this comes up again, here's the final script I'm using.

(oh, I also took the liberty of filtering your script down to only the continental regions)

library(tidyverse) 
library(jsonlite)

u1  <- url("https://unstats.un.org/SDGAPI/v1/sdg/GeoArea/Tree")
t1 <- tibble(fromJSON(u1,flatten=TRUE)) %>% head(1)

pdf <- t1

while (!is.logical(pull(pdf,children))) {
  pdf <- unnest(pdf,
                cols="children"
                ,names_repair = "unique", keep_empty = TRUE)
}

pdf <- pdf %>% select(contains("Name")) %>% set_names(paste0("n",1:6)) %>% 
  mutate(collated_country=coalesce(n6,n5,n4,n3)) %>% 
  pivot_longer(cols = n1:n6) %>% select(-name) %>% 
  filter(collated_country != value, !is.na(value)) %>% 
  unique %>% arrange(collated_country, value)