How to map column nested list of unknown depth to new dataframe columns?

soedr · August 2, 2018, 10:05am

Hi there, I have an unwieldy dataframe in which one column is a nested list of unknown depth (the example I've included only goes two levels deep, but the real data sometimes go further).

I'd like to be able to map the key:value pairs from all levels in the nested list into columns, where each unique key is a new column. To further complicate things, not all entries have all keys, so I need to be able to introduce NAs.

library(tidyverse)
library(jsonlite)

t <- 'id|json\n
1|{"latitude":"40.757","locationDetails":{"addressDetails":"2 sesame street","contactName":"ernie","phone":"987654321"},"longitude":"-73.924"}\n
2|{"city":"new york city","latitude":"40.756","locationDetails":{"addressDetails":"1 sesame street","contactName":"bert","phone":"123456"},"longitude":"-73.925","neighborhood":"astoria"}'

df <- read_delim(
 t,
 delim = '|'
)

df$json_list <- purrr::map(df$json, jsonlite::fromJSON, simplifyDataFrame = T)

If I do df$json_df <- purrr::map_dfr(df$json_list, magrittr::extract, c('latitude', 'longitude')), I can extract values from the first level, but that only works on the first level where data is present. If I try to include a key with NA in some places, like city, a fatal error is produced and the session is killed. Similarly, if I try to include keys from the second layer via locationDetails or locationDetails.name, a fatal error is produced and the session is killed, so I assume some form of layering of map_* is needed, but haven't been able to figure it out. Please help!

The desired format:

df_desired_format <- data.frame(
 id = c(1,2),
 city = c(NA, 'new york city'),
 latitude = c(40.757,40.756),
 addressDetails = c('2 sesame street', '1 sesame street'),
 contactName = c('ernie', 'bert'),
 phone = c(987654321, 123456),
 longitude = c(-73.924,-73.925),
 neighbordhood = c(NA, 'astoria')
)

Thanks in advance!

soedr · August 5, 2018, 11:35am

Flattening the list as follows works (sort of...), but there has to be a better way to do it that doesn't involve so much hard coding of what to extract where.

flatten_list <- function(x) {
 # get number of elements in list
 list_length <- length(x)
 if (list_length < 4) {
  list_values <- list(
   city = NA,
   neighborhood = NA,
   latitude = x$latitude,
   longitude = x$longitude,
   name = x$locationDetails$contactName,
   address = x$locationDetails$addressDetails
  )
 } else {
  list_values <- list(
   city = x$city,
   neighborhood = x$neighborhood,
   latitude = x$latitude,
   longitude = x$longitude,
   name = x$locationDetails$contactName,
   address = x$locationDetails$addressDetails
  )
 }
 return(list_values)
}

df$flat_list <- lapply(df$json_list, flatten_list)
df$out <- purrr::map_dfr(df$flat_list, magrittr::extract, c('city', 'neighborhood', 'latitude', 'longitude', 'name', 'address'))

mara · August 5, 2018, 1:26pm

This is more a possibly useful resource than it is an actual answer (sorry), but the roomba package might be helpful:

alistaire · August 5, 2018, 3:13pm

Since it's not nested that far, you can use flatten_dfc to flatten each list element to a single level and coerce it to a data frame, which unnest can expand:

library(tidyverse)

df <- read_delim(
    'id|json\n
1|{"latitude":"40.757","locationDetails":{"addressDetails":"2 sesame street","contactName":"ernie","phone":"987654321"},"longitude":"-73.924"}\n
2|{"city":"new york city","latitude":"40.756","locationDetails":{"addressDetails":"1 sesame street","contactName":"bert","phone":"123456"},"longitude":"-73.925","neighborhood":"astoria"}',
    delim = '|'
)

df$json_list <- map(df$json, jsonlite::fromJSON)
 
df %>% 
    mutate(json_list = map(json_list, flatten_dfc)) %>% 
    unnest()
#> # A tibble: 2 x 9
#>      id json  latitude addressDetails contactName phone longitude city 
#>   <dbl> <chr> <chr>    <chr>          <chr>       <chr> <chr>     <chr>
#> 1     1 "{\"… 40.757   2 sesame stre… ernie       9876… -73.924   <NA> 
#> 2     2 "{\"… 40.756   1 sesame stre… bert        1234… -73.925   new …
#> # ... with 1 more variable: neighborhood <chr>

Fix types with type_convert, if you like.

soedr · August 6, 2018, 6:56am

Thank you @alistaire, that works! Per my understanding, flatten_dfc() only removes one level, is there a function that can flatten multiple (3+) layers?

soedr · August 6, 2018, 7:38am

Roomba has potential (the way it handles the twitter data example is great), but produces an empty tibble when applied to my data.

j_list <- purrr::map(df$json, jsonlite::fromJSON)
roomba(jlist, cols = c("latitude", "addressDetails"), keep = all)

mara · August 6, 2018, 10:33am

Bummer. Yeah, the JSON to df struggle is real (and it's hard to generalize tidiers for diff. responses)

alistaire · August 7, 2018, 3:05am

unlist, but that has a tendency to lose structure you want. Otherwise there isn't a single way to simplify, as structures can vary widely. roomba is a promising potential solution to that variation, but is still pretty new and so isn't totally refined yet.