Filter out empty elements from list col

dplyr
tidyr
tibble

#1

I have converted a JSON table into a data frame with nested columns. I want to unnest one of these columns, but I am getting an error when that nested column holds an empty list (but not when that nested column holds an empty data frame).

For example, given this data, I see three kinds of rows:

test_data <- structure(list(duration = c(2624L, 3374L, 901L), precincts = list(
    structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame"), 
    structure(list(smartzoneName = c("Loc 1", "Loc 2", 
    "Loc 1", "Loc 2", "Loc 1"), duration = c(264L, 
    1329L, 968L, 474L, 359L), timestamp = c(1545522428000, 1545522692000, 
    1545524021000, 1545524989000, 1545525463000), smartzoneId = c(275L, 
    251L, 275L, 251L, 275L)), class = "data.frame", row.names = c(NA, 
    5L)), list())), row.names = c(NA, -3L), class = "data.frame")

test_data
# duration  precincts
# <int>     <list>
# 2624	<data.frame [0 × 0]>			
# 3374	<data.frame [5 × 4]>			
# 901	<list [0]>

I'd like to unnest based on precinct. But the empty list rows create an error:

test_data %>%
  slice(1:2) %>%
  unnest(precincts, .id = "id")
# Runs fine, dropping first row of original data

test_data %>%
  unnest(precincts, .id = "id")
# Error: Each column must either be a list of vectors or a list of data frames [precincts]

Is there a way to convert the empty lists to empty data tables, or filter out both kinds of rows (or at least the problematic ones) before the unnest? Is this expected behavior, or is this a bug with tidyr? (using 0.8.2)


#2

Well, this reprex is specific to the test data, but may help point you in the right direction if tidier data is not possible

library(purrr)
test_data <- structure(list(duration = c(2624L, 3374L, 901L), 
                            precincts = list(structure(list(),
                                                       .Names = character(0), 
                                                       row.names = integer(0),
                                                       class = "data.frame"),
                                             structure(list(smartzoneName = c("Loc 1", "Loc 2",
                                                                              "Loc 1", "Loc 2", "Loc 1"),
                                                            duration = c(264L, 1329L, 968L, 474L, 359L),
                                                            timestamp = c(1545522428000, 1545522692000,  
                                                                          1545524021000, 1545524989000, 1545525463000),
                                                            smartzoneId = c(275L, 251L, 275L, 251L, 275L)), 
                                                       class = "data.frame", 
                                                       row.names = c(NA, 5L)), list())),
                       row.names = c(NA, -3L), 
                       class = "data.frame")
test_data %>% pluck(2)[2]
#> [[1]]
#>   smartzoneName duration    timestamp smartzoneId
#> 1         Loc 1      264 1.545522e+12         275
#> 2         Loc 2     1329 1.545523e+12         251
#> 3         Loc 1      968 1.545524e+12         275
#> 4         Loc 2      474 1.545525e+12         251
#> 5         Loc 1      359 1.545525e+12         275

Created on 2019-01-10 by the reprex package (v0.2.1)


#3

Thank you. I should clarify, my challenge is around weeding out or working around the thousands of rows I have that are causing the unnest function to fail. Is there a way to distinguish in a filter the kind of data in my third row?


#4

I think, viable solution is to drop empty results (such as your third row in test_data) and then unnest will work without a problem:

library(magrittr)
test_data <- structure(list(duration = c(2624L, 3374L, 901L), precincts = list(
  structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame"), 
  structure(list(smartzoneName = c("Loc 1", "Loc 2", 
                                   "Loc 1", "Loc 2", "Loc 1"), duration = c(264L, 
                                                                            1329L, 968L, 474L, 359L), timestamp = c(1545522428000, 1545522692000, 
                                                                                                                    1545524021000, 1545524989000, 1545525463000), smartzoneId = c(275L, 
                                                                                                                                                                                  251L, 275L, 251L, 275L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                5L)), list())), row.names = c(NA, -3L), class = "data.frame")

test_data %>%
  dplyr::filter(purrr::map_lgl(precincts, ~!rlang::is_empty(.x))) %>%
  tidyr::unnest(precincts, .id = "id")
#>   duration id smartzoneName duration1    timestamp smartzoneId
#> 1     3374  1         Loc 1       264 1.545522e+12         275
#> 2     3374  1         Loc 2      1329 1.545523e+12         251
#> 3     3374  1         Loc 1       968 1.545524e+12         275
#> 4     3374  1         Loc 2       474 1.545525e+12         251
#> 5     3374  1         Loc 1       359 1.545525e+12         275

Created on 2019-01-11 by the reprex package (v0.2.1)