Filter out empty elements from list col

jonspring · January 10, 2019, 10:51pm

I have converted a JSON table into a data frame with nested columns. I want to unnest one of these columns, but I am getting an error when that nested column holds an empty list (but not when that nested column holds an empty data frame).

For example, given this data, I see three kinds of rows:

test_data <- structure(list(duration = c(2624L, 3374L, 901L), precincts = list(
    structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame"), 
    structure(list(smartzoneName = c("Loc 1", "Loc 2", 
    "Loc 1", "Loc 2", "Loc 1"), duration = c(264L, 
    1329L, 968L, 474L, 359L), timestamp = c(1545522428000, 1545522692000, 
    1545524021000, 1545524989000, 1545525463000), smartzoneId = c(275L, 
    251L, 275L, 251L, 275L)), class = "data.frame", row.names = c(NA, 
    5L)), list())), row.names = c(NA, -3L), class = "data.frame")

test_data
# duration  precincts
# <int>     <list>
# 2624	<data.frame [0 × 0]>			
# 3374	<data.frame [5 × 4]>			
# 901	<list [0]>

I'd like to unnest based on precinct. But the empty list rows create an error:

test_data %>%
  slice(1:2) %>%
  unnest(precincts, .id = "id")
# Runs fine, dropping first row of original data

test_data %>%
  unnest(precincts, .id = "id")
# Error: Each column must either be a list of vectors or a list of data frames [precincts]

Is there a way to convert the empty lists to empty data tables, or filter out both kinds of rows (or at least the problematic ones) before the unnest? Is this expected behavior, or is this a bug with tidyr? (using 0.8.2)

technocrat · January 11, 2019, 12:25am

Well, this reprex is specific to the test data, but may help point you in the right direction if tidier data is not possible

library(purrr)
test_data <- structure(list(duration = c(2624L, 3374L, 901L), 
                            precincts = list(structure(list(),
                                                       .Names = character(0), 
                                                       row.names = integer(0),
                                                       class = "data.frame"),
                                             structure(list(smartzoneName = c("Loc 1", "Loc 2",
                                                                              "Loc 1", "Loc 2", "Loc 1"),
                                                            duration = c(264L, 1329L, 968L, 474L, 359L),
                                                            timestamp = c(1545522428000, 1545522692000,  
                                                                          1545524021000, 1545524989000, 1545525463000),
                                                            smartzoneId = c(275L, 251L, 275L, 251L, 275L)), 
                                                       class = "data.frame", 
                                                       row.names = c(NA, 5L)), list())),
                       row.names = c(NA, -3L), 
                       class = "data.frame")
test_data %>% pluck(2)[2]
#> [[1]]
#>   smartzoneName duration    timestamp smartzoneId
#> 1         Loc 1      264 1.545522e+12         275
#> 2         Loc 2     1329 1.545523e+12         251
#> 3         Loc 1      968 1.545524e+12         275
#> 4         Loc 2      474 1.545525e+12         251
#> 5         Loc 1      359 1.545525e+12         275

^{Created on 2019-01-10 by the reprex package (v0.2.1)}

jonspring · January 11, 2019, 3:22pm

Thank you. I should clarify, my challenge is around weeding out or working around the thousands of rows I have that are causing the unnest function to fail. Is there a way to distinguish in a filter the kind of data in my third row?

mishabalyasin · January 11, 2019, 3:47pm

I think, viable solution is to drop empty results (such as your third row in test_data) and then unnest will work without a problem:

library(magrittr)
test_data <- structure(list(duration = c(2624L, 3374L, 901L), precincts = list(
  structure(list(), .Names = character(0), row.names = integer(0), class = "data.frame"), 
  structure(list(smartzoneName = c("Loc 1", "Loc 2", 
                                   "Loc 1", "Loc 2", "Loc 1"), duration = c(264L, 
                                                                            1329L, 968L, 474L, 359L), timestamp = c(1545522428000, 1545522692000, 
                                                                                                                    1545524021000, 1545524989000, 1545525463000), smartzoneId = c(275L, 
                                                                                                                                                                                  251L, 275L, 251L, 275L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                                                5L)), list())), row.names = c(NA, -3L), class = "data.frame")

test_data %>%
  dplyr::filter(purrr::map_lgl(precincts, ~!rlang::is_empty(.x))) %>%
  tidyr::unnest(precincts, .id = "id")
#>   duration id smartzoneName duration1    timestamp smartzoneId
#> 1     3374  1         Loc 1       264 1.545522e+12         275
#> 2     3374  1         Loc 2      1329 1.545523e+12         251
#> 3     3374  1         Loc 1       968 1.545524e+12         275
#> 4     3374  1         Loc 2       474 1.545525e+12         251
#> 5     3374  1         Loc 1       359 1.545525e+12         275

^{Created on 2019-01-11 by the reprex package (v0.2.1)}

jonspring · January 18, 2019, 3:47pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.