Why nested `purrr::map()` works differently in data.table vs tibble

Dobrokhotov1989 · June 23, 2022, 3:03am

Hi there,

Introduction

I have data with coordinates of objects on multiple images. I want to count the number of neighbors in the specific area near each of the objects (e.g. in the box 30 px × 30 px to the left of each object). To achieve this, I simply apply the filter() function to the coordinates relative to the object of interest. Given the size of my data, it takes way too long to be practical. However, I found a few posts that stated the {data.table} works much faster with filtering, so I tried to re-write my code using {dtplyr} instead of {dplyr}.

The problem

The problem is that with data.table when I nest map() functions, it sees only variables defined in the "inner" map() but not in "outer" map(), while it still sees variables defined in the global env. I don't have such a problem with tibble.

Example

Example 1: Error: object 'my_data' not found

suppressWarnings({
  library(tidyverse)
  library(data.table, warn.conflicts = FALSE)
  library(tidyfast)
  library(dtplyr)
})

mpg %>%
  as.data.table() %>%
  dt_nest(manufacturer) %>%
  # Outer mutate to apply the inner function to each row of the list-column
  mutate(data = map(
    .x = data,
    .f = function(my_data){
      
      # This example has no sense, but conveys my intention -
      # count number of rows which matches the filtering condition
      # which depends on the values from the given row
      my_data %>%
        mutate(the_n = map2_dbl(
          .x = cty,
          .y = hwy,
          .f = function(x, y){
            
            my_data %>%
              filter(x + 1 > 15 & y - 2 > 25) %>%
              nrow()
            
          }))
    }))  %>% 
  # Series of steps to unnest the results
  as.data.table() %>% 
  mutate(data = map(.x = data,
                    .f = ~as.data.table(.x))) %>%
  as.data.table() %>%
  dt_unnest(col = data) %>%
  as_tibble()
#> Error in filter(., x + 1 > 15 & y - 2 > 25): object 'my_data' not found

Example 2: Works fine

if avoid trying to access my_data in the inner map()

suppressWarnings({
  library(tidyverse)
  library(data.table, warn.conflicts = FALSE)
  library(tidyfast)
  library(dtplyr)
})

mpg %>%
  as.data.table() %>%
  dt_nest(manufacturer) %>%
  # Outer mutate to apply the inner function to each row of the list-column
  mutate(data = map(.x = data,
                    .f = function(my_data){
                      
                      # Here I just simply calculate the avarage of two values
                      # This does not requires any extra values outside of 
                      # `map2_dbl()`
                      my_data %>%
                        mutate(the_n = map2_dbl(.x = cty,
                                                .y = hwy,
                                                .f = ~mean(c(.x, .y))))
                    }))  %>% 
  # Series of steps to unnest the results
  as.data.table() %>% 
  mutate(data = map(.x = data,
                    .f = ~as.data.table(.x))) %>%
  as.data.table() %>%
  dt_unnest(col = data) %>%
  as_tibble()
#> # A tibble: 234 x 12
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
#>  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
#>  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
#>  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
#>  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
#>  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
#>  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
#>  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
#>  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
#> 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
#> # ... with 224 more rows, and 1 more variable: the_n <dbl>

Example 3: Works fine with tibble.

Same logic as in example 1, but with tibble instead of data.table

suppressWarnings({
  library(tidyverse)
  library(data.table, warn.conflicts = FALSE)
  library(tidyfast)
  library(dtplyr)
})

mpg %>%
  group_by(manufacturer) %>%
  nest() %>%
  mutate(data = map(.x = data,
                    .f = function(my_data){
                      my_data %>%
                        mutate(the_n = map2_dbl(.x = cty,
                                          .y = hwy,
                                          .f = function(x, y){
                                            
                                            my_data %>%
                                              filter(x + 1 > 15 & y - 2 > 25) %>%
                                              nrow()
                                            
                                          }))
                    })) %>%
  unnest(cols = data) %>%
  ungroup()
#> # A tibble: 234 x 12
#>    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
#>    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
#>  1 audi         a4           1.8  1999     4 auto~ f        18    29 p     comp~
#>  2 audi         a4           1.8  1999     4 manu~ f        21    29 p     comp~
#>  3 audi         a4           2    2008     4 manu~ f        20    31 p     comp~
#>  4 audi         a4           2    2008     4 auto~ f        21    30 p     comp~
#>  5 audi         a4           2.8  1999     6 auto~ f        16    26 p     comp~
#>  6 audi         a4           2.8  1999     6 manu~ f        18    26 p     comp~
#>  7 audi         a4           3.1  2008     6 auto~ f        18    27 p     comp~
#>  8 audi         a4 quattro   1.8  1999     4 manu~ 4        18    26 p     comp~
#>  9 audi         a4 quattro   1.8  1999     4 auto~ 4        16    25 p     comp~
#> 10 audi         a4 quattro   2    2008     4 manu~ 4        20    28 p     comp~
#> # ... with 224 more rows, and 1 more variable: the_n <dbl>

Example 4: Here I just checking which variables are available in the inner `map()`

Surprisingly, variable c explicitly defined in the outer map() is accessible, but neither my_data nor my_data2

suppressWarnings({
  library(tidyverse)
  library(data.table, warn.conflicts = FALSE)
  library(tidyfast)
  library(dtplyr)
})

# variable in the global env
a <- 1

test <- mpg %>%
  as.data.table() %>%
  dt_nest(manufacturer) %>%
  # Outer mutate to apply the inner function to each row of the list-column
  mutate(data = map(
    .x = data,
    .f = function(my_data){
      
      c <- 2
      
      my_data2 <- my_data
      
      # This example has no sense, but conveys my intention -
      # count number of rows which matches the filtering condition
      # which depends on the values from the given row
      my_data %>%
        mutate(the_n = map2_dbl(
          .x = cty,
          .y = hwy,
          .f = function(x, y){

            # Check existance of several variables
            print(c(
                # variable `a` defined in the global env
                "a" = exists("a"),
                # variable `b` not defined
                "b" = exists("b"),
                # variable `c` defined in the outer `map()`
                "c" = exists("c"),
                # my_data, which implicitly defined by the outer `map()`
                "my_data" = exists("my_data"),
                # my_data2, which explicitlt defined by the outer `map()`
                "my_data2" = exists("my_data2")
                ))
            
            # Just to remain code valid return number
            return(1)
            
          }))
    }))  %>% 
  # Series of steps to unnest the results
  as.data.table() %>% 
  mutate(data = map(.x = data,
                    .f = ~as.data.table(.x))) %>%
  as.data.table() %>%
  dt_unnest(col = data) %>%
  as_tibble()
#>        a        b        c  my_data my_data2 
#>     TRUE    FALSE     TRUE    FALSE    FALSE 
#> TRANCATED...

Question

I suspect that there is some issue caused by the lazy evaluation, but can't understand where it appears and how to deal with it. Any suggestions?

michaelbgarcia · June 24, 2022, 3:42pm

Honestly, I am not sure why this is the case. I know this is a contrived scenario, but if it is representative of your actual data set I don't think you need to do the nesting/unnesting bits. If I use your Example 2 as a baseline, this solution should cut down time tremendously for larger datasets. FYI I didn't understand the logic in "the_n" variable, but still this solution result matches yours.

library(tidyverse)
library(data.table, warn.conflicts = FALSE)
library(dtplyr)

base = mpg %>%
  group_by(manufacturer) %>%
  nest() %>%
  mutate(data = map(.x = data,
                    .f = function(my_data){
                      my_data %>%
                        mutate(the_n = map2_dbl(.x = cty,
                                                .y = hwy,
                                                .f = function(x, y){
                                                  
                                                  my_data %>%
                                                    filter(x + 1 > 15 & y - 2 > 25) %>%
                                                    nrow()
                                                  
                                                }))
                    })) %>%
  unnest(cols = data) %>%
  ungroup()

new = mpg %>%
  lazy_dt() %>%
  group_by(manufacturer) %>%
  mutate(flag = cty + 1 > 15 & hwy - 2 > 25,
         the_n = as.double(n() * flag)) %>%
  ungroup() %>%
  select(-flag) %>%
  as_tibble()

identical(new, base)
#> [1] TRUE

^{Created on 2022-06-24 by the reprex package (v2.0.1)}

system · July 15, 2022, 3:43pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.