Using list columns in data.table

Anantadinath · December 26, 2018, 5:37pm

Trying list columns in data.table. Just to see if this is possible. Just out of curiosity.

library(tidyverse)
library(broom)
library(speedglm)
library(DescTools)
library(data.table)
library(magrittr)
library(purrr)
data(diamonds)

lets try a very basic example by storing some plots and basic calculation in a column the tidy way

dmd_df <- diamonds %>%
    group_by(color) %>%
    nest() %>%
    mutate(model = map(data,
                       ~ glance(speedlm(
                           carat ~ cut + depth, data = .x
                       )))
        ,rsquare = map_dbl(model,
                             'r.squared')
        ,graph = map(data
                       ,  ~ ((
                           ggplot(data = .x
                                  , aes(cut, clarity, fill = price)) +
                               geom_tile()
                       )))
        ,discription =
               map(data,
                   ~ (DescTools::Desc(.x$depth, plotit = F))))

dmd_df
# A tibble: 7 x 6
  color data                  model             rsquare graph    discription
  <ord> <list>                <list>              <dbl> <list>   <list>     
1 E     <tibble [9,797 x 9]>  <tibble [1 x 10]>  0.0375 <S3: gg> <S3: Desc> 
2 I     <tibble [5,422 x 9]>  <tibble [1 x 10]>  0.0317 <S3: gg> <S3: Desc> 
3 J     <tibble [2,808 x 9]>  <tibble [1 x 10]>  0.0284 <S3: gg> <S3: Desc> 
4 H     <tibble [8,304 x 9]>  <tibble [1 x 10]>  0.0416 <S3: gg> <S3: Desc> 
5 F     <tibble [9,542 x 9]>  <tibble [1 x 10]>  0.0380 <S3: gg> <S3: Desc> 
6 G     <tibble [11,292 x 9]> <tibble [1 x 10]>  0.0294 <S3: gg> <S3: Desc> 
7 D     <tibble [6,775 x 9]>  <tibble [1 x 10]>  0.0568 <S3: gg> <S3: Desc>

it's concise and very easy to read.

The data.table way

dmd_dt<-diamonds %>%
    setDT() %>% 
    .[,.(
        name=names(.SD %>% 
                       split(.$color))
        ,data = .SD %>% 
         split(.$color)
    )]



dmd_dt[,':='(model = map(data,
                         ~ glance(speedlm(
                             carat ~ cut + depth, data = .x
                         ))))]

dmd_dt[,':='(
    rsquare = map_dbl(model,
                      'r.squared')
    ,graph = map(data
                 ,  ~ ((
                     ggplot(data = .x
                            , aes(cut, clarity, fill = price)) +
                         geom_tile()
                 )))
    ,discription =
        map(data,
            ~ (DescTools::Desc(.x$depth, plotit = F)))
    
)]


dmd_dt
   name         data    model    rsquare graph discription
1:    D <data.table> <tbl_df> 0.05677860  <gg>      <Desc>
2:    E <data.table> <tbl_df> 0.03748835  <gg>      <Desc>
3:    F <data.table> <tbl_df> 0.03802057  <gg>      <Desc>
4:    G <data.table> <tbl_df> 0.02937746  <gg>      <Desc>
5:    H <data.table> <tbl_df> 0.04162114  <gg>      <Desc>
6:    I <data.table> <tbl_df> 0.03168139  <gg>      <Desc>
7:    J <data.table> <tbl_df> 0.02839815  <gg>      <Desc>

lets benchmark both of the methods

Data.table version

    min       lq     mean   median       uq      max neval
 101.0824 108.6998 122.3211 111.4669 114.8225 388.5292   100

tidyverse version


     min      lq     mean   median       uq      max neval
 93.98552 109.089 141.5854 115.5025 125.7747 1646.156   100

Does anybody have a better solution to write data.table code for the example.

nwerth · December 26, 2018, 6:47pm

library(tidyverse)
library(data.table)

glance
# Error: object 'glance' not found

speedlm
# Error: object 'speedlm' not found

I think you mean glance from the broom package, but please edit your post to show where everything's from.

technocrat · December 26, 2018, 7:02pm

There's some missing dependencies for your first reprex: magrittr, broom, speedlm ggplot2 and DescTools in the second.

Also the call on the fifth line of the first example should be:

                       ~ glance(speedglm::speedlm(

In the second, same change should me made on line 14, and line 20 should be

dmd_dt[,':='(model = purrr::map(data,

and conforming changes should be made on lines 26, 28 and 35.

You need ggplot2 to load diamonds

library(data.table)
library(broom)
library(DescTools)
library(magrittr)
data(ggplot2::diamonds)

With these corrects, I get to

> summary(rlang::last_error())
<error>
message: Columns `r.squared`, `adj.r.squared`, `statistic`, `p.value`, `df` must be 1d atomic vectors or lists
class:   `rlang_error`
fields:  `message`, `trace` and `parent`
backtrace:
█
├─...[]
└─data.table:::`[.data.table`(...)
  └─base::eval(jsub, SDenv, parent.frame())
    └─base::eval(jsub, SDenv, parent.frame())
      └─purrr::map(...)
        └─global::.f(.x[[i]], ...)
          ├─generics::glance(speedglm::speedglm(carat ~ cut + depth, data = .x))
          └─broom:::glance.speedlm(...)
            └─tibble::tibble(...)
              ├─tibble::as_tibble(lst_quos(xs, expand = TRUE))
              └─tibble:::as_tibble.list(lst_quos(xs, expand = TRUE))
                └─tibble:::list_to_tibble(x, validate)
                  └─tibble:::check_tibble(x)
                    └─tibble:::invalid_df(...)
                      └─tibble:::stopc(...)
>

It may be possible to substitute library(tidyverse) for the separate calls to purrr, ggplot2 and magrittr, but I still don't think we have a reprex

nwerth · December 26, 2018, 8:55pm

My version of the data.table way:

dmd_dt <- as.data.table(diamonds)[
  ,
  list(
    data  = list(.SD),
    model = list(
      lm(carat ~ cut + depth, data = .SD)
    ),
    graph = list(
      ggplot(.SD, aes(cut, clarity, fill = price)) + geom_tile()
    )
  ),
  by = color
][
  ,
  model_summary := lapply(model, summary)
][
  ,
  rsquare := vapply(model_summary, `[[`, numeric(1), "r.squared")
]

dmd_dt
#    color         data model graph model_summary    rsquare
# 1:     E <data.table>  <lm>  <gg>  <summary.lm> 0.03748835
# 2:     I <data.table>  <lm>  <gg>  <summary.lm> 0.03168139
# 3:     J <data.table>  <lm>  <gg>  <summary.lm> 0.02839815
# 4:     H <data.table>  <lm>  <gg>  <summary.lm> 0.04162114
# 5:     F <data.table>  <lm>  <gg>  <summary.lm> 0.03802057
# 6:     G <data.table>  <lm>  <gg>  <summary.lm> 0.02937746
# 7:     D <data.table>  <lm>  <gg>  <summary.lm> 0.05677860

dmd_dt[["data"]][[1]]
#       carat       cut clarity depth table price    x    y    z
#    1:  0.23     Ideal     SI2  61.5    55   326 3.95 3.98 2.43
#    2:  0.21   Premium     SI1  59.8    61   326 3.89 3.84 2.31
#    3:  0.23      Good     VS1  56.9    65   327 4.05 4.07 2.31
#    4:  0.22      Fair     VS2  65.1    61   337 3.87 3.78 2.49
#    5:  0.20   Premium     SI2  60.2    62   345 3.79 3.75 2.27
#   ---                                                         
# 9793:  0.71     Ideal     SI1  61.9    56  2756 5.71 5.73 3.54
# 9794:  0.79   Premium     SI2  61.4    58  2756 6.03 5.96 3.68
# 9795:  0.71   Premium     SI1  60.5    55  2756 5.79 5.74 3.49
# 9796:  0.70 Very Good     VS2  60.5    59  2757 5.71 5.76 3.47
# 9797:  0.70 Very Good     VS2  61.2    59  2757 5.69 5.72 3.49

It doesn't use the speedlm packages, but the R-squared values are probably the same. It also doesn't have the DescTools part, but it already shows how list columns are created and used in data.table.

Edit: added the data column. I'll try to write equivalent dplyr code for benchmarking later.

Anantadinath · December 27, 2018, 2:46am

@technocrat

I am really very sorry I didn't realize it earlier.

I have changed the codes. I am really sorry for the trouble.

@nwerth

dmd_dt
   color model graph model_summary    rsquare
1:     E  <lm>  <gg>  <summary.lm> 0.03748835
2:     I  <lm>  <gg>  <summary.lm> 0.03168139
3:     J  <lm>  <gg>  <summary.lm> 0.02839815
4:     H  <lm>  <gg>  <summary.lm> 0.04162114
5:     F  <lm>  <gg>  <summary.lm> 0.03802057
6:     G  <lm>  <gg>  <summary.lm> 0.02937746
7:     D  <lm>  <gg>  <summary.lm> 0.05677860

It's a wonderful way to write the data.table code. But your answer doesn't contain the data column. It is related to lm model only. Say down the road you want any other analysis. Please add at least the data column so to make it complete. and if possible benchmarks too.

nwerth · December 27, 2018, 2:41pm

I edited my previous answer to save people the trouble of combining changes.

Does the function from the DescTools package cause a bottleneck that might behave differently between dplyr and data.table? If not, do you mind if the benchmarking doesn't use it? I don't usually install packages when helping people unless necessary. My R library already has 394 packages, and it's getting hard to keep a vague idea of the important ones' features.

nwerth · January 2, 2019, 3:55pm

Note: The benchmarking below doesn't register positive execution times for ~75% of runs. If this is because I'm doing it wrong, please correct me!

tl;dr: No real difference. Toy examples are poor choices for comparing efficient packages which focus on different use-cases.

Equivalent code for dplyr and data.table:

library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(microbenchmark)

data("diamonds", package = "ggplot2")
diamond_dt <- as.data.table(diamonds)

dplyr_process <- expression({
  result <- diamonds %>%
    group_by(color) %>%
    nest() %>%
    mutate(
      model = lapply(
        X       = data,
        FUN     = lm,
        formula = carat ~ cut + depth
      ),
      graph = lapply(
        X   = data,
        FUN = function(d) {
          ggplot(d, aes(cut, clarity, fill = price)) + geom_tile()
        }
      ),
      rsquare = vapply(
        X         = model,
        FUN       = function(m) summary(m)[["r.squared"]],
        FUN.VALUE = numeric(1)
      )
    )
})

dt_process <- expression({
  result <- diamond_dt[
    ,
    list(
      data  = list(.SD),
      model = list(lm(carat ~ cut + depth, data = .SD)),
      graph = list(
        ggplot(.SD, aes(cut, clarity, fill = price)) + geom_tile()
      )
    ),
    by = color
  ][
    ,
    rsquare := vapply(
      X         = model,
      FUN       = function(m) summary(m)[["r.squared"]],
      FUN.VALUE = numeric(1)
    )
  ]
})

Benchmarking the run times:

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0  0.15      0  0    1   100
#  data.table   0  0 21.33      0  0 2117   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 155 evaluations.

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0 14.23      0  0 1411   100
#  data.table   0  0  3.64      0  0  354   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 163 evaluations.

In both cases, the execution time for over 3/4 of the runs could not be measured. The times were too short. Even if this is more a "problem" with my laptop and operating system, I'd say nanosecond-level differences are likely unimportant.

The max times are more influenced by the other processes I have running. By extension, the means show the same. The quartile measures are more reliable.

If your data is often the size of diamonds (53,940 rows), then any problems in run time are unlikely to be affected by whether you use dplyr or data.table. But we can simulate much larger datasets by making a bigger version of diamonds.

set.seed(100)
diamonds <- diamonds %>%
  lapply(FUN = rep, times = 100) %>%
  as_data_frame() %>%
  mutate_all(sample, replace = TRUE)

diamond_dt <- as.data.table(diamonds)

microbenchmark(
  dplyr      = dplyr_process,
  data.table = dt_process
)
# Unit: nanoseconds
#        expr min lq  mean median uq  max neval
#       dplyr   0  0 31.86      0  0 2116   100
#  data.table   0  0 21.34      0  0 2117   100
# Warning message:
# In microbenchmark(dplyr = dplyr_process, data.table = dt_process) :
#   Could not measure a positive execution time for 151 evaluations.

With 5,394,000 rows, the 1st, 2nd, and 3rd quartiles are still 0. Unless you're doing a lot of munging, I don't think it matters what you use. And even if you do a lot of munging, the best choice is whatever's easiest to reason about and maintain.

Unless you hit RAM limits for copying data. Then try data.table.

Session info:

sessionInfo()
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
# [1] LC_COLLATE=English_United States.1252 
# [2] LC_CTYPE=English_United States.1252   
# [3] LC_MONETARY=English_United States.1252
# [4] LC_NUMERIC=C                          
# [5] LC_TIME=English_United States.1252    
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
# [1] bindrcpp_0.2.2       microbenchmark_1.4-4 ggplot2_3.0.0       
# [4] tidyr_0.8.1          dplyr_0.7.6          data.table_1.11.4   
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_0.12.17     bindr_0.1.1      magrittr_1.5     tidyselect_0.2.4
#  [5] munsell_0.4.3    colorspace_1.3-2 R6_2.2.2         rlang_0.2.0     
#  [9] plyr_1.8.4       tools_3.5.1      grid_3.5.1       gtable_0.2.0    
# [13] withr_2.1.2      yaml_2.1.19      lazyeval_0.2.1   assertthat_0.2.0
# [17] tibble_1.4.2     purrr_0.2.5      glue_1.2.0       compiler_3.5.1  
# [21] pillar_1.2.2     scales_0.5.0     pkgconfig_2.0.1

Anantadinath · January 2, 2019, 5:23pm

awesome post. Great insight.

I was just curious about syntax. it turns out. it does impact speed.

Best part about data.table is not just speed but memory management as well.

Thanks for such a detailed comparison.

system · January 9, 2019, 5:23pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.