case_when() alternative for speed?

luisferlante · March 30, 2020, 7:52pm

Hi there!

I was looking for a way to aggregate my data for easy interpretation in a table format for possible reporting. The below code works fine, what I am wondering is if it is efficient. My dataset has 50k obs and 149 vars, so when I ran this code on my data to classify everything, it took 816 seconds.

Would there be a more efficient way? It seems (not sure) but for each case_when() it goes through the entire dataset each time. I was thinking that a string evaluation which handles all cases on one single pass, but I'm unsure how to go about it and if it really is more efficient. Maybe this is the time it takes simply because of the sheer number of observations.

# Sample data:
dx <-  tibble(
  `Job Operator` = c("Wayne Martin", "Emil Manuel", "Joanne Lano", "Tony Solis", "Keri Cornwell", "Keri Cornwell", "Ester Navoa", "Ester Navoa", "Emil Manuel"),
  `Job.Recog.Date` = c( "2017-04-23",  "2017-04-17",  "2018-04-04",  "2018-03-25",  "2019-03-25",  "2019-03-16",  "2017-03-17",  "2019-03-17",  "2018-04-15"),
  `Job Profit` = c(425.00, 444.41, 594.80, 649.30, 488.10, 432.20, 696.50, 378.25,  856.00)
)

# Using case_when() to classify which quarter of the year:
ptm <- proc.time()
dx %>% 
  group_by( `Job.Recog.Date`, `Job Operator`) %>% 
  mutate( Quarters = case_when(
    `Job.Recog.Date`< ymd(20170331) ~ "2017.Q1",
    `Job.Recog.Date`> ymd(20170331) & `Job.Recog.Date`< ymd(20170630)  ~ "2017.Q2",
    `Job.Recog.Date`> ymd(20170630) & `Job.Recog.Date`< ymd(20170930)  ~ "2017.Q3",
    `Job.Recog.Date`> ymd(20170930) & `Job.Recog.Date`< ymd(20171231)  ~ "2017.Q4",
    
    `Job.Recog.Date`> ymd(20170131) & `Job.Recog.Date`< ymd(20180331)  ~ "2018.Q1",
    `Job.Recog.Date`> ymd(20180331) & `Job.Recog.Date`< ymd(20180630)  ~ "2018.Q2",
    `Job.Recog.Date`> ymd(20180630) & `Job.Recog.Date`< ymd(20180930)  ~ "2018.Q3",
    `Job.Recog.Date`> ymd(20180930) & `Job.Recog.Date`< ymd(20181231)  ~ "2018.Q4",
    
    `Job.Recog.Date`> ymd(20180131) & `Job.Recog.Date`< ymd(20190331)  ~ "2019.Q1",
    `Job.Recog.Date`> ymd(20190331) & `Job.Recog.Date`< ymd(20190630)  ~ "2019.Q2",
    `Job.Recog.Date`> ymd(20190630) & `Job.Recog.Date`< ymd(20190930)  ~ "2019.Q3",
    `Job.Recog.Date`> ymd(20190930) & `Job.Recog.Date`< ymd(20191231)  ~ "2019.Q4",
    
    `Job.Recog.Date`> ymd(20190131) & `Job.Recog.Date`< ymd(20200331)  ~ "2020.Q1"

  )) 
ptm <- proc.time()

This came up because it was the first time I had to wait so much to do anything with this dataset. Maybe it is the nature of the operation, hence the post. All other manipulations were pretty much instantaneous.

Thank you for your time.
Best regards,
LF.

StatSteph · March 30, 2020, 7:57pm

The lubridate package can help you a lot with this problem. I'm not sure it's much faster but a lot less typing.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
dx <-  tibble(
  `Job Operator` = c("Wayne Martin", "Emil Manuel", "Joanne Lano", "Tony Solis", "Keri Cornwell", "Keri Cornwell", "Ester Navoa", "Ester Navoa", "Emil Manuel"),
  `Job.Recog.Date` = c( "2017-04-23",  "2017-04-17",  "2018-04-04",  "2018-03-25",  "2019-03-25",  "2019-03-16",  "2017-03-17",  "2019-03-17",  "2018-04-15"),
  `Job Profit` = c(425.00, 444.41, 594.80, 649.30, 488.10, 432.20, 696.50, 378.25,  856.00)
)


ptm <- proc.time()
dx %>% 
  group_by( `Job.Recog.Date`, `Job Operator`) %>% 
  mutate( Quarters = case_when(
    `Job.Recog.Date`< ymd(20170331) ~ "2017.Q1",
    `Job.Recog.Date`> ymd(20170331) & `Job.Recog.Date`< ymd(20170630)  ~ "2017.Q2",
    `Job.Recog.Date`> ymd(20170630) & `Job.Recog.Date`< ymd(20170930)  ~ "2017.Q3",
    `Job.Recog.Date`> ymd(20170930) & `Job.Recog.Date`< ymd(20171231)  ~ "2017.Q4",
    
    `Job.Recog.Date`> ymd(20170131) & `Job.Recog.Date`< ymd(20180331)  ~ "2018.Q1",
    `Job.Recog.Date`> ymd(20180331) & `Job.Recog.Date`< ymd(20180630)  ~ "2018.Q2",
    `Job.Recog.Date`> ymd(20180630) & `Job.Recog.Date`< ymd(20180930)  ~ "2018.Q3",
    `Job.Recog.Date`> ymd(20180930) & `Job.Recog.Date`< ymd(20181231)  ~ "2018.Q4",
    
    `Job.Recog.Date`> ymd(20180131) & `Job.Recog.Date`< ymd(20190331)  ~ "2019.Q1",
    `Job.Recog.Date`> ymd(20190331) & `Job.Recog.Date`< ymd(20190630)  ~ "2019.Q2",
    `Job.Recog.Date`> ymd(20190630) & `Job.Recog.Date`< ymd(20190930)  ~ "2019.Q3",
    `Job.Recog.Date`> ymd(20190930) & `Job.Recog.Date`< ymd(20191231)  ~ "2019.Q4",
    
    `Job.Recog.Date`> ymd(20190131) & `Job.Recog.Date`< ymd(20200331)  ~ "2020.Q1"
    
  )) 
#> # A tibble: 9 x 4
#> # Groups:   Job.Recog.Date, Job Operator [9]
#>   `Job Operator` Job.Recog.Date `Job Profit` Quarters
#>   <chr>          <chr>                 <dbl> <chr>   
#> 1 Wayne Martin   2017-04-23             425  2017.Q2 
#> 2 Emil Manuel    2017-04-17             444. 2017.Q2 
#> 3 Joanne Lano    2018-04-04             595. 2018.Q2 
#> 4 Tony Solis     2018-03-25             649. 2018.Q1 
#> 5 Keri Cornwell  2019-03-25             488. 2019.Q1 
#> 6 Keri Cornwell  2019-03-16             432. 2019.Q1 
#> 7 Ester Navoa    2017-03-17             696. 2017.Q1 
#> 8 Ester Navoa    2019-03-17             378. 2019.Q1 
#> 9 Emil Manuel    2018-04-15             856  2018.Q2
Time1 <- ptm <- proc.time()

ptm <- proc.time()
dx %>%
  mutate(Quarters=as.character(quarter(Job.Recog.Date, with_year=TRUE)))
#> # A tibble: 9 x 4
#>   `Job Operator` Job.Recog.Date `Job Profit` Quarters
#>   <chr>          <chr>                 <dbl> <chr>   
#> 1 Wayne Martin   2017-04-23             425  2017.2  
#> 2 Emil Manuel    2017-04-17             444. 2017.2  
#> 3 Joanne Lano    2018-04-04             595. 2018.2  
#> 4 Tony Solis     2018-03-25             649. 2018.1  
#> 5 Keri Cornwell  2019-03-25             488. 2019.1  
#> 6 Keri Cornwell  2019-03-16             432. 2019.1  
#> 7 Ester Navoa    2017-03-17             696. 2017.1  
#> 8 Ester Navoa    2019-03-17             378. 2019.1  
#> 9 Emil Manuel    2018-04-15             856  2018.2

Time2 <- ptm <- proc.time()

Time1
#>    user  system elapsed 
#>    2.96    1.31    5.21
Time2
#>    user  system elapsed 
#>    2.98    1.31    5.25

^{Created on 2020-03-30 by the reprex package (v0.3.0)}

ismirsehregal · March 30, 2020, 8:07pm

In the data.table package the function fcase recently became available in the dev version.
Check point 6. here:

github.com

Rdatatable/data.table/blob/master/NEWS.md

**If you are viewing this file on CRAN, please check [latest news on GitHub](https://github.com/Rdatatable/data.table/blob/master/NEWS.md) where the formatting is also better.**

**Benchmarks are regularly updated: [here](https://h2oai.github.io/db-benchmark/)**

# data.table [v1.12.9](https://github.com/Rdatatable/data.table/milestone/19)  (in development)

## NEW FEATURES

1. `%chin%` and `chmatch(x, table)` are faster when `x` is length 1, `table` is long, and `x` occurs near the start of `table`. Thanks to Michael Chirico for the suggestion, [#4117](https://github.com/Rdatatable/data.table/pull/4117#discussion_r358378409).

2. The C function `CsubsetDT` is now exported for use by other packages, [#3751](https://github.com/Rdatatable/data.table/issues/3751). Thanks to Leonardo Silvestri for the request and the PR. This uses R's `R_RegisterCCallable` and `R_GetCCallable` mechanism, [R-exts§5.4.3](https://cran.r-project.org/doc/manuals/r-devel/R-exts.html#Linking-to-native-routines-in-other-packages) and [`?cdt`](https://rdatatable.gitlab.io/data.table/reference/cdt.html).

3. `print` method for `data.table`s gains `trunc.cols` argument (and corresponding option `datatable.print.trunc.cols`, default `FALSE`), [#1497](https://github.com/Rdatatable/data.table/issues/1497), part of [#1523](https://github.com/Rdatatable/data.table/issues/1523). This prints only as many columns as fit in the console without wrapping to new lines (e.g., the first 5 of 80 columns) and a message that states the count and names of the variables not shown. When `class=TRUE` the message also contains the classes of the variables. `data.table` has always automatically truncated _rows_ of a table for efficiency (e.g. printing 10 rows instead of 10 million); in the future, we may do the same for _columns_ (e.g., 10 columns instead of 20,000) by changing the default for this argument. Thanks to @nverno for the initial suggestion and to @TysonStanley for the PR.

4. `setnames(DT, new=new_names)` (i.e. explicitly named `new=` argument) now works as expected rather than an error message requesting that `old=` be supplied too, [#4041](https://github.com/Rdatatable/data.table/issues/4041). Thanks @Kodiologist for the suggestion.

5. `nafill` and `setnafill` gain `nan` argument to say whether `NaN` should be considered the same as `NA` for filling purposes, [#4020](https://github.com/Rdatatable/data.table/issues/4020). Prior versions had an implicit value of `nan=NaN`; the default is now `nan=NA`, i.e., `NaN` is treated as if it's missing. Thanks @AnonymousBoba for the suggestion. Also, while `nafill` still respects `getOption('datatable.verbose')`, the `verbose` argument has been removed.

6. New function `fcase(...,default)` implemented in C by Morgan Jacob, [#3823](https://github.com/Rdatatable/data.table/issues/3823), is inspired by SQL `CASE WHEN` which is a common tool in SQL for e.g. building labels or cutting age groups based on conditions. `fcase` is comparable to R function `dplyr::case_when` however it evaluates its arguments in a lazy way (i.e. only when needed) as shown below. Please see `?fcase` for more details.

This file has been truncated. show original

They usually outperform all other comparable packages. You might want to check if it's applicable to your usecase.

luisferlante · March 30, 2020, 8:17pm

Hi and thank you for your response!

The speed increased drastically!

> ptm <- proc.time()
> df_all_Qstamp <-  df_all %>% 
+   group_by( `Job.Recog.Date`, `Job Operator`) %>% 
+   mutate(Quarters = as.character(quarter(Job.Recog.Date, with_year=TRUE)))
> proc.time() - ptm   
   user  system elapsed 
   1.25    0.03    1.28

From 800ish seconds to 2 basically!

Thank you!

luisferlante · March 30, 2020, 8:17pm

Thanks for the reply,
I'll take a look!

martin.R · March 30, 2020, 10:12pm

Conversions from characters to dates tend to be quite slow, so these should be avoided if possible.

I don't think case_when() will cause any significant slowdown.

system · April 6, 2020, 10:13pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.