case_when() alternative for speed?

Hi there!

I was looking for a way to aggregate my data for easy interpretation in a table format for possible reporting. The below code works fine, what I am wondering is if it is efficient. My dataset has 50k obs and 149 vars, so when I ran this code on my data to classify everything, it took 816 seconds.

Would there be a more efficient way? It seems (not sure) but for each case_when() it goes through the entire dataset each time. I was thinking that a string evaluation which handles all cases on one single pass, but I'm unsure how to go about it and if it really is more efficient. Maybe this is the time it takes simply because of the sheer number of observations.

# Sample data:
dx <-  tibble(
  `Job Operator` = c("Wayne Martin", "Emil Manuel", "Joanne Lano", "Tony Solis", "Keri Cornwell", "Keri Cornwell", "Ester Navoa", "Ester Navoa", "Emil Manuel"),
  `Job.Recog.Date` = c( "2017-04-23",  "2017-04-17",  "2018-04-04",  "2018-03-25",  "2019-03-25",  "2019-03-16",  "2017-03-17",  "2019-03-17",  "2018-04-15"),
  `Job Profit` = c(425.00, 444.41, 594.80, 649.30, 488.10, 432.20, 696.50, 378.25,  856.00)
)

# Using case_when() to classify which quarter of the year:
ptm <- proc.time()
dx %>% 
  group_by( `Job.Recog.Date`, `Job Operator`) %>% 
  mutate( Quarters = case_when(
    `Job.Recog.Date`< ymd(20170331) ~ "2017.Q1",
    `Job.Recog.Date`> ymd(20170331) & `Job.Recog.Date`< ymd(20170630)  ~ "2017.Q2",
    `Job.Recog.Date`> ymd(20170630) & `Job.Recog.Date`< ymd(20170930)  ~ "2017.Q3",
    `Job.Recog.Date`> ymd(20170930) & `Job.Recog.Date`< ymd(20171231)  ~ "2017.Q4",
    
    `Job.Recog.Date`> ymd(20170131) & `Job.Recog.Date`< ymd(20180331)  ~ "2018.Q1",
    `Job.Recog.Date`> ymd(20180331) & `Job.Recog.Date`< ymd(20180630)  ~ "2018.Q2",
    `Job.Recog.Date`> ymd(20180630) & `Job.Recog.Date`< ymd(20180930)  ~ "2018.Q3",
    `Job.Recog.Date`> ymd(20180930) & `Job.Recog.Date`< ymd(20181231)  ~ "2018.Q4",
    
    `Job.Recog.Date`> ymd(20180131) & `Job.Recog.Date`< ymd(20190331)  ~ "2019.Q1",
    `Job.Recog.Date`> ymd(20190331) & `Job.Recog.Date`< ymd(20190630)  ~ "2019.Q2",
    `Job.Recog.Date`> ymd(20190630) & `Job.Recog.Date`< ymd(20190930)  ~ "2019.Q3",
    `Job.Recog.Date`> ymd(20190930) & `Job.Recog.Date`< ymd(20191231)  ~ "2019.Q4",
    
    `Job.Recog.Date`> ymd(20190131) & `Job.Recog.Date`< ymd(20200331)  ~ "2020.Q1"

  )) 
ptm <- proc.time()

This came up because it was the first time I had to wait so much to do anything with this dataset. Maybe it is the nature of the operation, hence the post. All other manipulations were pretty much instantaneous.

Thank you for your time.
Best regards,
LF.

The lubridate package can help you a lot with this problem. I'm not sure it's much faster but a lot less typing.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
dx <-  tibble(
  `Job Operator` = c("Wayne Martin", "Emil Manuel", "Joanne Lano", "Tony Solis", "Keri Cornwell", "Keri Cornwell", "Ester Navoa", "Ester Navoa", "Emil Manuel"),
  `Job.Recog.Date` = c( "2017-04-23",  "2017-04-17",  "2018-04-04",  "2018-03-25",  "2019-03-25",  "2019-03-16",  "2017-03-17",  "2019-03-17",  "2018-04-15"),
  `Job Profit` = c(425.00, 444.41, 594.80, 649.30, 488.10, 432.20, 696.50, 378.25,  856.00)
)


ptm <- proc.time()
dx %>% 
  group_by( `Job.Recog.Date`, `Job Operator`) %>% 
  mutate( Quarters = case_when(
    `Job.Recog.Date`< ymd(20170331) ~ "2017.Q1",
    `Job.Recog.Date`> ymd(20170331) & `Job.Recog.Date`< ymd(20170630)  ~ "2017.Q2",
    `Job.Recog.Date`> ymd(20170630) & `Job.Recog.Date`< ymd(20170930)  ~ "2017.Q3",
    `Job.Recog.Date`> ymd(20170930) & `Job.Recog.Date`< ymd(20171231)  ~ "2017.Q4",
    
    `Job.Recog.Date`> ymd(20170131) & `Job.Recog.Date`< ymd(20180331)  ~ "2018.Q1",
    `Job.Recog.Date`> ymd(20180331) & `Job.Recog.Date`< ymd(20180630)  ~ "2018.Q2",
    `Job.Recog.Date`> ymd(20180630) & `Job.Recog.Date`< ymd(20180930)  ~ "2018.Q3",
    `Job.Recog.Date`> ymd(20180930) & `Job.Recog.Date`< ymd(20181231)  ~ "2018.Q4",
    
    `Job.Recog.Date`> ymd(20180131) & `Job.Recog.Date`< ymd(20190331)  ~ "2019.Q1",
    `Job.Recog.Date`> ymd(20190331) & `Job.Recog.Date`< ymd(20190630)  ~ "2019.Q2",
    `Job.Recog.Date`> ymd(20190630) & `Job.Recog.Date`< ymd(20190930)  ~ "2019.Q3",
    `Job.Recog.Date`> ymd(20190930) & `Job.Recog.Date`< ymd(20191231)  ~ "2019.Q4",
    
    `Job.Recog.Date`> ymd(20190131) & `Job.Recog.Date`< ymd(20200331)  ~ "2020.Q1"
    
  )) 
#> # A tibble: 9 x 4
#> # Groups:   Job.Recog.Date, Job Operator [9]
#>   `Job Operator` Job.Recog.Date `Job Profit` Quarters
#>   <chr>          <chr>                 <dbl> <chr>   
#> 1 Wayne Martin   2017-04-23             425  2017.Q2 
#> 2 Emil Manuel    2017-04-17             444. 2017.Q2 
#> 3 Joanne Lano    2018-04-04             595. 2018.Q2 
#> 4 Tony Solis     2018-03-25             649. 2018.Q1 
#> 5 Keri Cornwell  2019-03-25             488. 2019.Q1 
#> 6 Keri Cornwell  2019-03-16             432. 2019.Q1 
#> 7 Ester Navoa    2017-03-17             696. 2017.Q1 
#> 8 Ester Navoa    2019-03-17             378. 2019.Q1 
#> 9 Emil Manuel    2018-04-15             856  2018.Q2
Time1 <- ptm <- proc.time()

ptm <- proc.time()
dx %>%
  mutate(Quarters=as.character(quarter(Job.Recog.Date, with_year=TRUE)))
#> # A tibble: 9 x 4
#>   `Job Operator` Job.Recog.Date `Job Profit` Quarters
#>   <chr>          <chr>                 <dbl> <chr>   
#> 1 Wayne Martin   2017-04-23             425  2017.2  
#> 2 Emil Manuel    2017-04-17             444. 2017.2  
#> 3 Joanne Lano    2018-04-04             595. 2018.2  
#> 4 Tony Solis     2018-03-25             649. 2018.1  
#> 5 Keri Cornwell  2019-03-25             488. 2019.1  
#> 6 Keri Cornwell  2019-03-16             432. 2019.1  
#> 7 Ester Navoa    2017-03-17             696. 2017.1  
#> 8 Ester Navoa    2019-03-17             378. 2019.1  
#> 9 Emil Manuel    2018-04-15             856  2018.2

Time2 <- ptm <- proc.time()

Time1
#>    user  system elapsed 
#>    2.96    1.31    5.21
Time2
#>    user  system elapsed 
#>    2.98    1.31    5.25

Created on 2020-03-30 by the reprex package (v0.3.0)

1 Like

In the data.table package the function fcase recently became available in the dev version.
Check point 6. here:

They usually outperform all other comparable packages. You might want to check if it's applicable to your usecase.

Hi and thank you for your response!

The speed increased drastically!

> ptm <- proc.time()
> df_all_Qstamp <-  df_all %>% 
+   group_by( `Job.Recog.Date`, `Job Operator`) %>% 
+   mutate(Quarters = as.character(quarter(Job.Recog.Date, with_year=TRUE)))
> proc.time() - ptm   
   user  system elapsed 
   1.25    0.03    1.28 

From 800ish seconds to 2 basically! :star_struck:
:man_facepalming:

Thank you!

Thanks for the reply,
I'll take a look!

Conversions from characters to dates tend to be quite slow, so these should be avoided if possible.

I don't think case_when() will cause any significant slowdown.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.