Hi there!
I was looking for a way to aggregate my data for easy interpretation in a table format for possible reporting. The below code works fine, what I am wondering is if it is efficient. My dataset has 50k obs and 149 vars, so when I ran this code on my data to classify everything, it took 816 seconds.
Would there be a more efficient way? It seems (not sure) but for each case_when() it goes through the entire dataset each time. I was thinking that a string evaluation which handles all cases on one single pass, but I'm unsure how to go about it and if it really is more efficient. Maybe this is the time it takes simply because of the sheer number of observations.
# Sample data:
dx <- tibble(
`Job Operator` = c("Wayne Martin", "Emil Manuel", "Joanne Lano", "Tony Solis", "Keri Cornwell", "Keri Cornwell", "Ester Navoa", "Ester Navoa", "Emil Manuel"),
`Job.Recog.Date` = c( "2017-04-23", "2017-04-17", "2018-04-04", "2018-03-25", "2019-03-25", "2019-03-16", "2017-03-17", "2019-03-17", "2018-04-15"),
`Job Profit` = c(425.00, 444.41, 594.80, 649.30, 488.10, 432.20, 696.50, 378.25, 856.00)
)
# Using case_when() to classify which quarter of the year:
ptm <- proc.time()
dx %>%
group_by( `Job.Recog.Date`, `Job Operator`) %>%
mutate( Quarters = case_when(
`Job.Recog.Date`< ymd(20170331) ~ "2017.Q1",
`Job.Recog.Date`> ymd(20170331) & `Job.Recog.Date`< ymd(20170630) ~ "2017.Q2",
`Job.Recog.Date`> ymd(20170630) & `Job.Recog.Date`< ymd(20170930) ~ "2017.Q3",
`Job.Recog.Date`> ymd(20170930) & `Job.Recog.Date`< ymd(20171231) ~ "2017.Q4",
`Job.Recog.Date`> ymd(20170131) & `Job.Recog.Date`< ymd(20180331) ~ "2018.Q1",
`Job.Recog.Date`> ymd(20180331) & `Job.Recog.Date`< ymd(20180630) ~ "2018.Q2",
`Job.Recog.Date`> ymd(20180630) & `Job.Recog.Date`< ymd(20180930) ~ "2018.Q3",
`Job.Recog.Date`> ymd(20180930) & `Job.Recog.Date`< ymd(20181231) ~ "2018.Q4",
`Job.Recog.Date`> ymd(20180131) & `Job.Recog.Date`< ymd(20190331) ~ "2019.Q1",
`Job.Recog.Date`> ymd(20190331) & `Job.Recog.Date`< ymd(20190630) ~ "2019.Q2",
`Job.Recog.Date`> ymd(20190630) & `Job.Recog.Date`< ymd(20190930) ~ "2019.Q3",
`Job.Recog.Date`> ymd(20190930) & `Job.Recog.Date`< ymd(20191231) ~ "2019.Q4",
`Job.Recog.Date`> ymd(20190131) & `Job.Recog.Date`< ymd(20200331) ~ "2020.Q1"
))
ptm <- proc.time()
This came up because it was the first time I had to wait so much to do anything with this dataset. Maybe it is the nature of the operation, hence the post. All other manipulations were pretty much instantaneous.
Thank you for your time.
Best regards,
LF.