My assumption was that the dplyr::select option would be a bit slower because R has to find dplyr. I tried microbenchmark, but failed to find the anticipated difference.
source <- data.frame(
stringsAsFactors = FALSE,
URN = c("aaa", "bbb", "ccc", "ddd", "eee", "fff", "ggg"),
VIN = c("xxx", "xxx", "yyy", "yyy", "yyy", "zzz", "abc"),
EventDate = c("2019-04-29","2019-11-04",
"2019-06-18","2019-11-21","2020-11-18","2020-01-27",
"2020-08-22"),
Q1 = c(10, 5, 8, 10, 2, 4, 3),
Q2 = c(1, 1, 1, 1, 2, 1, 2),
Q3 = c(1, 4, 3, 2, 1, 2, 4),
Q4 = c(2019, 2020, 2020, 2019, 2020, 2021, 2021),
Sequence = c(1, 2, 1, 2, 3, 0, 0)
)
microbenchmark(
bob <- source %>% select(Q1, Q4),
bob2 <- source %>% dplyr::select(Q1, Q4),
times=1000000
)
############# OUTPUT ###################
#Unit: milliseconds
#expr min lq mean median
# bob <- source %>% select(Q1, Q4) 1.7483 1.8171 1.926814 1.8482
#bob2 <- source %>% dplyr::select(Q1, Q4) 1.7561 1.8258 1.935550 1.8573
#uq max neval
#1.8820 116.5913 1e+06
#1.8914 106.2843 1e+06
#With so many replicates this program takes a significant time to run. I let it run over night.
I tried adding a few more functions (mutate, filter, arrange, and ggplot), but the difference in execution times was too small to be significant.