This is a pretty neat package! Congrats on the love it has received so far. I think it could be pretty helpful for beginners, or for just general logging and understanding of what dplyr is doing.
Since you asked for feedback, I do have a few thoughts!
The most obvious to me is that the S3 methods get clobbered by defining a new function called filter()
rather than adding a filter()
method. The most immediate issue with this is that if you flip the order of the library calls, it doesn't work!
# devtools::install_github("elbersb/tidylog")
library(tidylog)
#>
#> Attaching package: 'tidylog'
#> The following object is masked from 'package:stats':
#>
#> filter
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:tidylog':
#>
#> anti_join, distinct, filter, filter_all, filter_at, filter_if,
#> full_join, group_by, group_by_all, group_by_at, group_by_if,
#> inner_join, left_join, mutate, mutate_all, mutate_at,
#> mutate_if, right_join, select, select_all, select_at,
#> select_if, semi_join, transmute, transmute_all, transmute_at,
#> transmute_if
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
mtcars %>%
as_tibble() %>%
mutate(x = 5)
#> # A tibble: 32 x 12
#> mpg cyl disp hp drat wt qsec vs am gear carb x
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 5
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 5
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 5
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 5
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 5
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 5
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 5
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 5
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 5
#> # … with 22 more rows
It might work to instead use an S3 method for filter instead. Something like filter.tbl_log
. The API would probably look more like:
mtcars %>%
as_tbl_log() %>%
mutate(x = 5)
# or
mtcars %>%
init_logger() %>%
mutate(x = 5)
where as_tbl_log()
and init_logger()
would just add a tbl_log
class to the existing object. That way, when it passes off to mutate()
, the correct mutate.tbl_log
method is called.
mutate.tbl_log
could look like:
mutate.tbl_log <- function(.data, ...) {
# this calls the next method of dplyr::mutate(). essentially, it performs the real mutate() call
.data_new <- NextMethod()
log_mutate(old = .data, new = .data_new, "mutate")
.data_new
}
To learn more about S3 if you haven't used it before, you can look here!
https://adv-r.hadley.nz/s3.html
This would probably fix @dwhdai 's issue with conflicted
.
Regarding @jdlong 's comment about working with remote tibbles, I think you could change your logger a bit to look and see if .data
inherits from "data.frame"
(for base R data frames) or just "tbl"
(as a sql backend would, or any tibble object). You would also have to modify the way you compute n
to be a bit more generic so that it works with remote backends, but I don't think it is too hard to do. Something like this (using the old api! not the potentially new s3 way!):
library(dplyr)
filter <- function(.data, ...) {
log_filter2(.data, dplyr::filter, "filter", ...)
}
log_filter2 <- function(.data, fun, funname, ...) {
newdata <- fun(.data, ...)
n_old <- .data %>%
summarise(n = n()) %>%
pull()
n_new <- newdata %>%
summarise(n = n()) %>%
pull()
n_diff = n_old - n_new
cat(glue::glue("{n_diff} rows removed"))
}
con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")
mtcars2 %>%
filter(mpg < 20)
#> 14 rows removed
I don't think this is super computationally efficient, since it forces the data base to run the full SQL statement just to get n
(and the whole point of dbplyr is to delay it), but nevertheless, it is interesting.
Nice job!