new package: tidylog - feedback for basic dplyr operations

benj · January 31, 2019, 5:03am

Hi all,
I wrote a simple package that gives feedback to the user when doing basic dplyr operations. For instance, see this pipe:

library(tidyverse)
library(tidylog)
 summary <- mtcars %>%
    select(mpg, cyl, hp) %>%
    filter(mpg > 15) %>%
    mutate(mpg_round = round(mpg)) %>%
    group_by(cyl, mpg_round) %>%
    tally() %>%
    filter(n >= 1)
#> select: dropped 8 variables (disp, drat, wt, qsec, vs, …) 
#> filter: removed 6 rows (19%) 
#> mutate: new variable 'mpg_round' with 15 unique values and 0% NA 
#> group_by: 17 groups [cyl, mpg_round] 
#> filter: no rows removed

I find this especially helpful for filter and the scoped variants of mutate/select (i.e. mutate_if, mutate_at etc.). For instance:

c <- select_if(mtcars, is.character)
#> select_if: dropped all variables

This might have been inadvertent. With filter I often want to know how many cases I lose -- for instance, when doing a subsample analysis.

For more information, see the Readme: https://github.com/elbersb/tidylog

I would be grateful for feedback. This is still in the early stages. Is this useful for anyone?

Best,
Ben

rensa · January 31, 2019, 5:08am

This is a really, really cool package to have on hand for debugging! Nice job, @benj! Are you tweeting about it or submitting it to RWeekly?

pete · January 31, 2019, 6:27pm

Very nice!
Would be nice to be standard part of dplyr; maybe enabled with verbose = TRUE option.

benj · January 31, 2019, 10:46pm

Hi all, glad you like it!

@rensa, I was thinking about publicizing it more widely, but I wanted to make sure that there are no major bugs first, and was hoping that by posting it here people would give it a try. If there is a problem with one of the wrapper functions, the dplyr command won't work as well, of course -- although in that case it's easy to revert back to dplyr::mutate to circumvent the tidylog package.

@pete, that was my first idea, but it's "un-R-like"

mara · February 1, 2019, 12:13pm

Sorry, I wouldn't have tweeted it w/out asking had I seen this part!

jdlong · February 1, 2019, 2:59pm

I think this is SUPER neat. I tested it against a remote tibble (database backed) and notice that it does not give feedback. Any idea what it would take to make this work with remote lazy tibbles? I have not dug into the code, but I'll look at it soon. Great work!



library(tidyverse)
suppressMessages(library(tidylog))
mtcars %>% filter(mpg < 20) -> local_filter
#> filter: removed 14 rows (44%)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>% filter(mpg < 20) %>% collect -> remote_filter

^{Created on 2019-02-01 by the reprex package (v0.2.1)}

hoelk · February 1, 2019, 3:08pm

nice idea!

nitpick: I would consider using message() instead of cat() for that kind of output, though that's largely a matter of taste (since your package is mainly about interative usage anyways).

I browsed your code a bit, and you could use deparse(substitute()) in log_filter() etc.. so that you don't have to pass in the function AND the function name each time

test <- function(x) deparse(substitute(x))
test(dplyr::filter)

#> "dplyr::filter"

dwhdai · February 1, 2019, 4:32pm

This is a great idea!

One thing I'm noticing in very early stages of using this package is that in conjunction with conflicted and dplyr, I'm getting a lot of conflicts between the tidylog functions with the dplyr functions. If it's possible to eliminate these conflicts somehow, that would be great.

Either pick the one you want with `::` 
* tidylog::select
* dplyr::select
Or declare a preference with `conflict_prefer()`
* conflict_prefer("select", "tidylog")
* conflict_prefer("select", "dplyr")

davis · February 1, 2019, 5:45pm

This is a pretty neat package! Congrats on the love it has received so far. I think it could be pretty helpful for beginners, or for just general logging and understanding of what dplyr is doing.

Since you asked for feedback, I do have a few thoughts!

The most obvious to me is that the S3 methods get clobbered by defining a new function called filter() rather than adding a filter() method. The most immediate issue with this is that if you flip the order of the library calls, it doesn't work!

# devtools::install_github("elbersb/tidylog")

library(tidylog)
#> 
#> Attaching package: 'tidylog'
#> The following object is masked from 'package:stats':
#> 
#>     filter
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:tidylog':
#> 
#>     anti_join, distinct, filter, filter_all, filter_at, filter_if,
#>     full_join, group_by, group_by_all, group_by_at, group_by_if,
#>     inner_join, left_join, mutate, mutate_all, mutate_at,
#>     mutate_if, right_join, select, select_all, select_at,
#>     select_if, semi_join, transmute, transmute_all, transmute_at,
#>     transmute_if
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

mtcars %>%
  as_tibble() %>%
  mutate(x = 5)
#> # A tibble: 32 x 12
#>      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     x
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4     5
#>  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4     5
#>  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1     5
#>  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1     5
#>  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2     5
#>  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1     5
#>  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4     5
#>  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2     5
#>  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2     5
#> 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4     5
#> # … with 22 more rows

It might work to instead use an S3 method for filter instead. Something like filter.tbl_log. The API would probably look more like:

mtcars %>%
  as_tbl_log() %>%
  mutate(x = 5)

# or
mtcars %>%
  init_logger() %>%
  mutate(x = 5)

where as_tbl_log() and init_logger() would just add a tbl_log class to the existing object. That way, when it passes off to mutate(), the correct mutate.tbl_log method is called.

mutate.tbl_log could look like:

mutate.tbl_log <- function(.data, ...) {
  # this calls the next method of dplyr::mutate(). essentially, it performs the real mutate() call
  .data_new <- NextMethod()
  log_mutate(old = .data, new = .data_new, "mutate")
  .data_new
}

To learn more about S3 if you haven't used it before, you can look here!
https://adv-r.hadley.nz/s3.html

This would probably fix @dwhdai 's issue with conflicted.

Regarding @jdlong 's comment about working with remote tibbles, I think you could change your logger a bit to look and see if .data inherits from "data.frame" (for base R data frames) or just "tbl" (as a sql backend would, or any tibble object). You would also have to modify the way you compute n to be a bit more generic so that it works with remote backends, but I don't think it is too hard to do. Something like this (using the old api! not the potentially new s3 way!):


library(dplyr)

filter <- function(.data, ...) {
  log_filter2(.data, dplyr::filter, "filter", ...)
}

log_filter2 <- function(.data, fun, funname, ...) {
  
  newdata <- fun(.data, ...)
  
  n_old <- .data %>%
    summarise(n = n()) %>%
    pull()
  
  n_new <- newdata %>%
    summarise(n = n()) %>%
    pull()
  
  n_diff = n_old - n_new
  
  cat(glue::glue("{n_diff} rows removed"))
}

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)
mtcars2 <- tbl(con, "mtcars")

mtcars2 %>%
  filter(mpg < 20)
#> 14 rows removed

I don't think this is super computationally efficient, since it forces the data base to run the full SQL statement just to get n (and the whole point of dbplyr is to delay it), but nevertheless, it is interesting.

Nice job!

benj · February 1, 2019, 11:37pm

Yes, this is not supported right now. The problem is that dplyr builds up the SQL statement, and when the filter function is called, there is no dataframe yet. In other words, without executing the statement, tidylog can't know how many rows you drop. Of course, it's possible for tidylog to execute the statement and find out, that would entail a huge performance hit, especially in longer pipes... so I don't think there's a good way to deal with that.

benj · February 1, 2019, 11:42pm

thanks! Yes, message sounds like the right function. And thanks for the tip about deparse(substitute())!

By the way, for me the package is not only about interactive use. I have a lot of long-running scripts that I run on a server, and to get this kind of feedback in the R log is really helpful to see what happened when there are problems. For instance, I longer put stuff like this everywhere in my code to see whether a join had the intended effect:

print(nrow(d))
# do something
print(nrow(d))

benj · February 1, 2019, 11:53pm

@davis, I thought about this approach when starting the package, and defining a new S3 method would, of course, be in many ways the more elegant solution. However, I feel like this would take away a lot of the appeal of the package, because then you need to remember to call the as_tbl_log() (or similar) function on every dataframe that you work with. So tidylog can no longer be just dropped in. Right now, the only thing to remember is to load the package last, but apart from that it requires no further interaction from the user. I will keep thinking about, but I think I'll keep it like it is.

About the problem with conflicted, someone on Github proposed this: https://github.com/elbersb/tidylog/pull/2

hoelk · February 2, 2019, 8:34am

Hmm loud thought : what about instead of modifying the dplyr functions, you modify the pipe operator?

jonmcalder · February 2, 2019, 9:49am

As it happens, I'm curating the upcoming edition of RWeekly and tidylog did of course make it onto our radar.

By default it'll just be listed in the new packages section, but could even be highlighted (subject to our usual internal voting process).

So @benj if you are averse to the idea of any wider publicity at this stage, please just ping me sometime before Monday & I'll remove it from this weeks edition for you.

FWIW I think this is a really neat package idea and well worth sharing even in it's early development stages given the friendly & supportive nature of the rstats community.

If you're worried about exposing users to bugs, why not just add a tidyverse lifecycle badge to the ReadMe?

Kind regards,
Jon

benj · February 2, 2019, 10:36am

Hi Jon, it looks like a lot of people have tried the package now, and there don't seem to be any major problems. (It's not a complicated package anyway!) I fixed one bug regarding the upcoming release of dplyr, but otherwise it seems fine. So please feel free to list/feature the package.

Ben

jonmcalder · February 2, 2019, 10:37am

Cool - will do!

I look forward to trying out the package myself!

petermeissner · February 2, 2019, 12:07pm

Hey.

Does this allow for logging to file or data base?
Might be an idea to be able to switch out the log function. The use case I see is for logging information on data quality along with the data transformations don ... This could potentially be very powerful.

Nice job, best, Peter.

benj · February 3, 2019, 12:08am

Hi Peter,
in my workflow, I use R CMD BATCH --vanilla rfile.R, which gives me a rfile.Rout log file that contains both the code and all the outputs (including the log).

I agree that probably the best way to achieve more flexibility here is to allow setting a custom log function. Could you maybe open a github issue for this and explain your use case a bit more?

benj · February 4, 2019, 12:19am

There is now a way to specify custom log functions. See the updated readme: https://github.com/elbersb/tidylog#turning-logging-off-registering-additional-loggers

detlef · February 4, 2019, 10:24am

This is a really great idea and I agree it would be neat to have it as part of dplyr. I am very keen on test driven analytics and weave tests and assertions (e.g. assertr) into data wrangling code and be able to pull out documentation on what has happened to the data in preparation for analysis. I think this package would be a great asset.
Best,
Detlef