Functions that input & output tibbles AND have a class system?

jmichaelrosenberg · October 31, 2017, 6:46pm

Pardon what is a novice question in a few ways, but I'm interested in use of functions that input and output tibbles and also (possibly) have a class system.

My use case is a package for a type of clustering, what is called in my field Latent Profile Analysis. I noticed many beginners to this analysis found the greatest challenge to be figuring out what form the output took. My (proposed) solution was to have the main function in the package take a tibble (or a data.frame) and output a modified tibble - namely, one with the classification, or the profile to which the observation is assigned, in a new column.

If this is a good idea, I'm also curious how to additionally have a class system (so generic functions like plot() would work on the output). It's not clear to me whether this is a good idea - would intermediate steps (i.e., use of filter() on the output) strip the new class (at some point)? Would it be preferable to have a function like plot_profiles() that simply works on the modified tibble?

Related, I'm also considering having an option that defaults to outputting a modified tibble, with the other option being to return an object with a class system - with the output of the function and other data, like the fitted model object, on which generic functions would work (unlike for the tibble output). Does this seem like a good idea?

So, in summary, there would be the (default) tibble output - which would be especially easy to use interactively - but also the option to also output a model object of its own class, for which functions that extract information (or create other output, like plots) - some of which would be generic functions and others which would not - would be written. To make it concrete, the interface would be something like:

# by default and for interactive use
main_function(..., to_return = "tibble") 

# with class system for more fine-grained output available for the output of the fitted model object
x <- main_function(..., to_return = "class_name")

While highly specific, I wonder if this question could also be relevant more widely as package developers (like those for corrr or skimr) take a "tidy" approach with the functions in their packages.

This is a bit of a brainstorming question and so I appreciate any insight that can be shared with this novice package developer. If interested, the package tidyLPA is only on GitHub here.

nick · October 31, 2017, 7:10pm

If you haven't read @hadley's OO field guide from Advanced R, I would recommend starting there for information on classes. You're likely thinking of S3 classes, which are generally the most straightforward (and what tibble uses, AFAIK). Some brief testing with dplyr shows that using mutate, filter, and group_by all remove an added class, unless you do something silly like filter on nothing:

suppressPackageStartupMessages(library(dplyr))

bi <- band_instruments

class(bi)
#> [1] "tbl_df"     "tbl"        "data.frame"
class(bi) <- c("new_tibble", class(bi))
class(bi)
#> [1] "new_tibble" "tbl_df"     "tbl"        "data.frame"

bi %>% filter() %>% class()
#> [1] "new_tibble" "tbl_df"     "tbl"        "data.frame"
bi %>% filter(name == "John") %>% class()
#> [1] "tbl_df"     "tbl"        "data.frame"
bi %>% mutate(Rating = c(3,2,4)) %>% class()
#> [1] "tbl_df"     "tbl"        "data.frame"
bi %>% group_by(plays) %>% class()
#> [1] "grouped_df" "tbl_df"     "tbl"        "data.frame"

So, if you want to perform dplyr-ish operations on your new tibble, you would need to take additional steps to maintain the class (such as overriding the standard dplyr functions).

jmichaelrosenberg · October 31, 2017, 7:14pm

Thanks, this was kind of my concern (the need to take additional steps to maintain the class). That seems like a treacherous road to go down (maybe unnecessarily). I was thinking of S3 classes (in terms of my proposed used in this post).

jennybryan · October 31, 2017, 9:44pm

This is a really good and timely question.

I'd refine @nick's recommendation and point you to the S3 chapter in the place where a new edition of Advanced R is developing: https://adv-r.hadley.nz/s3. You'll get a good overview of how to think about S3 subclasses and a peek at how some of that might be formalized in a package called sloop.

Subclassing tbl_df is a really important special case that lots of people are thinking about in the tidyverse (those who work at RStudio and in the broader community). Because you're right, you want to retain your class after at least a certain subset of common operations. But you also don't want to re-implement all those methods for your class!

We had exactly this problem in googledrive, with the dribble class, so you could poke around there to see one solution with current technology (i.e. no sloop or whatever). We specifically wanted the dribble class to be retained after typical dplyr manipulations, as long as the object still had certain other properties. The files dribble.R and dplyr-compat.R are the most relevant.

nick · October 31, 2017, 9:57pm

Thanks for the refinement on the link -- I knew Hadley was working on new material and had also stumbled across http://adv-r.had.co.nz/S3.html, but that didn't seem quite right. The hints at the sloop package look very useful.

jennybryan · October 31, 2017, 10:00pm

This tibble issue https://github.com/tidyverse/tibble/issues/275 is about this topic, although is currently a bit of a placeholder. There are longer discussions in related issues linked there as being superseded by #275.

davis · November 1, 2017, 8:05pm

I'm really happy to see this being addressed so that we can have a standard way of extending tibbles. My current method has been manually removing class/attributes, calling the dplyr function, and adding them back. I'm much happier with the reconstruct() function described in the Advanced R link @jennybryan provided (see the Inheritance section).

This line in particular is incredibly promising and would save a ton of headache.

This duplicated code could be avoided completely if arrange.data.frame(), provided by dplyr, called reconstruct() for you. And indeed, a future version of that function will.