Tibbles, vctrs, and hegemonic code


#1

Dear Folks--
I have been using tibbles for a while now, and I have not found any situation where they are not at least as useful as data frames. Well, maybe if I happen to have an 11- to 19-row object, but that is a minor case. I have pretty much stopped using plain data frames. And I'm happy about that.

Lately I have been reading the documentation for the vctrs package, as suggested in Advanced R 2nd chapter "OO Field Guide." I can not clam to have fully understood it yet, but consistency in casting and coercion seems like an unconditional advantage, the only obvious problem being that replacing vectors with vctrs throughout will break some existing code.

But it is not clear to me that vctrs's are intended to be superior general replacements for vectors in the sense that tibbles are a superior replacement for dataframes. He implies in several places that they are useful primarily to package developers who are creating new S3 classes. On the other hand, at one point Hadey described them as "a type system for the tidyverse," which suggests to me that vector objects created within the tidyverse, such as the vectors inside of tibbles or returned by dplyr, are going to be vctrs by default. I don't believe that this has happened yet -- the major tidyverse packages do not seem to import or depend on vctrs, unless they do so only indirectly, and few other packages depend on them -- contrast with tibbles. But it seems like that is the goal. Does anyone know if that is right?

I am writing a package that create some pretty large S3 objects, with millions of rows, and hundreds of columns--and I am trying to figure out if there is any advantage or disadvantage to changing all of those columns into vctrs by default. Here are some things I don't know. I don't know if there are performance issues with vctrs. I don't know if they are still rapidly changing, so that reliance on them will cause your code to break if you don't keep up. I don't know if existing packages that do complicated things to base R vectors (e.g. survey) will always or almost always work if handed vctrs instead. I don't feel any need for vctrs. But I didn't feel any need for tibbles either, and now I would hate to be without them.

Do folks have thoughts on this?

Oh, one last point of confusion: In AR 2nd, Hadley says that at one point he used to think prototype-based programming was a good direction for R, and now he thinks it is not. But in the vctrs documentations, he repeatedly refers to thee advantages of vctrs over base R vectors as prototypes. So has he changed his mind back? Or decided they are useful only in this special case, or with this infrastructural support? I am guessing that there is some way to read these things consistently given that Hadley is working on both of them at the same time, but I don't know what it is. Can anybody else see it?


#2

After reading some of the vignettes, I think the package is only meant to provide two things useful for writing packages:

  • Basic vector functions (combining, describing, ifelse) with more predictable outputs.
  • Helper functions for defining new S3 classes wrapped around atomic vectors.

The package itself doesn't introduce new types, just the tools to make them. This can help preserve context with values. So a numeric is a numeric, but you might want to make "extensions" of numerics like Fahrenheit and Celsius. That'd let you write a method for combining them that handles the conversion.


The package does make some "opinionated" assumptions I don't like. Particularly that vec_size(<matrix>) is equal to nrow(<matrix>). A matrix can be used as a memory- and processing-efficient data.frame when all the values are the same type. But functions which assume all matrices are observational data sets mean there's no point to the matrix or array classes. They're now obsolete, not safe.

Really, this part of Type and Size Stability raised red flags for me:

vapply() is type-stable version of sapply() because vec_ptype(vapply(x, fun, template)) is always vec_ptype(template) .
It is size-unstable for the same reasons as sapply().

[Emphasis mine]. When vapply is not seen as the safest function in base R, the logic leading to that conclusion needs checked.


#3

For me, the most important thing about the package is being able to define how coercion occurs. There are plenty of good (and not obscure) examples in the conference talk.

If you think that someone might combine your vector class with some other class, base R might (silently) lead to some bad results. vctrs comes along with additional complexity but I don't know of any other way to solve the coercion issue.

It also provides a nice (but complex) template for new vector types that handles a lot of things that you would have to write novel code to do.


#4

Is there a chance the boilerplate code will be simplified? I'm hesitant to use it in packages which others may have to maintain.


#5

Yeah. I'm hesitant to use it in code that I may have to maintain.

I am a tidyverse fan and a tidyverse user. The one thing about it that often troubles me is that it often provides elegant and complete solutions to problems at the cost of needing a deeper understanding of programming and computer science principles than I have -- and maybe than I need. For example, I have a great deal of sympathy for John Mount's wrapr package, which seems to me to provide about 70 percent of the functionality of tidy evaluation for about a fifth of the mental load. I think I'd follow that path myself were it not that that anything connected to the tidyverse gets an automatic boost from the tidyverse's many excellent features, even when not required by those features.,


#6

Actually to understand the tidyverse as soon as you scratch the surface you need to understand a lot. what are !! !!! :=, why is the code of n() just a message, why doesn't := even exist ? How does one_of get its last argument by magic ? why do we have these close equivalents to base and how are they different when it seems underscore instead of dot is the only difference ?, why !!! works in list2 but not in list ? Why does !! sometimes mean as.logical and sometimes something else ? Why some of these commands work on data bases and some will not (and differently on different DBMS) ? Tibble/ data frame what is the difference ? Where did my rownames go ? Why do some functions recognize the grouping and some don't ? Why can't I copy the pipe ? In which environment am I ? How can these pipes present "incomplete" calls on the rhs and how come the dot is assigned the value of the lhs, except when I use a formula notation ? Then in the formula notation what are . .x ..1 etc... ? Why does .z not exist ?

I love the tidyverse as much as anyone here, but to be able to feel comfortable with it it takes time and the apparent simplicity can be a trap.

I don't think there's a rush to use vctrs before seeing it used throughout the tidyverse and beyond, especially if you're not often confronted to the coercion issues Hadley mentions in his talk.


#7

I agree with Moody's recommendation here. To generalize a bit, if you go to the pkgdown site for vctrs or the README in the GItHub repo you can see that it has an "experimental" lifecycle badge:

You can read more about our use of lifecycle badges in r-lib and tidyverse packages at https://www.tidyverse.org/lifecycle/. Re, experimental:

An experimental package is in the very early stages of development. The API will be changing frequently as we rapidly iterate and explore variations in search of the best fit. Experimental packages will make API breaking changes without deprecation, so you are generally best off waiting until the package is more mature before you use it.

Usually experimental packages haven't been released on CRAN (vctrs has), and, as we move forward, its lifecycle status will be updated accordingly. However, you're right, this would be very early adoption.


#8

If you are creating new S3 classes, I think you will find using vctrs much easier. There is no plan to create vctrs versions of existing base classes (e.g. factors, dates, date-times etc).


#9

Those are two independent uses of prototype. Prototype OO is a special type of object oriented programming that is unrelated to the use of prototypes in vctrs.

This is not an choice that vctrs made, but a choice that R itself made because is valid base R code:

df <- data.frame(x = 1:3)
df$y <- matrix(runif(6), nrow = 3)

And subsetting that data frame works, which does not happen without effort, suggesting that this is a deliberate design choice for data frames.

If f() is size-stable, it posses the useful property that df$z <- f(df$x) will work. That's not true for vapply():

df$z <- vapply(df$x, rep, 2, FUN.VALUE = integer(2))
# Error in `$<-.data.frame`(`*tmp*`, z, value = c(1L, 1L, 2L, 2L, 3L, 3L : 
#  replacement has 2 rows, data has 3

vapply() is very close to being size-stable (as well as type-stable), but the output is transposed compared to what the principle of size-stability suggests.

Yes, of course. See https://github.com/r-lib/vctrs/issues/170 and I'll also look into this in roxygen2. But I don't think the boilerplate is that big of a cost because you only need to create it once, so it's not super high on my priority list.

The goal of the tidyverse is to be simpler for users. If you want to contribute deeply to the tidyverse, you are going to have to learn a bunch of programming concepts and up your software engineering game. Unfortunately, it is by necessity hard to contribute to a system that's used by millions of people. We are trying to make it as easy as possible (by improving tooling, writing books, and running developer days), but it's never going to be as easy as using R for data analysis.


#10

Glad to hear. I'm not worried about writing the code; I'm lazy and often use sprintf() for that. I worry more about reading it, because that's repeatedly done and not automatable.

I hadn't considered this case. My stance was more about this part from the "Prototypes and sizes" vignette:

vec_size() was motivated by the need to have an invariant that describes the number of “observations” in a data structure. This is particularly important for data frames as it’s useful to have some function such that f(data.frame(x)) equals f(x).

I'm trying to square this with the concept of tidy data. Having vec_size(array) == dim(array)[1] works when the first dimension is categorical and the others describe different types of measures (e.g., the state.x77 matrix). But some arrays use all dimensions as categories, and all the values represent a single type of measure. For example, the HairEyeColor table:

library(vctrs)
data("HairEyeColor")
str(HairEyeColor)
# 'table' num [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
# - attr(*, "dimnames")=List of 3
#  ..$ Hair: chr [1:4] "Black" "Brown" "Red" "Blond"
#  ..$ Eye : chr [1:4] "Brown" "Blue" "Hazel" "Green"
#  ..$ Sex : chr [1:2] "Male" "Female"

hec_df <- data.frame(HairEyeColor)
head(hec_df)
#    Hair   Eye  Sex Freq
# 1 Black Brown Male   32
# 2 Brown Brown Male   53
# 3   Red Brown Male   10
# 4 Blond Brown Male    3
# 5 Black  Blue Male   11
# 6 Brown  Blue Male   50

vec_size(HairEyeColor)
# [1] 4
vec_size(hec_df)
# [1] 32

#11

I don't see the problem: not every vector/matrix/array is suitable for inclusion in a column of a data frame. HairEyeColor is an alternative to a (tidy) data frame, not something that you'd put in a column.