Base R and the tidyverse

jiho · November 27, 2017, 9:08am

For general R usage, my experience teaching base R, then plyr/reshape2/etc., then the tidyverse shows me that one can learn the tidyverse without learning base R and it is actually much easier (after a few hours of practice, students can make clever group_by() + summarise() calls while they don't even know what a for loop looks like and how they would replicate the behaviour of the tidyverse functions in base R).

For package development, the tradeoffs mentionned sound right: using base R allows to reduce API changes and many dependencies (even if you use individual packages rather than tidyverse itself; I think I read somewhere that it is even considered bad practice to import tidyverse directly). However, my decision regarding this tradeoff seems to diverge from the previous contributors to the thread.

On many occasions, I've been trying to use paste0() instead of str_c() to avoid depending on stringr, only to later add the dependency because I needed a more advanced function and trying to replicate it with grepl(), sub() or the like was too cumbersome. Or I ended up being bitten by the fact that paste0 and str_c() don't handle NA in the same way. Similarly, I've been trying to get back to aggregate() and friends to avoid depending on dplyr for just this "one" call. And then the package grows and I end up with many aggregate() calls which have more trouble debugging because I am less comfortable with them. In addition, it actually took me more time to code the non-tidyverse version because I don't know it as well.

Overall, I think it is important to use the same tools for everyday computation and package development because it lowers the barrier to package development, saves time and avoids mistakes. The dependency aspect is probably a non-issue because if you buy in the tidyverse approach, the functions of your package would probably work well within the tidyverse functions (i.e. simple functions that take vector arguments and work well with summarise() and mutate()) so that your users would probably have it installed. The API changes do occur, but then again, if writing the dplyr version takes you half the time and mental load compared to the base functions, you can tolerate having to rewrite some of them .

spiritus87 · November 27, 2017, 10:54am

My 2cent.

I love the tidyverse for interactive analysis and reproducible reporting. The ecosystem makes it very easy to do things that are cumbersome in base R, in my opinion. The dplyr verbs, for instance, are great for REPL exploratory analysis and reasonably fast. I notice that when I do analyses in the tidyverse I am considerably more productive than in base R.

But. In my work, doing reproducible analyses is only a part of the job. A large part of the job is doing machine learning, forecasting and developing data products and dashboards for clients. Needless to say, these activities require a well-defined data and modelling pipeline that is stable, efficient and works well in a production environment; a pipeline that I always organize using R packages. And in this context, where datasets can get pretty big, efficiency and stability are paramount. Here in my opinion the tidyverse is not a good choice; I tend to rely on the magic combination:

base R + data.table

data.table is fast, extremely memory efficient, mature and self-contained, and in a production context I think is (at the moment) a better alternative to the tidyverse.

Riccardo.

Mark6 · November 27, 2017, 11:50am

Agree with this. I find performing analysis using tidyverse much easier than base R.

As others have said, I think "it depends" is the answer here. If you can complete your analysis using tidyverse or other functions, without knowing Base R, then not understanding Base R is not detrimental.

However, for more complex tasks or package development; or even to be a good, well-rounded R user/programmer then understanding Base R becomes substantially more important.

alistaire · November 28, 2017, 3:22pm

I've written a response to this idea elsewhere, and won't rehash too much, but until the tidyverse is Turing-complete, this isn't an either/or: you still need some base R to make the tidyverse work, e.g. <-, c, sum, etc. If students get the impression that base R is bad, their capabilities will be stunted.

This is a good point. For local-use packages, this seems like the way to go.

You != your users. If a package is being written for a known group of users, whatever everyone is comfortable with is fine, but if a package is being written for wide distribution (i.e. via CRAN), some of your users will inevitably care about the size of their dependency graph. (To get an idea of the persona, go look at the "R in production" threads.) All the effort the package writer did not put into minimizing dependencies is then put in by these users, who spend time trying to figure out how to not use your package, or rewriting equivalent code themselves. In aggregate, that's a lot more time than it takes to write a lighter package.

To be clear, this is not to say tidyverse packages shouldn't be used as dependencies; some (glue, rlang, etc.) were explicitly designed for such a purpose. I'm just arguing not to throw in the kitchen sink unless you need the kitchen sink.

mara · November 28, 2017, 3:31pm

While looking for a free, online version of On Computable Numbers (because I love it and think everyone should read it, but can't just zap The Essential Turing over the internet), I just came across a paper, On the Turing Completeness of MS PowerPoint…I simply had to share this because I kind of can't believe it exists!

update: It turns out the video is kind of the greatest thing ever…