R vs. Python for Data Science by Norm Matloff

Peter_Griffin · June 20, 2019, 9:38pm

I saw this article recently, and it has some really interesting insights.

Here I quote some sentences from this article:

R is rapidly devolving into two mutually unintelligible dialects, ordinary R and the Tidyverse. I, as a seasoned R programmer, cannot read Tidy code, as it calls numerous Tidyverse functions that I don't know. Conversely, as one person in the Twitter discussion of this document noted (approvingly), "One can code in the Tidyverse while knowing very little R."

I've been [a skeptic] (GitHub - matloff/TidyverseSkeptic: An opinionated view of the Tidyverse "dialect" of the R language.) on Tidyverse. For instance, I question the claim that it makes R more accessible to nonprogrammers.

I would like to hear opinions from the RStudio community.

Fer · June 21, 2019, 12:47pm

It is pretty simple:

as a seasoned R programmer, cannot read Tidy code, as it calls numerous Tidyverse functions that I don't know

He doesn't speaks tidyverse...

From my point of view, that they are not mutually unintelligible is false statement. Of course you can use functions from the tidyverse and from base R simultaneously, the difference is that inside the tidyverse, the functions are consistently named, while base R is a jungle. I am not particularly a fan of snake_case, but if all functions in R were snake_case, I would probably be in (and here, my issue with camel_case is that _ translates to <- in ESS, wich is my daily choice). Maybe he doesn't like snake_case? . I keep struggling with that as I barely use Sys.time, once every few months. And I have to guess, as it could be sys.time, Sys.time, sys.Time, sysTime... etcetera, and my memory is small. Inside the Tidyverse, if it exists, that I don;t know, I actually know beforehand that it would be sys_time. Isn't that a big step forward?

This is more like a smaller version of the legendary editors wars based in individual choices (BTW we agree emacs won, don't we? )

I have little experience with Tidyverse, is growing as I am learning shiny, and it is very nice, I can see myself using it on daily basis soon. The point is that there is change, as paradigm, but it is not unintelligible. The reason, as he explicitly says, is that he does not know about those functions. And we, humans, tends to dislike the unknown.

I am a strong supporter of what the great Rocky Balboa told to the public after his combat in Moscow, defeating Ivan Drago, that is Everybody can change. But some people may not want to do it.

He took a very opinionated way to compare python and R, and highlight R-base vs Tidyverse differences (without explicit mentions of why or how), while at the end of the day, the disruption between python 2.7 and 3 is by far more challenging. You can live in R with both Tidyverse and base r, interactively in the same session, but I don't think you can do it in python (so the actual looser is python)

It is funny that most of the reactions in GitHub or twitter has been focused on this dichotomy, while the post was actually R vs Python... I reckon I will fill an issue in GitHub soon about his misinterpretations....

alexv · June 21, 2019, 5:18pm

I started using ess-smart-equals in ESS which is far superior:

      (use-package ess-smart-equals
        :init   (setq ess-smart-equals-extra-ops '(brace paren percent))
        :after  (:any ess-r-mode inferior-ess-r-mode ess-r-transcript-mode)
        :config (ess-smart-equals-activate))

SteveM7 · June 23, 2019, 1:31pm

I sort of don't get that. Understanding functions that one has not seen before takes little more than doing a web search on "r" plus the function name. If it's a Tidyverse function, the search results will include the reference. And it's hard to imagine someone immersing themselves in the Tidyverse without having a pretty decent foundation in base R.

I'm sure Norm Matloff's coding skills are superior to mine. But I taught myself R by simply downloading the Johns Hopkins R Programming course on Coursera. That course referred me to the RStudio swirl package that provides interactive tutorials on base R and then follow on tutorials of the basics of the Tidyverse which catalyze a further exploration of both base R and the Tidyverse.

In other words, gaining facility with both base R and the Tidyverse can be pretty uncomplicated.

woodward · June 23, 2019, 11:13pm

Don't forget data.table. That's mutually unintelligible too!

nwerth · June 24, 2019, 4:28pm

First, let's keep in mind the biased sample for this discussion on the RStudio forums.

I don't think his point is that he can't or doesn't want to learn "tidyverse" functions, it's that the overall syntax is so different from base R. I assume even Norm finds functions and classes he's never seen before every time he reads somebody else's R script. But those are easy to learn because he's familiar with the syntax and probably 95% of the script.

And, honestly, this is a tradeoff with "dataset in, dataset out" functions like those from dplyr and tidyr. If you're familiar with them, then they're amazingly concise ways to write complicated or otherwise verbose logic. But they don't always "abstract the logic away" (e.g., spread). Sometimes, they just require you to memorize the complicated logic or constantly refer to the docs. We should consider whether the code's easier to read and maintain in "verbose" base R or tidyverse code. One's not automatically better.

Also, Norm didn't explicitly mention this in his article, but I think one misinterpretation of "tidy" is that all functions should be "dataset in, dataset out." If a function works on values in a vector, it should just accept vectors. Honestly, I've seen a lot of code here and on StackOverflow that uses rlang but doesn't need to. The only good reason I have to care is that these people write packages I may want to use. If they're difficult to reason with or maintain, that can hurt the end users.

SteveM7 · June 25, 2019, 4:15pm

Thanks for the reply nwerth. I made the point about learning the Tidyverse via web content for new R users because Norm suggested it was cryptic. I learned "base" Tidy from swirl immediately after learning base R. Then picked up more Tidy later. So I just instinctively mashed together what R and Tidy I learned. I do understand that the Tidy function set is now very dense. I just use what Tidy I know in concert with base R and learn more on the fly out of intellectual curiosity.

I most appreciate R because it minimizes the use of explicit loops, and the Tidyverse contributes to that appreciation.

raybuhr · June 27, 2019, 3:27pm

I loved this post. I thought it stayed very objective and fair, up until the section on tidyverse. That section is completely opinionated. I happen to agree with him.

tldr -- base R is underrated, data.table is amazing, tidyverse make programming fun but forces you to consider package management to lock versions in order to be reproducible.

python2 vs python3 stopped being a huge hassle a couple years ago.

The vendor lock-in that tidyverse brings with RStudio is real. RStudio conference has become one of the biggest and most important conferences for R in the world, yet almost all the talks hinge around the tidyverse.

Almost anyone learning R in the past two years has "grown up" on the tidyverse and fully adopted it. It's easier to read and guess what's happening than base R because it uses common English phrases versus abbreviations of statistics terms. Unfortunately, the tidyverse is rapidly growing and changing. Tidyverse maintainers state that those packages are not stable yet and users need to expect changes to how functions work.

Here's a real world example. My team had written some ETL code to pull data from an application database and load into an analytics database. The code relied heavily on dplyr and called the distinct function. Starting with dplyr 0.5 this function added a new default argument .keep_all=FALSE which changed distinct from deduplicating all rows based on some column(s) to only returning the unique combinations of the criteria. You might think, well that change makes sense to me. Ok, but the base R function unique already supplies that functionality. It took my team an hour to figure out why our code broke, and ten minutes to make the change and redeploy. That's not so bad, right? I guess, but I'd rather have not had to waste an hour because the dplyr team broke their API for completely arbitrary (IMO) reasons.

The fix that team came to was to try and use packrat to specify the specific versions of the libraries installed for each project. This worked pretty well in practice, but means we had a lot of R packages installed multiple times on the same remote server, which introduced more work to keep server storage from running out.

Until very recently with R 3.6, code written in base R kept working indefinitely because the functions and APIs are stable and do not change. The reason for the change in R 3.6 is that there was a bug found in the sample function used to generate random samples from a set of data. The fix required a change to the underlying algorithm, which means seeing seed no longer guarantees you get the same random sample from R older than 3.6. The R core team provides an argument to use the old algorithm if you want to maintain reproducibility. That's a pretty big issue, but it's very infrequent unlike the regular changes to tidyverse APIs.

This focus on stability has also been true for data.table package, which also benefits from no external R package dependencies and ridiculously great speed for working with large data in memory.

I'm not a tidyverse hater. I've written a lot of it. I have a purrr sticker on my laptop. That and glue are two of my favorite packages in R. I just think tidyverse introduces some important trade-offs meant to make learning programming and R easier that seem to consistently sweep past beginners. Writing R is programming, and stability and reproducibility are very important for our community.

nwerth · June 28, 2019, 2:28pm

Relevant post from Edwin Thoen (@Edwin on these boards):

These are some good points. Nobody can argue that either sub-language is bad, considering they've both been used for awesome professional analyses. I honestly suggest people look into each style's strengths for themselves. I use both daily: dplyr for cleaning data (so much easier to read after a year), and data.table for repeatedly joining multiple datasets (I'm really into lookup tables now).

It really is. The standard installation of R comes with "batteries included" for almost all statistical analysis (to swipe a slogan from Python).

But maybe nobody's proselytizing base R because there are few competitors for what it focuses on: accurate, simple, and efficient statistics. The tidyverse and data.table just support it.

system · July 19, 2019, 2:28pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.