Base R and the tidyverse

tyluRp · November 20, 2017, 2:43am

"Those who do not understand base functions are doomed to replace them".

What do you guys think? Do R users need a solid understanding of base R before learning the tidyverse? What about users that want to develop R packages?

nutterb · November 20, 2017, 12:07pm

I think you present a false premise. First, replacing a base function isn't inherently bad. Second, base R functions and the tidyverse are not distinct skill sets. I'd argue there's a substantial intersection in that Venn diagram.

Regardless, the question focuses entirely on the wrong issue. As far as writing code goes, the R user needs to understand how to get the job done in a scientifically valid way. Prior to that, however, the R user needs to understand the subject matter of the data being analyzed.

The question really ought to be, "Do R users need a solid understanding of base R before learning the tidyverse in order to correctly conduct analyses for their subject matter?" The answer is a resounding no; but if (and when) they do understand base R, their toolset to tackle scientifically valid analyses will be enhanced.

The question of what toolset is more essential is a side show. There are more important things to concern ourselves with.

tyluRp · November 20, 2017, 12:53pm

Agreed. I'm wondering what's outside that intersection, if there is any.

This was something I was also thinking about. People who need to get the job done quickly versus people who want to develop R packages. Would a solid understanding of base R help someone who eventually wants to develop R packages?

This is a good point. I don't want to waste peoples time by asking which toolkit is more essential. I should rephrase my question by asking if base R is important to understand for a user who wants to develop R packages. Also, I'm curious if this extra abstraction (i.e. the tidyverse making base R more readable, easier to use) comes at a cost.

mungojam · November 20, 2017, 1:00pm

Good question. It's been funny going back through my old code as I try and package it up. Back then I didn't know the tidyverse dialect (as I like to think of it), but now I tend to use it wherever possible as I find it safer and easier to understand.

I am tending to push new users towards using the tidyverse as early as possible, but most are starting with the free Intro to R course on datacamp which covers base R.

I think people could be at a disadvantage if they didn't know base R and were then given some to work with but the same could be said for any language that somebody hasn't learnt or any alternative style of programming.

nutterb · November 20, 2017, 1:10pm

I understood the quote wasn't yours, and I apologize if I came off as critical toward you. I should have taken steps to make that clear in my response.

This is a more interesting question, and one which I've personally been exploring for some time now. My personal feeling is that there is a moderate benefit to understanding base R when developing packages. My justification is that, in most cases, base R is faster than tidyverse equivalents. The counter-argument is, of course, that it's usually a matter of 3-5 microseconds per call. In most cases, that small a change isn't by itself enough to warrant avoiding tidyverse stuff.

The bigger benefit I see to programming in base R is that it is extremely stable. In the past couple of years, I've coded things in packages using tidyverse tools only to later have those tools deprecated. I then had to go back and rework already functioning code to account for changes in the tidyverse API. It's a petty complaint, but almost all of my package development time comes out of my free time at home. I don't particularly want to spend it rewriting functionality that already works because something changed in dplyr. Coding in base R insulates me from some of that effect.

That being said, when I am developing new functionality, I almost always write it in the tidyverse dialect first and then translate it into base R once I have the process worked out and stable.

Over time, as tidyverse tools mature and stabilize (dplyr is on version 0.7.4; some would still consider that a development-phase version number), this will become less of an issue and I may become more comfortable using it again.

TANGENT: the shifting API thing is the key reason I'm very hesitant to pick up on tidy evaluation right now. I don't want to invest a lot of my code in it and then find the API changing again. I'll be a late adopter.

tyluRp · November 20, 2017, 1:26pm

No problem at all! I'm glad you brought up the points you made, it helped me make my question more narrow.

Thanks for sharing this. As a new R user, this is valuable information. I will keep it in mind when I eventually create a package of my own. The part you mention about writing code in tidyverse dialect and then translating it to base R is also helpful, as I have spent most of my time learning the tidyverse way rather than focusing on base R.

nutterb · November 21, 2017, 12:49pm

Now that I've run my mouth, I have to eat my words a little. As it turns out, with some experimentation, I'm having a really hard time coming up with a way in base R that I can make a summary table faster than with tidyverse tools. I'd be curious if anyone has a solution that can do the equivalent of the following faster using just base R.

But I think that emphasizes the point that knowing both gives you the option of using the set of tools that has the greatest benefits...curmudgeonly base fanatics like myself, notwithstanding.

quick_summary <- function(df, vars, group){
  require(magrittr)
  ncount <- function(x, na.rm = TRUE) sum(!is.na(x))
  df %>% 
    dplyr::select(dplyr::one_of(c(vars, group))) %>% 
    tidyr::gather(key = !!"variable",
                  value = !!"value",
                  dplyr::one_of(vars)) %>% 
    dplyr::group_by_at(c("variable", group)) %>% 
    dplyr::summarise_at(.vars = "value",
                        .fun = dplyr::funs(
                          n = ncount,
                          mean = mean,
                          sd = sd,
                          min = min,
                          median = median,
                          max = max
                        ), 
                        na.rm = TRUE)
}

quick_summary(mtcars, c("mpg", "hp", "wt"), c("am", "gear"))

milesmcbain · November 21, 2017, 11:06pm

I agree with the original quote. Thesedays I always check tools:: and utils:: before authoring a new helper function. I have found myself re-implementing those more than I care to admit! They cover a wide range of R problems.

I also think @nutterb makes some solid points from a package dev perspective re: dplyr. I use it nearly daily for my analysis but I will avoid taking it as a dependency for one of my packages if at all possible. API stability is a valid concern, and I am also uneasy with the number of recursive dependencies I typically don't need that it adds to my package.

alistaire · November 22, 2017, 12:57am

I mean, it's possible, of course:

quick_summary <- function(df, vars, group){
    do.call(rbind, 
            lapply(vars, 
                   function(variable){
                       result <- aggregate(
                           as.formula(paste(variable, '~', paste(group, collapse = '+'))), 
                           mtcars, 
                           function(value){
                               c(n = sum(!is.na(value)), 
                                 mean = mean(value, na.rm = TRUE), 
                                 sd = sd(value, na.rm = TRUE), 
                                 min = min(value, na.rm = TRUE), 
                                 median = median(value, na.rm = TRUE), 
                                 max = max(value, na.rm = TRUE))
                           })
                       cbind(variable = variable, 
                             result[-which(names(result) == variable)], 
                             result[[variable]])
                   }
            )
    )
}

quick_summary(mtcars, c("mpg", "hp", "wt"), c("am", "gear"))
#>    variable am gear  n      mean          sd    min  median     max
#> 1       mpg  0    3 15  16.10667   3.3716182 10.400  15.500  21.500
#> 2       mpg  0    4  4  21.05000   3.0697448 17.800  21.000  24.400
#> 3       mpg  1    4  8  26.27500   5.4144648 21.000  25.050  33.900
#> 4       mpg  1    5  5  21.38000   6.6589789 15.000  19.700  30.400
#> 5        hp  0    3 15 176.13333  47.6892720 97.000 180.000 245.000
#> 6        hp  0    4  4 100.75000  29.0100557 62.000 109.000 123.000
#> 7        hp  1    4  8  83.87500  24.1745882 52.000  79.500 110.000
#> 8        hp  1    5  5 195.60000 102.8338466 91.000 175.000 335.000
#> 9        wt  0    3 15   3.89260   0.8329929  2.465   3.730   5.424
#> 10       wt  0    4  4   3.30500   0.1567376  3.150   3.315   3.440
#> 11       wt  1    4  8   2.27250   0.4608145  1.615   2.260   2.875
#> 12       wt  1    5  5   2.63260   0.8189254  1.513   2.770   3.570

There are definitely some subtleties involved, though, and it's certainly way slower to write correctly than the tidyverse equivalent.

But I think Miles hits on the crux of the issue:

Making tidyverse a dependency of your package adds 47 packages via Depends/Imports, and more via Suggests. If the package extends the tidyverse framework (dplyr bindings for a database, say) or is for your own personal use, that's probably fine, as your users/you already have all those packages installed.

However, if you're writing a package you intend to be broadly used, keep in mind that you're making your package really heavy in terms of install time, space required, etc., and some users will avoid your package for that reason. On your own end, you'll see the difference if you add dplyr to your dependencies in your Travis/AppVeyor build times, which will shoot up by about half an hour.

Thus, my position is that if you intend a package for broad use and it's not inherently designed to only work with the tidyverse, it's worth it to put in the extra time to write it in base R. A different subset of base R functions like match.arg will become necessary anyway, so it's frequently not much more work to rewrite the rest. As much as I find the tidyverse indispensable for non-package code, I do my best to make the packages I work on grammar-agnostic when possible, even if it means getting comfortable with vapply.

nutterb · November 22, 2017, 3:58am

Formula syntax in aggregate I didn't think to try that. Brilliant!

jennybryan · November 23, 2017, 2:39am

There are also intermediate states. I don't think anyone would regard the tidyverse meta-package itself as a very practical dependency (except in very specific situations). I also regard dplyr more as an end-user package and would not depend on it lightly. But both tibble and purrr, for example, are being intentionally developed with these issues in mind, so the value proposition for Importing them is more clear. Now, I am often operating in an explicit tidyverse framework, so that also affects how I work.

alistaire · November 23, 2017, 3:15am

Agreed, and that was presumably the point of making the tidyverse modular instead of monolithic: it's easy to import rlang, magrittr, glue, jsonlite, etc. as necessary without all of dplyr.

I do find purrr::map_df's dependency on dplyr a little weird, but I deeply love the function, and my own package has a similarly weird dependency on jsonlite that despite my efforts I can't manage to list in Imports without incurring an R CMD check warning, so I can't really talk.

jennybryan · November 23, 2017, 3:44am

Ha! I have also bumped my shins on that particular coffee table.

nutterb · November 23, 2017, 7:10pm

I find that I am often able to comfortably do everything I want without much from other packages. Until it comes to tidyr::gather. Base R just doesn't have anything that compares. Once I have already imported tidyr, there isn't much barrier to using tidyr::separate. That is pretty much the only one I can't do without. If gather ever goes away, I will likely have an identity crisis.

Moody_Mudskipper · November 23, 2017, 11:36pm

About this quick summary in base R maybe the function fivenum would help

desmonds22 · November 24, 2017, 3:16pm

I think its the other way around.

Sane people don't start using R to learn how to program - they pick up R to do some work. Something like dplyr makes it far easier to get up and running. The better they learn R the less they'll rely upon packages to things.

DarioBoh · November 24, 2017, 5:04pm

definetely a big tidyverse fan but I had somehow the same thought watching the General Data science overview seminar. In particular, they make a case for using dplyr for refactoring in cases like this:


mtcars$gear_char <- ifelse(mtcars$gear == 3,
                           "three",
                           ifelse(mtcars$gear == 4,
                                  'four',
                                  'five')
                           )

And they suggest an arguably clearer solution in dplyr:

mtcars$gear_char <- mtcars %>% 
                      mutate(
                        gear_char = (
                          case_when( gear == 3 ~ "three",
                                     gear == 4 ~ "four",
                                     gear == 5 ~ "five")
                        )
                      )

However, I would have found more concise and more informative (in terms of final data type) to use a different base R solution like:

mtcars$gear_char <- ordered(mtcars$gear, labels = c('three', 'four', 'five'))

This is more about having some understanding of base R rather than a solid one, but it made me think of whether sometimes the risk was not to reinvent the wheel.

alistaire · November 24, 2017, 5:49pm

For a simple case of reshaping, utils::stack is quite usable instead of gather. For a more complicated wide-to-long or long-to-wide transformation, stats::reshape exists, but despite aggregate hours of trying, I still can never figure out what it wants in what parameter.

separate is mostly just strsplit, but it requires some munging to use in the same way:

x <- data.frame(foo = paste(letters[1:5], 1:5, sep = '_'), 
                stringsAsFactors = FALSE)

x_split <- as.data.frame(do.call(rbind, strsplit(x$foo, '_')), 
                         stringsAsFactors = FALSE)
names(x_split) <- c('foo1', 'foo2')
x_split[] <- lapply(x_split, type.convert, as.is = TRUE)

x_split
#>   foo1 foo2
#> 1    a    1
#> 2    b    2
#> 3    c    3
#> 4    d    4
#> 5    e    5

unnest is a little magic, too; the equivalent base is usually not pretty.

nutterb · November 24, 2017, 9:42pm

I'm not familiar with utils::unstack, so I will study up on that one. I am (painfully) familiar with reshape, and have shared your experience. Which is why I am so content to import tidyr. I'm also familiar with strsplit, but once tidyr has been imported, I may as well make use of separate. I may not have been clear in my comment, but I wasn't requesting help finding solutions as much as I was commenting on the extreme utility of those two particular functions, and gather in particular.

alevy · November 26, 2017, 2:17am

For students without a background in quantitative science or programming the tidyverse way is much easier to understand and learn R.