Should I move away from do() and rowwise()?

dplyr
purrr

#1

While it is not officially declared, my instinct tells these two functions are going to be deprecated. For example, do() was removed from README and Hadley implies they won’t be developed anymore including bug fixes. Am I right?

If so, do I have to immediately start getting ready for a life without do and rowwise()? Or is it too early? I’m almost confident list-columns and purrr can take the place of these, but I want to know if there are any cases where only do() can do at the moment.

Thanks!


#2

do() is definitely going away in the long term, but I’m not yet sure we have comprehensive alternative solutions to all problems that do() solves

(Also “going away” means that we won’t make improvements to it and we won’t mention it in documentation and tutorials, but the code will continue to exist for a number of years)


#3

Thanks for clarification! I misunderstood the tone of “going away” :slight_smile:


#4

I have yet to figure out a word to precisely convey the sense that I no longer believe that something is the right approach and I’m actively looking for alternatives.

We are gradually moving the tidyverse to a more conservative development philosophy so it’s easier to rely on things working in the long term.


#5

Great. I really appreciate your thoughtfulness!

By the way, do you have any examples that only do() can solve at the moment?


#6

No, sorry. I haven’t used do() for several years now.


#7

Oh, I see. Thanks anyway.


#8

I think I am able as of today to stop using ‘do()’ but what about ‘rowwise()’? I actually used it even today, what would be the proper way of applying a function to every row of a tibble?


#9

@hadley I like that you are striving to create the best approach to doing analysis. Maintaining backwards compatibility is always a pain. It would help if R had a richer package versioning approach where you could make backwards incompatible changes in major updates while making it easy for people to stick to the old major version.

Packrat and checkpoint help a bit with that but aren’t part of the core.


#10

I think purrr::pmap or purrr::pwalk may give you what you are after. Or pmap_df if you want to return tibbles from your function.


#11

I’ll have to try using ‘map()’ inside ‘mutate()’ then.


#12

I’ll admit to using do to apply non-dplyr functions to database tables, e.g.

library(sergeant)
db <- src_drill()

db %>%
    tbl("cp.`employee.json`") %>%
    select(first_name, last_name, salary) %>%
    do(tidyr::unite(., name, first_name, last_name, sep = " "))
#> # A tibble: 1,155 x 2
#>                 name salary
#>  *             <chr>  <dbl>
#>  1      Sheri Nowmer  80000
#>  2   Derrick Whelply  40000
#>  3    Michael Spence  40000
#>  4    Maya Gutierrez  35000
#>  5   Roberta Damstra  25000
#>  6  Rebecca Kanagaki  15000
#>  7       Kim Brunner  10000
#>  8   Brenda Blumberg  17000
#>  9      Darren Stanz  50000
#> 10 Jonathan Murraiin  15000
#> # ... with 1,145 more rows

though in this (every?) case it’s really equivalent to calling collect beforehand, and is thus not really that useful. To avoid bringing the data into memory yet, you’ve got to use SQL functions:

db %>%
    tbl("cp.`employee.json`") %>%
    transmute(name = concat(first_name, " ", last_name), salary)
#> # Source:   lazy query [?? x 2]
#> # Database: DrillConnection
#>                 name salary
#>                <chr>  <dbl>
#>  1      Sheri Nowmer  80000
#>  2   Derrick Whelply  40000
#>  3    Michael Spence  40000
#>  4    Maya Gutierrez  35000
#>  5   Roberta Damstra  25000
#>  6  Rebecca Kanagaki  15000
#>  7       Kim Brunner  10000
#>  8   Brenda Blumberg  17000
#>  9      Darren Stanz  50000
#> 10 Jonathan Murraiin  15000
#> # ... with more rows

For do's list column behavior, it’s always possible to explicitly call list, e.g.

mtcars %>%
    group_by(cyl) %>%
    do(mod = lm(mpg ~ disp, .))
#> Source: local data frame [3 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 2
#>     cyl      mod
#> * <dbl>   <list>
#> 1     4 <S3: lm>
#> 2     6 <S3: lm>
#> 3     8 <S3: lm>

mtcars %>%
    group_by(cyl) %>%
    summarise(mod = list(lm(mpg ~ disp, .)))
#> # A tibble: 3 x 2
#>     cyl      mod
#>   <dbl>   <list>
#> 1     4 <S3: lm>
#> 2     6 <S3: lm>
#> 3     8 <S3: lm>

do's data frame behavior has been superseded by the idiom of nesting the non-grouping columns, iterating over them, and unnesting.

library(tidyverse)

mtcars %>%
    group_by(cyl) %>%
    do(model = lm(mpg ~ disp, .)) %>%
    do(broom::tidy(.$model))
#> Source: local data frame [6 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 6 x 5
#>          term     estimate   std.error  statistic      p.value
#> *       <chr>        <dbl>       <dbl>      <dbl>        <dbl>
#> 1 (Intercept) 40.871955322 3.589605400 11.3861973 1.202715e-06
#> 2        disp -0.135141815 0.033171608 -4.0740206 2.782827e-03
#> 3 (Intercept) 19.081987419 2.913992892  6.5483988 1.243968e-03
#> 4        disp  0.003605119 0.015557115  0.2317344 8.259297e-01
#> 5 (Intercept) 22.032798914 3.345241115  6.5863112 2.588765e-05
#> 6        disp -0.019634095 0.009315926 -2.1075838 5.677488e-02

mtcars %>%
    as_data_frame() %>%
    nest(-cyl) %>%
    mutate(model = map(data, ~lm(mpg ~ disp, .x)),
           summary = map(model, broom::tidy)) %>%
    unnest(summary)
#> # A tibble: 6 x 6
#>     cyl        term     estimate   std.error  statistic      p.value
#>   <dbl>       <chr>        <dbl>       <dbl>      <dbl>        <dbl>
#> 1     6 (Intercept) 19.081987419 2.913992892  6.5483988 1.243968e-03
#> 2     6        disp  0.003605119 0.015557115  0.2317344 8.259297e-01
#> 3     4 (Intercept) 40.871955322 3.589605400 11.3861973 1.202715e-06
#> 4     4        disp -0.135141815 0.033171608 -4.0740206 2.782827e-03
#> 5     8 (Intercept) 22.032798914 3.345241115  6.5863112 2.588765e-05
#> 6     8        disp -0.019634095 0.009315926 -2.1075838 5.677488e-02

#13

Thanks, I totally agree with you at this point. I think do() could be useful if it supported database backends and iterated computation group by group before passing to R’s memory, but actually it’s not…

con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")

mtcars2 <- copy_to(con, mtcars)

mtcars2 %>%
  group_by(cyl) %>%
  do(mod = lm(mpg ~ disp, .))
#> Error: No more ticks

For your last example, I choose split() + map() if the number of grouping variable is one. (Anyway, we don’t need do() here)

library(purrr)

mtcars %>%
  split(.$cyl) %>%
  map(~lm(mpg ~ disp, data = .)) %>%
  map_dfr(broom::tidy, .id = "cyl")
#>   cyl        term     estimate   std.error  statistic      p.value
#> 1   4 (Intercept) 40.871955322 3.589605400 11.3861973 1.202715e-06
#> 2   4        disp -0.135141815 0.033171608 -4.0740206 2.782827e-03
#> 3   6 (Intercept) 19.081987419 2.913992892  6.5483988 1.243968e-03
#> 4   6        disp  0.003605119 0.015557115  0.2317344 8.259297e-01
#> 5   8 (Intercept) 22.032798914 3.345241115  6.5863112 2.588765e-05
#> 6   8        disp -0.019634095 0.009315926 -2.1075838 5.677488e-02

#14

I don’t think this is too inconvenient, but I feel it’s great if we have data.frame-specific version of pmap().

For example, a data.frame usually has many rows so it’s not that all of them involve all computation. In this case, you need a function with ... in its argument to ignore irrelevant rows. I often forgot this :stuck_out_tongue:

This example from ?pmap illustrates this well:

library(purrr)

## Use `...` to absorb unused components of input list .l
df <- data.frame(
  x = 1:3 + 0.1,
  y = 3:1 - 0.1,
  z = letters[1:3]
)
plus <- function(x, y) x + y
pmap(df, plus)
#> Error in .f(x = .l[[c(1L, i)]], y = .l[[c(2L, i)]], z = .l[[c(3L, i)]], : unused argument (z = .l[[c(3, i)]])

plus2 <- function(x, y, ...) x + y
pmap_dbl(df, plus2)
#> [1] 4 4 4

#15

It is supposed to retrieve group by group, but that code is complex and few people have used it so it’s likely to have some bugs (although it is tested)


#16

Sorry, I misunderstood this… So, this is definitely an advantage of do() (although it may be buggy).


#17

In addition to making it easy to use multiple grouping variables, nest makes it simple to hold on to and extract the pieces of an analysis in an organized fashion, e.g.

library(tidyverse)

models <- mtcars %>% 
    as_data_frame() %>% 
    nest(-vs, -am) %>% 
    mutate(model = map(data, ~lm(mpg ~ wt + hp, .x)), 
           tidy = map(model, broom::tidy),
           glance = map(model, broom::glance),
           augment = map(model, broom::augment))

# mm, pretty
models
#> # A tibble: 4 x 7
#>      vs    am              data    model                 tidy
#>   <dbl> <dbl>            <list>   <list>               <list>
#> 1     0     1  <tibble [6 x 9]> <S3: lm> <data.frame [3 x 5]>
#> 2     1     1  <tibble [7 x 9]> <S3: lm> <data.frame [3 x 5]>
#> 3     1     0  <tibble [7 x 9]> <S3: lm> <data.frame [3 x 5]>
#> 4     0     0 <tibble [12 x 9]> <S3: lm> <data.frame [3 x 5]>
#> # ... with 2 more variables: glance <list>, augment <list>

models %>% unnest(glance, .drop = TRUE)
#> # A tibble: 4 x 13
#>      vs    am r.squared adj.r.squared    sigma statistic    p.value    df
#>   <dbl> <dbl>     <dbl>         <dbl>    <dbl>     <dbl>      <dbl> <int>
#> 1     0     1 0.9332273     0.8887122 1.337351 20.964269 0.01725434     3
#> 2     1     1 0.6299660     0.4449490 3.544570  3.404908 0.13692518     3
#> 3     1     0 0.7453295     0.6179943 1.527285  5.853286 0.06485705     3
#> 4     0     0 0.5074320     0.3979724 2.152666  4.635794 0.04131408     3
#> # ... with 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>,
#> #   deviance <dbl>, df.residual <int>

models %>% unnest(augment)
#> # A tibble: 32 x 12
#>       vs    am   mpg    wt    hp  .fitted   .se.fit     .resid      .hat
#>    <dbl> <dbl> <dbl> <dbl> <dbl>    <dbl>     <dbl>      <dbl>     <dbl>
#>  1     0     1  21.0 2.620   110 21.88541 0.7267886 -0.8854121 0.2953423
#>  2     0     1  21.0 2.875   110 20.30171 1.1632668  0.6982858 0.7566030
#>  3     0     1  26.0 2.140    91 25.04363 1.1843529  0.9563671 0.7842809
#>  4     0     1  15.8 3.170   264 17.03381 0.7592228 -1.2338072 0.3222907
#>  5     0     1  19.7 2.770   175 20.34781 0.5739990 -0.6478116 0.1842179
#>  6     0     1  15.0 3.570   335 13.88762 1.0842156  1.1123780 0.6572653
#>  7     1     1  22.8 2.320    93 25.61599 1.7058264 -2.8159879 0.2316021
#>  8     1     1  32.4 2.200    66 28.29872 1.8459116  4.1012791 0.2712031
#>  9     1     1  30.4 1.615    52 33.05121 2.3234181 -2.6512065 0.4296627
#> 10     1     1  33.9 1.835    65 30.71644 1.6666564  3.1835617 0.2210879
#> # ... with 22 more rows, and 3 more variables: .sigma <dbl>,
#> #   .cooksd <dbl>, .std.resid <dbl>

You could do the same thing with a non-data.frame list, but creating and manipulating sub-elements gets tricky, and the print method takes way too much space (and can’t be salvaged by str if there’s a model inside).


#18

Thanks! Agreed, nest() can do well and the nicer print method is a thing.

(Yet, sometimes I feel more comfortable with a list of data.frames than with a data.frame with a nested column, since list is more flexible than data.frame. I think this is rather a matter of preference.)


#19

I didn’t think of using … with pmap, thanks! Wondering if pmap does parameter name matching now.


#20

I’m a big fan of do() and particularly like the progress bar. In light of this thread I decided to explore alternatives to do() using purrr::map(). However, it seems to me that the purrr::map() approach is slower than do() as illustrated by this example:

library(dplyr)                                             
library(tidyr)                                             
library(purrr)                                             
library(microbenchmark) 
                                   
xx <- tibble(                                              
  x = rep(seq(1,260), 1e4),                                  
  y = rep(letters, 1e5),                                     
  z = rnorm(26e5)                                            
)
                                                          
mean_and_sd <- function(x) list(mean(x), sd(x))            

microbenchmark(                                            
  usedo = xx %>%                                             
    group_by(x, y) %>%                                            
    do(zz = mean_and_sd(.$z)),                                 
  usemap = xx %>%                                            
    group_by(x, y) %>%                                         
    nest() %>%                                                 
    transmute(
      x = x, 
      y = y, 
      zz = map(data, ~ mean_and_sd(.$z))
    )
)
                                                          
#> Unit: milliseconds
#>    expr      min       lq     mean   median       uq      max neval
#>   usedo 305.5013 310.2084 320.7708 311.9238 314.7841 391.9066   100
#>  usemap 565.7612 572.6649 607.6548 623.6925 630.6112 815.0381   100

I’m obviously calling more complex functions that return lists than mean_and_sd() above in the course of my work so the difference between the approaches becomes much clearer.

So, the question is, is there a way of using purrr::map() (or something else) in this sort of context that is as efficient as do()? I’m aware that data.table might be faster, but I’m more interested in a tidyverse method.