Dplyr: Alternatives to rowwise

hadley · May 8, 2018, 4:45pm

I don't think I can put it into words very well, but in R, I think it is really important to understand the difference between a vectorised function and a "scalarised" function. rowwise() blurs the difference with magic, which in the long-run gives you a poor mental model, and I think will lead you to more problems down the line.

hadley · May 8, 2018, 4:49pm

In particular, this example feels very magical to me. I don't really understand why it works.

library(tidyverse)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(m, s) {
  list(rnorm(10, m, s))
}

df %>%
  rowwise() %>%
  mutate(t = fun(v1, v2)) %>%
  mutate(s = sum(t))
#> Source: local data frame [3 x 6]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 6
#>   groupA groupB    v1    v2 t               s
#>   <chr>  <chr>  <dbl> <dbl> <list>      <dbl>
#> 1 A      C          4     1 <dbl [10]>  42.3 
#> 2 A      D          2     3 <dbl [10]>  28.0 
#> 3 A      D          1     5 <dbl [10]>  -8.12

hadley · May 8, 2018, 4:52pm

The way rowwise() affects later mutate() calls seems surprising to me in a way that group_by() does not.

I think maybe because group_by() doesn't affect the function call - it's just applied multiple times, once to each group. But rowwise() changes sum(t) to sum(t[[1]]), and then applies that to each row.

jdlong · May 8, 2018, 5:26pm

This is really helpful for me to wrap my head around some of the nuances... so what's your thoughts about using group_by( now_number() ) to get row wise behavior but with complete group_by consistency?

I have to admit when I first wrote that in my response above I kinda felt dirty. But the more I think about it the more it feels like a logically consistent way of getting a row by row operation.

jennybryan · May 8, 2018, 6:11pm

Can you show what you mean in an example? I just tried something but basically end up in the uncomfortable places explored above.

jdlong · May 8, 2018, 6:12pm

Sorry Jenny but I lost the thread of the convo. An example of what?

jennybryan · May 8, 2018, 6:16pm

The workflow that starts with df %>% group_by(row_number()) .

jdlong · May 8, 2018, 6:23pm

Certainly... Here I resurrect our earlier example (sans typo I hope):


library(tidyverse)
set.seed(42)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(x, y) {
  val <-  sum(rnorm(10,x,y))
  return(val)
}

df %>%
  group_by( r = row_number() ) %>%
  mutate( t = fun(v1, v2))
#> # A tibble: 3 x 6
#> # Groups:   r [3]
#>   groupA groupB    v1    v2     r     t
#>   <chr>  <chr>  <dbl> <dbl> <int> <dbl>
#> 1 A      C         4.    1.     1 45.5 
#> 2 A      D         2.    3.     2 15.1 
#> 3 A      D         1.    5.     3  1.10

granted there's an extra column in there now... but the row wise operation works in a way that feels expected (at least to me)

jennybryan · May 8, 2018, 6:33pm

OK got it. I add something @hadley cooked up in another channel that is quite nice. He uses list() as the first argument of the pmap_(), instead of ., to select the relevant columns of the data frame. This also means you can re-associate variable names to argument names on-the-fly, as we need to do here.

library(tidyverse)
set.seed(42)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(x, y) {
  val <-  sum(rnorm(10,x,y))
  return(val)
}

df %>% 
  mutate(t = pmap_dbl(list(x = v1, y = v2), fun))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1 45.5 
#> 2 A      D          2     3 15.1 
#> 3 A      D          1     5  1.10

jdlong · May 8, 2018, 6:40pm

Ohhhh! now this is quite intuitive to me. When I first tried pmap this is what I expected the behavior to be. I was immediately flummoxed that I had lost names. I was a little confused by this line in the map documentation:

.l - A list of lists. The length of .l determines the number of arguments that .f will be called with. List names will be used if present.

because the docs say "list names will be used if present" I had expected I could do exactly what you illustrate above without wrapping the input in a list()

This conversation is VERY helpful. Thank you.

EconomiCurtis · May 9, 2018, 8:49am

8 posts were split to a new topic: Re rowwise(), when it is useful to access the parent environment for some row-wise operation?

jdlong · May 9, 2018, 10:21am



OK, I talked myself off the ledge of using `group_by( row_number() )` by testing it. Turns out @hadley 's list naming trick is ~ 10x faster. Here's my test:

```library(rbenchmark)
library(tidyverse)
set.seed(42)
n <- 1e4
df <- data.frame(my_int  = sample(1:5, n, replace=TRUE), 
           my_min = sample(1:5, n, replace=TRUE), 
           range  = sample(1:5, n, replace=TRUE))

benchmark(
df %>% 
  group_by(r=row_number()) %>%
  mutate(calc = list(runif(my_int, my_min, my_min + range) )) %>%
  ungroup() %>%
  select(-r) -> 
out
)
#>                                                                                                                                   test
#> 1 out <- df %>% group_by(r = row_number()) %>% mutate(calc = list(runif(my_int, my_min, my_min + range))) %>% ungroup() %>% select(-r)
#>   replications elapsed relative user.self sys.self user.child sys.child
#> 1          100   51.51        1     51.42     0.06         NA        NA

benchmark(
df %>%
  mutate(data = pmap(list(n = my_int, min = my_min, max = my_min + range), runif)) -> out
)
#>                                                                                             test
#> 1 out <- df %>% mutate(data = pmap(list(n = my_int, min = my_min, max = my_min + range), runif))
#>   replications elapsed relative user.self sys.self user.child sys.child
#> 1          100     5.5        1       5.5        0         NA        NA
```

martin.R · May 9, 2018, 11:10am

Unfortunately group_by is very slow when there are a large number of groups as there are with row_number() here.

hadley · May 9, 2018, 2:55pm

I wouldn't make decisions primarily based on performance costs since those can change over time. That said, rowwise(), is fundamentally slow because it can never make use of vectorised functions as it must always automatically vectorised by (effectively) wrapping the code in a call to map.

KenWilliams · May 10, 2018, 4:09am

Another variant that always seemed pretty natural to me is plyr::adply:

plyr::adply(df, 1, function(row) data.frame(t=fun(row$v1, row$v2)))

Might be a good one to add to the timing study or list of standard approaches.

jennybryan · May 10, 2018, 6:11am

I will light a candle with you for plyr, a package that I love(d). But it is basically deprecated now and will see no further development. It had a huge, positive influence on how I think about these sorts of tasks, but I would advise against writing new code that uses plyr.

taras · May 30, 2018, 2:10pm

Off-topic, but really, JD?

nutterb · May 30, 2018, 2:32pm

@taras has a point, the correct spelling is y'all.

taras · May 31, 2018, 1:01am

Well, the irony is that JD is the most qualified in the "y'all" spelling around here...

jdlong · May 31, 2018, 1:38pm

I can typo in multiple languages: English, Southern English, Python, R... I have no constraints.