Dplyr: Alternatives to rowwise

dplyr

#24

This is really helpful for me to wrap my head around some of the nuances... so what's your thoughts about using group_by( now_number() ) to get row wise behavior but with complete group_by consistency?

I have to admit when I first wrote that in my response above I kinda felt dirty. But the more I think about it the more it feels like a logically consistent way of getting a row by row operation.


#25

Can you show what you mean in an example? I just tried something but basically end up in the uncomfortable places explored above.


#26

Sorry Jenny but I lost the thread of the convo. An example of what?


#27

The workflow that starts with df %>% group_by(row_number()).


Calculating a new variable - best practice?
#28

Certainly... Here I resurrect our earlier example (sans typo I hope):


library(tidyverse)
set.seed(42)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(x, y) {
  val <-  sum(rnorm(10,x,y))
  return(val)
}

df %>%
  group_by( r = row_number() ) %>%
  mutate( t = fun(v1, v2))
#> # A tibble: 3 x 6
#> # Groups:   r [3]
#>   groupA groupB    v1    v2     r     t
#>   <chr>  <chr>  <dbl> <dbl> <int> <dbl>
#> 1 A      C         4.    1.     1 45.5 
#> 2 A      D         2.    3.     2 15.1 
#> 3 A      D         1.    5.     3  1.10

granted there's an extra column in there now... but the row wise operation works in a way that feels expected (at least to me)


#29

OK got it. I add something @hadley cooked up in another channel that is quite nice. He uses list() as the first argument of the pmap_(), instead of ., to select the relevant columns of the data frame. This also means you can re-associate variable names to argument names on-the-fly, as we need to do here.

library(tidyverse)
set.seed(42)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(x, y) {
  val <-  sum(rnorm(10,x,y))
  return(val)
}

df %>% 
  mutate(t = pmap_dbl(list(x = v1, y = v2), fun))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1 45.5 
#> 2 A      D          2     3 15.1 
#> 3 A      D          1     5  1.10

#30

Ohhhh! now this is quite intuitive to me. When I first tried pmap this is what I expected the behavior to be. I was immediately flummoxed that I had lost names. I was a little confused by this line in the map documentation:

.l - A list of lists. The length of .l determines the number of arguments that .f will be called with. List names will be used if present.

because the docs say "list names will be used if present" I had expected I could do exactly what you illustrate above without wrapping the input in a list()

This conversation is VERY helpful. Thank you.


split this topic #31

8 posts were split to a new topic: Re rowwise(), when it is useful to access the parent environment for some row-wise operation?


Re rowwise(); best practice when it is useful to access the parent environment for some row-wise operation?
Re rowwise(); best practice when it is useful to access the parent environment for some row-wise operation?
#32

~~~~ this response belongs on this thread but earlier got moved when the topic split. I'm bringing it back so that future time travelers will learn that group_by( row_number() ) is not just ugly, it's slow. ~~~~~~~~~~~~~~~~

OK, I talked myself off the ledge of using group_by( row_number() ) by testing it. Turns out @hadley 's list naming trick is ~ 10x faster. Here's my test:

library(tidyverse)
set.seed(42)
n <- 1e4
df <- data.frame(my_int  = sample(1:5, n, replace=TRUE), 
           my_min = sample(1:5, n, replace=TRUE), 
           range  = sample(1:5, n, replace=TRUE))

benchmark(
df %>% 
  group_by(r=row_number()) %>%
  mutate(calc = list(runif(my_int, my_min, my_min + range) )) %>%
  ungroup() %>%
  select(-r) -> 
out
)
#>                                                                                                                                   test
#> 1 out <- df %>% group_by(r = row_number()) %>% mutate(calc = list(runif(my_int, my_min, my_min + range))) %>% ungroup() %>% select(-r)
#>   replications elapsed relative user.self sys.self user.child sys.child
#> 1          100   51.51        1     51.42     0.06         NA        NA

benchmark(
df %>%
  mutate(data = pmap(list(n = my_int, min = my_min, max = my_min + range), runif)) -> out
)
#>                                                                                             test
#> 1 out <- df %>% mutate(data = pmap(list(n = my_int, min = my_min, max = my_min + range), runif))
#>   replications elapsed relative user.self sys.self user.child sys.child
#> 1          100     5.5        1       5.5        0         NA        NA

Re rowwise(); best practice when it is useful to access the parent environment for some row-wise operation?
#33

Unfortunately group_by is very slow when there are a large number of groups as there are with row_number() here.


#34

I wouldn't make decisions primarily based on performance costs since those can change over time. That said, rowwise(), is fundamentally slow because it can never make use of vectorised functions as it must always automatically vectorised by (effectively) wrapping the code in a call to map.


#35

Another variant that always seemed pretty natural to me is plyr::adply:

plyr::adply(df, 1, function(row) data.frame(t=fun(row$v1, row$v2)))

Might be a good one to add to the timing study or list of standard approaches.


#36

I will light a candle :candle:with you for plyr, a package that I love(d). But it is basically deprecated now and will see no further development. It had a huge, positive influence on how I think about these sorts of tasks, but I would advise against writing new code that uses plyr.


Re rowwise(); best practice when it is useful to access the parent environment for some row-wise operation?
#37

Off-topic, but really, JD?


#38

@taras has a point, the correct spelling is y'all. :smiley:


#39

Well, the irony is that JD is the most qualified in the "y'all" spelling around here...


#40

I can typo in multiple languages: English, Southern English, Python, R... I have no constraints.


#41

Sorry to jump in a long-dead thread, but this is clearly an important topic, as several SO questions addressing are highly "liked" e.g, in this SO question and also because hadley seems to be "questioning" the best approach in this GH issue
I've read through several threads and it seems it's prefered not to use rowwise, but pmap instead.

As solely an end-user, this approach is not nearly as intuitive as the rowwise approach (to me at least) for a few reasons.

  1. The pmap help and examples are not very informative (lumped into purrr::map et al) (as pointed out below as well).
  2. pmap_... doesn't quite work the same as other map_ syntax (my attempts below)
  3. I don't understand why there can't be a mutate_rowwise-type option, so the underlying data grouping is not altered?

I'm sure much of this is just my limited understanding of the internals, but I've managed to maneuver the purrr framework much easier than pmap for some reason.

library(tidyverse)
mtcars %>% as_tibble() %>% 
  mutate(new_mean_var = mean(c(vs, am, gear, carb)),
         new_mean_pmap = pmap_dbl(.l = list(vs, am, gear, carb), mean), # NO
         new_mean_pmap_attempt2 = pmap_dbl(.l = list(vs, am, gear, carb), ~mean(c(vs, am, gear, carb))), # NO
         new_mean_pmap_attempt3 = pmap_dbl(.l = list(vs, am, gear, carb), function(x,y,z, zz) mean(c(x,y,z, zz))))  # YES

#42

You are right, pmap is the most confusing mapping operator to me as well. I've fixed your examples to show how you can still use pmap:

library(tidyverse)
mtcars %>% as_tibble() %>% 
  mutate(new_mean_var = mean(c(vs, am, gear, carb)),
         #new_mean_pmap = pmap_dbl(.l = list(vs, am, gear, carb), mean), # NO
         new_mean_pmap_attempt2 = pmap_dbl(.l = list(vs, am, gear, carb), ~mean(c(...))),
         new_mean_pmap_attempt3 = pmap_dbl(.l = list(vs, am, gear, carb), function(...) mean(c(...)))) %>%
  select(starts_with("new_mean"))
#> # A tibble: 32 x 3
#>    new_mean_var new_mean_pmap_attempt2 new_mean_pmap_attempt3
#>           <dbl>                  <dbl>                  <dbl>
#>  1         1.84                   2.25                   2.25
#>  2         1.84                   2.25                   2.25
#>  3         1.84                   1.75                   1.75
#>  4         1.84                   1.25                   1.25
#>  5         1.84                   1.25                   1.25
#>  6         1.84                   1.25                   1.25
#>  7         1.84                   1.75                   1.75
#>  8         1.84                   1.75                   1.75
#>  9         1.84                   1.75                   1.75
#> 10         1.84                   2.25                   2.25
#> # … with 22 more rows

Created on 2019-01-07 by the reprex package (v0.2.1)
BTW, your new_mean_var is not correct, as you can see.


#43

yeah I'm pretty partial to rowwise myself. I had written up the pmap solution for The R Cookbook 2nd Edition and technical reviewers just hated it. Found it really hard to grok. So I'm rolling back to rowwise.

@romain and @davis have been doing some interesting work with Rap:

I've not taken time to work with Rap, but it looks like a promising alternative to pmap for rowwise operations.