Dplyr: Alternatives to rowwise

jdlong · May 5, 2018, 2:14pm

I was just surprised to stumble upon a github comment where @hadley says not to use rowwise... um, that's surprising. What's the alternative to apply a non vectorized function on each row?

Source: data.frame with 'rowwise()' grouping claims to have no 'groups()' · Issue #3144 · tidyverse/dplyr · GitHub

JohnMount · May 5, 2018, 2:58pm

One could try wrapping the non-vectorized function to be vectorized using one of base::Vectorize() or base::mapply().

I've been collecting some notes on under-appreciated R functions here: http://www.win-vector.com/blog/2018/04/neglected-r-super-functions/

mara · May 5, 2018, 3:44pm

Not a direct answer to your question, but I think Jenny covers the general thinking on this, which is to tilt towards column-wise thinking, and provide tools to essentially avoid having to work row-wise in most cases…

Of note, Hadley saying it's not being actively developed isn't to say it's being deprecated, and the handling of groups, which is half of that question (way to just include hadley's response sans click-less context, JD **), is very much being actively worked on.

Question one-boxed (Discourse is weird)

github.com/tidyverse/dplyr

data.frame with 'rowwise()' grouping claims to have no 'groups()'

opened 02:02AM - 18 Oct 17 UTC

closed 07:22PM - 02 Nov 17 UTC

coolbutuseless

After processing data grouped by 'rowwise()', the output of 'groups()' claims th…ere are NULL groups - which is obviously not the case. Expected behaviour: A data.frame with rowwise grouping outputs "<by row>" (or something similar) in response to the "groups()" command. ``` > suppressPackageStartupMessages({ + library(dplyr) + }) > > packageVersion('dplyr') [1] ‘0.7.4’ > > mtcars <- mtcars %>% rowwise() > > mtcars %>% groups() NULL > > # an operation that is group sensitive will give grouped results, > # even though 'groups()' claims there are no groups. The > # output of the following indicates it is "Groups: <by row>" > # even if groups() command doesn't > mtcars %>% filter(row_number() == 1) Source: local data frame [32 x 11] Groups: <by row> # A tibble: 32 x 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 3 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 5 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 6 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 7 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 8 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 9 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 10 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 # ... with 22 more rows > ```

Jenny Bryan's webinar/materials on row-oriented workflows in R with the tidyverse:

** Said with loving kindness, because I know JD totally included the source below.

cderv · May 6, 2018, 10:07am

Using a combination of nested data.frame with tidyr::nest and purrr::pmap family on list columns is an option for rowise operation.
The RStudio Webinar by Jenny mentions this I think

RuReady · May 7, 2018, 4:21am

Winston Chang had an interesting post on this topic https://rpubs.com/wch/200398

jennybryan · May 7, 2018, 4:47am

The example in Winston's RPub is also one of the central examples in the "Row-oriented workflows" webinar materials:

https://rstd.io/row-work

Updated version of the timing study (spoiler: purrr::pmap() does very well):

github.com

jennybc/row-oriented-workflows/blob/master/iterate-over-rows.md

Turn data frame into a list, one component per row
================
Jenny Bryan, updating work of Winston Chang
2018-09-05

Update of <https://rpubs.com/wch/200398>.

  - Added some methods, removed some methods.
  - Run every combination of problem size & method multiple times.
  - Explore different number of rows and columns, with mixed col types.

<!-- end list -->

``` r
library(scales)
library(tidyverse)
```

    ## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──

This file has been truncated. show original

jdlong · May 8, 2018, 12:09am

so I keep messing with pmap and I can't make a row wise workflow to work with pmap to save my life. I'm not groking something. Can one of you kind folks please help me turn the rowwise() pipe sequence below into a pmap() pipe sequence?

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}
fun(2,3)

df %>%
  rowwise() %>%
  mutate( t = fun(v1, v2))

in this reprex, the gist is fun is not vectorized. So it needs inputs fed to it row by row. Anything I try with pmap not only fails to work but often insults the marital status of my parents, which becomes quite tiring after a while.

Thanks for the hand holding, ya'll.

jennybryan · May 8, 2018, 12:45am

pmap() provides the entire tuple (row in the data frame case) to the function you're mapping.

So, if the function only uses a subset of the inputs it will see, you have to address that. I assume you're seeing a lot of:

Error in mutate_impl(.data, dots) : 
  Evaluation error: unused arguments (groupA = .l[[c(1, i)]], groupB = .l[[c(2, i)]])

There are a couple of options. First, you can include ... as an argument to mop up any arguments fun doesn't use. Second, you can use the ..i pronouns to map your input to the function args by position. This generally terrifies me and I would not recommend it. However it does work.

library(tidyverse)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}
fun(2,3)
#> [1] 28.57652

set.seed(1234)
df %>%
  rowwise() %>%
  mutate( t = fun(v1, v2))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

## absorb ununsed arguments with `...`
fun2 <- function(v1, v2, ...) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}

set.seed(1234)
df %>% 
  mutate( t = pmap_dbl(., fun2))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6
  

set.seed(1234)
df %>% 
  mutate( t = pmap_dbl(., ~ fun(v1 = ..3, v2 = ..4)))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

Created on 2018-05-07 by the reprex package (v0.2.0).

jennybryan · May 8, 2018, 12:49am

I think in this specific case map2_*() is nicer.

library(tidyverse)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}

set.seed(1234)
df %>% 
  mutate( t = map2_dbl(v1, v2, fun))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

Created on 2018-05-07 by the reprex package (v0.2.0).

I think pmap() really comes up with there are more arguments.

markdly · May 8, 2018, 4:04am

Nice discussion! Minor comment - I think there might be a typo using the example function fun as v1 is declared but never used

# existing
fun <- function(v1, v2) {
  val <-  sum(rnorm(10, v2, v2))
  return(val)
}

# change
fun <- function(v1, v2) {
  val <-  sum(rnorm(10, v1, v2))
  return(val)
}

(Thought I'd note it in case it detracts from the clarity of the examples for other readers)

jdlong · May 8, 2018, 9:57am

Good catch. that was my mistake. Remember kids, don't drink and code.

jdlong · May 8, 2018, 10:34am

Jenny, as always, this is insanely helpful. Thank you. It's hard for me to express how helpful it is that I can wander in here frustrated and alone, throw a reprex against the wall, yell a little, and then have all y'all guide me down the narrow path.

My learning objective here is to not just grok this for myself, but to grok it well enough to teach others. Here's my observation (keep in mind that while I've been using R for many years I just started using dplyr ~ 6 months ago):

the ability to operate row by row is important because sometimes there's logic that's hard to vectorize. This is even more common for beginners.
row by row on data frame objects is very intuitive because that's how many formulas work in Excel which many beginners are used to.
rowwise() is a super intuitive to a beginner. Conceptually it feels just like a group_by but its group is an individual row. This makes using it and explaining it super easy because of the analog to group_by
all of the pmap family of functions require learning new concepts that are not needed with rowwise. For example, .i or ... in a function. Those two ideas will likely be totally new to a beginner.

So candidly I'm sitting here with a book manuscript that has a very clean and easy to understand 3 paragraph chunk that explains rowwise in a way that any beginner can understand. And I'm going to delete it and replace it with "doing operations row by row is really hard. here's a couple of messy hacks that are hard to understand but may magically do what you want. Or they may not. But you'll probably get some weird error messages. I know I did!"

Meh. Deprecating rowwise feels like a real step backwards for the tidy ecosystem and tidy workflow to me.

mara · May 8, 2018, 10:51am

It's not being deprecated.

[Emphasis added ⇩]

Honestly, rowwise() has never felt intuitive to me – I'm not sure why.

In the event that you're not being sarcastic here, I'd probably at the very least mention that these patterns offer advantages elsewhere in terms of reusability — I'd try to explain that more eloquently here were it not for the fact that Jenny has already done so in her slides/talk, and I don't want to mess with a good thing.

So will Git, but I think it's worth learning! Seriously though, rowwise is not (repeat not) being deprecated, so do with it what you will! I look forward to reading what you write — one of my favourite things about the R community is that so many people take the time to write out how they approach problems/tasks, and different takes seem to just click for whatever reason (for me, bits and pieces from a number of sources comprise my hodgepodge mental models).

tbradley · May 8, 2018, 10:57am

I agree with Mara that it has never been super clear to me when to use rowwise() and when I shouldn't. Normally, if something keeps failing without it, I will add rowwise() to see if it works.

However, since I switched to using the map_* family of functions things are much more consistent. I disagree with your statement that all of the pmap functions require learning new concepts, because they are purposefully designed to have complimentary syntaxes. So once you have grasped one, you can likely pick up the others easier.

I will say that using pmap for the rowwise operations is the most confusing to me (of the map functions), because a lot of the times I need to do something by row, I do not need every column, but only a few. This is why map_* and map2_*, like in Jenny's second example, make it more clear in the beginning, in my opinion.

Also worth noting, pmap can take a list of arguments if you need to access more than 2 columns but not all of your columns.

tbradley · May 8, 2018, 10:59am

Also, I may be out of the norm here, but grasping how the mutate + map combo worked with list-columns, especially nested tibbles, really helped me see how it worked row by row

jdlong · May 8, 2018, 11:01am

Sorry for not using my sarcastic font.

Thanks for hanging in with me as I learn out loud, ya'll. I'm venting frustration in both learning and teaching as I go through this so please take me with a grain of salt. My frustrations are real in this moment but I fully expect my views to change as I muddle through this.

mara · May 8, 2018, 11:04am

I know…I should've said: In the event that you're not being sarcastic and/or anyone stumbles upon this in the future and I'd like two mediocrely informed cents on this, please.

Or sarcastices…

Due to my high base-rate usage of sarcasm (yes, I assume everyone parses me with a Bayesian approach), I'm more of a sinceroid girl myself.

jdlong · May 8, 2018, 11:50am

Good point. I think the map family of functions are TOTALLY internally consistent. My argument is that there's a barrier to entry that requires learning a few new concepts. Learners should totally learn these, but they provide friction. I am of the strong opinion that when learning row wise operations the map family of functions have a lot of cognitive friction.

Unless I'm missing something, rowwise is sort of syntactic sugar round this little jewel:

df %>%
  group_by( row_number() ) %>%
  mutate( t = fun(v1, v2))

So one of y'all should talk me off the ledge of teaching that! Because I find it incredibly easy to grok and use.

And, once again. thank you all for some back and forth on these concepts. This is super helpful for me to get lots of perspectives and I'm really glad there's a forum in which we can hash this out.

jennybryan · May 8, 2018, 3:44pm

Re: rowwise() and its "least favorite child" status.... I confess that I also have never used it. I went straight from my dearly beloved plyr to purrr-based approaches, without stopping in the middle. (I also never got used to group_by() + do().) But if you think about the effect that rowwise() has, I can understand why an implementer doesn't love it. It means the rowwise()'d data frame now has to carry around some property that changes how subsequent calls should operate. It is much nicer to encourage a workflow that has less "special case" going on.

The comparison to group_by() is apt and also gets at this. Grouping and ungrouping is another rich source of puzzles for people, so it's nice to avoid it when it's not truly necessary.

I think the more interesting question is why must .f in purrr::pmap() use all the named inputs? I have asked this before and, although this is not the conversation I was thinking of, it's the closest match I can find right now:

github.com/tidyverse/purrr

Anonymous functions in pmap

opened 12:03PM - 15 Jun 16 UTC

closed 10:24PM - 03 Jul 16 UTC

1danjordan

The documentation for `pmap` is bundled in with `map2` and doesn't include any `…pmap` specific examples. This makes it unclear how to access lists in `.f`, and after a lot of investigation I'm still stumped. In `map2`, the variables are simply `.x` and `.y`, and in the deprecated `map3` were `.x`, `.y` and `.z`. The below example works for the first two lists, but I have no idea how access the third list. ``` a <- list(name1 = 1, name2 = 1, name3 = 1) b <- list(name1 = 1, name2 = 10, name3 = 100) c <- list(name1 = 5, name2 = 50, name3 = 500) pmap(list(a, b, c), ~ .x + .y) $name1 [1] 2 $name2 [1] 11 $name3 [1] 101 pmap(list(a,b, c), ~ .x + .y + .z) Error in .f(.l[[c(1L, i)]], .l[[c(2L, i)]], .l[[c(3L, i)]], ...) : object '.z' not found ``` Obviously this pattern could not extend to a list of length n, but I can't work out what the convention could be. If this is me missing something blatant or `pmap` is not designed to work in this way, then I apologise!

I'm not sure if that is a completely closed conversation, but you could imagine trying to address this pain point in various ways.

jennybryan · May 8, 2018, 3:55pm

BTW I noticed that too but had the good taste not to mention it and faithfully reproduced it.