Dplyr: Alternatives to rowwise

dplyr

#1

I was just surprised to stumble upon a github comment where @hadley says not to use rowwise... um, that's surprising. What's the alternative to apply a non vectorized function on each row?


Source: https://github.com/tidyverse/dplyr/issues/3144


#2

One could try wrapping the non-vectorized function to be vectorized using one of base::Vectorize() or base::mapply().

I've been collecting some notes on under-appreciated R functions here: http://www.win-vector.com/blog/2018/04/neglected-r-super-functions/


#3

Not a direct answer to your question, but I think Jenny covers the general thinking on this, which is to tilt towards column-wise thinking, and provide tools to essentially avoid having to work row-wise in most cases…

Of note, Hadley saying it's not being actively developed isn't to say it's being deprecated, and the handling of groups, which is half of that question (way to just include hadley's response sans click-less context, JD :smirk:**), is very much being actively worked on.

Question one-boxed (Discourse is weird)

Jenny Bryan's webinar/materials on row-oriented workflows in R with the tidyverse:


https://www.rstudio.com/resources/webinars/thinking-inside-the-box-you-can-do-that-inside-a-data-frame/

** Said with loving kindness, because I know JD totally included the source below.


#4

Using a combination of nested data.frame with tidyr::nest and purrr::pmap family on list columns is an option for rowise operation.
The RStudio Webinar by Jenny mentions this I think


#5

Winston Chang had an interesting post on this topic https://rpubs.com/wch/200398


#6

The example in Winston's RPub is also one of the central examples in the "Row-oriented workflows" webinar materials:

https://rstd.io/row-work

Updated version of the timing study (spoiler: purrr::pmap() does very well):


#7

so I keep messing with pmap and I can't make a row wise workflow to work with pmap to save my life. I'm not groking something. Can one of you kind folks please help me turn the rowwise() pipe sequence below into a pmap() pipe sequence?

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}
fun(2,3)

df %>%
  rowwise() %>%
  mutate( t = fun(v1, v2))

in this reprex, the gist is fun is not vectorized. So it needs inputs fed to it row by row. Anything I try with pmap not only fails to work but often insults the marital status of my parents, which becomes quite tiring after a while.

Thanks for the hand holding, ya'll.


#8

pmap() provides the entire tuple (row in the data frame case) to the function you're mapping.

So, if the function only uses a subset of the inputs it will see, you have to address that. I assume you're seeing a lot of:

Error in mutate_impl(.data, dots) : 
  Evaluation error: unused arguments (groupA = .l[[c(1, i)]], groupB = .l[[c(2, i)]])

There are a couple of options. First, you can include ... as an argument to mop up any arguments fun doesn't use. Second, you can use the ..i pronouns to map your input to the function args by position. This generally terrifies me and I would not recommend it. However it does work.

library(tidyverse)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}
fun(2,3)
#> [1] 28.57652

set.seed(1234)
df %>%
  rowwise() %>%
  mutate( t = fun(v1, v2))
#> Source: local data frame [3 x 5]
#> Groups: <by row>
#> 
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

## absorb ununsed arguments with `...`
fun2 <- function(v1, v2, ...) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}

set.seed(1234)
df %>% 
  mutate( t = pmap_dbl(., fun2))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6
  

set.seed(1234)
df %>% 
  mutate( t = pmap_dbl(., ~ fun(v1 = ..3, v2 = ..4)))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

Created on 2018-05-07 by the reprex package (v0.2.0).


#9

I think in this specific case map2_*() is nicer.

library(tidyverse)

df <- tribble(
  ~groupA, ~groupB, ~v1, ~v2,
  "A","C",4, 1,
  "A","D",2, 3,
  "A","D",1, 5 
)

fun <- function(v1, v2) {
  val <-  sum(rnorm(10,v2,v2))
  return(val)
}

set.seed(1234)
df %>% 
  mutate( t = map2_dbl(v1, v2, fun))
#> # A tibble: 3 x 5
#>   groupA groupB    v1    v2     t
#>   <chr>  <chr>  <dbl> <dbl> <dbl>
#> 1 A      C          4     1  6.17
#> 2 A      D          2     3 26.5 
#> 3 A      D          1     5 30.6

Created on 2018-05-07 by the reprex package (v0.2.0).

I think pmap() really comes up with there are more arguments.


#10

Nice discussion! Minor comment - I think there might be a typo using the example function fun as v1 is declared but never used

# existing
fun <- function(v1, v2) {
  val <-  sum(rnorm(10, v2, v2))
  return(val)
}

# change
fun <- function(v1, v2) {
  val <-  sum(rnorm(10, v1, v2))
  return(val)
}

(Thought I'd note it in case it detracts from the clarity of the examples for other readers)


#11

Good catch. that was my mistake. Remember kids, don't drink and code.


#12

Jenny, as always, this is insanely helpful. Thank you. It's hard for me to express how helpful it is that I can wander in here frustrated and alone, throw a reprex against the wall, yell a little, and then have all y'all guide me down the narrow path.

My learning objective here is to not just grok this for myself, but to grok it well enough to teach others. Here's my observation (keep in mind that while I've been using R for many years I just started using dplyr ~ 6 months ago):

  • the ability to operate row by row is important because sometimes there's logic that's hard to vectorize. This is even more common for beginners.
  • row by row on data frame objects is very intuitive because that's how many formulas work in Excel which many beginners are used to.
  • rowwise() is a super intuitive to a beginner. Conceptually it feels just like a group_by but its group is an individual row. This makes using it and explaining it super easy because of the analog to group_by
  • all of the pmap family of functions require learning new concepts that are not needed with rowwise. For example, .i or ... in a function. Those two ideas will likely be totally new to a beginner.

So candidly I'm sitting here with a book manuscript that has a very clean and easy to understand 3 paragraph chunk that explains rowwise in a way that any beginner can understand. And I'm going to delete it and replace it with "doing operations row by row is really hard. here's a couple of messy hacks that are hard to understand but may magically do what you want. Or they may not. But you'll probably get some weird error messages. I know I did!"

Meh. Deprecating rowwise feels like a real step backwards for the tidy ecosystem and tidy workflow to me.


#13

It's not being deprecated.

[Emphasis added ⇩]

Honestly, rowwise() has never felt intuitive to me – I'm not sure why.

In the event that you're not being sarcastic here, I'd probably at the very least mention that these patterns offer advantages elsewhere in terms of reusability β€” I'd try to explain that more eloquently here were it not for the fact that Jenny has already done so in her slides/talk, and I don't want to mess with a good thing.

So will Git, but I think it's worth learning! Seriously though, rowwise is not (repeat not) being deprecated, so do with it what you will! I look forward to reading what you write β€” one of my favourite things about the R community is that so many people take the time to write out how they approach problems/tasks, and different takes seem to just click for whatever reason (for me, bits and pieces from a number of sources comprise my hodgepodge mental models).


#14

I agree with Mara that it has never been super clear to me when to use rowwise() and when I shouldn't. Normally, if something keeps failing without it, I will add rowwise() to see if it works.

However, since I switched to using the map_* family of functions things are much more consistent. I disagree with your statement that all of the pmap functions require learning new concepts, because they are purposefully designed to have complimentary syntaxes. So once you have grasped one, you can likely pick up the others easier.

I will say that using pmap for the rowwise operations is the most confusing to me (of the map functions), because a lot of the times I need to do something by row, I do not need every column, but only a few. This is why map_* and map2_*, like in Jenny's second example, make it more clear in the beginning, in my opinion.

Also worth noting, pmap can take a list of arguments if you need to access more than 2 columns but not all of your columns.


#15

Also, I may be out of the norm here, but grasping how the mutate + map combo worked with list-columns, especially nested tibbles, really helped me see how it worked row by row


#16

Sorry for not using my sarcastic font.

Thanks for hanging in with me as I learn out loud, ya'll. I'm venting frustration in both learning and teaching as I go through this so please take me with a grain of salt. My frustrations are real in this moment but I fully expect my views to change as I muddle through this.


#17

I know…I should've said: In the event that you're not being sarcastic and/or anyone stumbles upon this in the future and :thinking:I'd like two mediocrely informed cents on this, please.

Or sarcastices…

Due to my high base-rate usage of sarcasm (yes, I assume everyone parses me with a Bayesian approach), I'm more of a sinceroid girl myself.


#18

Good point. I think the map family of functions are TOTALLY internally consistent. My argument is that there's a barrier to entry that requires learning a few new concepts. Learners should totally learn these, but they provide friction. I am of the strong opinion that when learning row wise operations the map family of functions have a lot of cognitive friction.

Unless I'm missing something, rowwise is sort of syntactic sugar round this little jewel:

df %>%
  group_by( row_number() ) %>%
  mutate( t = fun(v1, v2))

So one of y'all should talk me off the ledge of teaching that! Because I find it incredibly easy to grok and use.

And, once again. thank you all for some back and forth on these concepts. This is super helpful for me to get lots of perspectives and I'm really glad there's a forum in which we can hash this out.


#19

Re: rowwise() and its "least favorite child" status.... I confess that I also have never used it. I went straight from my dearly beloved plyr to purrr-based approaches, without stopping in the middle. (I also never got used to group_by() + do().) But if you think about the effect that rowwise() has, I can understand why an implementer doesn't love it. It means the rowwise()'d data frame now has to carry around some property that changes how subsequent calls should operate. It is much nicer to encourage a workflow that has less "special case" going on.

The comparison to group_by() is apt and also gets at this. Grouping and ungrouping is another rich source of puzzles for people, so it's nice to avoid it when it's not truly necessary.

I think the more interesting question is why must .f in purrr::pmap() use all the named inputs? I have asked this before and, although this is not the conversation I was thinking of, it's the closest match I can find right now:

I'm not sure if that is a completely closed conversation, but you could imagine trying to address this pain point in various ways.


#20

BTW I noticed that too but had the good taste not to mention it :stuck_out_tongue_winking_eye: and faithfully reproduced it.