A general approach to large rowwise jobs within the tidyverse

Hi all,

I would like to be able to perform some row-wise operations (without having to pivot due to some computational limitations).

I found a way that almost pleases me using pmap() but I don't seem to figure out how not to have to write the name of the dataset multiple times...

Here is my working example: I would like to add the mean of the numeric variables in iris to the data, so I wrote:

library(tidyverse)
iris %>%
  as_tibble() %>%
  mutate(size = pmap_dbl(iris %>% select_if(is.numeric), ~ mean(c(...))))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  size
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa   2.55
#>  2          4.9         3            1.4         0.2 setosa   2.38
#>  3          4.7         3.2          1.3         0.2 setosa   2.35
#>  4          4.6         3.1          1.5         0.2 setosa   2.35
#>  5          5           3.6          1.4         0.2 setosa   2.55
#>  6          5.4         3.9          1.7         0.4 setosa   2.85
#>  7          4.6         3.4          1.4         0.3 setosa   2.42
#>  8          5           3.4          1.5         0.2 setosa   2.52
#>  9          4.4         2.9          1.4         0.2 setosa   2.22
#> 10          4.9         3.1          1.5         0.1 setosa   2.4 
#> # … with 140 more rows

I naively thought I could replace the 2nd iris by "." but it does not seem to be the case...
Note that I do not want to have a build a list manually because I am talking about hundreds of columns with cryptic names...
The select in the middle of the call is also not very sexy...

Any tidier ideas?

Many thanks

Hi @courtiol

Perhaps I am not entirely clear on what you are hoping to achieve, does this code work?

iris %>%
  as_tibble() %>%
  mutate(size = rowMeans(iris %>% select_if(is.numeric)))

I would write it this way

library(tidyverse)
iris %>%
  as_tibble() %>%
  mutate(size = pmap_dbl(select_if(., is.numeric), lift_vd(mean)))
#> # A tibble: 150 x 6
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  size
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>   <dbl>
#>  1          5.1         3.5          1.4         0.2 setosa   2.55
#>  2          4.9         3            1.4         0.2 setosa   2.38
#>  3          4.7         3.2          1.3         0.2 setosa   2.35
#>  4          4.6         3.1          1.5         0.2 setosa   2.35
#>  5          5           3.6          1.4         0.2 setosa   2.55
#>  6          5.4         3.9          1.7         0.4 setosa   2.85
#>  7          4.6         3.4          1.4         0.3 setosa   2.42
#>  8          5           3.4          1.5         0.2 setosa   2.52
#>  9          4.4         2.9          1.4         0.2 setosa   2.22
#> 10          4.9         3.1          1.5         0.1 setosa   2.4 
#> # ... with 140 more rows

Created on 2019-10-16 by the reprex package (v0.3.0)

. %>% select_if() has special meaning .

library(tidyverse)
. %>% select_if(is_numeric)
#> Functional sequence with the following components:
#> 
#>  1. select_if(., is_numeric)
#> 
#> Use 'functions' to extract the individual functions.

That is why I think it does not work in your pipe. I think you don't need %>% and can use select_if(., is_numeric).

lift_vd is from purrrr. lift_xy can help you lift a domain for a function from x to y. That is what you are doing with ~ mean(c(...)) I guess.

Not sure if you find it clearer but I would write it like that.

4 Likes

Hi @cderv, this is exactly what I was looking for!!! Many thanks!
I still do not fully understand when/where one can use the pronoun "." and when/where one cannot but I like the new syntax.
I did not know of lift_xx() and I find the whole approach quite sensible and elegant.
I am surprised that the combo never showed up in the threads of endless discussion about rowwise work within the tidyverse, which I had seen so far.
I will benchmark that but I predict it to be a general solution to any rowwise problem!
++

1 Like

Hi @mattwarkentin, thanks for trying but no sorry this solves no issue whatsoever (2 iris and not a general approach), but look at the great post from @cderv, it nails it!
++

Thanks ! :smiley:

We should not forget that %>% is from magrittr :package: and that it can be used without tidyverse, with a lot of feature. About . %>% fun(), this is special feature of magrittr to create some anonymous function
See https://magrittr.tidyverse.org/#building-unary-functions

Same about where to put the dot:

  • iris %>% mutate(size = fun(.)) is equivalent to mutate(iris, size = fun(iris)) because you can reuse the placeholder in magrittr.

Hope it helps understand better the (magical) pipe !

1 Like

I thought I knew most of what the maggritr :package: was about, but I surely had forgotten that starting with the "." was creating functions! Now I do understand your previous statement!
Thanks again!

Benchmark-wise, lifting the domain of the mean function does imply some cost, but nothing too bad:

library(tidyverse)

big_iris <- map_dfc(1:100, ~ bind_cols(iris, iris)) 
dim(big_iris)
#> [1]  150 1000

bench::mark(
  lift = {
    big_iris %>%
    mutate(size = pmap_dbl(select_if(., is.numeric), lift_vd(mean)))
  },
  dots = {
    big_iris %>%
      mutate(size = pmap_dbl(select_if(., is.numeric), ~ mean(c(...))))
  }, min_iterations = 100
) %>%
  plot()
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.

Created on 2019-10-17 by the reprex package (v0.3.0)

The relative cost seems to remain quite the same for various size of datasets I created (did not time anything huge, but still quite large).

++

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.