When to use c_across() instead of across()?

riinu · July 28, 2020, 10:01am

I am all over dplyr 1.0 and across(), love it!

I'm curious about c_across(). In the c_across() example provided in its Reference, across() would work equally well. So I'm wondering when to definitely use c_across() over across()?

library(tidyverse)

# c_across() example copied from https://dplyr.tidyverse.org/reference/across.html
df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))
df %>%
  rowwise() %>%
  mutate(
    sum = sum(c_across(w:z)),
    sd  = sd(c_across(w:z))
  )
#> # A tibble: 4 x 7
#> # Rowwise: 
#>      id     w     x      y     z   sum    sd
#>   <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.229 0.218 0.640  0.266  1.35 0.202
#> 2     2 0.925 0.573 0.235  0.814  2.55 0.306
#> 3     3 0.729 0.560 0.0415 0.957  2.29 0.389
#> 4     4 0.677 0.998 0.711  0.698  3.08 0.152

# same thing works with just across():
df %>%
  rowwise() %>%
  mutate(
    sum = sum(across(w:z)),
    sd  = sd(across(w:z))
  )
#> # A tibble: 4 x 7
#> # Rowwise: 
#>      id     w     x      y     z   sum    sd
#>   <int> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.229 0.218 0.640  0.266  1.35 0.202
#> 2     2 0.925 0.573 0.235  0.814  2.55 0.306
#> 3     3 0.729 0.560 0.0415 0.957  2.29 0.389
#> 4     4 0.677 0.998 0.711  0.698  3.08 0.152

Thanks very much in advance for clarifications or further examples! Exciting times!

francisbarton · July 28, 2020, 9:41pm

I have the same question - the distinction isn't clear to me.
Initially, I thought that c_across was going to be equivalent to using across after having sent rowwise() but that doesn't seem to be the relevant difference.

francisbarton · July 28, 2020, 9:45pm

(off topic... but using across() feels good to me when acting on more than one variable, but it feels weird if I'm only passing one variable, say, to mutate(), in a way that mutate_at(vars(var)...) didn't. But I like the consistency of the new syntax, at least. I suppose, perhaps, I'd like it if the across() were unnecessary when only acting on a single variable?)

nirgrahamuk · July 29, 2020, 7:57am

I'm confused about c_across as well, at the least its a poor example for the documentation.
A concrete difference appears to be when rowwise is not used, and c_across can execute, but across simply fails:

library(tidyverse)

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))
df %>%
  rowwise() %>%
  mutate(
    sum = sum(c_across(w:z)),
    sd  = sd(c_across(w:z))
  )
df %>%
  mutate(
    sum = sum(c_across(w:z)),
    sd  = sd(c_across(w:z))
  )

df %>%
  rowwise() %>%
  mutate(
    sum = sum(across(w:z)),
    sd  = sd(across(w:z))
  )

df %>%
  mutate(
    sum = sum(across(w:z)),
    sd  = sd(across(w:z))
  )

enixam · July 29, 2020, 8:07am

I figured that c_across() is simply designed to select columns more easily when using rowwise() for summary statistics across multiple columns per row, not to apply functions across multiple columns in a functional manner, which across() does. Though across() could work in this unsual way inside summary functions and rowwise(). Could be an intended deisgn to avoid errors for people who do not know c_across()?

siddharthprabhu · July 29, 2020, 12:51pm

I think the c_across() function's name is the source of confusion. It's not really similar to across(); in fact it's closer to select() or c() (the latter being the inspiration for the name). Its only purpose is to enable use of tidyselect syntax for selecting variables for row-wise transformations (as enixam correctly deduced).

I've provided some examples with commentary below.

library(dplyr, warn.conflicts = FALSE)

set.seed(42)

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))

# This is the desired result.
df %>%
  rowwise() %>%
  mutate(sum = sum(w, x, y, z))
#> # A tibble: 4 x 6
#> # Rowwise: 
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  3.15
#> 2     2 0.937 0.519 0.705 0.255  2.42
#> 3     3 0.286 0.737 0.458 0.462  1.94
#> 4     4 0.830 0.135 0.719 0.940  2.62

# But what if we don't want to spell out each variable?
# Can we use tidyselect syntax?

# Try select().
df %>% 
  rowwise() %>% 
  mutate(sum = sum(select(., w:z)))
#> # A tibble: 4 x 6
#> # Rowwise: 
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  10.1
#> 2     2 0.937 0.519 0.705 0.255  10.1
#> 3     3 0.286 0.737 0.458 0.462  10.1
#> 4     4 0.830 0.135 0.719 0.940  10.1

# Gives the wrong result because select() doesn't understand row-wise
# operations.

# Use c_across() instead of select().
df %>% 
  rowwise() %>% 
  mutate(sum = sum(c_across(w:z)))
#> # A tibble: 4 x 6
#> # Rowwise: 
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  3.15
#> 2     2 0.937 0.519 0.705 0.255  2.42
#> 3     3 0.286 0.737 0.458 0.462  1.94
#> 4     4 0.830 0.135 0.719 0.940  2.62

# Does using across() also work?
df %>% 
  rowwise() %>% 
  mutate(sum = sum(across(w:z)))
#> # A tibble: 4 x 6
#> # Rowwise: 
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  3.15
#> 2     2 0.937 0.519 0.705 0.255  2.42
#> 3     3 0.286 0.737 0.458 0.462  1.94
#> 4     4 0.830 0.135 0.719 0.940  2.62

# Why?
args(c_across)
#> function (cols = everything()) 
#> NULL
args(across)
#> function (.cols = everything(), .fns = NULL, ..., .names = NULL) 
#> NULL

# Both functions take column specifications as their first argument, so using
# across() without any other arguments is the same as c_across().

# across() is different only when transformations are supplied. 
# c_across() cannot do this.
df %>% 
  mutate(across(w:z, .fns = sum, .names = "{col}_sum"))
#> # A tibble: 4 x 9
#>      id     w     x     y     z w_sum x_sum y_sum z_sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  2.97  2.03  2.54  2.59
#> 2     2 0.937 0.519 0.705 0.255  2.97  2.03  2.54  2.59
#> 3     3 0.286 0.737 0.458 0.462  2.97  2.03  2.54  2.59
#> 4     4 0.830 0.135 0.719 0.940  2.97  2.03  2.54  2.59

^{Created on 2020-07-29 by the reprex package (v0.3.0)}

So in summary, think of c_across() as a selection helper for row-wise transformations. across() is a column-wise transformation function that comes with a selection helper built-in.

@francisbarton You don't need across() when operating on a single variable.

francisbarton · July 29, 2020, 2:57pm

Thanks for a great answer - I think you're right about the confusion being to do with the similarity of the names.

Edit: on second thoughts, I'm still a little confused as to the point of c_across - it doesn't seem to do anything that across() doesn't? Seems superfluous. Be great to see a situation where c_across does something unique that across() can't (as @riinu said in the first place!)

On my (off-topic) point about not using across() with a single variable,

@francisbarton You don't need across() when operating on a single variable.

I think it is needed. Look and compare:

library(dplyr, warn.conflicts = FALSE)

set.seed(42)

df <- tibble(id = 1:3, w = runif(3), x = runif(3))

df %>% 
  mutate(x, ~ `*`(., w))
#> Error: Problem with `mutate()` input `..2`.
#> x Input `..2` must be a vector, not a `formula` object.
#> i Input `..2` is `~. * w`.

df %>% 
  mutate_at(vars(x), ~ `*`(., w))
#> # A tibble: 3 x 3
#>      id     w     x
#>   <int> <dbl> <dbl>
#> 1     1 0.915 0.760
#> 2     2 0.937 0.601
#> 3     3 0.286 0.149

df %>% 
  mutate(across(x, ~ `*`(., w)))
#> # A tibble: 3 x 3
#>      id     w     x
#>   <int> <dbl> <dbl>
#> 1     1 0.915 0.760
#> 2     2 0.937 0.601
#> 3     3 0.286 0.149

df %>% 
  mutate(x = `*`(x, w))
#> # A tibble: 3 x 3
#>      id     w     x
#>   <int> <dbl> <dbl>
#> 1     1 0.915 0.760
#> 2     2 0.937 0.601
#> 3     3 0.286 0.149

^{Created on 2020-07-29 by the reprex package (v0.3.0)}

Using a bare variable without across() in the first example leads to an error. It's a very minor thing but I think it would be neat to mutate a single variable by passing a function using the formula notation, without using across(). [In the same way that you only need to use c() to construct a vector if there's more than one item, otherwise just a bare element is fine.]
The last example with '=' is fine but I like the elegance of the formula notation.

siddharthprabhu · July 29, 2020, 3:20pm

Well, if you absolutely insist on using formula notation then yes, across() is required. But as you pointed out yourself, one can simply use = instead. Some would argue that the latter is more elegant but it's a matter of taste.

francisbarton · July 29, 2020, 3:21pm

Using two of your examples (without rowwise()) and just using sum() without sd() as well, does not give the same difference/error:

library(dplyr, warn.conflicts = FALSE)

set.seed(42)

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))

df %>%
  mutate(
    sum = sum(c_across(w:z))
  )
#> # A tibble: 4 x 6
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  10.1
#> 2     2 0.937 0.519 0.705 0.255  10.1
#> 3     3 0.286 0.737 0.458 0.462  10.1
#> 4     4 0.830 0.135 0.719 0.940  10.1

df %>%
  mutate(
    sum = sum(across(w:z))
  )
#> # A tibble: 4 x 6
#>      id     w     x     y     z   sum
#>   <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1 0.915 0.642 0.657 0.935  10.1
#> 2     2 0.937 0.519 0.705 0.255  10.1
#> 3     3 0.286 0.737 0.458 0.462  10.1
#> 4     4 0.830 0.135 0.719 0.940  10.1

^{Created on 2020-07-29 by the reprex package (v0.3.0)}

riinu · July 31, 2020, 8:07am

Thanks very much everyone, the exact kind of discussion and examples I was looking for. Especially @siddharthprabhu explanation of the arguments and why both work.

Unless anyone can convince me that there are situations where c_across() is better than across() (faster?) I'll accept @siddharthprabhu's answer with the function arguments. And I'll also forget about c_across() and just use across().

Edit: On second look, @nirgrahamuk has provided a very interesting example where c_across() works but across() doesn't. But I can't understand why that is, sum() works in both, sd() or , e.g., mean() only work in c_across().

riinu · July 31, 2020, 8:12am

Hmm very interesting this one!
I can't understand why sum() works but other functions don't (like sd(), but I also tried mean() and that doesn't work either). And why all functions work with c_across() but only some work with across().

nirgrahamuk · July 31, 2020, 8:15am

This is complete speculation on my part, but sum is a primitive function, and the others mentioned are not, it might be relevant to the observed behaviour

siddharthprabhu · July 31, 2020, 8:59am

There is a key difference between the way these two functions operate; sum() takes ... as arguments while sd() takes a single vector (so does mean()).

args(sum)
#> function (..., na.rm = FALSE) 
#> NULL
args(sd)
#> function (x, na.rm = FALSE) 
#> NULL

^{Created on 2020-07-31 by the reprex package (v0.3.0)}

I think the reason why one works but not the other has to do with how across() and c_across() splice arguments. Since across() is designed for column-wise transformations, the transformed variables are returned in a list which is then spliced (ref: lines 112 to 134 in across.R). This obviously isn't required for c_across().

This can also be seen in the error message generated when using across() with sd().

library(tidyverse)

df <- tibble(id = 1:4, w = runif(4), x = runif(4), y = runif(4), z = runif(4))

df %>%
  mutate(
    sd  = sd(across(w:z))
  )
#> Error: Problem with `mutate()` input `sd`.
#> x 'list' object cannot be coerced to type 'double'
#> i Input `sd` is `sd(across(w:z))`.

^{Created on 2020-07-31 by the reprex package (v0.3.0)}

Makes sense if across() is returning a list since sd() expects a numeric vector. I would stick to c_across() for making selections to avoid running into this type of error.

Disclaimer: I'm stretching my knowledge of the tidyverse here so I can't say for sure whether this reasoning is right. Just trying to work it out as best as I can.

system · August 7, 2020, 8:59am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.