My questions are:
- is there an equivalent to scoped verbs that either does not execute sequentially, or behaves as if it does not execute sequentially (basically a column-wise map operation that is well optimized for use with database backends)?
- is sequential execution within
_all
scoped operators along the order of columns in a table an intentional design decision, and if so, what informed this decision? - should sequential operation be documented in
scoped
and individual verbs' mans?
Users expect that base dplyr verbs will operate on columns in sequence. This is one of the things that makes dplyr appealing, because it allows more concise code than e.g. SQL. Take the trivial example, where I want to update both column_1 and column_2 based on the values in column_1:
> my_tbl <- tibble(column_1 = c(TRUE, FALSE, TRUE, FALSE), column_2 = c(TRUE, TRUE, FALSE, FALSE))
> my_tibble
# A tibble: 4 x 2
column_1 column_2
<lgl> <lgl>
1 TRUE TRUE
2 FALSE TRUE
3 TRUE FALSE
4 FALSE FALSE
> mutate(
+ my_tbl,
+ column_1 = if_else(column_1, !column_1, column_1),
+ column_2 = if_else(column_1, !column_2, column_2)
+ )
# A tibble: 4 x 2
column_1 column_2
<lgl> <lgl>
1 FALSE TRUE
2 FALSE TRUE
3 FALSE FALSE
4 FALSE FALSE
> mutate(
+ my_tbl,
+ column_2 = if_else(column_1, !column_2, column_2),
+ column_1 = if_else(column_1, !column_1, column_1)
+ )
# A tibble: 4 x 2
column_1 column_2
<lgl> <lgl>
1 FALSE FALSE
2 FALSE TRUE
3 FALSE TRUE
4 FALSE FALSE
So clearly the order of arguments matters in the basic verbs, operation is sequential rather than parallel, and the order in which this happens is controlled by the user by ordering the arguments to the verb.
What was NOT obvious to me is that scoped verbs also operate in sequence. Rather than having each column operation act based on the table at the time it is passed to the function, each operation on a column updates the table before the next column is operated on.
The order that this happens in for _all
verbs is the order that columns happen to appear in the table. This is not documented in scoped
or mutate_all
. Maybe it's too obvious? I feel that it's bad to have the native order of columns in a table influencing the behavior of tidy operators acting on that table, just as it's bad to have the native order of rows in a table that is not explicitly arrange()
d influencing their behavior (note that e.g. slice() is not supported by dbplyr).
> mutate_all(my_tbl, ~ if_else(column_1, !.x, .x))
# A tibble: 4 x 2
column_1 column_2
<lgl> <lgl>
1 FALSE TRUE
2 FALSE TRUE
3 FALSE FALSE
4 FALSE FALSE
I'm not sure what the tidyverse equivalent of a scoped operator that simulates simultaneous column operations is, other than something like:
> map2_dfc(set_names(colnames(my_tbl)), list(my_tbl), ~ select(mutate(.y, !!.x := if_else(column_1, !(!!as.name(.x)), !!as.name(.x))), .x))
# A tibble: 4 x 2
column_1 column_2
<lgl> <lgl>
1 FALSE FALSE
2 FALSE TRUE
3 FALSE TRUE
4 FALSE FALSE
... which isn't user-friendly or database backend - friendly. (You can fudge it with, e.g., map2 and sdf_bind_cols(), but I don't think this is an efficient way to perform this operation.)