dplyr scoped verbs: questions about scoped verbs' sequential execution

My questions are:

  • is there an equivalent to scoped verbs that either does not execute sequentially, or behaves as if it does not execute sequentially (basically a column-wise map operation that is well optimized for use with database backends)?
  • is sequential execution within _all scoped operators along the order of columns in a table an intentional design decision, and if so, what informed this decision?
  • should sequential operation be documented in scoped and individual verbs' mans?

Users expect that base dplyr verbs will operate on columns in sequence. This is one of the things that makes dplyr appealing, because it allows more concise code than e.g. SQL. Take the trivial example, where I want to update both column_1 and column_2 based on the values in column_1:

> my_tbl <- tibble(column_1 = c(TRUE, FALSE, TRUE, FALSE), column_2 = c(TRUE, TRUE, FALSE, FALSE))
> my_tibble
# A tibble: 4 x 2
  column_1 column_2
  <lgl>    <lgl>   
1 TRUE     TRUE    
2 FALSE    TRUE    
3 TRUE     FALSE   
4 FALSE    FALSE

> mutate(
+     my_tbl,
+     column_1 = if_else(column_1, !column_1, column_1),
+     column_2 = if_else(column_1, !column_2, column_2)
+     )
# A tibble: 4 x 2
  column_1 column_2
  <lgl>    <lgl>   
1 FALSE    TRUE    
2 FALSE    TRUE    
3 FALSE    FALSE   
4 FALSE    FALSE
   
> mutate(
+     my_tbl,
+     column_2 = if_else(column_1, !column_2, column_2),
+     column_1 = if_else(column_1, !column_1, column_1)
+     )
# A tibble: 4 x 2
  column_1 column_2
  <lgl>    <lgl>   
1 FALSE    FALSE   
2 FALSE    TRUE    
3 FALSE    TRUE    
4 FALSE    FALSE   

So clearly the order of arguments matters in the basic verbs, operation is sequential rather than parallel, and the order in which this happens is controlled by the user by ordering the arguments to the verb.

What was NOT obvious to me is that scoped verbs also operate in sequence. Rather than having each column operation act based on the table at the time it is passed to the function, each operation on a column updates the table before the next column is operated on.

The order that this happens in for _all verbs is the order that columns happen to appear in the table. This is not documented in scoped or mutate_all. Maybe it's too obvious? I feel that it's bad to have the native order of columns in a table influencing the behavior of tidy operators acting on that table, just as it's bad to have the native order of rows in a table that is not explicitly arrange()d influencing their behavior (note that e.g. slice() is not supported by dbplyr).

> mutate_all(my_tbl, ~ if_else(column_1, !.x, .x))
# A tibble: 4 x 2
  column_1 column_2
  <lgl>    <lgl>   
1 FALSE    TRUE    
2 FALSE    TRUE    
3 FALSE    FALSE   
4 FALSE    FALSE 

I'm not sure what the tidyverse equivalent of a scoped operator that simulates simultaneous column operations is, other than something like:

> map2_dfc(set_names(colnames(my_tbl)), list(my_tbl), ~ select(mutate(.y, !!.x := if_else(column_1, !(!!as.name(.x)), !!as.name(.x))), .x))

# A tibble: 4 x 2
  column_1 column_2
  <lgl>    <lgl>   
1 FALSE    FALSE   
2 FALSE    TRUE    
3 FALSE    TRUE    
4 FALSE    FALSE   

... which isn't user-friendly or database backend - friendly. (You can fudge it with, e.g., map2 and sdf_bind_cols(), but I don't think this is an efficient way to perform this operation.)

This is by design, the scoped variants just template a set of expressions.

mtcars[1:2] %>% mutate_all(~ .x / sd(.x))

is equivalent to

mtcars[1:2] %>% mutate(mpg = mpg / sd(mpg), cyl = cyl / sd(cyl))

Though it is possible, it is not necessarily recommended to refer to other columns. Best practice is to use these verbs to map an independent function column by column.

Is it really best practice to use functions that behave totally differently from a map operation for mapping?

This is the responsibility of the user to make sure the mapped function does not depend on other columns. If you refer to other columns, this is no longer a simple map.

To make sure your function is independent, extract the lambda into a named function:

scale <- function(x) x / sd(x)

data %>% mutate_all(scale)
1 Like

My specific use case seems to require the ability to reference the (initial) state of other columns. I'm basically feeling around in the dark for a functional analogue to SQL's UPDATE ... SET... WHERE ... pattern, but for a tidy-but-wide dataset with many variables associated with each observation.

Is there a recommended backend-compatible equivalent to a scoped verb that operates based on the state of the table when presented to the function, rather than on a rolling basis?

As an aside, sparklyr 1.0.0 doesn't support the use of named functions in the manner you demonstrated - unless I'm overlooking something!

Maybe another example would be helpful. My specific use case is overriding values in a dataset based on a set of conditions. The conditions reference the initial state of the data, so an operation that updates columns sequentially can't update the dataset correctly:

my_df <- tibble(year = c("2019", "2019", "2018"), greeting = c("hello", "goodbye", "howdy"), greeter = c("man", "woman", "prospector"))

# A tibble: 3 x 3
  year  greeting greeter   
  <chr> <chr>    <chr>     
1 2019  hello    man       
2 2019  goodbye  woman     
3 2018  howdy    prospector

> overrides <- list(
  greeting = list(
    expr(greeting == "hello" ~ "hi"),
    expr(greeting == "goodbye" ~ "bye")
  ),
  greeter = list(
    expr(greeting == "howdy" ~ "cowboy")
  ),
  year = list(
    expr(year == "2019" & greeting == "hello" ~ "2020")
  )
)

# this is a tidyverse approach that will give a "correct" result:

> map2_dfc(
    set_names(colnames(my_df)),
    list(my_df),
    ~ select(mutate(.y, !!.x := case_when(!!!overrides[[.x]], TRUE ~ !!as.name(.x))), .x)
    )

# A tibble: 3 x 3
  year  greeting greeter
  <chr> <chr>    <chr>  
1 2020  hi       man    
2 2019  bye      woman  
3 2018  howdy    cowboy 

For spark, I've been employing:

my_df_spark <- sdf_copy_to(sc, my_df, spark_table_name(my_df))

map2(
  set_names(colnames(my_df_spark)),
  list(my_df_spark),
  ~ select(mutate(.y, !!.x := case_when(!!!overrides[[.x]], TRUE ~ !!as.name(.x))), .x)
  ) %>%
  sdf_bind_cols()

However, my understanding is that this solution is not well optimized for database backends.

I imagine that performing this kind of conditional update is a common problem but the tidyverse doesn't seem to offer a very user-friendly abstraction for it.

I have opened a feature request: https://github.com/tidyverse/dplyr/issues/4517

2 Likes

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.