Arrange calculated columns immediately after source columns when using dplyr column-wise operations

brad.cannell · April 13, 2021, 2:39pm

I'm trying to arrange calculated columns immediately after source columns when using dplyr column-wise operations

library(dplyr)

Example data

df <- tibble(
  id         = c(1, 2, 2),
  id_row     = c(1, 1, 2),
  name_first = c("John", "Jane", "Jane"),
  city       = c("NY", "DAL", "HOU"),
  x          = c(0, 1, 0)
)

# A tibble: 3 x 5
     id id_row name_first city      x
  <dbl>  <dbl> <chr>      <chr> <dbl>
1     1      1 John       NY        0
2     2      1 Jane       DAL       1
3     2      2 Jane       HOU       0

Within ID, the values of name_first and city should be constant. The values for id_row and x need not be constant. I want to test for and inspect differing values of name_first and city within-id.

One convenient way to do this is with mutate() and across().

df %>% 
  group_by(id) %>% 
  mutate(
    across(
      .cols  = c(name_first, city),
      .fns   = ~ length(unique(.x)) == 1,
      .names = "{col}_all_match"
    )
  )

# A tibble: 3 x 7
# Groups:   id [2]
     id id_row name_first city      x name_first_all_match city_all_match
  <dbl>  <dbl> <chr>      <chr> <dbl> <lgl>                <lgl>         
1     1      1 John       NY        0 TRUE                 TRUE          
2     2      1 Jane       DAL       1 TRUE                 FALSE         
3     2      2 Jane       HOU       0 TRUE                 FALSE

The issue is that the "all_match" (calculated) columns are added to the far right of the data frame rather than immediately after their source column. This makes it difficult to visually inspect differing values for the columns of interest.

Of course, in this small data frame, we could easily rearrange the columns using select().

df %>% 
  group_by(id) %>% 
  mutate(
    across(
      .cols  = c(name_first, city),
      .fns   = ~ length(unique(.x)) == 1,
      .names = "{col}_all_match"
    )
  ) %>% 
  select(id, id_row, starts_with("name_first"), starts_with("city"), x)

# A tibble: 3 x 7
# Groups:   id [2]
     id id_row name_first name_first_all_match city  city_all_match     x
  <dbl>  <dbl> <chr>      <lgl>                <chr> <lgl>          <dbl>
1     1      1 John       TRUE                 NY    TRUE               0
2     2      1 Jane       TRUE                 DAL   FALSE              1
3     2      2 Jane       TRUE                 HOU   FALSE              0

The issue with that approach is that it quickly becomes pretty cumbersome with more columns. A more tractable approach would be to sort the names alphabetically...

df %>% 
  group_by(id) %>% 
  mutate(
    across(
      .cols  = c(name_first, city),
      .fns   = ~ length(unique(.x)) == 1,
      .names = "{col}_all_match"
    )
  ) %>% 
  select(sort(names(.)))

# A tibble: 3 x 7
# Groups:   id [2]
  city  city_all_match    id id_row name_first name_first_all_match     x
  <chr> <lgl>          <dbl>  <dbl> <chr>      <lgl>                <dbl>
1 NY    TRUE               1      1 John       TRUE                     0
2 DAL   FALSE              2      1 Jane       TRUE                     1
3 HOU   FALSE              2      2 Jane       TRUE                     0

...but in my situation I need to preserve the original column order. I'd also prefer to stick with Tidyverse solutions if possible.

Any ideas are appreciated!

Also posted on Stack Overflow

mara · April 13, 2021, 3:25pm

Have you tried using dplyr::relocate() with either the .after or .before arguments?

mara · April 13, 2021, 3:34pm

I'm actually experimenting myself right now. Let me kick this out to one of the devs to see if we can get some clarity (it's virtual workweek, so we might be slower to respond than usual).

brad.cannell · April 13, 2021, 4:46pm

Hi @mara and @Yarnabrina,

Thank you for the suggestion! Unfortunately, as you've pointed out, that seems to only allow me to relocate one column at a time. I need to relocate each column iteratively. I tried using across() inside of relocate(), but got an error.

brad.cannell · April 13, 2021, 4:49pm

Hi @mara and @Yarnabrina ,

Here is a solution I got from SO, with some minor adjustments:

Select the columns of interest

# Use select because it allows for more complex column selection when working 
# with more complex data frames.
inspect_cols <- df %>% select(name_first, city) %>% names()

Set column order

# Set column order ahead of time. This assumes that you know the names of each 
# of the columns you want to inspect
col_order <- purrr::map(
  names(df), 
  function(x) {
    if (x %in% inspect_cols) {
      c(x, paste0(x, "_all_match"))
    } else {
      x
    }
  }
) %>% 
  unlist()

Check for discrepancies

df %>% 
  group_by(id) %>% 
  mutate(
    across(
      .cols  = all_of(inspect_cols),
      .fns   = ~ length(unique(.x)) == 1,
      .names = "{col}_all_match"
    )
  ) %>% 
  dplyr::select(all_of(col_order))

# A tibble: 3 x 7
# Groups:   id [2]
     id id_row name_first name_first_all_match city  city_all_match     x
  <dbl>  <dbl> <chr>      <lgl>                <chr> <lgl>          <dbl>
1     1      1 John       TRUE                 NY    TRUE               0
2     2      1 Jane       TRUE                 DAL   FALSE              1
3     2      2 Jane       TRUE                 HOU   FALSE              0

It doesn't feel like the most satisfying solution, but it does get the job done. I'm totally open to any improvements you have to offer! Thanks again!

lionel · April 14, 2021, 9:31am

The hard part in this problem is that we want to relocate different batches of columns. Here is a (rather experimental) design to achieve this:

df <- tibble(
  id         = c(1, 2, 2),
  id_row     = c(1, 1, 2),
  name_first = c("John", "Jane", "Jane"),
  city       = c("NY", "DAL", "HOU"),
  x          = c(0, 1, 0)
)
matched_df <- df %>% 
  group_by(id) %>%
  mutate(
    across(
      .cols  = c(name_first, city),
      .fns   = ~ length(unique(.x)) == 1,
      .names = "{col}_all_match"
    )
  )

relocate_matches <- function(.data, .match, ..., .matcher = starts_with) {
  for (x in .match) {
    sel <- tidyselect::eval_select(quote(.matcher(x, ...)), .data)

    if (length(sel) > 1) {
      .data <- relocate(.data, all_of(sel), .after = all_of(sel[[1]]))
    }
  }

  .data
}

matched_df %>% relocate_matches(c("name_first", "city"))
#> # A tibble: 3 x 7
#> # Groups:   id [2]
#>      id id_row name_first name_first_all_… city  city_all_match     x
#>   <dbl>  <dbl> <chr>      <lgl>            <chr> <lgl>          <dbl>
#> 1     1      1 John       TRUE             NY    TRUE               0
#> 2     2      1 Jane       TRUE             DAL   FALSE              1

We relocate in batch by looping over the matcher arguments. The invokation is set up to require a matcher like starts_with() or contains(), but argument passing and matcher selection is straightforward:

matched_df %>%
  relocate_matches(c("name_first", "city"), ignore.case = FALSE)

matched_df %>%
  relocate_matches(c("name_(first|last)", "city"), .matcher = matches)

mara · April 14, 2021, 12:47pm

Hope lionel's solution works a bit better for you! Just for future reference, if you cross-post a question to SO (or anywhere, really), could you please also post the link here? That way no one gets their wires crossed and starts working on an answer to something that's already been solved elsewhere.

Thanks!

brad.cannell · April 14, 2021, 2:29pm

Thank you, @lionel and @mara! Yes, I will definitely post the SO link next time!

romain · April 19, 2021, 8:30am

Here is an alternative approach based on data frame columns. The idea is that the all_match() function called by across() returns a data frame with both the original column and the added one.

library(tidyverse)

df <- tibble(
  id         = c(1, 2, 2),
  id_row     = c(1, 1, 2),
  name_first = c("John", "Jane", "Jane"),
  city       = c("NY", "DAL", "HOU"),
  x          = c(0, 1, 0)
)

all_match <- function(name, x) {
  tibble("{name}" := x, "{name}_all_match" := length(unique(x)) == 1)
}
all_match("name_first", df$name_first)
#> # A tibble: 3 x 2
#>   name_first name_first_all_match
#>   <chr>      <lgl>               
#> 1 John       FALSE               
#> 2 Jane       FALSE               
#> 3 Jane       FALSE

Then we can call all_match() on the selected set of columns:

df %>% 
  group_by(id) %>% 
  mutate(
    across(c(name_first, city), ~ all_match(cur_column(), .x))
  )
#> # A tibble: 3 x 5
#> # Groups:   id [2]
#>      id id_row name_first$name… $name_first_all… city$city $city_all_match     x
#>   <dbl>  <dbl> <chr>            <lgl>            <chr>     <lgl>           <dbl>
#> 1     1      1 John             TRUE             NY        TRUE                0
#> 2     2      1 Jane             TRUE             DAL       FALSE               1
#> 3     2      2 Jane             TRUE             HOU       FALSE               0

name_first and city are then data frame columns, this structure might be of interest for its ease of access, but otherwise you can go further and tidyr::unpack() them.

df %>% 
  group_by(id) %>% 
  mutate(
    across(c(name_first, city), ~ all_match(cur_column(), .x))
  ) %>% 
  unpack(c(name_first, city))
#> # A tibble: 3 x 7
#> # Groups:   id [2]
#>      id id_row name_first name_first_all_match city  city_all_match     x
#>   <dbl>  <dbl> <chr>      <lgl>                <chr> <lgl>          <dbl>
#> 1     1      1 John       TRUE                 NY    TRUE               0
#> 2     2      1 Jane       TRUE                 DAL   FALSE              1
#> 3     2      2 Jane       TRUE                 HOU   FALSE              0

Ultimately, we can combine both the mutate() + across() step and the unpack() step into a single function so that we don't have to repeat the the selection twice:

next_all_match <- function(.data, selection) {
  .data %>% 
    mutate( 
      across({{ selection}}, ~ all_match(cur_column(), .x))
    ) %>% 
    unpack({{ selection }})
}

df %>% 
  group_by(id) %>% 
  next_all_match(c(name_first, city))
#> # A tibble: 3 x 7
#> # Groups:   id [2]
#>      id id_row name_first name_first_all_match city  city_all_match     x
#>   <dbl>  <dbl> <chr>      <lgl>                <chr> <lgl>          <dbl>
#> 1     1      1 John       TRUE                 NY    TRUE               0
#> 2     2      1 Jane       TRUE                 DAL   FALSE              1
#> 3     2      2 Jane       TRUE                 HOU   FALSE              0

^{Created on 2021-04-19 by the reprex package (v0.3.0)}

brad.cannell · April 19, 2021, 2:47pm

Hi @romain,

This is a really nice solution! Thank you!

Leighton · April 21, 2021, 6:21pm

I appreciate your answers, I also found a solution to my problem, thanks!

system · May 12, 2021, 6:21pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.