Compute pairwise column differences based on matching names

dplyr

#1

Consider the following data frame:

test <- structure(list(date = structure(c(1477872000, 1480464000, 1483142400, 
                                   1485820800, 1488240000, 1490918400, 1493510400, 1496188800, 1498780800, 
                                   1501459200, 1504137600, 1506729600, 1509408000, 1.512e+09, 1514678400, 
                                   1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), alpha = c(283L, 
                                                             298L, 277L, 231L, 276L, 323L, 242L, 255L, 208L, 289L, 284L, 263L, 
                                                             280L, 278L, 269L, 288L, 255L, 324L, 339L, 355L, 300L), alpha_mod = c(133L, 
                                                                                                                                  139L, 106L, 85L, 132L, 141L, 89L, 110L, 80L, 142L, 174L, 159L, 
                                                                                                                                  146L, 162L, 153L, 161L, 142L, 174L, 211L, 208L, 194L), beta = c(260L, 
                                                                                                                                                                                                  278L, 249L, 242L, 301L, 349L, 249L, 309L, 256L, 280L, 326L, 276L, 
                                                                                                                                                                                                  299L, 322L, 235L, 281L, 256L, 293L, 356L, 307L, 279L), beta_mod = c(102L, 
                                                                                                                                                                                                                                                                      119L, 92L, 107L, 119L, 126L, 108L, 132L, 89L, 141L, 199L, 148L, 
                                                                                                                                                                                                                                                                      161L, 160L, 125L, 159L, 137L, 139L, 208L, 177L, 162L), gam_ma = c(208L, 
                                                                                                                                                                                                                                                                                                                                        190L, 176L, 208L, 221L, 265L, 204L, 215L, 251L, 283L, 314L, 257L, 
                                                                                                                                                                                                                                                                                                                                        250L, 290L, 240L, 290L, 275L, 295L, 292L, 316L, 324L), gam_ma_mod = c(64L, 
                                                                                                                                                                                                                                                                                                                                                                                                              67L, 62L, 86L, 78L, 76L, 67L, 67L, 90L, 128L, 155L, 106L, 125L, 
                                                                                                                                                                                                                                                                                                                                                                                                              132L, 125L, 143L, 132L, 159L, 159L, 158L, 191L)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -21L), class = c("tbl_df", "tbl", "data.frame"))

Created on 2018-07-12 by the reprex package (v0.2.0).

(btw, the very weird formatting of the structure object is courtesy of RStudio editor's auto-formatting - I don't know why, but it always messes with my structures).

Question
For each pair of columns x and x_mod I would like to add a third column x_unmod <- x - x_mod to test. Note that the date column doesn't have a corresponding date_mod column, thus no column date_unmod must be included. How can I do that? The most obvious solution would be to use the fact that the x columns are the even columns and the x_mod columns are the odd columns (except for the first column), but I would like to see a solution which relies only (or mostly) on the column names.


#2

I think this would do what you want:

test <- structure(list(date = structure(c(1477872000, 1480464000, 1483142400, 
                                          1485820800, 1488240000, 1490918400, 1493510400, 1496188800, 1498780800, 
                                          1501459200, 1504137600, 1506729600, 1509408000, 1.512e+09, 1514678400, 
                                          1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), alpha = c(283L, 
                                                             298L, 277L, 231L, 276L, 323L, 242L, 255L, 208L, 289L, 284L, 263L, 
                                                             280L, 278L, 269L, 288L, 255L, 324L, 339L, 355L, 300L), alpha_mod = c(133L, 
                                                                                                                                  139L, 106L, 85L, 132L, 141L, 89L, 110L, 80L, 142L, 174L, 159L, 
                                                                                                                                  146L, 162L, 153L, 161L, 142L, 174L, 211L, 208L, 194L), beta = c(260L, 
                                                                                                                                                                                                  278L, 249L, 242L, 301L, 349L, 249L, 309L, 256L, 280L, 326L, 276L, 
                                                                                                                                                                                                  299L, 322L, 235L, 281L, 256L, 293L, 356L, 307L, 279L), beta_mod = c(102L, 
                                                                                                                                                                                                                                                                      119L, 92L, 107L, 119L, 126L, 108L, 132L, 89L, 141L, 199L, 148L, 
                                                                                                                                                                                                                                                                      161L, 160L, 125L, 159L, 137L, 139L, 208L, 177L, 162L), gam_ma = c(208L, 
                                                                                                                                                                                                                                                                                                                                        190L, 176L, 208L, 221L, 265L, 204L, 215L, 251L, 283L, 314L, 257L, 
                                                                                                                                                                                                                                                                                                                                        250L, 290L, 240L, 290L, 275L, 295L, 292L, 316L, 324L), gam_ma_mod = c(64L, 
                                                                                                                                                                                                                                                                                                                                                                                                              67L, 62L, 86L, 78L, 76L, 67L, 67L, 90L, 128L, 155L, 106L, 125L, 
                                                                                                                                                                                                                                                                                                                                                                                                              132L, 125L, 143L, 132L, 159L, 159L, 158L, 191L)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -21L), class = c("tbl_df", "tbl", "data.frame"))

library(tidyverse)
nms <- test %>%
  dplyr::select_at(vars(matches("_mod"))) %>%
  names() %>%
  gsub(pattern = "_mod", replacement = "", x = .)

add_unmod <- function(df, nm){
  nm_sym <- rlang::sym(nm)
  nm_unmod <- rlang::sym(paste0(nm, "_unmod"))
  nm_mod <- rlang::sym(paste0(nm, "_mod"))
  
  df %>%
    dplyr::mutate(!!nm_unmod := !!nm_sym - !!nm_mod)
}

add_unmods <- purrr::map(nms, ~purrr::partial(add_unmod, nm = .x)) %>%
  Reduce(f = purrr::compose, x = .)

test %>%
  add_unmods()
#> # A tibble: 21 x 10
#>    date                alpha alpha_mod  beta beta_mod gam_ma gam_ma_mod
#>    <dttm>              <int>     <int> <int>    <int>  <int>      <int>
#>  1 2016-10-31 00:00:00   283       133   260      102    208         64
#>  2 2016-11-30 00:00:00   298       139   278      119    190         67
#>  3 2016-12-31 00:00:00   277       106   249       92    176         62
#>  4 2017-01-31 00:00:00   231        85   242      107    208         86
#>  5 2017-02-28 00:00:00   276       132   301      119    221         78
#>  6 2017-03-31 00:00:00   323       141   349      126    265         76
#>  7 2017-04-30 00:00:00   242        89   249      108    204         67
#>  8 2017-05-31 00:00:00   255       110   309      132    215         67
#>  9 2017-06-30 00:00:00   208        80   256       89    251         90
#> 10 2017-07-31 00:00:00   289       142   280      141    283        128
#> # ... with 11 more rows, and 3 more variables: gam_ma_unmod <int>,
#> #   beta_unmod <int>, alpha_unmod <int>

Created on 2018-07-13 by the reprex package (v0.2.0).


#3

It works perfectly! However, I don't understand the code very well, so I have a few questions:

  1. I get the basic idea (instead than using a for loop, use functional programming, i.e., define a function and apply it to a list using the great purrr::map) but all the tidyeval part is quite obscure to me. Would this function be equivalent to your add_unmod?

    my_add_unmod <- function(df, colname){
    
       unmod_colname <- paste0(colname, "_unmod")
       mod_colname   <- paste0(colname, "_mod")
    
       df[, unmod_colname] <- df[, colname] - df[, mod_colname]
       df
    }
    
  2. Why wouldn't purrr::map be enough, and you have instead to define a new function with

     add_unmods <- purrr::map(nms, ~purrr::partial(add_unmod, nm = .x)) %>%
          Reduce(f = purrr::compose, x = .)
    

    ? What do purrr::partial and Reduce do? I think I know what partial does (given a function with some dummy arguments, it should define a new function by setting some of the dummy arguments to some fixed actual arguments), but I know little about Reduce (I tried ?Reduce but base R help didn't illuminate me).

  3. I would have used slightly longer variable names :stuck_out_tongue_winking_eye: it took me a bit to realize that nms stands for names and nm stands for name. It doesn't change the functionality of the code of course, but using longer variable names makes it a bit more readable.

Thanks a lot for the answer and sorry for the maybe naive questions, but my knowledge of functional programming & tidy evaluation is much more basic than yours!


#4

Great, good to know it works :slight_smile:

  1. It looks like this might work. At this point tidyeval at this level seems to me a bit more readable, but I can totally understand if it is not as clear. It took me a while to grok it, so I may use it a bit too much to show off :slight_smile: .
  2. Your understanding of purrr::partial is correct. It takes a function and a list of arguments. It returns a modified function where all of the arguments are "baked in". I use it in this example to make sure that add_unmod takes only one argument (df) and returns modified df in return.
    Reduce is a bit obscure, I agree :slight_smile: . Why can't you use purrr::map instead? Well, purrr::map will return list back, but what you want to have is a modified dataframe. For that reason we need to find a way to combine three functions prepared with purrr::partial into one function that takes one argument (df) and returns modified df with all three functions applied to it.
    This is the job of purrr::compose and Reduce. purrr::compose takes in two functions (e.g., f1 and f2) and returns another function that would first apply f1 and then f2. Reduce takes in a list and a function that will be consequently applied to elements of this list. For example, if you have a function + and list with 3 elements (1, 2 and 3), Reduce will basically return ``+(+(1, 2), 3). However, in our case it will return a function (because purrr::compose ultimately returns a function) that takes in a df and applies 3 functions to it in turn.
    It is not as complicated as I make it look, I promise :slight_smile:.
  3. It is a fairly common way to name names with nms, so singular of nms is nm, isn't it? :slight_smile: One reason to not call variable names is that there is already a function names and it makes it a bit confusing.

#5

If either of you (anyone) think the docs for a specific function are lacking (e.g. for purrr::partial(), or any of the functions listed above), please always feel free to file an issue re. submitting a PR to improve them! That's one of the things that's most difficult to see for the authors of the package (curse of expertise), but that vastly improves the usability.


#6

I've given it a little bit more thought and modified my solution to be a little more tidyversish. It's mostly for myself, but since it solves your problem all the same, I'll add it here for posterity

test <- structure(list(date = structure(c(1477872000, 1480464000, 1483142400, 
                                          1485820800, 1488240000, 1490918400, 1493510400, 1496188800, 1498780800, 
                                          1501459200, 1504137600, 1506729600, 1509408000, 1.512e+09, 1514678400, 
                                          1517356800, 1519776000, 1522454400, 1525046400, 1527724800, 1530316800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), alpha = c(283L, 
                                                             298L, 277L, 231L, 276L, 323L, 242L, 255L, 208L, 289L, 284L, 263L, 
                                                             280L, 278L, 269L, 288L, 255L, 324L, 339L, 355L, 300L), alpha_mod = c(133L, 
                                                                                                                                  139L, 106L, 85L, 132L, 141L, 89L, 110L, 80L, 142L, 174L, 159L, 
                                                                                                                                  146L, 162L, 153L, 161L, 142L, 174L, 211L, 208L, 194L), beta = c(260L, 
                                                                                                                                                                                                  278L, 249L, 242L, 301L, 349L, 249L, 309L, 256L, 280L, 326L, 276L, 
                                                                                                                                                                                                  299L, 322L, 235L, 281L, 256L, 293L, 356L, 307L, 279L), beta_mod = c(102L, 
                                                                                                                                                                                                                                                                      119L, 92L, 107L, 119L, 126L, 108L, 132L, 89L, 141L, 199L, 148L, 
                                                                                                                                                                                                                                                                      161L, 160L, 125L, 159L, 137L, 139L, 208L, 177L, 162L), gam_ma = c(208L, 
                                                                                                                                                                                                                                                                                                                                        190L, 176L, 208L, 221L, 265L, 204L, 215L, 251L, 283L, 314L, 257L, 
                                                                                                                                                                                                                                                                                                                                        250L, 290L, 240L, 290L, 275L, 295L, 292L, 316L, 324L), gam_ma_mod = c(64L, 
                                                                                                                                                                                                                                                                                                                                                                                                              67L, 62L, 86L, 78L, 76L, 67L, 67L, 90L, 128L, 155L, 106L, 125L, 
                                                                                                                                                                                                                                                                                                                                                                                                              132L, 125L, 143L, 132L, 159L, 159L, 158L, 191L)), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                                                                                                                                                              -21L), class = c("tbl_df", "tbl", "data.frame"))


library(tidyverse)
library(rlang)
#> 
#> Attaching package: 'rlang'
#> The following objects are masked from 'package:purrr':
#> 
#>     %@%, %||%, as_function, flatten, flatten_chr, flatten_dbl,
#>     flatten_int, flatten_lgl, invoke, list_along, modify, prepend,
#>     rep_along, splice
nms <- test %>%
  dplyr::select_at(vars(matches("_mod"))) %>%
  names() %>%
  gsub(pattern = "_mod", replacement = "", x = .)

add_unmod <- function(nm){
  nm_sym <- rlang::sym(nm)
  nm_mod <- rlang::sym(paste0(nm, "_mod"))
  
  rlang::quo(!!nm_sym - !!nm_mod)
}

add_unmods <- purrr::map(nms, ~add_unmod(.x)) %>% 
  purrr::set_names(paste0(nms, "_unmod"))

test %>%
  dplyr::mutate(!!!add_unmods)
#> # A tibble: 21 x 10
#>    date                alpha alpha_mod  beta beta_mod gam_ma gam_ma_mod
#>    <dttm>              <int>     <int> <int>    <int>  <int>      <int>
#>  1 2016-10-31 00:00:00   283       133   260      102    208         64
#>  2 2016-11-30 00:00:00   298       139   278      119    190         67
#>  3 2016-12-31 00:00:00   277       106   249       92    176         62
#>  4 2017-01-31 00:00:00   231        85   242      107    208         86
#>  5 2017-02-28 00:00:00   276       132   301      119    221         78
#>  6 2017-03-31 00:00:00   323       141   349      126    265         76
#>  7 2017-04-30 00:00:00   242        89   249      108    204         67
#>  8 2017-05-31 00:00:00   255       110   309      132    215         67
#>  9 2017-06-30 00:00:00   208        80   256       89    251         90
#> 10 2017-07-31 00:00:00   289       142   280      141    283        128
#> # ... with 11 more rows, and 3 more variables: alpha_unmod <int>,
#> #   beta_unmod <int>, gam_ma_unmod <int>

Created on 2018-07-14 by the reprex package (v0.2.0).


#7
  1. ok!
  2. Ok, thanks a lot for the explanation. I need to think a bit more on whether I could pull this off without Reduce...but I see you posted another solution which doesn't use Reduce. I'll check that.
  3. I've never seen that abbreviation before, but now I know! Thanks.

#8

Interesting. Can you explain a bit the modifications you introduced? Specifically, why you moved the mutate outside of add_unmod and removed df from its arguments? What is quo doing in add_unmod? And why isn't Reduce needed anymore? Thanks,

Andrea


#9

The main idea is that it is better (I think) to construct function calls using quo instead of creating multiple functions that all take the same dataframe in. This solution is also using less moving parts (no purrr::partial, Reduce or purrr::compose). Fewer moving parts is

  1. less code,
  2. less maintenance,
  3. easier to explain,
  4. easier to reason about.

rlang::quo create a quosure which is a main working horse behind tidy evaluation framework. It allows you to build complex function calls piece by piece and then evaluate them in a right context. Specifically here, I use it to create 3 functions that you need and then use another tool from tidyeval (!!!) to splice them into dplyr::mutate call (this is why there is no need for Reduce).
I recommend reading this vignette as well as chapters on meta-programming from 2nd edition of "Advanced R" to understand in a little bit more detail what all of that means.


#10

Thanks @mishabalyasin, I'll check your references.