pairwise() function for use within dplyr::mutate() and dplyr::summarise()?

brshallo · January 31, 2021, 6:05am

I am looking for a function for doing arbitrary pairwise operations that can be applied in an analagous way to dplyr::across() (i.e. could be used in mutate() or summarise() and handle groups, etc).

Fictive examples of such a pairwise() function:

library(dplyr)

cor_p_value <- function(x, y){
  stats::cor.test(x, y)$p.value
}

ks_p_value <- function(x, y){
  stats::ks.test(x, y)$p.value
}

iris <- as_tibble(iris)

# hypothetical use within mutate()
iris %>% 
  mutate(pairwise(.col = where(is.numeric), 
                  .fns = ~ .x / .y, # could also just do `/`
                  .names = "ratio_{.col$.x}_{.col$.y}",
                  associative = FALSE)
         ) %>% 
  glimpse()
#> Rows: 150
#> Columns: 17
#> $ Sepal.Length                    <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, ...
#> $ Sepal.Width                     <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, ...
#> $ Petal.Length                    <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, ...
#> $ Petal.Width                     <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, ...
#> $ Species                         <fct> setosa, setosa, setosa, setosa, set...
#> $ ratio_Sepal.Length_Sepal.Width  <dbl> 1.457143, 1.633333, 1.468750, 1.483...
#> $ ratio_Sepal.Length_Petal.Length <dbl> 3.642857, 3.500000, 3.615385, 3.066...
#> $ ratio_Sepal.Length_Petal.Width  <dbl> 25.50000, 24.50000, 23.50000, 23.00...
#> $ ratio_Sepal.Width_Petal.Length  <dbl> 2.500000, 2.142857, 2.461538, 2.066...
#> $ ratio_Sepal.Width_Petal.Width   <dbl> 17.50000, 15.00000, 16.00000, 15.50...
#> $ ratio_Petal.Length_Petal.Width  <dbl> 7.000000, 7.000000, 6.500000, 7.500...
#> $ ratio_Sepal.Width_Sepal.Length  <dbl> 0.6862745, 0.6122449, 0.6808511, 0....
#> $ ratio_Petal.Length_Sepal.Length <dbl> 0.2745098, 0.2857143, 0.2765957, 0....
#> $ ratio_Petal.Width_Sepal.Length  <dbl> 0.03921569, 0.04081633, 0.04255319,...
#> $ ratio_Petal.Length_Sepal.Width  <dbl> 0.4000000, 0.4666667, 0.4062500, 0....
#> $ ratio_Petal.Width_Sepal.Width   <dbl> 0.05714286, 0.06666667, 0.06250000,...
#> $ ratio_Petal.Width_Petal.Length  <dbl> 0.14285714, 0.14285714, 0.15384615,...

# hypothetical use within summarise()
iris %>% 
  group_by(Species)
  summarise(pairwise(.col = where(is.numeric),
                     .fns = list(ksp = ks_p_value, corp = cor_p_value),
                     .names = "{.fn}_{.col$.x}_{.col$.y}",
                     associative = TRUE)
            ) %>% 
    glimpse()
#> Rows: 3
#> Columns: 13
#> $ Species                        <fct> setosa, versicolor, virginica
#> $ ksp_Sepal.Length_Sepal.Width   <dbl> 0, 0, 0
#> $ ksp_Sepal.Length_Petal.Length  <dbl> 0.000000e+00, 0.000000e+00, 6.951782...
#> $ ksp_Sepal.Length_Petal.Width   <dbl> 0, 0, 0
#> $ ksp_Sepal.Width_Petal.Length   <dbl> 0, 0, 0
#> $ ksp_Sepal.Width_Petal.Width    <dbl> 0, 0, 0
#> $ ksp_Petal.Length_Petal.Width   <dbl> 0, 0, 0
#> $ corp_Sepal.Length_Sepal.Width  <dbl> 6.709843e-10, 8.771860e-05, 8.434625...
#> $ corp_Sepal.Length_Petal.Length <dbl> 6.069778e-02, 2.586190e-10, 6.297786...
#> $ corp_Sepal.Length_Petal.Width  <dbl> 5.052644e-02, 4.035422e-05, 4.798149...
#> $ corp_Sepal.Width_Petal.Length  <dbl> 2.169789e-01, 2.302168e-05, 3.897704...
#> $ corp_Sepal.Width_Petal.Width   <dbl> 1.038211e-01, 1.466661e-07, 5.647610...
#> $ corp_Petal.Length_Petal.Width  <dbl> 1.863892e-02, 1.271916e-11, 2.253577...

# don't know if this a realistic way to handle .names 
# also may make sense to have option to pivot columns so ends-up in format more like `corrr`
# may be interesting to make a pwise() function (like pmap()) that handles combinations of 2, 3, ... p columns
# maybe indirectly run `corrr::colpair_map() ...

What would such a pairwise() function (or similar) look like? (Ideally set-up in a way that is consistent with tidyverse + tidyselection etc.)

Or what are the resources I need to read in order to go about setting it up? (Eg How would I grab the .col variables selected and then set-up the respective .x and .y combinations from these? etc...)

Note on current approaches:

There are excellent tidyverse friendly packages that can be used to create the output I describe above (e.g. corrr, widyr, parts of recipes). I've written/tweeted about these and related approaches, documented here. The specific outputs pasted into the code snippets for this question were returned via the code at this gist and uses those methods.

With this topic, I am interested in an approach for pairwise operations that (in some circumstances) may be slightly easier to chain with piped dplyr verbs (though am not attached to the exact particulars of the pairwise() function I describe above).

jameslairdsmith · February 14, 2021, 2:20pm

This is a really dedicated and thoughtful idea for what is an interesting problem. And you've really gone above and beyond in cataloguing the different ways this could be approached.

I like pairwise() as a function name and I really like the way that it's used here in the same way as across() would be. My guess is that, in order to make it work inside of mutate(), summarise() etc, it would need to be part of dplyr. Because that's where it seems the required plumbing is (that makes eg. across() do what it does).

brshallo · February 24, 2021, 7:57pm

@lionel any tips on how to go about getting a tidyselected set of columns and then applying a function to the permuted sets of those columns in a way that would facilitate above?

carlomedina · February 25, 2021, 5:41am

@brshallo I saw your message on r4ds about this and thought it's pretty nifty use case!

This is the first time I'm going through the internal of across.R and I don't really understand it that well (I need to join an advance R r4ds book club...), but with pattern matching and not-so-clean code, I think I was able to prototype it within the dplyr codebase

Using your examples above, I think this somewhat replicates the output

devtools::install_github("carlomedina/dplyr", ref="pairwise")
cor_p_value <- function(x, y){
  stats::cor.test(x, y)$p.value
}

ks_p_value <- function(x, y){
  stats::ks.test(x, y, exact = F)$p.value
}

iris %>% 
  group_by(Species) %>%
  summarise(
    pairwise(
      .col = where(is.numeric),
      .fns = list(ksp = ks_p_value, corp = cor_p_value),
      .is_commutative  = TRUE
    )
  ) %>% 
  glimpse()

iris %>% 
  group_by(Species) %>%
  summarise(
    pairwise(
      .col = where(is.numeric),
      .fns = list(ksp = ks_p_value, corp = cor_p_value),
      .is_commutative  = FALSE
    )
  ) %>% 
  glimpse()

iris %>% 
  mutate(
    pairwise(
      .col = where(is.numeric), 
      .fns = ~ .x / .y, 
      .names = "ratio_{.col_x}_{.col_y}",
      .is_commutative = TRUE
    )
  ) %>% 
  glimpse()

iris %>% 
  mutate(
    pairwise(
      .col = where(is.numeric), 
      .fns = ~ .x / .y, 
      .names = "ratio_{.col_x}_{.col_y}",
      .is_commutative = FALSE
    )
  ) %>% 
  glimpse()

slight change in the API (commutative as opposed to associative)

lionel · February 25, 2021, 8:13am

I think you could implement pairwise() using tidyselect to take the selection, and dplyr::cur_data() to get the current data frame. Note that with grouped data frames it won't be as performant as across() because dplyr tries to inline across() operations when possible.

brshallo · February 25, 2021, 9:58pm

@carlomedina , just tried it out, this is awesome!! Great work!! Thanks for fixing arg name to .is_commutative. I’ll try and look through the code some next week (I also need to go through advanced R more so will be a bit of a slog for me) and leave a longer reply.

brshallo · March 4, 2021, 6:06pm

@lionel per jameslairdsmith's comment above do you think the tidyverse team would have any interest in a pairwise() like function? (E.g. a similar fun might be pwise() with a p argument for applying functions on permutations with choose >= 2.)

If the answer is 'maybe' @carlomedina and I had discussed opening an issue/feature request on dplyr's github pointing to his prototype.

lionel · March 5, 2021, 5:48am

To me it feels out of scope for dplyr but I don't know. cc @romain and @hadley

system · March 12, 2021, 5:48am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.