Calculate the differences in the column values between rows/ observations (all combinations)

Hello,

I have a data frame as in the following example. I wish to calculate the differences in the column values between observations/ rows (all combinations).

my_df <- tibble(a=runif(5), b=runif(5), c=runif(5))

> my_df
# A tibble: 5 x 3
       a     b      c
   <dbl> <dbl>  <dbl>
1 0.0513 0.267 0.846 
2 0.614  0.683 0.937 
3 0.230  0.700 0.0651
4 0.671  0.110 0.901 
5 0.424  0.520 0.817 

I have tried the code below which gives me only the difference between subsequent rows; I want to have all combinations: row2 - row1; row3 - row1; row4 - row1, row5- row1, row3 - row2, row4 - row2, and so on...

Also, the code I wrote does not seem the best to me (!), although it outputs the result I wish, but not for all possible combinations!

my_diff <- as.data.frame(diff(as.matrix(my_df)))
> my_diff
           a           b           c
1  0.5623574  0.41522579  0.09165630
2 -0.3837289  0.01755953 -0.87209740
3  0.4407068 -0.58982681  0.83540813
4 -0.2463205  0.40943495 -0.08358985

I appreciate if someone could provide help in solving my question, if possible a using tidy verse options.

Thanks.

Hello,

Here is one of many possible solutions. This seems to be relatively easy with the combination of combn and apply.

library(tidyverse)


test <- tibble(a=runif(5), b=runif(5), c=runif(5)) %>% as.data.frame()
test
#>           a         b         c
#> 1 0.7360352 0.9480512 0.5935539
#> 2 0.7244965 0.7372150 0.5043489
#> 3 0.4677090 0.8161523 0.3011784
#> 4 0.3436353 0.6341290 0.9938101
#> 5 0.4223668 0.2589678 0.9968452


cols <- combn(names(test), 2, paste, collapse = "-")
cols <- c(cols, sub("(.)-(.)", "\\2-\\1", cols))

test[cols] <- t(apply(test, 1, function(x) {
  out <- combn(x, 2, function(x) x[1] - x[2])
  c(out, -out)
}))
  
test
#>           a         b         c         a-b        a-c        b-c         b-a
#> 1 0.7360352 0.9480512 0.5935539 -0.21201602  0.1424813  0.3544974  0.21201602
#> 2 0.7244965 0.7372150 0.5043489 -0.01271844  0.2201477  0.2328661  0.01271844
#> 3 0.4677090 0.8161523 0.3011784 -0.34844325  0.1665306  0.5149739  0.34844325
#> 4 0.3436353 0.6341290 0.9938101 -0.29049372 -0.6501748 -0.3596811  0.29049372
#> 5 0.4223668 0.2589678 0.9968452  0.16339892 -0.5744784 -0.7378773 -0.16339892
#>          c-a        c-b
#> 1 -0.1424813 -0.3544974
#> 2 -0.2201477 -0.2328661
#> 3 -0.1665306 -0.5149739
#> 4  0.6501748  0.3596811
#> 5  0.5744784  0.7378773

Created on 2021-10-23 by the reprex package (v2.0.0)

My apologies. I did not see you want row combinations and someone was nice enough to point it out for me.

I am taking your original dataframe and transposing it as you will see with test_2 from there we have 5 rows and I name them 1 to 5. And then from there I apply the same steps. Let me know if this works for you.

library(tidyverse)


test <- tibble(a=runif(5), b=runif(5), c=runif(5)) %>% as.data.frame()
test
#>            a         b          c
#> 1 0.29827961 0.2847710 0.49713224
#> 2 0.01627554 0.6714382 0.03150204
#> 3 0.77252409 0.9607009 0.37498908
#> 4 0.87203985 0.6141838 0.39510871
#> 5 0.17926325 0.5988435 0.38927007


test_2 <- test %>% t() %>% as.data.frame()

colnames(test_2) <-   paste(c(1:length(letters[1:ncol(test_2)])), sep="")


cols <- combn(names(test_2), 2, paste, collapse = "-")
cols <- c(cols, sub("(.)-(.)", "\\2-\\1", cols))

test_2[cols] <- t(apply(test_2, 1, function(x) {
  out <- combn(x, 2, function(x) x[1] - x[2])
  c(out, -out)
}))

test_2
#>           1          2         3         4         5        1-2        1-3
#> a 0.2982796 0.01627554 0.7725241 0.8720398 0.1792633  0.2820041 -0.4742445
#> b 0.2847710 0.67143820 0.9607009 0.6141838 0.5988435 -0.3866672 -0.6759299
#> c 0.4971322 0.03150204 0.3749891 0.3951087 0.3892701  0.4656302  0.1221432
#>          1-4        1-5        2-3         2-4         2-5         3-4
#> a -0.5737602  0.1190164 -0.7562485 -0.85576431 -0.16298771 -0.09951576
#> b -0.3294128 -0.3140724 -0.2892628  0.05725439  0.07259473  0.34651714
#> c  0.1020235  0.1078622 -0.3434870 -0.36360667 -0.35776803 -0.02011963
#>           3-5         4-5        2-1        3-1        4-1        5-1       3-2
#> a  0.59326084 0.692776596 -0.2820041  0.4742445  0.5737602 -0.1190164 0.7562485
#> b  0.36185749 0.015340342  0.3866672  0.6759299  0.3294128  0.3140724 0.2892628
#> c -0.01428098 0.005838646 -0.4656302 -0.1221432 -0.1020235 -0.1078622 0.3434870
#>           4-2         5-2         4-3         5-3          5-4
#> a  0.85576431  0.16298771  0.09951576 -0.59326084 -0.692776596
#> b -0.05725439 -0.07259473 -0.34651714 -0.36185749 -0.015340342
#> c  0.36360667  0.35776803  0.02011963  0.01428098 -0.005838646

Created on 2021-10-23 by the reprex package (v2.0.0)

1 Like

Thank you @GreyMerchant. I understood your approach but still must digest some parts of your code.
As I understand it, according to your suggestion, I will still need transform test_2 back to the format as in test so that I can further work on it. How could I do that?
In fact I hoped a more tidy/ straight forward approach existed, but I really appreciate your help :wink:

@GreyMerchant I applied your proposed solution and it is working for me. Transposing back was quite straight forward.
But still would appreciate suggestions for a more tidy approach in R!

Someone kindly proposed an answer I find satisfying and also eliminates duplicates.
Still wonder if there exists something even more tidy!
Thanks.

my_df <- tibble(a=runif(5), b=runif(5), c=runif(5))

# Generating the sequence to calculate the combinations
seq1 <- seq(1,nrow(my_df)) 
seq2 <- seq1

# Generating the Combinations
Combinations <- expand.grid(seq1, seq2)
# Removing the dupilicate Combinations
Combinations <- Combinations[which(Combinations$Var2 < Combinations$Var1),]

# Performing the subtraction
result <- my_df[Combinations$Var1,] - my_df[Combinations$Var2,]

Another approach.

A step by step approach:

library(tidyverse)
set.seed(1)
my_df <- tibble(a=runif(5), b=runif(5), c=runif(5)) %>%
  mutate(ID = row_number())

all_combs <- combn(1:nrow(my_df), 2)

my_df_1 <- my_df %>%
  slice(all_combs[1,])

my_df_2 <- my_df %>%
  slice(all_combs[2,])

combined <- data.frame(comparison = paste0(my_df_1$ID, "-", my_df_2$ID),
                       difference = my_df_1[,1:3] - my_df_2[,1:3])

which gives:

   comparison difference.a difference.b difference.c
1         1-2  -0.10661524  -0.04628558   0.02941782
2         1-3  -0.30734470   0.23759189  -0.48104827
3         1-4  -0.64269913   0.26927564  -0.17812914
4         1-5   0.06382673   0.83660341  -0.56386685
5         2-3  -0.20072946   0.28387748  -0.51046609
6         2-4  -0.53608389   0.31556122  -0.20754697
7         2-5   0.17044197   0.88288900  -0.59328467
8         3-4  -0.33535443   0.03168375   0.30291913
9         3-5   0.37117143   0.59901152  -0.08281857
10        4-5   0.70652586   0.56732777  -0.38573770

UPDATE from a kind user:

library(tidyverse)
set.seed(1)
my_df <- tibble(a=runif(5), b=runif(5), c=runif(5))

gives:

# A tibble: 5 x 3
      a      b     c
  <dbl>  <dbl> <dbl>
1 0.266 0.898  0.206
2 0.372 0.945  0.177
3 0.573 0.661  0.687
4 0.908 0.629  0.384
5 0.202 0.0618 0.770

And from there:

my_df %>%
  mutate(ID = row_number()) %>%
  slice(as.numeric(t(combn(1:nrow(.), 2)))) %>%
  mutate(group = rep(1:(n()/2), 2)) %>%
  group_by(group) %>%
  summarize(comparison = paste0(ID[2], "-", ID[1]),
            across(c(a, b, c), ~ .[2] - .[1])) %>%
  select(-group)

which gives:

# A tibble: 10 x 4
   comparison       a       b       c
   <chr>        <dbl>   <dbl>   <dbl>
 1 2-1         0.107   0.0463 -0.0294
 2 3-1         0.307  -0.238   0.481 
 3 4-1         0.643  -0.269   0.178 
 4 5-1        -0.0638 -0.837   0.564 
 5 3-2         0.201  -0.284   0.510 
 6 4-2         0.536  -0.316   0.208 
 7 5-2        -0.170  -0.883   0.593 
 8 4-3         0.335  -0.0317 -0.303 
 9 5-3        -0.371  -0.599   0.0828
10 5-4        -0.707  -0.567   0.386  
1 Like

This definitely seems like the cleanest way :slight_smile: I like it.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.