Substracting dfs

Hi,

I'm looking fon an elegant way how to substact values of df 'B' from values of df 'A' based on ID.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

A <- tibble(ID = c(1, 2, 3),
            value_a = c(5, 6, 10),
            value_b = c(4,1,8))

B <- tibble(ID = c(2, 3, 1),
            value_a = c(7, 8, 4),
            value_b = c(2,1,4))

I'll aprreciate your suggestions.

Many thanks,

Jakub

1 Like
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
A <- tibble(ID = c(1, 2, 3),
            value_a = c(5, 6, 10),
            value_b = c(4,1,8))

B <- tibble(ID = c(2, 3, 1),
            value_a = c(7, 8, 4),
            value_b = c(2,1,4))

B <- B |> arrange(-desc(ID))
A[2:3] - B[2:3]
#>   value_a value_b
#> 1       1       0
#> 2      -1      -1
#> 3       2       7
2 Likes

I just want to add to this a little bit. While this solution works, and is definitely an elegant solution for the data in the example, there are numerous ways in which this code could get you in trouble. The most two most obvious ones are:

  1. if dataframe A has exactly the same IDs as in B but is not actually sorted (arranged) by ID, you end up subtracting from the wrong datapoint for every row that is not in the same position in A and B

of course you can just also arrange A by ID, but...

  1. If there actually are IDs in B that are not in A, you'll either get an out of scope error (if they are of differing lengths), or just the wrong result if they still have the same length.

Starting with loading dplyr and creating the dummy data as before:

suppressPackageStartupMessages(library(dplyr))

A <- tibble(ID = c(1, 2, 3),
            value_a = c(5, 6, 10),
            value_b = c(4,1,8))

B <- tibble(ID = c(2, 3, 1),
            value_a = c(7, 8, 4),
            value_b = c(2,1,4))

Here is the approach I want to suggest as, while it is not as elegant, it is more robust to differences in the actual data.

The first step is a left_join. Matching IDs in B to their corresponding IDs in A, and creating a new dataframe that contains information from both A and B. Using the suffix argument we make sure that the values from the first dataframe (A) get marked as such, and the same for values from the B dataframe. Then we can use the mutate function to perform the actual subtraction, and then with the select function, removing the columns that contain "_df" in their column names (referring to the suffixes we gave the columns in the join process.

A %>% 
  left_join(B, by = c("ID"), suffix = c("_df_A", "_df_B")) %>% 
  mutate(value_a = value_a_df_A - value_a_df_B,
         value_b = value_b_df_A - value_b_df_B) %>% 
  select(-contains("_df"))
#> # A tibble: 3 × 3
#>      ID value_a value_b
#>   <dbl>   <dbl>   <dbl>
#> 1     1       1       0
#> 2     2      -1      -1
#> 3     3       2       7

Created on 2022-11-30 with reprex v2.0.2

3 Likes

Great points. I tend to limit my answers too much to the data provided.

A more universal approach is to convert each tibble to "long" format (pivot_longer) and then perform any joint mathematical operation where both id and attribute (column header) match.
This will always work even if there are rows that don't match.

The nature of the approach will depend on whether one needs to preserve the rows present in one tibble and not the other, or not.

After the arithmetic/mathematical operation, the resulting tibble(s) can be regenerated with pivot_wider.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.