I just want to add to this a little bit. While this solution works, and is definitely an elegant solution for the data in the example, there are numerous ways in which this code could get you in trouble. The most two most obvious ones are:
- if dataframe
A
has exactly the same IDs as in B but is not actually sorted (arranged) by ID, you end up subtracting from the wrong datapoint for every row that is not in the same position in A and B
of course you can just also arrange A by ID, but...
- If there actually are IDs in B that are not in A, you'll either get an out of scope error (if they are of differing lengths), or just the wrong result if they still have the same length.
Starting with loading dplyr and creating the dummy data as before:
suppressPackageStartupMessages(library(dplyr))
A <- tibble(ID = c(1, 2, 3),
value_a = c(5, 6, 10),
value_b = c(4,1,8))
B <- tibble(ID = c(2, 3, 1),
value_a = c(7, 8, 4),
value_b = c(2,1,4))
Here is the approach I want to suggest as, while it is not as elegant, it is more robust to differences in the actual data.
The first step is a left_join. Matching IDs in B to their corresponding IDs in A, and creating a new dataframe that contains information from both A and B. Using the suffix argument we make sure that the values from the first dataframe (A) get marked as such, and the same for values from the B dataframe. Then we can use the mutate function to perform the actual subtraction, and then with the select function, removing the columns that contain "_df" in their column names (referring to the suffixes we gave the columns in the join process.
A %>%
left_join(B, by = c("ID"), suffix = c("_df_A", "_df_B")) %>%
mutate(value_a = value_a_df_A - value_a_df_B,
value_b = value_b_df_A - value_b_df_B) %>%
select(-contains("_df"))
#> # A tibble: 3 × 3
#> ID value_a value_b
#> <dbl> <dbl> <dbl>
#> 1 1 1 0
#> 2 2 -1 -1
#> 3 3 2 7
Created on 2022-11-30 with reprex v2.0.2