compare two columns each

mesisa · March 31, 2022, 2:17pm

Hi!
I have a dataset which consists of the columns A1, A2, A3 and B1, B2, B3. I want to compare the values between the columns A1 & B2, A2 & B2, A3 & B3. Usually this would be easy with mutate from dplyr, but as I have many columns to compare in my real dataset, I would like to avoid having to do the comparison manually. That's why I build pairs which should be compared:

library(tidyverse)

df_A = data.frame(id = c("x1", "x2", "x3"), A1 = c(1,NA,3), A2 = c(2,3,4), A3 = c(3,4,5))
df_B = data.frame(id = c("x1", "x2", "x3"), B1 = c(1,2,3), B2 = c(NA,3,5), B3 = c(3,4,5))

df = df_A %>% left_join(df_B)

ndf  = names(df)
An = ndf[startsWith(ndf,"A")]
Bn = ndf[startsWith(ndf,"B")]

shared_nums = intersect(parse_number(An), parse_number(Bn))

pairs_to_do = map(shared_nums, ~c(paste0("A",.x), paste0("B",.x)))

For the result of the comparison, I would like to indicate the type of change by using the following variables:

NA ... if there is no value in the column A and no value in the column B
0 ... value A == value B
1 ... value A was deleted in B (value A != NA, value B == NA)
2 ... value A != value B, but neither value A nor value B == NA
3 ... value was added in B (value A == NA, value B != NA)

do_pair <- function(x,y,name){
  stopifnot(length(x)==length(y))
  r <- case_when(is.na(x) & is.na(y) ~ NA,
                 !is.na(x) & !is.na(y) & x == y ~ 0, # same value
                 !is.na(x) & is.na(y) ~ 1, # deleted value in B
                 !is.na(x) & !is.na(y) & x != y ~ 2, # different values
                 is.na(x) & !is.na(y) ~ 3) # added value in B
  tibble(!!sym(name):=r)
}

I'm not sure, how to do the next step the best way. Everything I tried, gave me an error or didn't work. That's how, the output should look like, when the colums C are the mutated columns:

grafik

I hope, I was clear enough explaining my problem.

gabriel.de.wit · March 31, 2022, 11:30pm

Hi, I think this does what you need:

library(tidyverse)

df_A <- data.frame(id = c("x1", "x2", "x3"), A1 = c(1,NA,3), A2 = c(2,3,4), A3 = c(3,4,5))
df_B <- data.frame(id = c("x1", "x2", "x3"), B1 = c(1,2,3), B2 = c(NA,3,5), B3 = c(3,4,5))

df <- df_A %>% left_join(df_B, by = "id")

do_pair <- function(x, y) {
  case_when(
    is.na(x) & is.na(y) ~ NA_real_,
    !is.na(x) & !is.na(y) & x == y ~ 0, # same value
    !is.na(x) & is.na(y) ~ 1, # deleted value in B
    !is.na(x) & !is.na(y) & x != y ~ 2, # different values
    is.na(x) & !is.na(y) ~ 3 # added value in B
  ) 
}

df %>%
  pivot_longer(-id) %>%
  separate(name, into = c("letter", "number"), sep = 1) %>%
  pivot_wider(c(id, number), names_from = letter, values_from = value) %>%
  mutate(C = do_pair(A, B)) %>%
  pivot_wider(id, names_from = number, values_from = c(A, B, C), names_sep = "")

mesisa · April 1, 2022, 7:29am

Yes, that's what I was looking for. Thank you very much for your help!

system · April 8, 2022, 7:30am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.