data wrangling matching columns

Hey everyone.

I need help some data wrangling techniques.

Thanks in advance for your help.

Yoyong


I have a dataset with two columns. I want to compare the two columns to get the additional columns found in Table 2.

Matched_type: 0 = columns did not match, 1 = columns have exact match; 2 = columns have partial match
Matched_response = lists all the matched responses in one column
Matched1, Matched2,...Matchedn = separates the matched responses in different columns

Table 1

Column 1 Column 2
a a
b b;d
c f
d;e e;d
c;d d
a;b a;e;c
a;c;e f;a
b;c d
f f;g;a
g;h g;h
c;a;b;d a;d;c;e
a;f;c;d;e;g f;a;d;e;h;j

Table 2

Column 1 Column 2 Matched_type Matched_response Matched1 Matched2 Matched3 Matched4
a a 1 a a
b b;d 2 b b
c f 0
d;e e;d 1 d;e d e
c;d d 2 d;e d
a;b a;e;c 2 a a
a;c;e f;a 2 a a
b;c d 0
f f;g;a 2 f f
g;h g;h 1 g;h g h
c;a;b;d a;d;c;e 2 c;a c a
a;f;c;d;e;g f;a;d;e;h;j 2 a;f;d;e a f d e

Hi. It would be useful if you could provide a reproducible example of the code that you have tried and the table (in code).

It is probably a combination of regex and splitting the columns.

This may be more complicated than necessary.

library(tidyverse)
#> Warning: package 'tibble' was built under R version 4.1.2
DF <- read.csv("~/R/Play/Dummy.csv", sep = " ")
MyFunc <- function(V1,V2) {
  V1 <- str_split(V1,";")
  V2 <- str_split(V2,";")
  Matched_Type <- map2_dbl(V1,V2, ~case_when(
    all(.x %in% .y) & all(.y %in% .x) ~ 1,
    any(.x %in% .y)  ~ 2,
    TRUE ~ 0
  ))
  Matched_Response <- map2_chr(V1,V2, ~paste(.x[.x %in% .y], collapse = ";"))
  return(list(Type = Matched_Type, Response = Matched_Response,
              MaxMatches = (max(nchar(Matched_Response))+1)/2))
}
OUT <- MyFunc(DF$Column_1,DF$Column_2)
OUT
#> $Type
#>  [1] 1 2 0 1 2 2 2 0 2 1 2 2
#> 
#> $Response
#>  [1] "a"       "b"       ""        "d;e"     "d"       "a"       "a"      
#>  [8] ""        "f"       "g;h"     "c;a;d"   "a;f;d;e"
#> 
#> $MaxMatches
#> [1] 4
DF$Type <- OUT$Type
DF$Response <- OUT$Response
NewNames <- paste("Match",1:OUT$MaxMatches, sep = "_")
DF2 <- separate(DF,col = Response, into = NewNames, 
                remove = FALSE, fill = "right")
DF2
#>       Column_1    Column_2 Type Response Match_1 Match_2 Match_3 Match_4
#> 1            a           a    1        a       a    <NA>    <NA>    <NA>
#> 2            b         b;d    2        b       b    <NA>    <NA>    <NA>
#> 3            c           f    0                     <NA>    <NA>    <NA>
#> 4          d;e         e;d    1      d;e       d       e    <NA>    <NA>
#> 5          c;d           d    2        d       d    <NA>    <NA>    <NA>
#> 6          a;b       a;e;c    2        a       a    <NA>    <NA>    <NA>
#> 7        a;c;e         f;a    2        a       a    <NA>    <NA>    <NA>
#> 8          b;c           d    0                     <NA>    <NA>    <NA>
#> 9            f       f;g;a    2        f       f    <NA>    <NA>    <NA>
#> 10         g;h         g;h    1      g;h       g       h    <NA>    <NA>
#> 11     c;a;b;d     a;d;c;e    2    c;a;d       c       a       d    <NA>
#> 12 a;f;c;d;e;g f;a;d;e;h;j    2  a;f;d;e       a       f       d       e

Created on 2022-03-22 by the reprex package (v2.0.1)

Hey, thank you for the reply. I am getting this comment when I run the entire code (I used the location of my dataset and used similar variable names):

DF$Type <- OUT$Type
Error in $<-.data.frame(*tmp*, Type, value = numeric(0)) :
replacement has 0 rows, data has 12

Your issue may be related to having run 'the entire code' as the solution provided assumes an input csv ("~/R/Play/Dummy.csv") that may not match what you have on your machine, and will likely cause a host of errors. You would need to adapt this portion of the code so that the data.frame you want to manipulate is in your environment with the name DF, rather than the read.csv() before the rest of the code Myfunc starts.

Although, if I have misinterpreted what is happening, my apologies

I changed this part ("~/R/Play/Dummy.csv") and used the location where my dataset is stored. I used similar variables.

Therefore, if you still have errors/issues with getting the proposed solution to work with your real data, the next step would be for you to create a reprex.

Hi again.

I made the dataset. But I am still getting errors. Appreciate the help.

library(tidyverse)

DF <- data.frame(V1 = c("a", "b", "c", "d;e", "c;d", "a;b","a;c;e", "b;c", "f", "g;h", "c;a;b;d", "a;f;c;d;e;g"),
                 V2 = c("a", "b;d", "f", "d;e", "d", "a;e;c", "f;a", "d", "f;g;a", "g;h", "a;d;c;e" , "f;a;d;h;j"))



MyFunc <- function(V1,V2) {
  V1 <- str_split(V1,";")
  V2 <- str_split(V2,";")
  Matched_Type <- map2_dbl(V1,V2, ~case_when(
    all(.x %in% .y) & all(.y %in% .x) ~ 1,
    any(.x %in% .y)  ~ 2,
    TRUE ~ 0
  ))
  Matched_Response <- map2_chr(V1,V2, ~paste(.x[.x %in% .y], collapse = ";"))
  return(list(Type = Matched_Type, Response = Matched_Response,
              MaxMatches = (max(nchar(Matched_Response))+1)/2))
}
OUT <- MyFunc(DF$Column_1,DF$Column_2)
#> Warning in max(nchar(Matched_Response)): no non-missing arguments to max;
#> returning -Inf
OUT
#> $Type
#> numeric(0)
#> 
#> $Response
#> character(0)
#> 
#> $MaxMatches
#> [1] -Inf

DF$Type <- OUT$Type
#> Error in `$<-.data.frame`(`*tmp*`, Type, value = numeric(0)): replacement has 0 rows, data has 12
DF$Response <- OUT$Response
#> Error in `$<-.data.frame`(`*tmp*`, Response, value = character(0)): replacement has 0 rows, data has 12
NewNames <- paste("Match",1:OUT$MaxMatches, sep = "_")
#> Error in 1:OUT$MaxMatches: result would be too long a vector
DF2 <- separate(DF,col = Response, into = NewNames, 
                remove = FALSE, fill = "right")
#> Error: object 'Response' not found
DF2
#> Error in eval(expr, envir, enclos): object 'DF2' not found

reprex::reprex()
#> x Install the styler package in order to use `style = TRUE`.
#> ℹ Non-interactive session, setting `html_preview = FALSE`.
#> CLIPR_ALLOW has not been set, so clipr will not run interactively
#> Error in switch(where, expr = stringify_expression(x_expr), clipboard = ingest_clipboard(), : EXPR must be a length 1 vector

Created on 2022-03-24 by the reprex package (v2.0.1)

In you version of DF, the two columns are named V1 and V2

DF <- data.frame(V1 = c("a", "b", "c", "d;e", "c;d", "a;b","a;c;e", "b;c", "f", "g:h", "c;a;b;d", "a;f;c;d;e;g"),
                 V2 = c("a", "b;d", "f", "d;e", "d", "a;e;c", "f;a", "d", "f;g;a", "g;h", "a;d;c;e" , "f;a;d;h;j"))

but your call to MyFunc refers to Column_1 and Column_2

OUT <- MyFunc(DF$Column_1,DF$Column_2)

Try

OUT <- MyFunc(DF$V1,DF$V2)

Thank you. I learned a lot today.

Thanks. I learned something new today.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.