Matching similar values between data frames, difficult wrangling task

applesauce · September 5, 2019, 3:32am

In the following reprex, I'd like to create Petal.Length and Petal.Width columns in the virginica_species data frame based on the following conditions: 1) each row in virginica_species finds the most similar value in setosa_versicolor according to their Septal.Length columns that is also in the same new_col group, and then 2) I'd like to extract the corresponding Petal.Length value in setosa_versicolor to create virginica_species's column Petal.Width. 3) Then I'd like to do the same but with regards to the Sepal.Width and Petal.Width columns. The idea is, in my actual data there could be an unspecified number of these columns that I'd like to 'match' between two data sets so I can't just manually input the column names beyond, say, setting a vector equal to the column name 'inputs' and running that vector into something like mutate_at(vars()).

library(dplyr)

# creates my two data frames, with `setosa_versicolor` being the data frame I'd like to map the onto `virginica_species` data frame
my_iris <- as.data.frame(iris) %>% mutate(new_col = (row_number() - 1) %% 2)
setosa_versicolor <- filter(my_iris, Species != "virginica")
virginica_species <- my_iris %>% filter(Species == "virginica") %>% select(-c(Petal.Length, Petal.Width))

# As an example of what I'd like the output to look like, this would be the first two rows of the final `virginica_species` data frame:
#     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species    new_col
# 1            6.3         3.3          4.9         1.6     virginica       0   # The closest Sepal.Length value in `seotsa_versicolor` and also with new_col == 1 also happened to be 6.3, so we find that row in `seotsa_versicolor` and extract its `Petal.Length` val; closest Sepal.Width value with new_col == 1 was 3.3, so we extract its `Petal.Width`
# 2            5.8         2.7          4.1         1.6     virginica       1   # Same methodology in this and every subsequent line (note there are sometimes multiple matching values in `seotsa_versicolor` but in my actual data there aren't any matches so I don't particularly care in this repex how multiple matches are chosen between)

I gave it my best shot (many attempts not documented below) but was getting an error, wasn't sure whether group_by was achieving what I was intended, couldn't generalize the method to any number of unspecified columns, and didn't know how to extract Petal.Length or Petal.Width when I finally did determine row in setosa_versicolor with the closest value in Sepal.Length or Sepal.Width

library(DescTools)
sepal_cols <- c("Sepal.Length", "Sepal.Width")
virginica_species %>%
  group_by(new_col) %>%
  rowwise() %>%
  mutate_at(vars(sepal_cols), list(petals = ~ Closest(pull(setosa_versicolor, quo_name(quo("Sepal.Width"))), .)))

# Error: Column `Sepal.Width_petals` must be length 1 (the group size), not 3

Many thanks!

DavoWW · September 5, 2019, 1:34pm

Thanks for your question which allowed me to discover the DescTools::Closest() function!
However, I found your example very difficult to understand, but I think that the following shows how to achieve what you want using a for() loop in a user-defined function. Not dplyr, I know, but with my simpler example you might get a dplyr solution from other contributors. The for() loop can be extended to match and add multiple columns (if required).

# A simpler example perhaps
library(DescTools)

df1 <- data.frame(a_1 = c(1:5),
                  b_1 = c(5,3,9,7,11),
                  c_1 = c(0,1,0,1,0))
df1
#>   a_1 b_1 c_1
#> 1   1   5   0
#> 2   2   3   1
#> 3   3   9   0
#> 4   4   7   1
#> 5   5  11   0

df2 <- data.frame(a_2 = c(1:6),
                  b_2 = c(3.1, 5.8, 1.6, 9.7, 6.2, 12.8),
                  c_2 = c(0,1,0,1,0,1))
df2
#>   a_2  b_2 c_2
#> 1   1  3.1   0
#> 2   2  5.8   1
#> 3   3  1.6   0
#> 4   4  9.7   1
#> 5   5  6.2   0
#> 6   6 12.8   1

# Version 1
get_close <- function(xx=df1, yy=df2) {
  pos <- vector(mode = "numeric")
  for(i in 1:dim(yy)[1]) {
    pos[i] <- DescTools::Closest(xx$b_1, yy$b_2[i])
    #print(pos[i])
    yy$d_2[i] <- pos[i]
  }
  out <- yy
  return(out)
}

get_close()
#>   a_2  b_2 c_2 d_2
#> 1   1  3.1   0   3
#> 2   2  5.8   1   5
#> 3   3  1.6   0   3
#> 4   4  9.7   1   9
#> 5   5  6.2   0   7
#> 6   6 12.8   1  11


# Version 2 - only match where the c_x columns are the same.
get_close2 <- function(xx=df1, yy=df2) {
  pos <- vector(mode = "numeric")
  for(i in 1:dim(yy)[1]) {
    pos[i] <- DescTools::Closest(xx$b_1[xx$c_1 == yy$c_2[i]], yy$b_2[i])
    #print(pos[i])
    yy$d_2[i] <- pos[i]
  }
  out <- yy
  return(out)
}

get_close2()
#>   a_2  b_2 c_2 d_2
#> 1   1  3.1   0   5
#> 2   2  5.8   1   7
#> 3   3  1.6   0   5
#> 4   4  9.7   1   7
#> 5   5  6.2   0   5
#> 6   6 12.8   1   7
Created on 2019-09-05 by the reprex package (v0.3.0)

HTH

applesauce · September 8, 2019, 5:08pm

Thanks for your reply! This is definitely close to what I was looking for (and sorry about the question being confusing). On top of what you wrote, the one thing I'd like the code to do is run for a vector of columns. So rather than finding the closest match between just b_1 and b_2, also working if we added more columns and wanted to find the closest match between, say, g_1 and g_2.

I imagined purrr might be useful and your code led me to write the purrr code below. However I still don't know how to yield multiple columns (like this post) while iterating rowwise (like this post)

library(DescTools)

pmap_dbl(df2, function(b_2, c_2, ...){
  df1 %>%
    pull(a_1) %>%
    nth(which(df1$b_1 == Closest(df1$b_1[which(df1$c_1 %in% c_2)], b_2))) # finds row in df1 that 1) matches in C column and 2) is closest in value in the B column
})

joels · September 8, 2019, 6:04pm

This sounds like the kind of task where a rolling join might work, but I don't think I understand your example well enough to attempt a specific solution.

applesauce · September 8, 2019, 9:58pm

Hopefully this diagram makes it clearer. I'd like to create a_2, b_2, and c_2 columns in df2 that are composed of specific values from df1's a_2, b_2, and c_2 columns. For each df2 row, we find the row in df1 closest in terms of a_1 that also matches the must_match column. When we find this row in df1, we extract its a_2 value and insert it into df2's a_2 column. We do this for every row in df2. My purrr code below successfully accomplishes everything so far. However, I'd like to repeat this process to create b_2, c_2, and potentially more other columns; creating b_2 would mean finding the closest b_1 value in df1 among matching must_match rows, c_2 would mean finding the closest c_1 value in df1 among matching must_match rows, and so on.

library(DescTools)


df1 <- data.frame(must_match = c(0, 1, 0, 1, 0), 
                      c_2 = c(8.3, 6.2, 6.1, 4.5, 1.5), 
                      c_1 = c(4, 1, 3, 9, 5), 
                      b_2 = c(5, 3, 9, 7, 11), 
                      b_1 = c(20, 22, 24, 26, 27), 
                      a_2 = c(5.3, 5.4, 3.1, 4.2, 7.2),
                      a_1 = c(2, 7, 9, 6, 11))

df2 <- data.frame(must_match = c(0, 1, 0, 1, 0, 1),
                  a_1 = c(7, 2, 5, 11, 1, 6),
                  b_1 = c(6, 5.8, 1.6, 9.7, 6.2, 12.8),
                  c_1 = c(2.3, 5.3, 1.2, 7.3, 7.8, 8.2))

pmap_dbl(df2, function(a_1, must_match, ...){
  df1 %>%
    pull(a_2) %>%
    nth(which(df1$a_1 == Closest(df1$a_1[which(df1$must_match %in% must_match)], a_1))) # finds row in df1 that 1) matches in `must_mattch` column and 2) is closest in `a_1`
# granted this output is a vector, but converting to a column should be easy enough
})

raytong · September 8, 2019, 11:34pm

Hi @applesauce. I make some script with pmap_df see if can help.

pmap_df(df2, function(a_1, b_1, c_1, must_match) {
  x <- df1 %>%
    filter(must_match == must_match)
  
  data.frame(a_1 = a_1,
             a_2 = x$a_2[Closest(x$a_1, a_1, which = TRUE)[1]],
             b_1 = b_1,
             b_2 = x$b_2[Closest(x$b_1, b_1, which = TRUE)[1]],
             c_1 = c_1,
             c_2 = x$c_2[Closest(x$c_1, c_1, which = TRUE)[1]],
             must_match = must_match)
})

applesauce · September 9, 2019, 12:31am

Thanks @raytong! Unfortunately though I'd like the code to generalize to when there are many columns, like d_1, e_1, f_1, etc. I was wondering if there was a way of not having to call each column of interest (a_1, b_1`, ...). Also when I run your code it doesn't look like it is excluding rows in df1 that equal must_match (I guess even though there's a line filtering it away).

raytong · September 9, 2019, 3:44am

@applesauce. If you want it generalise, try this.

getClosest <- function(must_match, col, col2, value) {
  x <- df1[df1$must_match == must_match,]
  x[[col2]][Closest(x[[col]], value, which = TRUE)[1]]
}

df2 %>%
  gather(col, value, -must_match) %>%
  rowwise() %>%
  mutate(col2 = paste0(strsplit(col, "_1")[[1]][1], "_2")) %>%
  mutate(value2 = getClosest(must_match, col, col2, value)) %>%
  mutate(group = col) %>%
  group_by(group) %>%
  nest() %>%
  .$data %>%
  map(~{
    .x %>%
      rename(!!as.name(.$col[1]) := value, !!as.name(.$col2[1]) := value2) %>%
      select(-col, -col2)
  }) %>%
  bind_cols() %>%
  select(-matches("must_match\\d$"))

applesauce · September 10, 2019, 3:44am

Thanks so much @raytong! That works perfectly.

system · September 17, 2019, 3:57am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.