Cumulative AND/OR joining of data frames in a list: vectorize?

I have a problem that I've already solved using a for loop, but I'm trying to figure out if there's a tidyverse equivalent: ideally I would love to be able to do this as part of a longer dplyr pipe. Even apart from that, my curiosity is piqued because I can't figure out how to approach this.

I have data that comes from some survey responses, in which participants judged a series of sentences on a scale of 1-5. For this example, let's say I have five participants, labeled A through E. I have four sentences in my example, called s1 through s4.

I've already gone through and put different restrictions on the ratings for each sentence. So, for example, I've limited the s1 data to only people who rated it between 3 and 5. I've restricted the sentence 2 data to people who rated it either a 1 or a 2. Etc. etc.

After applying the restrictions, I end up with these four data frames:

df1 <- data.frame(person = c("A", "B", "C", "D"), rating = c(3, 3, 4, 5), sentence = "s1")
df2 <- data.frame(person = c("A", "C", "D", "B", "E"), rating = c(1, 1, 1, 2, 1), sentence = "s2")
df3 <- data.frame(person = c("A", "B"), rating = c(5, 5), sentence = "s3")
df4 <- data.frame(person = c("C", "D", "A"), rating = c(1, 2, 5), sentence = "s4")

...which, in my real code, I will need to have in a list:

dataList <- list(df1, df2, df3, df4)

I need to join these data frames together by different boolean criteria (either AND or OR), sequentially/cumulatively.

Here are some sample join criteria:

joinTypes <- c("or", "and", "or")

The join criteria apply in order. So, for example, I want to combine df1 and df2 by "or", i.e. keeping all the rows, even if a given person doesn't appear in both df's. Then, I want to keep only people who appear in [the combined df1/df2 result] AND in df3. Finally, I want to keep people who appear in [the result of the df1/df2/df3 join] OR in df4 (i.e. keep all rows).

This example has four df's (four sentences), but my real use case will have an arbitrarily large number of sentences (minimum 1), with a new join type specified for each additional sentence beyond 1 (such that the joinType vector will always have length 1 less than the number of sentences).

So, how to make this work? Here is a solution that works by initializing some starting data (the data for sentence 1), and then looping through each additional sentence and appending it to the sentence 1 data (using the <<- arrow at the end of the loop) according to either an "and" or an "or" rule.

dat <- dataList[[1]] # initialize with the first sentence df
for(i in 2:length(dataList)){
  newDat <- dataList[[i]]
  joinType <- joinTypes[i-1]
  if(joinType == "or"){ # if it's an "or" join, we keep all of the rows
    outDat <- bind_rows(dat, newDat)
  }else{ # if it's an "and" join, we only keep people who appear in both data frames
    outDat <- bind_rows(dat, newDat) %>%
      group_by(person) %>%
      filter(person %in% newDat$person) %>% # only keep the people who appear in both the new data and the previous data
      ungroup()
  }
  dat <<- outDat # update dat with the new sentence data, joined by "or" or "and"
}

> dat
# A tibble: 9 x 3
  person rating sentence
  <fct>   <dbl> <fct>   
1 A           3 s1      
2 B           3 s1      
3 A           1 s2      
4 B           2 s2      
5 A           5 s3      
6 B           5 s3      
7 C           1 s4      
8 D           2 s4      
9 A           5 s4    

This gets the desired result: df1 and df2 were joined using "or", but then everyone except for A and B got removed once we joined df3 using "and". But then we join df4 using "or", so person C and D are included only for s4.

But my question is: is there a way to vectorize this kind of cumulative operation? I ask because 1) I prefer the tidyverse/piped workflow, personally, 2) I'm curious, since I can't figure out how to approach it, and 3) I'm working in a Shiny app, and the code above is currently in a reactive expression where it's convenient to have a single long pipeline instead of multiple intermediate objects.

  1. is not insurmountable; I could definitely alter the Shiny app code if this can't be done in a single pipeline. But I would love to know how if anyone does come up with a solution.

Thanks in advance! And let me know if anything is unclear or if I need to provide additional details.

I think this will do it, but let me know if I've misunderstood. In the code below, we use the intersect function to identify the values of person to keep for the "and" condition.

library(tidyverse)

dataList <- list(df1, df2, df3, df4)
joinTypes <- c("or", "and", "or")

d = dataList[[1]]
for(i in 2:length(dataList)) {
  keep = intersect(d$person, dataList[[i]]$person)
  d = bind_rows(d, dataList[[i]])
  if(joinTypes[i-1] == "and") {
    d = d[d$person %in% keep, ]
  }
}
d
#>   person rating sentence
#> 1      A      3       s1
#> 2      B      3       s1
#> 3      A      1       s2
#> 4      B      2       s2
#> 5      A      5       s3
#> 6      B      5       s3
#> 7      C      1       s4
#> 8      D      2       s4
#> 9      A      5       s4

You can put this in a function:

conditional.stack = function(data.list, condition) {
  d = data.list[[1]]
  for(i in 2:length(data.list)) {
    keep = intersect(d$person, data.list[[i]]$person)
    d = bind_rows(d, data.list[[i]])
    if(condition[i-1] == "and") {
      d = d[d$person %in% keep, ]
    }
  }
  return(d)
}

conditional.stack(dataList, joinTypes)

Let me know if there are other bells and whistles you need in order to generate the desired workflow.

Also, a note on terminology: What you're describing is, as far as I know, usually called "stacking" the data. "joining" usually means operations that put two data frames together based on matching one or more key columns between data frames. This usually combines columns from both data frames and potentially expands or reduces the number of rows, depending on the type of join and the key columns used for joining. For example :

library(tidyverse)

d = mtcars %>% rownames_to_column("model")
d1 = d[1:5, 1:3]
d2 = d[4:8, c(1,4,5)]

left_join(d1, d2, by="model")
#>               model  mpg cyl disp  hp
#> 1         Mazda RX4 21.0   6   NA  NA
#> 2     Mazda RX4 Wag 21.0   6   NA  NA
#> 3        Datsun 710 22.8   4   NA  NA
#> 4    Hornet 4 Drive 21.4   6  258 110
#> 5 Hornet Sportabout 18.7   8  360 175

inner_join(d1, d2, by="model")
#>               model  mpg cyl disp  hp
#> 1    Hornet 4 Drive 21.4   6  258 110
#> 2 Hornet Sportabout 18.7   8  360 175

full_join(d1, d2, by="model")
#>               model  mpg cyl  disp  hp
#> 1         Mazda RX4 21.0   6    NA  NA
#> 2     Mazda RX4 Wag 21.0   6    NA  NA
#> 3        Datsun 710 22.8   4    NA  NA
#> 4    Hornet 4 Drive 21.4   6 258.0 110
#> 5 Hornet Sportabout 18.7   8 360.0 175
#> 6           Valiant   NA  NA 225.0 105
#> 7        Duster 360   NA  NA 360.0 245
#> 8         Merc 240D   NA  NA 146.7  62

In your case, we could make one wide data frame with all the data by first renaming the rating column to the value of sentence and then successively running full_join, for which we use the reduce function below:

map(dataList, ~{
  names(.x)[grep("rating", names(.x))] = .x$sentence[1]
  .x %>% select(-sentence)
}) %>% 
  reduce(full_join)
  person s1 s2 s3 s4
1      A  3  1  5  5
2      B  3  2  5 NA
3      C  4  1 NA  1
4      D  5  1 NA  2
5      E NA  1 NA NA

@joels Thank you! This does what I need. Sorry for taking a while to get back to you, and thank you for such a rapid response.

Good note on the terminology, too. I knew that "join" was the wrong term (and I'm pretty familiar with what joins actually do), but I didn't know the right one--might have had more success googling if I had known what to search for!

Cheers :slight_smile:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.