I have a problem that I've already solved using a for loop, but I'm trying to figure out if there's a tidyverse equivalent: ideally I would love to be able to do this as part of a longer dplyr pipe. Even apart from that, my curiosity is piqued because I can't figure out how to approach this.
I have data that comes from some survey responses, in which participants judged a series of sentences on a scale of 1-5. For this example, let's say I have five participants, labeled A through E. I have four sentences in my example, called s1 through s4.
I've already gone through and put different restrictions on the ratings for each sentence. So, for example, I've limited the s1 data to only people who rated it between 3 and 5. I've restricted the sentence 2 data to people who rated it either a 1 or a 2. Etc. etc.
After applying the restrictions, I end up with these four data frames:
df1 <- data.frame(person = c("A", "B", "C", "D"), rating = c(3, 3, 4, 5), sentence = "s1")
df2 <- data.frame(person = c("A", "C", "D", "B", "E"), rating = c(1, 1, 1, 2, 1), sentence = "s2")
df3 <- data.frame(person = c("A", "B"), rating = c(5, 5), sentence = "s3")
df4 <- data.frame(person = c("C", "D", "A"), rating = c(1, 2, 5), sentence = "s4")
...which, in my real code, I will need to have in a list:
dataList <- list(df1, df2, df3, df4)
I need to join these data frames together by different boolean criteria (either AND or OR), sequentially/cumulatively.
Here are some sample join criteria:
joinTypes <- c("or", "and", "or")
The join criteria apply in order. So, for example, I want to combine df1 and df2 by "or", i.e. keeping all the rows, even if a given person doesn't appear in both df's. Then, I want to keep only people who appear in [the combined df1/df2 result] AND in df3. Finally, I want to keep people who appear in [the result of the df1/df2/df3 join] OR in df4 (i.e. keep all rows).
This example has four df's (four sentences), but my real use case will have an arbitrarily large number of sentences (minimum 1), with a new join type specified for each additional sentence beyond 1 (such that the joinType
vector will always have length 1 less than the number of sentences).
So, how to make this work? Here is a solution that works by initializing some starting data (the data for sentence 1), and then looping through each additional sentence and appending it to the sentence 1 data (using the <<- arrow at the end of the loop) according to either an "and" or an "or" rule.
dat <- dataList[[1]] # initialize with the first sentence df
for(i in 2:length(dataList)){
newDat <- dataList[[i]]
joinType <- joinTypes[i-1]
if(joinType == "or"){ # if it's an "or" join, we keep all of the rows
outDat <- bind_rows(dat, newDat)
}else{ # if it's an "and" join, we only keep people who appear in both data frames
outDat <- bind_rows(dat, newDat) %>%
group_by(person) %>%
filter(person %in% newDat$person) %>% # only keep the people who appear in both the new data and the previous data
ungroup()
}
dat <<- outDat # update dat with the new sentence data, joined by "or" or "and"
}
> dat
# A tibble: 9 x 3
person rating sentence
<fct> <dbl> <fct>
1 A 3 s1
2 B 3 s1
3 A 1 s2
4 B 2 s2
5 A 5 s3
6 B 5 s3
7 C 1 s4
8 D 2 s4
9 A 5 s4
This gets the desired result: df1 and df2 were joined using "or", but then everyone except for A and B got removed once we joined df3 using "and". But then we join df4 using "or", so person C and D are included only for s4.
But my question is: is there a way to vectorize this kind of cumulative operation? I ask because 1) I prefer the tidyverse/piped workflow, personally, 2) I'm curious, since I can't figure out how to approach it, and 3) I'm working in a Shiny app, and the code above is currently in a reactive expression where it's convenient to have a single long pipeline instead of multiple intermediate objects.
- is not insurmountable; I could definitely alter the Shiny app code if this can't be done in a single pipeline. But I would love to know how if anyone does come up with a solution.
Thanks in advance! And let me know if anything is unclear or if I need to provide additional details.