Thanks for your help! I will give the latest update of what I'm trying to accomplish, and what the result looks like using your example. You certainly got me closer than I was before, and thanks for all your help!
# Make a Dataframe of lineage, testset, and truest
library(dplyr, warn.conflicts = FALSE)
df_trueset <- tribble(~ lineage, ~ cluster,
"blood", 43,
"blood",36,
"blood",6,
"blood",65,
"blood",73,
"blood",41,
"bone",42,
"central",53,
"central",7,
"central",60,
"skin",73,
"soft",60,
"soft",68)
df_testset <- tribble(~ lineage, ~ cluster,
"blood", 0,
"blood",0,
"blood",6,
"blood",65,
"blood",73,
"blood",41,
"bone",42,
"central",90,
"central",53,
"skin",1,
"soft",65,
"soft",68)
Ultimate goal : Make and attach to df_final
the following 4 columns:
(Using blood lineage as an example, but I need to do this for each row)
a_result = # of blood clusters in testset that DO match with blood clusters in trueset
b_result = # of blood clusters in testset that DO NOT match with blood clusters in trueset
c_result = # of any NON-blood clusters of testset that DO match with blood clusters in trueset
d_result = # any NON-blood clusters that DO NOT match with blood clusters in trueset (which will be the biggest number)
First get to df_final
by using your examples:
# Find entries in df_testset that are also present in df_trueset.
df_matched <- semi_join(df_testset, df_trueset, by = c("lineage", "cluster"))
# Summarise the true set.
df_trueset_summary <- df_trueset %>%
group_by(lineage) %>%
summarise(clusters_trueset = paste(cluster, collapse = " "),
n_trueset = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
# Summarise the matched set.
df_matched_summary <- df_matched %>%
group_by(lineage) %>%
summarise(clusters_matched = paste(cluster, collapse = " "),
n_matched = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
# Join the resulting tibbles.
df_final <- left_join(df_trueset_summary, df_matched_summary, by = "lineage")
Now inspect df_final
to get the manual answers of a/b/c/d_result :
df_final
lineage clusters_trueset n_trueset clusters_matched n_matched
<chr> <chr> <int> <chr> <int>
1 blood 43 36 6 65 73 41 6 6 65 73 41 4
2 bone 42 1 42 1
3 central 53 7 60 3 53 1
4 skin 73 1 NA NA
5 soft 60 68 2 68 1
Again, using just blood as the example:
a_result = 4 (4 blood in testset match with trueset)
b_result = 2 (2 testset blood clusters do no match anywhere with blood trueset)
c_result = 1 (Only 1 non-blood cluster in the testset column matches with one of the blood trueset clusters [it is soft, cluster 65])
d_result = 5 (There are 5 non-blood clusters in testset that do not match with blood clusters of trueset)
Aproach
First, the n_matched column is exactly equal to what I want for a_result. Therefore, change the name of that column because it's already done:
names(df_final)[names(df_final) == "n_matched"] <- "a_result"
How can I continue using this method to get the b_, c_, and d_result columns?
Thanks for your help!