How can I use a for() loop to create multiple numerics based on a dataframe?

cwright1 · August 5, 2020, 4:19pm

I have the dataframe df which contains the columns 'lineage' and 'cluster' :

lineage	cluster
central_nervous_system	53
bone	42
central_nervous_system	7
skin	1
blood	43
central_nervous_system	60
blood	35
blood	6
soft_tissue	64
blood	65
blood	73
blood	41
soft_tissue	68

For each unique row in the lineage column, I want to make a numeric that contains all associated values in the cluster column. For example, clusters_blood, clusters_bone' etc..

I can do this manually, like so (using 'blood' as an example):

clusters_blood <- df[df$lineage=="blood", c("clusters")]

which returns the numeric :

clusters_blood
[1] 43 35 6 65 73 41

My question: How can I use a for() loop to automate this process? Essentially, "For each unique row of df$lineage , return a numeric (with "cluster_" pasted in front) containing all values in the 'cluster' column, respectively.

siddharthprabhu · August 5, 2020, 4:56pm

I'd advise keeping the column in the data frame and using summarise().

library(dplyr, warn.conflicts = FALSE)

df <- tribble(~ lineage, ~ cluster,
              "central_nervous_system", 53,
              "bone", 42,
              "central_nervous_system", 7,
              "skin", 1,
              "blood", 43,
              "central_nervous_system", 60,
              "blood", 35,
              "blood", 6,
              "soft_tissue", 64,
              "blood", 65,
              "blood", 73,
              "blood", 41,
              "soft_tissue", 68)

df %>% 
  group_by(lineage) %>% 
  summarise(cluster_str = paste(cluster, collapse = " "))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 2
#>   lineage                cluster_str     
#>   <chr>                  <chr>           
#> 1 blood                  43 35 6 65 73 41
#> 2 bone                   42              
#> 3 central_nervous_system 53 7 60         
#> 4 skin                   1               
#> 5 soft_tissue            64 68

^{Created on 2020-08-05 by the reprex package (v0.3.0)}

cwright1 · August 5, 2020, 6:24pm

Thanks for this helpful and interesting response! As far as keeping it in the column, this definitely works. I will keep your suggestion if I can make my next process work, as it originally did when i was putting these in numerics.

When cluster_blood was a numeric returning this :
[1] 43 35 6 65 73 41

I could reference it, answering the question "How many "blood" rows from the dataframe df2 appear in cluster_blood - where df2 was just a huge dataframe of all lineages and what clusters were detected (same as df but from a different source). So if there were rows in df2 called "blood" in the lineage column and in one of the clusters of cluster_blood, they get added up and returned. I did it with something like this :

result_blood <- nrow(df2[df2$lineage=="blood" & df2$cluster %in% cluster_blood,])

But how can I do that if I'm no longer referencing cluster_blood, but rather df$lineage=="blood" ...for all lineages?

siddharthprabhu · August 5, 2020, 6:32pm

You could count the values during the summarise() operation and then simply reference that (here I'm storing the counts in a variable called n).

library(dplyr, warn.conflicts = FALSE)

df <- tribble(~ lineage, ~ cluster,
              "central_nervous_system", 53,
              "bone", 42,
              "central_nervous_system", 7,
              "skin", 1,
              "blood", 43,
              "central_nervous_system", 60,
              "blood", 35,
              "blood", 6,
              "soft_tissue", 64,
              "blood", 65,
              "blood", 73,
              "blood", 41,
              "soft_tissue", 68)

df %>% 
  group_by(lineage) %>% 
  summarise(cluster_str = paste(cluster, collapse = " "), n = n())
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 5 x 3
#>   lineage                cluster_str          n
#>   <chr>                  <chr>            <int>
#> 1 blood                  43 35 6 65 73 41     6
#> 2 bone                   42                   1
#> 3 central_nervous_system 53 7 60              3
#> 4 skin                   1                    1
#> 5 soft_tissue            64 68                2

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

cwright1 · August 5, 2020, 7:01pm

Sorry - I didn't explain clearly enough my apologies! I want to incorporate your advice of keeping things in that dataframe. Here's a fresh explanation of what I'm trying to do:

df_testset is lineages and the clusters they were put into based on a set of samples that were tested called "testset"

df_trueset is exactly the same, but is known to be correct, "trueset"

Question: How many samples of a given lineage within testset, were correctly classified into the correct clusters (of the respective lineage) based on the trueset?

#Make the 2 example dataframes

df_trueset <- tribble(~ lineage, ~ cluster,
              "central_nervous_system", 53,
              "bone", 42,
              "central_nervous_system", 7,
              "skin", 1,
              "blood", 43,
              "central_nervous_system", 60,
              "blood", 35,
              "blood", 6,
              "soft_tissue", 64,
              "blood", 65,
              "blood", 73,
              "blood", 41,
              "soft_tissue", 68)

df_testset <- tribble(~ lineage, ~ cluster,
              "central_nervous_system", 90,
              "bone", 42,
              "skin", 1,
              "blood", 0,
              "central_nervous_system", 53,
              "blood", 0,
              "blood", 6,
              "soft_tissue", 64,
              "blood", 65,
              "blood", 73,
              "blood", 41,
              "soft_tissue", 68)

So, I think my overall approach would be:

Get the clusters of every lineage based on df_trueset
Ex: the 'true' clusters for blood are 43, 35, 6, 65, 73, and 41
Find out how many entires of a given lineage of df_testset match the respective clusters.
Ex: The answer for blood is 4. That is because out of df_testset, 4 of the clusters are correct (if blood = 0, that's incorrect because 0 is not a true cluster for blood based on df_trueset.

If I use the approach like you mentioned before, then my goal dataframe should have all the clusters (like your example generates), but also another column indicating how many are correct matches for each lineage.

EDIT:
I could gather all the clusters in 2 columns and just now need to ask the question, "how many entries in cluster_testset match cluster_trueset?"

siddharthprabhu · August 5, 2020, 8:37pm

Thank you for the detailed explanation (and for providing the sample data in an easy-to-copy format). It's really quite clear now. Here's how I would approach solving the problem.

The key problem to solve really is to find out which combinations of lineage and cluster are present in both data sets. For this kind of "which records in x are also present in y" problem, dplyr's semi_join() is very handy. It returns a data frame with only the matched records.

Once we have identified them, it's just a simple matter of summarising each data frame and joining the results as shown below. I've included the cluster numbers for reference.

library(dplyr, warn.conflicts = FALSE)

df_trueset <- tribble(~ lineage, ~ cluster,
                      "central_nervous_system", 53,
                      "bone", 42,
                      "central_nervous_system", 7,
                      "skin", 1,
                      "blood", 43,
                      "central_nervous_system", 60,
                      "blood", 35,
                      "blood", 6,
                      "soft_tissue", 64,
                      "blood", 65,
                      "blood", 73,
                      "blood", 41,
                      "soft_tissue", 68)

df_testset <- tribble(~ lineage, ~ cluster,
                      "central_nervous_system", 90,
                      "bone", 42,
                      "skin", 1,
                      "blood", 0,
                      "central_nervous_system", 53,
                      "blood", 0,
                      "blood", 6,
                      "soft_tissue", 64,
                      "blood", 65,
                      "blood", 73,
                      "blood", 41,
                      "soft_tissue", 68)

# Find entries in df_testset that are also present in df_trueset.
df_matched <- semi_join(df_testset, df_trueset, by = c("lineage", "cluster"))

# Summarise the true set.
df_trueset_summary <- df_trueset %>% 
  group_by(lineage) %>% 
  summarise(clusters_trueset = paste(cluster, collapse = " "),  
            n_trueset = n())
#> `summarise()` ungrouping output (override with `.groups` argument)

# Summarise the matched set.
df_matched_summary <- df_matched %>% 
  group_by(lineage) %>% 
  summarise(clusters_matched = paste(cluster, collapse = " "),  
            n_matched = n())
#> `summarise()` ungrouping output (override with `.groups` argument)

# Join the resulting tibbles.
left_join(df_trueset_summary, df_matched_summary, by = "lineage")
#> # A tibble: 5 x 5
#>   lineage                clusters_trueset n_trueset clusters_matched n_matched
#>   <chr>                  <chr>                <int> <chr>                <int>
#> 1 blood                  43 35 6 65 73 41         6 6 65 73 41               4
#> 2 bone                   42                       1 42                       1
#> 3 central_nervous_system 53 7 60                  3 53                       1
#> 4 skin                   1                        1 1                        1
#> 5 soft_tissue            64 68                    2 64 68                    2

^{Created on 2020-08-06 by the reprex package (v0.3.0)}

Note: I've used left_join() to combine the summary data frames so that every lineage in df_trueset will be reported in the results. But if you don't want cases where no matching clusters are found in df_testset you can swap it with an inner_join() instead.

cwright1 · August 6, 2020, 7:41pm

Thanks for your help! I will give the latest update of what I'm trying to accomplish, and what the result looks like using your example. You certainly got me closer than I was before, and thanks for all your help!

# Make a Dataframe of lineage, testset, and truest
library(dplyr, warn.conflicts = FALSE)

df_trueset <- tribble(~ lineage, ~ cluster,
"blood", 43,
"blood",36,
"blood",6,
"blood",65,
"blood",73,
"blood",41,
"bone",42,
"central",53,
"central",7,
"central",60,
"skin",73,
"soft",60,
"soft",68)


df_testset <- tribble(~ lineage, ~ cluster,
"blood", 0,
"blood",0,
"blood",6,
"blood",65,
"blood",73,
"blood",41,
"bone",42,
"central",90,
"central",53,
"skin",1,
"soft",65,
"soft",68)

Ultimate goal : Make and attach to df_final the following 4 columns:
(Using blood lineage as an example, but I need to do this for each row)

a_result = # of blood clusters in testset that DO match with blood clusters in trueset
b_result = # of blood clusters in testset that DO NOT match with blood clusters in trueset
c_result = # of any NON-blood clusters of testset that DO match with blood clusters in trueset
d_result = # any NON-blood clusters that DO NOT match with blood clusters in trueset (which will be the biggest number)

First get to df_final by using your examples:

# Find entries in df_testset that are also present in df_trueset.
df_matched <- semi_join(df_testset, df_trueset, by = c("lineage", "cluster"))

# Summarise the true set.
df_trueset_summary <- df_trueset %>% 
  group_by(lineage) %>% 
  summarise(clusters_trueset = paste(cluster, collapse = " "),  
            n_trueset = n())
#> `summarise()` ungrouping output (override with `.groups` argument)

# Summarise the matched set.
df_matched_summary <- df_matched %>% 
  group_by(lineage) %>% 
  summarise(clusters_matched = paste(cluster, collapse = " "),  
            n_matched = n())
#> `summarise()` ungrouping output (override with `.groups` argument)

# Join the resulting tibbles.
df_final <- left_join(df_trueset_summary, df_matched_summary, by = "lineage")

Now inspect df_final to get the manual answers of a/b/c/d_result :

df_final
  lineage clusters_trueset n_trueset clusters_matched n_matched
  <chr>   <chr>                <int> <chr>                <int>
1 blood   43 36 6 65 73 41         6 6 65 73 41               4
2 bone    42                       1 42                       1
3 central 53 7 60                  3 53                       1
4 skin    73                       1 NA                      NA
5 soft    60 68                    2 68                       1

Again, using just blood as the example:
a_result = 4 (4 blood in testset match with trueset)
b_result = 2 (2 testset blood clusters do no match anywhere with blood trueset)
c_result = 1 (Only 1 non-blood cluster in the testset column matches with one of the blood trueset clusters [it is soft, cluster 65])
d_result = 5 (There are 5 non-blood clusters in testset that do not match with blood clusters of trueset)

Aproach
First, the n_matched column is exactly equal to what I want for a_result. Therefore, change the name of that column because it's already done:

names(df_final)[names(df_final) == "n_matched"] <- "a_result"

How can I continue using this method to get the b_, c_, and d_result columns?

Thanks for your help!

siddharthprabhu · August 7, 2020, 6:36am

b_result is straightforward; we can just compute the difference between n_trueset and a_result.

df_final %>% 
  mutate(b_result = n_trueset - a_result)
# A tibble: 5 x 6
  lineage clusters_trueset n_trueset clusters_matched a_result b_result
  <chr>   <chr>                <int> <chr>               <int>    <int>
1 blood   43 36 6 65 73 41         6 6 65 73 41              4        2
2 bone    42                       1 42                      1        0
3 central 53 7 60                  3 53                      1        2
4 skin    73                       1 NA                     NA       NA
5 soft    60 68                    2 68                      1        1

skin doesn't have a cluster 65 in df_testset. I'm presuming this is a typo and you meant soft instead.

Calculating c_result and d_result is tricky because it requires working across rows. I think you may have to go with the approach suggested by technocrat in the other thread. I'll write back here if I come up with a different solution.

cwright1 · August 7, 2020, 9:20am

Thanks again for your time and help! good catch, yes it was a typo and was soft (not skin) - I've edited the post.

Yes, unfortunately c_result and d_result are difficult because, like you mentioned, it involves all rows in a column. In the other thread, technocrat's example wasn't able to solve this either (directly, anyway. He did give me an idea I'm just trying to figure out how to implement it).

That example did work for one row, but not multiple rows without pre-defining the non-X clusters (non-blood for the 'blood' row, non-bone for the 'bone' row, etc).

siddharthprabhu · August 7, 2020, 4:09pm

Can you please provide the expected values for c_result and d_result for all 5 lineages in this data set?

cwright1 · August 7, 2020, 6:18pm

Certainly! Here is what the final result should be :

lineage	clusters_testset	clusters_trueset	a_result	b_result	c_result	d_result
blood	0 0 6 65 73 41	43 35 6 65 73 41	4	2	1	5
bone	42	42	1	0	0	11
central	90 53	53 7 60	1	1	0	10
skin	1	73	0	1	1	11
soft	65 68	60 68	1	1	0	10

A couple examples:
c_result for soft: "How many non-soft clusters in the testset column are found within the soft trueset column?" Well, the non-soft clusters in testset are 0,0,6,65,73,41,42,90,53,1. Out of those, zero match with 60 or 68.

d_result for soft: "How many non-soft clusters in testset are NOT found in the soft trueset (60, 68)?"
Blood has 6 that don't match
bone has 1 that doesn't match
central has 2 that don't match
skin has 1 that doesn't match
So the answer = 10

Thanks for helping out!

system · August 28, 2020, 6:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.