appending dataframes to list named by variable

evodevo · August 27, 2020, 4:27am

Hi R community,

So below I've got two bits of code for the same problem. The first is the actual code I'm working on (which obviously won't be able to run) and the second is my attempt at a reprex (though I'm new to this so forgive me if I've done this wrong).

Ok, so here's my actual code:


for (i in 1:nrow(enrichResList[[1]])) {
  #get the names of each ontology and the analysis and put it into an empty list variable 'v'
  v <- append(v, as.name(paste(names(enrichResList)[[1]], "_", enrichResList[[1]]@result[i,]$Description, sep = "")))
  #get the geneIDs for each and put them in the list variable as well
  v[[i]] <- append(v[[i]], enrichResList[[1]]@result[i,]$geneID)
  #split the geneIDs from 1 string into multiple strings
  v[[i]][[2]] <- str_split(v[[i]][[2]], "/")
  #set the names of each element in the vector to be the name of the analysis/ontology from line one
  v[[i]] <- setNames(v[[i]][[2]], as.character(v[[i]][[1]]))
}

#remove the top level from the list of lists so that it is a list of vectors with names corresponding to experiments
v <- flatten(v)

#for each of the vectors in the list, v, subset the FTD_Mouse_Ontology dataframe by the entrez IDs (individual vector elements)
#is there a way to keep these subset dataframes in a list of dataframes?

for (i in 1:length(v)) {
  #this splits the elements into separate dataframes with proper naming
  assign(names(v)[[i]], subset(FTD_Mouse_Ontology, entrez %in% v[[i]]))
}

So this works fine. Basically I've gone from an S4 object which contains a dataframe enrichResList[[1]]@result. Ultimately, I will be trying to convert all of this to a function so the [[1]] index is just a placeholder as I actually have 4 elements which I will want to process in this way.

So each row of the dataframe contains a few bits of information, and I want to extract two of the columns (one of which contains a string variable that is actually multiple variables that need to be split). This is what I'm doing in the first for loop - just extracting the data and formatting the character strings properly as separate observations/elements rather than as 1 thing.

From there, I flatten the resulting list so that it is a list of vectors, each of which has a name that is relevant to what it is, e.g. v[[1] is a vector called clusterGO_CellCycle which contains a bunch of numbers that are basically ID numbers (entrez IDs for those of you in bioinformatics). Simple enough so far (P.S: these IDs are what were contained in the string variable that was split)

I want to then use these IDs to subset a dataframe. It's important to iterate through each vector, subset the dataframe, and call the resulting newly subset dataframe a name that corresponds to the vector that it has come from, this is what I'm doing with the second for loop.

However, this just creates a bunch of dataframes in my environment, and because I am going to be doing this for many lists of vectors, and some of them will be quite long, I want to be able to group the dataframes together into a list according to which list of vectors they've come from. So in the example above, I would want to name the new list paste(names(enrichResList)[[1]], "_dfList", sep = "")

But as I'm sure you all know, I can't create an empty list with a name that is produced from paste(), so I would have to use assign. This would look something like:

for (i in 1:length(v)) {
  #this splits the elements into separate dataframes with proper naming
  assign(paste(names(enrichResList)[[1]], "_dfList", sep = "") ,assign(names(v)[[i]], subset(FTD_Mouse_Ontology, entrez %in% v[[i]])))
}

But this doesn't work as I would like.
So my problem is, how do I make an empty list with a name that is produced from paste() which I can assign dataframes as I iteratively create them by subsetting a larger dataframe.

Ok, so that's a lot of information sorry, so below find my attempt at a reprex:


#start with a dataframe containing some information (e.g. numeric IDs)
id <- sample(1:100, 50, replace=TRUE)
n <- sample(1:100, 50, replace = TRUE)
ratio <- rnorm(50, mean = 0.5, sd=0.1)

start_df <- data.frame(id,n,ratio)

#have some other data structure which contains named elements
experiment <- list("item1" = sample(1:100, 100), "item2" = sample(1:100, 100))

#create a list of vectors
v1 <- sample(1:100, 10, replace = TRUE)
v2 <- sample(1:100, 10, replace = TRUE)
v3 <- sample(1:100, 10, replace = TRUE)

v_list <- list("res1" = v1, "res2" = v2, "res3" = v3)

#for each vector in the list, subset the original dataframe by the elements of the vector
for (i in 1:length(v_list)) {
  assign(names(v_list)[[i]], subset(start_df, id %in% v_list[[i]]))
}

#NEXT STEP: assign the resulting dataframes to a list which is named based upon some variables
#e.g:

nameOfDfList <- paste(names(experiment)[[1]], "_dfList", sep="")

#So what I'm trying to achieve, instead of the above for loop, would be something like:
for (i in 1:length(v_list)) {
  assign(paste(names(experiment)[[1]], "_dfList", sep=""), assign(names(v_list)[[i]], subset(start_df, id %in% v_list[[i]])))
}

#however this overwrites the list of dataframes so that I end up with only the dataframe produced last from the for loop

The problem can be seen here as, how do I get res1,res2, res3 into a list which is named based upon the named elements of v_list ?

I have also had a go experimenting with as.name() but this doesn't seem to work well either....

thanks for any help or advice you can give, and sorry if the first chunk of code causes any confusion!

HanOostdijk · August 27, 2020, 10:38am

Is it an option to first assign the values and in the last step the names as in :

v1 <- sample(1:100, 10, replace = TRUE)
v2 <- sample(1:100, 10, replace = TRUE)
v3 <- sample(1:100, 10, replace = TRUE)
v_list <- list("res1" = v1, "res2" = v2, "res3" = v3)

names(v_list) = c('London','Paris','Amsterdam')
str(v_list)
#> List of 3
#>  $ London   : int [1:10] 98 17 12 41 6 71 2 100 43 31
#>  $ Paris    : int [1:10] 99 21 79 91 31 37 80 29 80 42
#>  $ Amsterdam: int [1:10] 71 97 22 56 15 86 31 49 33 50

^{Created on 2020-08-27 by the reprex package (v0.3.0)}

evodevo · August 27, 2020, 11:29pm

Thanks for getting back to me.

So, the issue isn't so much the names of the elements in v_list. These are already named (in the first for loop of the actual code this is done on the first line and the last line:

v <- append(v, as.name(paste(names(enrichResList)[[1]], "_", enrichResList[[1]]@result[i,]$Description, sep = "")))

and

v[[i]] <- setNames(v[[i]][[2]], as.character(v[[i]][[1]]))

in the reprex example, this is done when I created the list of vectors (v_list):

v_list <- list("res1" = v1, "res2" = v2, "res3" = v3)

The actual issue is, these vectors are then used to subset a separate dataframe FTD_Mouse_Ontology or start_df respectively in the above code.

This subsetting produces the same number of individual dataframes as there are vectors in the list (i.e.the number of new dataframes = length(v_list))

My issue then is that I have n = length(v_list) dataframes in the environment, and I would instead like these in a list of dataframes (which is as of yet not created). This list is what I'm looking to name, and the name should reference other variables:

In the actual code the name will look like:

paste(names(enrichResList)[[1]], "_dfList", sep = "")

in the dummy code/reprex example:

paste(names(experiment)[[1]], "_dfList", sep="")

evodevo · August 28, 2020, 12:12am

Ok, in the end, I've decided that the effort required to automate this is more than what is reasonable for the outcome. So I think the solution is to just do this manually.

technocrat · August 28, 2020, 4:46am

suppressPackageStartupMessages({library(dplyr)
                                library(pander)
                                library(purrr)
                                library(tidyr)
                              })
f <- function(x) {
  map(1:length(mk_vs()),sub_df) %>% set_names(.,mk_names(mk_vs(),x))
}

mk_synth_df <- function() {
  set.seed(137)
  ID <- sample(1:100, 50, replace=TRUE)
  N <- sample(1:100, 50, replace = TRUE)
  ratio <- rnorm(50, mean = 0.5, sd=0.1)
  data.frame(ID,N,ratio)
}

mk_list <- function() {
  set.seed(137)
  list("item1" = sample(1:100, 100), "item2" = sample(1:100, 100))
}

mk_vs <- function() {
  set.seed(137)
  v1 <- sample(1:100, 10, replace = TRUE)
  v2 <- sample(1:100, 10, replace = TRUE)
  v3 <- sample(1:100, 10, replace = TRUE)
  list("res1" = v1, "res2" = v2, "res3" = v3)
}

mk_names <- function(x,y) {
  paste(names(x)[1:length(x)],y, sep="")
}

sub_df <- function(x) {
  drop_na(mk_synth_df()[mk_vs()[[x]],])
}

f("_dFList")
#> $res1_dFList
#>    ID  N     ratio
#> 34 47 27 0.4427454
#> 8  96 22 0.5997265
#> 38 71 44 0.6615295
#> 39 14 83 0.4865537
#> 35 15 59 0.4820364
#> 
#> $res2_dFList
#>    ID  N     ratio
#> 30 35 89 0.6159864
#> 48 38 43 0.3970454
#> 13 13 21 0.4866509
#> 43 51 14 0.4953379
#> 14 43 55 0.5479219
#> 22 48 96 0.5725216
#> 4  38 44 0.4549781
#> 6  39 25 0.7056911
#> 
#> $res3_dFList
#>      ID  N     ratio
#> 15   14 89 0.5237554
#> 48   38 43 0.3970454
#> 32   86 89 0.5995506
#> 13   13 21 0.4866509
#> 14   43 55 0.5479219
#> 35   15 59 0.4820364
#> 35.1 15 59 0.4820364

^{Created on 2020-08-27 by the reprex package (v0.3.0)}

evodevo · August 31, 2020, 7:39am

Hi @technocrat, this looks really interesting and seems as though it might solve my problem.

I'm not very experienced with reading/working with this sort of functional programming though.

For my own learning, I'm going to try and write out a step by step explanation for what is happening here, if you have the time it would be great if you could confirm/correct my understanding. (and hopefully this will provide useful, or at worst interesting, for someone else in the future!)

From the top:
So first the function f defined at the start applies sub_df() to each element in the list of named vectors (produced here by mk_vs but equivalent to v_list in my example).

So what does sub_df do?
sub_df takes the argument x and removes any rows with missing values (x here would be a vector I suppose given that the function is applied to a list of vectors). So within the call drop_na(), you are retrieving the xth index of the list of vectors, and using this to subset a row from a dataframe which has been created (to match my regex example) by mk_synth_df. So my understanding would be that this is where the subsetting is occuring:

mk_synth_df()[mk_vs()[[x]],] - that is, use the value of the xth element from the list of vectors as the row index from the dataframe.

This seems like it would be equivalent to:

start_df[vs_list[[x]],]

Given that vs_list[[x]] is actually a vector, not a double, would this then return multiple rows corresponding to all of the rows that match on values? (this is very cool)

Second part of the function f

So we have some list of dataframes (produced by map(), which returns a list by default) that are subset by the values from the list of vectors. This is then piped to a function set_names().

Is the . as the first argument redundant here though given that we are piping?.

Also I assume based on this that set_names take lists as well as vectors for it's first argument?

The second argument of set_names is a list of names produced by mk_names().

mk_names does the following:

Take two arguments (the first argument, x, is the list of vectors, and the second argument y, is equal to the input of function f)

So the mk_names function pastes the name of the vector, x, from the list of vectors, for each vector in the list and pastes it to y, which is whatever string is input to f.

f is called on the string "_dfList" and as a result, this is the suffix for the mk_names function

This is really beautifully done, I am yet to try adapt it to my actual dataset, but thanks very much even just for taking the time to produce this awesome solution!

technocrat · August 31, 2020, 8:42am

What creates a painful learning curve in R is often an impatience with analysis—picking apart a code block down to its smallest pieces, seeing what each does and putting them back again together. It's the same impatience that traditionally keeps first year law students up all night looking for answers instead of questions. Only when the realization hits that there is far too much material to read to find answers do they reach for the tools to construct the right questions.

You've done a great take down of the code and it illustrates a pattern that everyone should know and apply to help(). Like the code block in the example, every R function has arguments and results that are derived in just this stepwise fashion. Following a help example with this mindset makes the documentation actually useful rather than more cruel hazing.

evodevo · August 31, 2020, 11:01am

Thanks again!

As someone who is looking to improve on my understanding of functional programming, I was wondering:

When you developed this solution, did you start at the function f or at the "building blocks" (e.g. sub_df)?

technocrat · August 31, 2020, 2:53pm

I knew that there would be a function f or a chain expression, but at the start its composition was aspirational. That left discovery of the form of the return value of f, y and the objects, x_i ... x_n available. Some of those objects, in turn, required other objects to complete.

For example, mk_synth_df() exists as an abstraction of a data frame of stipulated structure with arbitrary values. In

mk_synth_df()[mk_vs()[[x]],]

the function object’s return value is directly subsetted because in R functions are first-class citizens.

In an imperative language pcode has an analogous model of dealing with the problem abstractly. However, that aims at the narrower question of what to do next. Because of lazy evaluation the focus in a functional language is the what, not the how.

Roundabout way of saying working from both ends to the middle and the order in which f is composed does not signify.

evodevo · September 2, 2020, 3:43am

Thanks for this.

I was able to adapt this solution to my dataset! Whilst others won't be able to run this code, I thought I'd provide it for posterity and completeness. The approach was slightly different as my dataframes were actually contained as slots in S4 objects which themselves were in a list (with there names stored as the keys in the list). The naming also worked slightly differently:

library(purrr)

#for loop splits the geneIDs (located at enrichRes_List[[i]]@result$geneID into a vector of character strings 
#rather than 1 long string (per ontology)
for (i in seq_along(enrichRes_List)) {
  for (k in 1:nrow(enrichRes_List[[i]]@result)) {
    enrichRes_List[[i]]@result[k,]$geneID <- str_split(enrichRes_List[[i]]@result[k,]$geneID, "/")
  }
}

#create a vector for each row of the dataframe containing geneIDs and name the vector the name of the ontology (Description)
#from enrichRes_list@result$Description to names(v)

#function produces a list of vectors with geneIDs included
mk_v <- function(x) {
          v <- vector("list", length = nrow(x@result))
          v <- map(1:length(v), ~{append(v[[.x]], x@result[.x,]$geneID)})
          v <- flatten(v)
          return(v)
}

#input is an individual S4 object, output is a list of vectors
name_v <- function(x) {
            setNames(mk_v(x), x@result$Description)
}

#if name_v() is called on an S4 object it will produce a list of vectors with gene IDs

#for mk_names - x should be a list of vectors
mk_names <- function(x) {
              names(x)[1:length(x)]
}

sub_df <- function(x,y) {
            FTD_Mouse_tidy %>% filter(entrez %in% name_v(x)[[y]])
}

#takes as arguments an enrichResult object and a character string to append to the name
f <- function(e) {
        map(1:length(name_v(e)), ~{sub_df(x=e, y=.x)}) %>% set_names(., mk_names(name_v(e)))
}

#final product, a list of length equal to the number of analyses.
#each list item contains dataframes for each ontology
df_list_Ont <- lapply(enrichRes_List, f)

thanks again

Edit: I just noticed the subsetting function sub_df() in my last post did not subset correctly. I modified it to use filter(), which now functions as expected:

sub_df <- function(x,y) {
            FTD_Mouse_tidy %>% filter(entrez %in% name_v(x)[[y]])
}

system · September 9, 2020, 3:43am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.