Group_by on elements of a large list

NuKeRiSk · September 16, 2019, 1:47pm

Hi All,

I've only recently started coding and am totally stuck!

I have a large list (Large.List.Df) that consists of 50+ arrays (each with 1000+ rows and 5+ columns). These arrays are all listed in double square brackets (e.g. [[A]] ) in a drop down menu when you open the dataframe Large.List.Df

I would like to use group_by() on name.of.column in each of the 50+ arrays so that I can mutate(name.of.new.column = 1:n()). I have used this combination of group_by(name.of.column) and mutate(name.of.new.column = 1:n()) on a normal dataframe (so just one element of the large list) and it works perfectly. But, if I run:

NewDf <- Large.List.Df %>% group_by(name.of.column) %>% mutate(name.of.new.column = 1:n())

I get the following error message:

Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "list"

I hope this all makes sense. I would be very grateful for any suggestions, advice, help, etc!

Thanks!

FJCC · September 16, 2019, 2:05pm

Are you looking for something like the following. If not, please post an example of your data as a Reproducible Example.

FAQ: How to do a minimal reproducible example ( reprex ) for beginners Guides & FAQs

A minimal reproducible example consists of the following items: A minimal dataset, necessary to reproduce the issue The minimal runnable code necessary to reproduce the issue, which can be run on the given dataset, and including the necessary information on the used packages. Let's quickly go over each one of these with examples: Minimal Dataset (Sample Data) You need to provide a data frame that is small enough to be (reasonably) pasted on a post, but big enough to reproduce your issue. Let's say, as an example, that you are working with the iris data frame head(iris) #> Sepal.Length Sepal.Width Petal.Length Petal.Width Species #> 1 5.1 3.5 1.4 0.…

library(purrr)
#> Warning: package 'purrr' was built under R version 3.5.3
library(dplyr)

LIST <- list(A = data.frame(D = 1:6, B = rep(LETTERS[1:3], 2)),
             C = data.frame(E = 2:7, B = rep(LETTERS[1:3], 2)))
LIST
#> $A
#>   D B
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 A
#> 5 5 B
#> 6 6 C
#> 
#> $C
#>   E B
#> 1 2 A
#> 2 3 B
#> 3 4 C
#> 4 5 A
#> 5 6 B
#> 6 7 C

MyFunc <- function(DF) {
  DF %>% group_by(B) %>% 
    mutate(NewCol = 1:n()) %>%
    arrange(B)
}

LIST2 <- map(LIST, MyFunc)
LIST2
#> $A
#> # A tibble: 6 x 3
#> # Groups:   B [3]
#>       D B     NewCol
#>   <int> <fct>  <int>
#> 1     1 A          1
#> 2     4 A          2
#> 3     2 B          1
#> 4     5 B          2
#> 5     3 C          1
#> 6     6 C          2
#> 
#> $C
#> # A tibble: 6 x 3
#> # Groups:   B [3]
#>       E B     NewCol
#>   <int> <fct>  <int>
#> 1     2 A          1
#> 2     5 A          2
#> 3     3 B          1
#> 4     6 B          2
#> 5     4 C          1
#> 6     7 C          2

^{Created on 2019-09-16 by the reprex package (v0.2.1)}

NuKeRiSk · September 16, 2019, 2:27pm

This was exactly what I needed - thank you so very much @FJCC !!! Please could you explain what the function(DF) does? I'm a total newbie and haven't actually written any functions yet.

Many thanks,
N

FJCC · September 16, 2019, 4:51pm

MyFunc <- function(DF) {
  DF %>% group_by(B) %>% 
    mutate(NewCol = 1:n()) %>%
    arrange(B)
}

The above part of my code defines a new function that takes one argument named DF. It processes DF through the steps within the braces, grouping by B, mutating it to add NewCol, and sorting by B and then returns the result of that process. It would have been clearer if I had written

MyFunc <- function(DF) {
  tmp <- DF %>% group_by(B) %>% 
    mutate(NewCol = 1:n()) %>%
    arrange(B)

  return(tmp)
}

After running that code, I can pass a data frame that has a column named B into MyFunc and get back a data frame with the additional NewCol. Below is an example of NewFunc acting on the first element of the LIST I defined in my previous post.

library(dplyr)

LIST <- list(A = data.frame(D = 1:6, B = rep(LETTERS[1:3], 2)),
             C = data.frame(E = 2:7, B = rep(LETTERS[1:3], 2)))
LIST
#> $A
#>   D B
#> 1 1 A
#> 2 2 B
#> 3 3 C
#> 4 4 A
#> 5 5 B
#> 6 6 C
#> 
#> $C
#>   E B
#> 1 2 A
#> 2 3 B
#> 3 4 C
#> 4 5 A
#> 5 6 B
#> 6 7 C

MyFunc <- function(DF) {
  DF %>% group_by(B) %>% 
    mutate(NewCol = 1:n()) %>%
    arrange(B)
}

subList <- MyFunc(LIST[[1]])
subList
#> # A tibble: 6 x 3
#> # Groups:   B [3]
#>       D B     NewCol
#>   <int> <fct>  <int>
#> 1     1 A          1
#> 2     4 A          2
#> 3     2 B          1
#> 4     5 B          2
#> 5     3 C          1
#> 6     6 C          2

^{Created on 2019-09-16 by the reprex package (v0.2.1)}

MyFunc is no different than a standard R function like mean() that returns the average of whatever is passed to it, except that MyFunc is very simple, with no error handling or flexibility.

I coupled MyFunc with map(). What map() does is act on each element of the list that is given as its first argument using the function that is given as its second argument.
The call

map(LIST, MyFunc)

just acts on each element of LIST with MyFunc.

NuKeRiSk · September 16, 2019, 8:41pm

This is such a clear explanation - thank you so much @FJCC !

system · October 7, 2019, 8:41pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.