Summarising column keeping other information

Good morning
I have a data frame similar to this one

tibble(year = c(rep(2012, 5), rep(2013, 5)),
month = c("April", "April", "June", "June", "July", "April", "April", "April", "August", "August"),
sampler_type = c("A", "A", "B", "B", "B", "A", "A", "A", "B", "B"),
city = c("city_1", "city_1","city_2","city_2","city_3","city_1","city_1","city_2","city_3","city_3"),
var_1 = c(0,0,0,0,150, 325, 523,335, 25, 36),
var_2 = c(122, 12, 654, 321, 321, 0, 0, 0, 0,0),
var_3 = c(0, 0, 0, 321, 154, 214, 214, 0, 0, 0),
var_4 = c(123, 245, 21, 0, 124, 321, 0, 231, 214, 0),
var_5 = c(213, 142, 0, 0, 132, 31, 54, 35, 54, 23))

I would like to summarise data by summing up the rows which have the same month, from the same place and same year, also keeping the info about the "sampler_type" variable.

Thanks for the help

library(dplyr)

t1 <- tibble(year = c(rep(2012, 5), rep(2013, 5)),
       month = c("April", "April", "June", "June", "July", "April", "April", "April", "August", "August"),
       sampler_type = c("A", "A", "B", "B", "B", "A", "A", "A", "B", "B"),
       city = c("city_1", "city_1","city_2","city_2","city_3","city_1","city_1","city_2","city_3","city_3"),
       var_1 = c(0,0,0,0,150, 325, 523,335, 25, 36),
       var_2 = c(122, 12, 654, 321, 321, 0, 0, 0, 0,0),
       var_3 = c(0, 0, 0, 321, 154, 214, 214, 0, 0, 0),
       var_4 = c(123, 245, 21, 0, 124, 321, 0, 231, 214, 0),
       var_5 = c(213, 142, 0, 0, 132, 31, 54, 35, 54, 23))


group_by(t1,
         year,month,sampler_type,city) %>% summarise_all(sum)

I found 'keeping the info about the sample_type' ambigious, I interpreted it to mean its just another thing to group by.

1 Like

Thanks for your answer.
I tried your code on the database I'm working on but R takes a lot of time to process the operation without giving any result. It is a table with more than 12000 columns and more than 1000 rows.
I've tried the same code on a subset of 10 rows and it worked.
I uploaded the file with the fread () function from data.table package but I'm not able to compute this operation...

Is there a way to speed up this operation?

I'd be suprised if the fastest way wasnt to query it on the database directly.
You can try dbplyr for that https://cran.r-project.org/web/packages/dbplyr/vignettes/dbplyr.html
or like you said, learn the data.table syntax and summarise with that.

As an inbetween dtplyr lets you write dplyr code and get data.table performance.

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)


t1 <- tibble(year = c(rep(2012, 5), rep(2013, 5)),
             month = c("April", "April", "June", "June", "July", "April", "April", "April", "August", "August"),
             sampler_type = c("A", "A", "B", "B", "B", "A", "A", "A", "B", "B"),
             city = c("city_1", "city_1","city_2","city_2","city_3","city_1","city_1","city_2","city_3","city_3"),
             var_1 = c(0,0,0,0,150, 325, 523,335, 25, 36),
             var_2 = c(122, 12, 654, 321, 321, 0, 0, 0, 0,0),
             var_3 = c(0, 0, 0, 321, 154, 214, 214, 0, 0, 0),
             var_4 = c(123, 245, 21, 0, 124, 321, 0, 231, 214, 0),
             var_5 = c(213, 142, 0, 0, 132, 31, 54, 35, 54, 23))

lazy_dt(t1) %>% group_by(year,month,sampler_type,city) %>% summarise_all(sum)
1 Like

I've tried the new code and R was able to compute the operation.
Thank you so much for the help.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.