dplyr - summarise with condition

john22 · April 4, 2021, 12:19pm

I have this data frame:

library(dplyr)
library(tidyr)

data <- tribble(
  ~Date, ~A1, ~A2,~B1,~B2,
  as.Date("2019-01-01"), 20, 10,20, 10,
  as.Date("2019-01-01"), 20 ,5,20,5,
  as.Date("2019-01-01"), 10, 2,10,20,
  as.Date("2019-01-01"), 20, 60,0,0,
  as.Date("2019-01-01"), 30, 4,20,5,
  as.Date("2019-02-01"), 0, 0,16,8,
  as.Date("2019-02-01"), 0, 0,0,40,
  as.Date("2019-02-01"), 0, 0,4,2,
  as.Date("2019-02-01"), 4, 8,10,6,
  as.Date("2019-02-01"), 6, 3,0,0,
  as.Date("2019-03-01"), 20, 8,23,9,
  as.Date("2019-03-01"), 60, 4,0,0,
  as.Date("2019-03-01"), 4, 2,8,3,
  as.Date("2019-03-01"), 0, 6,10,0
)

For each day I want to calculate the mean of (A1-B1) and of (A2-B2) only in the rows where A1>B1 or A2>B2 and A1>0,A2>0,B1>0,B2>0.

data_mean = data %>%
    group_by(Date) %>%
    dplyr::summarise(
      mean_1 = mean(A1[A1>=B1 & A1>0 & B1>0] - B1[A1>=B1 & A1>0 & B1>0]),
      mean_2 = mean(A2[A2>=B2 & A2>0 & B2>0] - B2[A2>=B2 & A2>0 & B2>0]))

Is there a way to use the filter function while using summarise function? Or a smarter way to apply my code?

mishabalyasin · April 4, 2021, 5:54pm

But your solution does exactly what you want already?

What is missing from it that you would like to have?

john22 · April 4, 2021, 5:57pm

The solution does what I want but it's not very efficient. The column names in my real data vary long and the code becomes very long if I write all the conditions with all the columns names. maybe there are more efficient ways to perform this code.

mishabalyasin · April 4, 2021, 6:31pm

Would this help a bit?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

data <- tribble(
  ~Date, ~A1, ~A2,~B1,~B2,
  as.Date("2019-01-01"), 20, 10,20, 10,
  as.Date("2019-01-01"), 20 ,5,20,5,
  as.Date("2019-01-01"), 10, 2,10,20,
  as.Date("2019-01-01"), 20, 60,0,0,
  as.Date("2019-01-01"), 30, 4,20,5,
  as.Date("2019-02-01"), 0, 0,16,8,
  as.Date("2019-02-01"), 0, 0,0,40,
  as.Date("2019-02-01"), 0, 0,4,2,
  as.Date("2019-02-01"), 4, 8,10,6,
  as.Date("2019-02-01"), 6, 3,0,0,
  as.Date("2019-03-01"), 20, 8,23,9,
  as.Date("2019-03-01"), 60, 4,0,0,
  as.Date("2019-03-01"), 4, 2,8,3,
  as.Date("2019-03-01"), 0, 6,10,0
)

data_mean = data %>%
  dplyr::mutate(condition_1 = A1>=B1 & A1>0 & B1>0,
                condition_2 = A2>=B2 & A2>0 & B2>0) %>%
  group_by(Date) %>%
  dplyr::summarise(
    mean_1 = mean(A1[condition_1] - B1[condition_1]),
    mean_2 = mean(A2[condition_2] - B2[condition_2]))

^{Created on 2021-04-04 by the reprex package (v2.0.0)}

system · April 11, 2021, 6:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.