Wide format - filter by date and aggregate

Jakub_Komarek · August 2, 2022, 4:52pm

Hi,

I've got data in the wide format such as:

df <- structure(list(ID = c(53155, 53163, 53198, 54771, 55069, 55107, 55166, 55301, 55468, 56324), 
                     `A_from` = c("01/10/2019",NA, "01/10/2016", NA, NA, "01/10/2020", NA, NA, "01/10/2022", NA), 
                     `A_until` = c("01/10/2021", NA, "30/06/2018", NA, NA, "01/10/2022", NA, NA, "01/10/2024", NA), 
                     `A_value` = c(35, NA, 0, NA, NA, 40, NA, NA, 10, NA),
                     `B_from` = c(NA, NA, "01/09/2018", NA, NA, NA, NA, NA, NA, "01/10/2020"), 
                     `B_until` = c(NA, NA, "30/11/2022", NA, NA, NA, NA, NA, NA, "01/10/2022"), 
                     `B_value` = c(NA,NA, 20, NA, NA, NA, NA, NA, NA, 40)), 
                      row.names = c(NA, -10L), 
                      class = c("tbl_df", "tbl", "data.frame"))

which are divided into two 'sections' 'A' and 'B'. From both sections, I would like to keep data which were at least one day 'valid' for year 2021. Then, I would like to sum A_value and B_value. What steps would you suggest? Note, I have more tahn 2 'sections' and each 'section' has more than one 'value'.

many thanks,

Jakub

Ven · August 2, 2022, 9:02pm

Do you intend to sum both A_value and B_value if either A or B contains a date in their respective columns?

Example:

      ID A_from     A_until    A_value B_from    B_until B_value   sum
   <dbl> <chr>      <chr>        <dbl> <chr>     <chr>     <dbl> <dbl>
 1 53155 01/10/2019 01/10/2021      35 NA        NA           NA    35
 2 53163 NA         NA              NA NA        NA           NA    NA
 3 53198 01/10/2016 30/06/2018       0 01/09/20~ 30/11/~      20    20
 4 54771 NA         NA              NA NA        NA           NA    NA
 5 55069 NA         NA              NA NA        NA           NA    NA
 6 55107 01/10/2020 01/10/2022      40 NA        NA           NA    40
 7 55166 NA         NA              NA NA        NA           NA    NA
 8 55301 NA         NA              NA NA        NA           NA    NA
 9 55468 01/10/2022 01/10/2014      10 NA        NA           NA    10
10 56324 NA         NA              NA 01/10/20~ 01/10/~      40    40

In any case;

library(dplyr)

df |>
  mutate(sum = rowSums(select(df[-1], ends_with("_value")), na.rm = TRUE))

# A tibble: 10 x 8
      ID A_from     A_until    A_value B_from    B_until B_value   sum
   <dbl> <chr>      <chr>        <dbl> <chr>     <chr>     <dbl> <dbl>
 1 53155 01/10/2019 01/10/2021      35 NA        NA           NA    35
 2 53163 NA         NA              NA NA        NA           NA     0
 3 53198 01/10/2016 30/06/2018       0 01/09/20~ 30/11/~      20    20
 4 54771 NA         NA              NA NA        NA           NA     0
 5 55069 NA         NA              NA NA        NA           NA     0
 6 55107 01/10/2020 01/10/2022      40 NA        NA           NA    40
 7 55166 NA         NA              NA NA        NA           NA     0
 8 55301 NA         NA              NA NA        NA           NA     0
 9 55468 01/10/2022 01/10/2014      10 NA        NA           NA    10
10 56324 NA         NA              NA 01/10/20~ 01/10/~      40    40

Here's another method in data.table

library(data.table)

dt[, sum := rowSums(.SD, na.rm = TRUE), .SDcols = c(4, 7)]

       ID     A_from    A_until A_value     B_from    B_until B_value sum
 1: 53155 01/10/2019 01/10/2021      35       <NA>       <NA>      NA  35
 2: 53163       <NA>       <NA>      NA       <NA>       <NA>      NA   0
 3: 53198 01/10/2016 30/06/2018       0 01/09/2018 30/11/2020      20  20
 4: 54771       <NA>       <NA>      NA       <NA>       <NA>      NA   0
 5: 55069       <NA>       <NA>      NA       <NA>       <NA>      NA   0
 6: 55107 01/10/2020 01/10/2022      40       <NA>       <NA>      NA  40
 7: 55166       <NA>       <NA>      NA       <NA>       <NA>      NA   0
 8: 55301       <NA>       <NA>      NA       <NA>       <NA>      NA   0
 9: 55468 01/10/2022 01/10/2014      10       <NA>       <NA>      NA  10
10: 56324       <NA>       <NA>      NA 01/10/2020 01/10/2022      40  40

Jakub_Komarek · August 3, 2022, 6:53am

Hi,

Yes, I intend to sum both A_value and B-Value if either A or B contains a date in their respective columns but also filter values for which the period x_from <-> x_until includes at least one day in 2021.

In this case, I would sum 1st and 6th column of A_value and 10th value of B_value.

Many thanks,

Jakub

Ven · August 3, 2022, 11:30pm

I'm not experienced in dplyr to provide an answer. However, in data.table it can be completed with the following.

dt[A_from < '2022-01-01' & A_until >= '2021-01-01' |
     B_from < '2022-01-01' & B_until >= '2021-01-01',
   sum := rowSums(.SD, na.rm = TRUE), .SDcols = c(4, 7)]

       ID     A_from    A_until A_value     B_from    B_until B_value sum
 1: 53155 2019-10-01 2021-10-01      35       <NA>       <NA>      NA  35
 2: 53163       <NA>       <NA>      NA       <NA>       <NA>      NA  NA
 3: 53198 2016-10-01 2018-06-30       0 2018-09-01 2020-11-30      20  NA
 4: 54771       <NA>       <NA>      NA       <NA>       <NA>      NA  NA
 5: 55069       <NA>       <NA>      NA       <NA>       <NA>      NA  NA
 6: 55107 2020-10-01 2022-10-01      40       <NA>       <NA>      NA  40
 7: 55166       <NA>       <NA>      NA       <NA>       <NA>      NA  NA
 8: 55301       <NA>       <NA>      NA       <NA>       <NA>      NA  NA
 9: 55468 2022-10-01 2024-10-01      10       <NA>       <NA>      NA  NA
10: 56324       <NA>       <NA>      NA 2020-10-01 2022-10-01      40  40

Jakub_Komarek · August 4, 2022, 3:03pm

Hi,

many thanks for your answer. I corrected the example of my data. Your solution would not work for the third columns in which A_value does not take place in 2021 (and, thus, should be omitted) but B_value takes place in 2021 (and, thus, should be included).

J

system · August 25, 2022, 3:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.