Add a cumulative column according to condition

Aloysio · April 18, 2020, 4:43pm

Hello everyone, thank you in advance. I have these two data frames:
A tibble: 214 x 4
Groups: provincia_deteccion [16]
fecha provincia_deteccion n fa
dttm chr int int>
1 2020-03-11 Sancti Spíritus 3 3
2 2020-03-13 Villa Clara 1 1
3 2020-03-16 La Habana 1 1
4 2020-03-17 Camagüey 1 1
5 2020-03-17 La Habana 1 2
6 2020-03-18 Holguín 1 1
7 2020-03-18 Sancti Spíritus 1 4
8 2020-03-18 Villa Clara 2 3
9 2020-03-19 Granma 1 1
10 2020-03-19 00:00:00 Holguín 1 2
...with 204 more rows
n: case report and fa: cumulative frequency

y este otro:
b
A tibble: 24 x 4
Groups: provincia_deteccion [7]
fecha provincia_deteccion fallecidos faa_fallecidos

1 2020-03-12 00:00:00 Sancti Spíritus 1 1
2 2020-03-26 00:00:00 La Habana 1 1
3 2020-03-28 00:00:00 Villa Clara 1 1
4 2020-03-29 00:00:00 Ciego de Ávila 1 1
5 2020-03-29 00:00:00 La Habana 1 2
6 2020-03-30 00:00:00 La Habana 1 3
7 2020-04-04 00:00:00 Isla de la Juventud 1 1
8 2020-04-04 00:00:00 Sancti Spíritus 1 2
9 2020-04-05 00:00:00 La Habana 1 4
10 2020-04-06 00:00:00 La Habana 1 5

donde: faa_fallecidos: is cumulative frequency

I need to add to the data frame (b) a field that is the (fa) of data frame (a) according to date and province. I try to do it with the left_join function but it didn't work for me because if there is no data matching the date in both data frames it doesn't add them. For examplo:

prueba<- b %>%left_join(a)

result
fecha provincia_deteccion fallecidos faa_fallecidos n fa

1 2020-03-18 00:00:00 Sancti Spíritus 1 1 1 4
2 2020-03-26 00:00:00 La Habana 1 1 2 25
3 2020-03-28 00:00:00 Villa Clara 1 1 NA NA
4 2020-03-29 00:00:00 Ciego de Ávila 1 1 4 15
5 2020-03-29 00:00:00 La Habana 1 2 8 42
6 2020-03-30 00:00:00 La Habana 1 3 8 50
7 2020-04-04 00:00:00 Isla de la Juventud 1 1 1 1
8 2020-04-04 00:00:00 Sancti Spíritus 1 2 1 18
9 2020-04-05 00:00:00 La Habana 1 4 15 109
10 2020-04-06 00:00:00 La Habana 1 5 6 115
By that time "Villa Clara" and "Isla de la Juventud" already had many accumulated cases.
Thank you for allowing us to express ourselves in this forum.

siddharthprabhu · April 18, 2020, 6:41pm

Hi @Aloysio, welcome to RStudio Community.

I would do it like this:

library(dplyr, warn.conflicts = FALSE)

a <- tribble(~ fecha, ~ provincia_deteccion, ~ n, ~ fa, 
             "2020-03-11", "Sancti Spíritus", 3L, 3L, 
             "2020-03-13", "Villa Clara", 1L, 1L, 
             "2020-03-16", "La Habana", 1L, 1L,
             "2020-03-17", "Camagüey", 1L, 1L,
             "2020-03-17", "La Habana", 1L, 2L,
             "2020-03-18", "Holguín", 1L, 1L)

a <- mutate(a, fecha = as.POSIXct(fecha))

b <- tribble(~ fecha, ~ provincia_deteccion, ~  fallecidos, ~ faa_fallecidos, 
             "2020-03-12 00:00:00", "Sancti Spíritus", 1L, 1L, 
             "2020-03-26 00:00:00", "La Habana", 1L, 1L, 
             "2020-03-28 00:00:00", "Villa Clara", 1L, 1L,
             "2020-03-29 00:00:00", "Ciego de Ávila", 1L, 1L,
             "2020-03-29 00:00:00", "La Habana", 1L, 2L,
             "2020-03-30 00:00:00", "La Habana", 1L, 3L)

b <- mutate(b, fecha = as.POSIXct(fecha))

a %>% 
  rename(fallecidos = n, faa_fallecidos = fa) %>% 
  bind_rows(b) %>% 
  group_by(fecha, provincia_deteccion) %>% 
  summarize(fallecidos = sum(fallecidos)) %>% 
  group_by(provincia_deteccion) %>% 
  mutate(faa_fallecidos = cumsum(fallecidos)) %>% 
  ungroup()
#> # A tibble: 12 x 4
#>    fecha               provincia_deteccion fallecidos faa_fallecidos
#>    <dttm>              <chr>                    <int>          <int>
#>  1 2020-03-11 00:00:00 Sancti Spíritus              3              3
#>  2 2020-03-12 00:00:00 Sancti Spíritus              1              4
#>  3 2020-03-13 00:00:00 Villa Clara                  1              1
#>  4 2020-03-16 00:00:00 La Habana                    1              1
#>  5 2020-03-17 00:00:00 Camagüey                     1              1
#>  6 2020-03-17 00:00:00 La Habana                    1              2
#>  7 2020-03-18 00:00:00 Holguín                      1              1
#>  8 2020-03-26 00:00:00 La Habana                    1              3
#>  9 2020-03-28 00:00:00 Villa Clara                  1              2
#> 10 2020-03-29 00:00:00 Ciego de Ávila               1              1
#> 11 2020-03-29 00:00:00 La Habana                    1              4
#> 12 2020-03-30 00:00:00 La Habana                    1              5

^{Created on 2020-04-19 by the reprex package (v0.3.0)}

Aloysio · April 18, 2020, 8:12pm

Hello @siddharthprabhu. Thank you very much for your reply. Sorry I didn't explain myself well. The problem is the following:
I have obtained (a) from:

a <- mortalidad%>%
   group_by(fecha, provincia_deteccion)%>%
   summarise(n =n())%>%
   ungroup()%>%
   group_by(provincia_deteccion)%>%
   mutate(fa=cumsum(n))
# A tibble: 214 x 4
# Groups:   provincia_deteccion [16]
   fecha               provincia_deteccion     n    fa
   <dttm>              <chr>               <int> <int>
 1 2020-03-11 00:00:00 Sancti Spíritus         3     3
 2 2020-03-13 00:00:00 Villa Clara             1     1
 3 2020-03-16 00:00:00 La Habana               1     1
 4 2020-03-17 00:00:00 Camagüey                1     1
 5 2020-03-17 00:00:00 La Habana               1     2
 6 2020-03-18 00:00:00 Holguín                 1     1
 7 2020-03-18 00:00:00 Sancti Spíritus         1     4
 8 2020-03-18 00:00:00 Villa Clara             2     3
 9 2020-03-19 00:00:00 Granma                  1     1
10 2020-03-19 00:00:00 Holguín                 1     2
# ... with 204 more rows

and (b):

b<- mortalidad%>%filter(egresado == "fallecido")%>%
    group_by(fecha_muerte, provincia_deteccion)%>%
   summarise(fallecidos = n())%>%
   ungroup()%>%
   group_by(provincia_deteccion)%>%
   mutate(faa_fallecidos = cumsum(fallecidos))
b<- b%>%rename(fecha=fecha_muerte)
# A tibble: 24 x 4
# Groups:   provincia_deteccion [7]
   fecha               provincia_deteccion fallecidos faa_fallecidos
   <dttm>              <chr>                    <int>          <int>
 1 2020-03-18 00:00:00 Sancti Spíritus              1              1
 2 2020-03-26 00:00:00 La Habana                    1              1
 3 2020-03-28 00:00:00 Villa Clara                  1              1
 4 2020-03-29 00:00:00 Ciego de Ávila               1              1
 5 2020-03-29 00:00:00 La Habana                    1              2
 6 2020-03-30 00:00:00 La Habana                    1              3
 7 2020-04-04 00:00:00 Isla de la Juventud          1              1
 8 2020-04-04 00:00:00 Sancti Spíritus              1              2
 9 2020-04-05 00:00:00 La Habana                    1              4
10 2020-04-06 00:00:00 La Habana                    1              5
# ... with 14 more rows

in -a- (n) is the number of cases and (fa) is the accumulated frequency.
in -b- (fallecido) is the number of deceased and faa_deceased the accumulated absolute frequency of the deceased.
I need to add a field in (b) that is the cumulative absolute frequency of the data frame cases from (a). That is, the 3 Santi Spírutu accumulated cases of (a) should appear in (b) on March 18 and so on. When the dates and provinces coincide there is no problem in joining them with left_join. But in the case that they do not coincide, the scored is not added.

prueba<- b %>%left_join(a)
# A tibble: 24 x 6
# Groups:   provincia_deteccion [7]
   fecha               provincia_deteccion fallecidos faa_fallecidos     n    fa
   <dttm>              <chr>                    <int>          <int> <int> <int>
 1 2020-03-18 00:00:00 Sancti Spíritus              1              1     1     4
 2 2020-03-26 00:00:00 La Habana                    1              1     2    25
 3 2020-03-28 00:00:00 Villa Clara                  1              1    NA    NA
 4 2020-03-29 00:00:00 Ciego de Ávila               1              1     4    15
 5 2020-03-29 00:00:00 La Habana                    1              2     8    42
 6 2020-03-30 00:00:00 La Habana                    1              3     8    50
 7 2020-04-04 00:00:00 Isla de la Juventud          1              1     1     1
 8 2020-04-04 00:00:00 Sancti Spíritus              1              2     1    18
 9 2020-04-05 00:00:00 La Habana                    1              4    15   109
10 2020-04-06 00:00:00 La Habana                    1              5     6   115
# ... with 14 more rows

Villa Clara on March 27 had 24 accumulated cases, but since on the 28th he did not report cases, it does not appear and therefore the join is not carried out properly.

Summarizing I need the accumulated frequencies of the cases to calculate the case fatality rate.

siddharthprabhu · April 19, 2020, 7:27am

Ah, I think I misunderstood your question. Thank you for the more detailed explanation.

Two things may help here. Firstly, consider using a full_join() instead; that will keep observations from both a and b. This will however result in NA values where fecha and provincia_deteccion do not match. These can be handled with tidyr::fill().

I've made up some data for Villa Clara to illustrate. In future, it would be most helpful if you can post reproducible data.

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

a <- tribble(~ fecha, ~ provincia_deteccion, ~ n, ~ fa, 
             "2020-03-13", "Villa Clara", 1L, 1L,
             "2020-03-18", "Villa Clara", 2L, 3L,
             "2020-03-27", "Villa Clara", 21L, 24L)

a <- mutate(a, fecha = as.POSIXct(fecha))

b <- tribble(~ fecha, ~ provincia_deteccion, ~  fallecidos, ~ faa_fallecidos, 
             "2020-03-28 00:00:00", "Villa Clara", 1L, 1L)

b <- mutate(b, fecha = as.POSIXct(fecha))

b %>% 
  full_join(a) %>% 
  arrange(provincia_deteccion, fecha) %>% 
  group_by(provincia_deteccion) %>% 
  fill(n, .direction = "down") %>% 
  fill(fa, .direction = "down")
#> Joining, by = c("fecha", "provincia_deteccion")
#> # A tibble: 4 x 6
#> # Groups:   provincia_deteccion [1]
#>   fecha               provincia_deteccion fallecidos faa_fallecidos     n    fa
#>   <dttm>              <chr>                    <int>          <int> <int> <int>
#> 1 2020-03-13 00:00:00 Villa Clara                 NA             NA     1     1
#> 2 2020-03-18 00:00:00 Villa Clara                 NA             NA     2     3
#> 3 2020-03-27 00:00:00 Villa Clara                 NA             NA    21    24
#> 4 2020-03-28 00:00:00 Villa Clara                  1              1    21    24

^{Created on 2020-04-19 by the reprex package (v0.3.0)}

Aloysio · April 19, 2020, 8:42am

@siddharthprabhu, thank you very much for your deference to me.
The solution you provide me is very useful. But, would it be possible that on the dates that new cases occur, according to provinces, but not new deceased, the same accumulated frequency of deceased will be maintained? On the other hand, would it be possible that the provinces that do not report deaths until a certain date in the variable faa_apereciecie 0?
How can I send you the original data?
Thank you very much.

siddharthprabhu · April 19, 2020, 9:59am

Just a sample of the original data would be sufficient (maybe pick 2-3 provinces). You can generate the sample data by using the following code:

library(dplyr)

provincias <- c("Sancti Spíritus", "Villa Clara") # change if required

a %>% 
  filter(provincia_deteccion %in% provincias) %>% 
  dput()

b %>% 
  filter(provincia_deteccion %in% provincias) %>% 
  dput()

Please post the output of those 2 dput() function calls here. That will generate the code required for me to reproduce your data.

Aloysio · April 19, 2020, 10:16am

A million thanks. The truth that you have helped me a lot.
I have a solution to the problem (of course thanks to you)
I just did the following to what you sent me:

 H[["fallecidos"]][is.na(H[["fallecidos"]])] <- 0
H<- H%>% mutate( acumulado = cumsum(fallecidos))  
Groups:   provincia_deteccion [1]
   fecha               provincia_deteccion fallecidos faa_fallecidos     n    fa acumulado
   <dttm>              <chr>                    <dbl>          <int> <int> <int>     <dbl>
 1 2020-03-13 00:00:00 Villa Clara                  0             NA     1     1         0
 2 2020-03-18 00:00:00 Villa Clara                  0             NA     2     3         0
 3 2020-03-22 00:00:00 Villa Clara                  0             NA     2     5         0
 4 2020-03-23 00:00:00 Villa Clara                  0             NA     1     6         0
 5 2020-03-25 00:00:00 Villa Clara                  0             NA     1     7         0
 6 2020-03-26 00:00:00 Villa Clara                  0             NA     5    12         0
 7 2020-03-27 00:00:00 Villa Clara                  0             NA    12    24         0
 8 2020-03-28 00:00:00 Villa Clara                  1              1    12    24         1
 9 2020-03-29 00:00:00 Villa Clara                  0             NA     1    25         1
10 2020-03-31 00:00:00 Villa Clara                  0             NA     3    28         1

y.... problem solved. Maybe not is the best solution but i resolve.
thanks a thousand for your time. Have a happy day.

siddharthprabhu · April 19, 2020, 10:18am

Good to hear that. If you want to impute zeros for the NA values within the pipe you can use tidyr::replace_na().

Aloysio · April 19, 2020, 10:23am

@siddharthprabhu
It would only be necessary to clean the code a little and ready.
I am satisfied. I take your last advice.
I repeat my thanks and best wishes that you are well in the times of Covid19.

system · April 26, 2020, 10:36am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.