Filter out rows in dplyr adjusting dynamically to the current date (as Year Month)

I have a dataframe containing - for each monthly cohort of patients - the proportion of patients retained in the study after n months.

    df <- tibble::tribble(
              ~cohort, ~months_elapsed, ~act_patients,
            "2020-06",               1,        100,
            "2020-05",               1,        100,
            "2020-05",               2,         80,
            "2020-05",               3,         20,
            "2020-04",               1,        100,
            "2020-04",               2,         80,
            "2020-04",               3,         50,
            "2020-04",               4,         10
            )

    # A tibble: 8 x 3
      cohort  months_elapsed act_patients
      <chr>            <dbl>     <dbl>
    1 2020-06              1       100
    2 2020-05              1       100
    3 2020-05              2        80
    4 2020-05              3        20
    5 2020-04              1       100
    6 2020-04              2        80
    7 2020-04              3        50
    8 2020-04              4        10

Unfortunately, the dataframe contains inconsistent rows I want to clean out as follows:

Let's imagine we are at the end of the month of June 2020.

  • The cohort 2020-06, after the current month has elapsed, has the totality (100%) of the patients active in the study.

  • The cohort of 2020-05, had 100% of the patients after a month elapsed, 80% of the patients after the second month elapsed. But here I have an extra row, containing the % of active patients of a month that didn't elapse yet (the 3rd month). Being at the end of June 2020, only 2 months have fully elapsed (May and June)

  • Same for the cohort of 2020-04, I have an extra row (the 4th month) I want to clean out: the cohort of April 2020, at the end of June, had just 3 possible months that elapsed (April, May, June).

The final - clean - dataframe I want to have is:

df_clean <- tibble::tribble(
                ~cohort, ~months_elapsed, ~act_patients,
              "2020-06",               1,        100,
              "2020-05",               1,        100,
              "2020-05",               2,         80,
              "2020-04",               1,        100,
              "2020-04",               2,         80,
              "2020-04",               3,         50
              )

# A tibble: 6 x 3
  cohort  months_elapsed act_patients
  <chr>            <dbl>     <dbl>
1 2020-06              1       100
2 2020-05              1       100
3 2020-05              2        80
4 2020-04              1       100
5 2020-04              2        80
6 2020-04              3        50

I am looking for a possible solution in dplyr, any help is highly appreciated!!

I'm not sure if I totally understand the logic of what you are trying to achieve, but does this work for you or does the solution need to involve the number of months between the current date at the cohort start?

library(tidyverse)
df <- tribble(
  ~cohort, ~months_elapsed, ~act_patients,
  "2020-06",               1,        100,
  "2020-05",               1,        100,
  "2020-05",               2,         80,
  "2020-05",               3,         20,
  "2020-04",               1,        100,
  "2020-04",               2,         80,
  "2020-04",               3,         50,
  "2020-04",               4,         10
  )

 df %>% 
   group_by(cohort) %>% 
   filter(months_elapsed == 1 | months_elapsed != max(months_elapsed)) %>% 
   ungroup()
#> # A tibble: 6 x 3
#>   cohort  months_elapsed act_patients
#>   <chr>            <dbl>        <dbl>
#> 1 2020-06              1          100
#> 2 2020-05              1          100
#> 3 2020-05              2           80
#> 4 2020-04              1          100
#> 5 2020-04              2           80
#> 6 2020-04              3           50

Created on 2020-06-05 by the reprex package (v0.3.0)

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.