How to delete rows before certain dates

Each row in the Main_table are claims for unique Patient_ID and shows if they are Metastatic or Non_Metastatic. My goal is for each patient delete all claims of Metastatic Drug_Group before the first date of that patient Non_Metastatic claim. However, for the patient with the first claim as Non_metastatic, then, subsequent occurrence of Metastatic should not be deleted.

library(tidyverse)

Main_table <- tibble::tribble(
                ~Patient_ID,        ~Date,      ~Drug_Group,
                         1L, "17/11/2016",     "Metastatic",
                         1L, "18/11/2016",     "Metastatic",
                         1L, "19/11/2016",     "Metastatic",
                         1L, "20/11/2016", "Non_Metastatic",
                         1L, "21/11/2016", "Non_Metastatic",
                         2L, "19/01/2017", "Non_Metastatic",
                         2L, "20/01/2017",     "Metastatic",
                         2L, "21/01/2017", "Non_Metastatic",
                         2L, "22/01/2017",     "Metastatic",
                         2L, "23/01/2017",     "Metastatic"
                )

expected_output <-tibble::tribble(
  ~Patient_ID,        ~Date,      ~Drug_Group,
           1L, "20/11/2016", "Non_Metastatic",
           1L, "21/11/2016", "Non_Metastatic",
           2L, "21/01/2017", "Non_Metastatic",
           2L, "19/01/2017", "Non_Metastatic",
           2L, "20/01/2017",     "Metastatic",
           2L, "21/01/2017", "Non_Metastatic",
           2L, "22/01/2017",     "Metastatic",
           2L, "23/01/2017",     "Metastatic"
  )


I can do for each patient separately. However, I have 100,000 patients and need some clever solution.

Any help would be really appreciated!!!
Thank you

Just to be clear : does the Patient 2, 20/01/2017 should be present in your output because it is after the first Non_Metastatic on the 19/01/2017 ?

Or the rule is to keep only the Metastatic Drug Group after the last Non_Metastatic claim ?

Patient 2, 20/01/2017 should be present in the output.

The rule is to delete all Metastatic Drug Group patients before the first Non_metastatic claim. However, for patient 2 the first claim is Non_metastatic, therefore, subsequent occurrence of Metastatic should not be deleted.

Thank you for picking this up.

I have modified my output to

expected_output <-tibble::tribble(
  ~Patient_ID,        ~Date,      ~Drug_Group,
           1L, "20/11/2016", "Non_Metastatic",
           1L, "21/11/2016", "Non_Metastatic",
           2L, "21/01/2017", "Non_Metastatic",
           2L, "19/01/2017", "Non_Metastatic",
           2L, "20/01/2017",     "Metastatic",
           2L, "21/01/2017", "Non_Metastatic",
           2L, "22/01/2017",     "Metastatic",
           2L, "23/01/2017",     "Metastatic"
  )

Ok so I tried something using a function that would iterate over the drug group and return only true after the first occurrence. You can create a function that modify its behaviour by using a function inside another function. I don't know if it will be performant on a big dataset but it works

library(tidyverse)

Main_table <- tibble::tribble(
  ~Patient_ID,        ~Date,      ~Drug_Group,
  1L, "17/11/2016",     "Metastatic",
  1L, "18/11/2016",     "Metastatic",
  1L, "19/11/2016",     "Metastatic",
  1L, "20/11/2016", "Non_Metastatic",
  1L, "21/11/2016", "Non_Metastatic",
  2L, "19/01/2017", "Non_Metastatic",
  2L, "20/01/2017",     "Metastatic",
  2L, "21/01/2017", "Non_Metastatic",
  2L, "22/01/2017",     "Metastatic",
  2L, "23/01/2017",     "Metastatic"
)
# this function uses a function for which the behavior changes
# depending if we found the first Non_Metastatic
is_metastic_before <- function(drug_group){
  # by default we test. This will be set to false as soon as 
  # we find Non_Metastatic
  to_test <- TRUE
  to_keep <- function(drug) {
    # We test until we find the first Non_Metastatic
    if (to_test) {
      res <- drug == "Non_Metastatic"
      # if we found the first Non_Metastatic
      # we will always return TRUE
      if (res) {
        to_test <<- FALSE
      }
      return(res)
    } else {
      return(TRUE)
    }
  }
  # Apply the preceeding function on the Drug_Group
  map_lgl(drug_group, to_keep)
}

Main_table %>%
  group_by(Patient_ID) %>%
  arrange(Date, .by_group = TRUE) %>%
  mutate(to_keep = is_metastic_before(Drug_Group)) %>%
  # keep only the TRUE value.
  filter(to_keep)
#> # A tibble: 7 x 4
#> # Groups:   Patient_ID [2]
#>   Patient_ID Date       Drug_Group     to_keep
#>        <int> <chr>      <chr>          <lgl>  
#> 1          1 20/11/2016 Non_Metastatic TRUE   
#> 2          1 21/11/2016 Non_Metastatic TRUE   
#> 3          2 19/01/2017 Non_Metastatic TRUE   
#> 4          2 20/01/2017 Metastatic     TRUE   
#> 5          2 21/01/2017 Non_Metastatic TRUE   
#> 6          2 22/01/2017 Metastatic     TRUE   
#> 7          2 23/01/2017 Metastatic     TRUE

Created on 2018-10-28 by the reprex package (v0.2.1)

3 Likes

Here's another solution. First, I create a new column with the date of the first non-metastatic claim for each patient and then filter to remove all rows that are metastatic and occur before the first non-metastatic claim. I also converted to a date column to make sure there was nothing strange going on with the character dates.

library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date

Main_table <- tibble::tribble(
  ~Patient_ID,        ~Date,      ~Drug_Group,
  1L, "17/11/2016",     "Metastatic",
  1L, "18/11/2016",     "Metastatic",
  1L, "19/11/2016",     "Metastatic",
  1L, "20/11/2016", "Non_Metastatic",
  1L, "21/11/2016", "Non_Metastatic",
  2L, "19/01/2017", "Non_Metastatic",
  2L, "20/01/2017",     "Metastatic",
  2L, "21/01/2017", "Non_Metastatic",
  2L, "22/01/2017",     "Metastatic",
  2L, "23/01/2017",     "Metastatic"
)

Main_table %>%
  group_by(Patient_ID) %>% 
  mutate(
    Date = dmy(Date),
    first_non_met_claim = min(Date[Drug_Group == "Non_Metastatic"])
    ) %>%
  ungroup() %>% 
  filter(!(Drug_Group == "Metastatic" & Date < first_non_met_claim))
#> # A tibble: 7 x 4
#>   Patient_ID Date       Drug_Group     first_non_met_claim
#>        <int> <date>     <chr>          <date>             
#> 1          1 2016-11-20 Non_Metastatic 2016-11-20         
#> 2          1 2016-11-21 Non_Metastatic 2016-11-20         
#> 3          2 2017-01-19 Non_Metastatic 2017-01-19         
#> 4          2 2017-01-20 Metastatic     2017-01-19         
#> 5          2 2017-01-21 Non_Metastatic 2017-01-19         
#> 6          2 2017-01-22 Metastatic     2017-01-19         
#> 7          2 2017-01-23 Metastatic     2017-01-19

Created on 2018-10-28 by the reprex package (v0.2.1)

6 Likes

Way simpler ! thank you for sharing ! And I think it could be more efficient on a table with more line. :+1:

Thank you for your help - your solution has solved my problem.

Thank you for your help. Also, I am still learning functions, this is a good sample to learn.