Indexing time periods and count number of events

Story:

Many times, when a patient purchases a certain Drug, we define the first purchase of that drug as the index date. We then want to find out from that index date ( you can think as time 0), look back 12 months and 12 months forward and count how many purchases he has made in that 2 time period.

Requirement:

In my sample data, I have 2 patients and also includes index date.

  1. For all patients I want to find out how many Drug purchases has been made from:
  • [-12 months to 0 months]
  • (0 months to 12 months]

For patient 1 & 2 its 3 purchases & 3 purchases respectively 12 months back from the index date. Therefore the total count is 6 purchases.

  1. For all patients I want to find out how unique category of drugs they have purchased from:
  • [-12 months to 0 months]
  • (0 months to 12 months]

For patient 1 & 2, it's 2 & 3 unique counts in Cateogry_Level_1 therefore, the total sum is 5

In my real data, I've millions of patients

I'd be grateful, If I could have:

  1. Solution
  2. Solution based on tidyverse preferred
  3. Without writing a new function (if possible)
  4. Maybe entirely new solution design to get the same result (I don't maybe it's efficient to convert all dmy format to time 0, time 1 month from index, time 2 month from index..... something like that)

Thanks

library(tidyverse)
library(lubridate)

Sample_data <- tibble::tribble(
  ~PatientCode,   ~Drugs, ~Date_of_Purchased,  ~Index_Date,       ~Category_Level_1,                    ~Category_Level_2,
            1L, "Drug_A",       "15/12/2012",  "5/03/2015", "CARDIOVASCULAR SYSTEM",           "CALCIUM CHANNEL BLOCKERS",
            1L, "Drug_A",        "3/05/2014",  "5/03/2015", "CARDIOVASCULAR SYSTEM",           "CALCIUM CHANNEL BLOCKERS",
            1L, "Drug_C",        "4/02/2015",  "5/03/2015", "CARDIOVASCULAR SYSTEM",             "LIPID MODIFYING AGENTS",
            1L, "Drug_B",        "5/03/2015",  "5/03/2015",       "DERMATOLOGICALS", "ANTIFUNGALS FOR DERMATOLOGICAL USE",
            1L, "Drug_B",        "6/04/2015",  "5/03/2015",       "DERMATOLOGICALS",         "EMOLLIENTS AND PROTECTIVES",
            1L, "Drug_D",        "7/04/2016",  "5/03/2015",        "NERVOUS SYSTEM",                        "ANESTHETICS",
            2L, "Drug_A",       "15/12/2013", "30/04/2015", "CARDIOVASCULAR SYSTEM",           "CALCIUM CHANNEL BLOCKERS",
            2L, "Drug_A",        "3/05/2014", "30/04/2015", "CARDIOVASCULAR SYSTEM",           "CALCIUM CHANNEL BLOCKERS",
            2L, "Drug_C",        "4/02/2015", "30/04/2015",        "NERVOUS SYSTEM",             "LIPID MODIFYING AGENTS",
            2L, "Drug_B",       "30/04/2015", "30/04/2015",       "DERMATOLOGICALS", "ANTIFUNGALS FOR DERMATOLOGICAL USE",
            2L, "Drug_B",        "6/04/2016", "30/04/2015",       "DERMATOLOGICALS",         "EMOLLIENTS AND PROTECTIVES",
            2L, "Drug_D",        "7/04/2016", "30/04/2015",        "NERVOUS SYSTEM",                        "ANESTHETICS"
  ) %>% 
  mutate(Date_of_Purchased = dmy(Date_of_Purchased), Index_Date = dmy(Index_Date))

Exected_Result <- tibble::tribble(
                             ~Followup_period,                 ~Category, ~Count,
                    "- 12 months to 0 months",         "No_of_Purchases",     6L,
                      "0 months to 12 months",         "No_of_Purchases",     3L,
                    "- 12 months to 0 months", "Category_Level_1_unique",     5L,
                      "0 months to 12 months", "Category_Level_1_unique",     3L,
                    "- 12 months to 0 months", "Category_Level_2_unique",     6L,
                      "0 months to 12 months", "Category_Level_2_unique",     3L
                    )

UPDATE:

I misread your requirements w.r.t. point 2 where it seems that, instead of knowing the number of unique categories across all patients, you want to know the sum of the unique categories for each. I've updated my answer to incorporate this (though I include my original answer below as well in case it may be of interest).

Here's what I did:

  1. Group by PatientCode
  2. Create a new variable called Followup_period using case_when in order to classify any particular Date_of_Purchased into either the - 12 months to 0 months or 0 months to 12 months period (alternatively NA)
  3. Group the data by PatientCode and Followup_period
  4. Summarise the data, creating three new summary variables, namely No_of_purchases, Category_Level_1_unique, and Category_Level_2_unique, using n() for the first and n_distinct() for the latter two
  5. Group only by Followup_period and deselect PatientCode from the tibble
  6. Use summarise all to create sums of all summary columns
  7. Gather the data in order to go from wide to long format
  8. Ungroup the data
  9. Filter out any NA periods if they are not of interest

Here it is in a single pipe:

Result <- Sample_data %>% 
  group_by(PatientCode) %>% 
  mutate(Followup_period = case_when(
    Date_of_Purchased <= Index_Date & Date_of_Purchased >= (Index_Date - months(12)) ~ '- 12 months to 0 months',
    Date_of_Purchased > Index_Date & Date_of_Purchased <= (Index_Date + months(12)) ~ '0 months to 12 months'
  )) %>% 
  group_by(PatientCode, Followup_period) %>% 
  summarise(No_of_purchases = n(),
            Category_Level_1_unique = n_distinct(Category_Level_1),
            Category_Level_2_unique = n_distinct(Category_Level_2)) %>% 
  group_by(Followup_period) %>% 
  select(-PatientCode) %>% 
  summarise_all(sum) %>% 
  gather(key = Vategory, value = Count, No_of_purchases:Category_Level_2_unique) %>% 
  ungroup() %>% 
  filter(!is.na(Followup_period))

ORIGINAL ANSWER

Here's what I'd do (though I get a few different totals from what you provide in the Expected_Result tibble):

  1. Group by PatientCode
  2. Create a new variable called Followup_period using case_when in order to classify any particular Date_of_Purchased into either the - 12 months to 0 months or 0 months to 12 months period (alternatively NA)
  3. Ungroup and the group again, but this time by Followup_period
  4. Summarise the data, creating three new summary variables, namely No_of_purchases, Category_Level_1_unique, and Category_Level_2_unique, using n() for the first and n_distinct() for the latter two
  5. Gather the data in order to go from wide to long format
  6. Ungroup the data
  7. (optional) filter out any NA periods if they are not of interest
Result <- Sample_data %>% 
  group_by(PatientCode) %>% 
  mutate(Followup_period = case_when(
    Date_of_Purchased <= Index_Date & Date_of_Purchased >= (Index_Date - years(1)) ~ '- 12 months to 0 months',
    Date_of_Purchased > Index_Date & Date_of_Purchased <= (Index_Date + years(1)) ~ '0 months to 12 months'
  )) %>% 
  ungroup() %>% 
  group_by(Followup_period) %>% 
  summarise(No_of_purchases = n(),
            Category_Level_1_unique = n_distinct(Category_Level_1),
            Category_Level_2_unique = n_distinct(Category_Level_2)) %>% 
  gather(key = Vategory, value = Count, No_of_purchases:Category_Level_2_unique) %>% 
  ungroup() %>% 
  filter(!is.na(Followup_period))
2 Likes

That's exactly I was after. Also the code is very easy to understand. Thanks a lot @hendrikvanb

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.