Detect a repeating pattern in a dataframe that repeats 3 time or more across two columns

Hossam · September 20, 2020, 7:22am

Thanks a lot dear. I wanted to retrieve how many IDs (patients) would be exposed to adverse events associated with all possible combinations of drugs. The problem is that I have around 1600 unique drugs , 3000 unique adverse and 42000 IDs (patients). So , the task is a little bit complicatd

nirgrahamuk · September 20, 2020, 3:16pm

I'm thinking this wouldnt work, because we would need to match ID with drugs that while they touch one AE would be present in another...

what is your distribution of AE like ?

group_by(sample_df,AE) %>%
  summarise(n=n()) %>% group_by(n) %>%
  summarise(n2=n())

Hossam · September 20, 2020, 3:34pm

Not sure i got ur point ... i am attaching the output of AE distribution , and my actual data.

AE distribution:
https://mailaub-my.sharepoint.com/:x:/g/personal/hmm59_mail_aub_edu/EaPH2udzpWZPvWn0JaEC-fYB2nR9UFFqainomEhENsGQcg?e=k0TqfD

Actual dataset:
https://mailaub-my.sharepoint.com/:x:/g/personal/hmm59_mail_aub_edu/EZ-UDFwADzFEr7jqhq1dd6ABfHTKyFQ3HCNKV6hkBJMDyA?e=zmXgX5

technocrat · September 20, 2020, 6:43pm

The underlying idea in @nirgrahamuk's snippet is to identify all observed interactions; if the objective is all possible interactions, only the union of drugs and effects is required.

Given an object with the drug-effect combinations to be examined, it is unnecessary to test against each id, only those that have more than two in a vector of drug-effect combinations.

nirgrahamuk · September 21, 2020, 2:42pm

I had another attempt at approaching this.


make_sample_data <- function(volume_param){
set.seed(37)
patients <- seq(1,volume_param)
id <- sample(patients,20*volume_param,replace= TRUE)
drug <- sample(seq(1,volume_param),20*volume_param,replace = TRUE) %>% paste0("Drug",.)
effect <- paste0("effect_",as.character(sample(seq(1,volume_param),20*volume_param, replace = TRUE)))
tibble(ID=id,D=drug,AE=effect) ->> sample_df
}

docomb <- function(x){
  if(length(x)>=2)
      combn(x,2,simplify = FALSE) %>% map(sort)
  else 
    character(0)
}
per_AE <- function(aeval){

#step 1 filter for an AE - m
step1 <- filter(sample_df,
                AE==aeval)
#step 2 get all ID's associated
step2 <- left_join(step1,sample_df,
                   by=c("ID"))

#step 3 cleanup the ids

step3 <- group_by(step2,ID) %>%
  summarise(dlist=list(unique(c(D.x,D.y))),
            .groups="drop")  
if(!shiny::isTruthy(step3$dlist)){
  return(NULL)
}

# if(aeval==3244)
#   browser()

pairs_in_ids_feat_token <- map(step3$dlist,
    docomb) %>% setNames(step3$ID)

pairs_in_ids_feat_token <- Filter((length), pairs_in_ids_feat_token)
frames <- enframe(pairs_in_ids_feat_token)
if(nrow(frames)==0) return(NULL)

pairs_in_ids_feat_token_df <- mutate(rowwise(frames),
                                               v2 = ifelse(!is.list(value),list(value),value)) %>%
  ungroup %>% select(name,v2) %>% rename(value=v2)

uniqpairs <- distinct(pairs_in_ids_feat_token_df,
                      value)
 
hit_counts <- pairs_in_ids_feat_token_df %>% group_by(value) %>% summarise(n=n(),idlist = list(name),
                                                                           .groups="drop")

filtered_hits <- filter(hit_counts,
                        n>=3) %>% mutate(AE_TOKEN = aeval)
# print(aeval)
# print(filtered_hits)
filtered_hits
}


make_sample_data(volume_param = 5000)
microbenchmark(
  r5000 =results_5000 <- map_dfr(unique(sample_df$AE),
                               per_AE)
  ,times=1L)
#Unit: seconds
 # expr      min       lq     mean   median       uq      max neval
 #r5000 55.63694 55.63694 55.63694 55.63694 55.63694 55.63694     1

Hossam · September 21, 2020, 6:51pm

Thank you so much. I am sorry I am still a beginner , so may you please tell me how to apply it on the actual dataset that i attached in my previous reply ?

Also, someone in another forum provided me with a code that really worked and produced exact counts (that I check by filtering the cases dataset, and make sure that the retrieved cases equal to the count produced by the code). Here is the code that worked with most combinations , although some combinations differed a littel bit:

sample1<-sample %>% subset(select=c(ID,D)) %>%unique() %>%
group_by(ID) %>% filter(n()>1) %>%
group_by(D)%>% filter(n()>1)
sample2=sample1 %>% rename(D2=D)
sample3=sample1 %>% left_join(sample2,by=c("ID")) %>%
filter(D!=D2) %>%
left_join(sample,by=c("ID","D"),all.x=TRUE) %>%
group_by(AE) %>%filter(n()>2)%>% group_by(D,D2) %>%
filter(n()>2) %>% group_by(D,D2,AE) %>%
mutate(Count=n()) %>%
filter (n()>2) %>%filter(AE!="NA")

sample4=sample3 %>% group_by(ID) %>% arrange(AE)%>%
mutate(lagd=lag(D),lagae=lag(AE)) %>%
filter((lagae==AE)) %>%filter(D2==lagd) %>% ungroup() %>% select(-lagae,-lagd)
sample5=sample4 %>% distinct(D,D2,AE,.keep_all = T) %>% filter(AE%in%IMEpt)
View(sample5)

stonepeople · September 23, 2020, 1:25pm

I think you should be considering how to reshape the data - though, I have to admit that I can't yet see quite how the reshaped data would look.
One solution might be to play with the widyr package, which works alongside of the tidyverse packages. Specifically it has a function - pairwise which works its way through every combination of factors, counting how many times they appear together in a dataframe.
So I re-arranged your dataset like this -
side_effects <- pairwise_count(drugs, D, AE) %>%
arrange(desc(n))
where "drugs" was the name I gave the dataframe. Other widyr functions might suit your needs even better - I just tried this one.
Jonathan

Hossam · September 23, 2020, 1:40pm

Thanks very much ... would this count the IDs where each possible drug pairs in any ID are repeated ? (i.e would this take ID into consideration while counting drug pairs) ?

Hossam · September 23, 2020, 1:47pm

Also , can we calculates the counts provided in the second image (named "pattern 3"). These are the counts of all possible events of any combination between (any two drugs within a certain ID and the AEs corresponding to that ID) ?

stonepeople · September 23, 2020, 2:42pm

I imagine you can - I just began looking at widyr: perhaps someone else on here really knows the package....The documentation is pretty thorough - see it here: https://rdrr.io/cran/widyr/

Hossam · September 23, 2020, 3:24pm

I tried , it only works within one column (i want to get the possible combinations between each pairs of drugs with AE). Also it produces counts of pairs of two , however i need count of triplets (D1 , D2 and AE).

Thanks anyway

stonepeople · September 24, 2020, 1:20pm

I think the problem is in the need to line up more than 2 coinciding drugs: a dataframe is essentially 2 dimensions. My guess is that you will need to do it in two stages.
BUT - please be aware I am not a data scientist. I am guessing using what knowledge of tidyverse I have learned along the way.
I suggest that you look for wider expertise and advice via twitter - hashtag #rstats and especially @R4DSCommunity where there are some really good data scientists and R experts ready to help. (I am still trying to find a solution myself meanwhile....)

Hossam · September 24, 2020, 3:18pm

Thanks a lot for your help ... I know it is a complicated task and I asked here and there but no one could find a clue (for calculation of the other 7 counts) till now.

system · October 15, 2020, 3:18pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.