Detect a repeating pattern in a dataframe that repeats 3 time or more across two columns

Hossam · September 19, 2020, 9:05am

I have a data set that has 3 columns , (ID , D , AE). i want to retrieve records where any two values in column D are co-occurring with any value in Column AE 3 times or more

sample=data.frame(
  ID=c(1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5),
  D=c('a','b','c','d','e','f','g','y','z','a','b','g','y','d','e','f','g','b','a','y'),
  AE=c('m','h','j','k','m','h','j','k','m','j','m','h','l','j','k','m','h','m','o','s')
      )

^{Created on 2020-09-19 by the reprex package (v0.3.0)}

so that the output will be something like that:
output=data.frame(ID=c(1,1),
D=c('a','b'),
AE='m')

nirgrahamuk · September 19, 2020, 11:51am

I don't understand how ID relates to the stated requirements ( so I ignore it)

sample_df=data.frame(
  ID=c(1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5),
  D=c('a','b','c','d','e','f','g','y','z','a','b','g','y','d','e','f','g','b','a','y'),
  AE=c('m','h','j','k','m','h','j','k','m','j','m','h','l','j','k','m','h','m','o','s')
, stringsAsFactors = TRUE)

library(tidyverse)
(ae_groups <- map(unique(sample_df$AE),
    ~filter(sample_df,AE==.)))


(ae_group_counted <- map(ae_groups,
                         ~(group_by(.,D) %>% summarise(n=n(),
                                                       AE=max(as.character(AE))))))

# from this point we want to keep pairs with n >= 3 ?


pairscnt <- function(df){
  if(length(df[["D"]]) <2)
    return(NULL)
  combs <- combn(df[["D"]],2,simplify = FALSE)
 
  map_dfr(
    combs,
    ~filter(df,
           D==.[[1]] | D==.[[2]]) %>%
      mutate(nsum=sum(n)) %>% filter(nsum>=3) 
    
  ) %>% distinct()
}



map_dfr(ae_group_counted,
    pairscnt)

# A tibble: 8 x 4
# D         n AE     nsum
# <fct> <int> <chr> <int>
# 1 a         1 m         3
# 2 b         2 m         3
# 3 e         1 m         3
# 4 f         1 m         3
# 5 z         1 m         3
# 6 b         1 h         3
# 7 g         2 h         3
# 8 f         1 h         3

Hossam · September 19, 2020, 12:33pm

Thanks a lot for taking the time to write this code. However , i wanted to retrieve each couple of values from D column which co-occured along with a value from AE column in three IDs or more.

for example a,b was repeated along with m in three IDs , so i wanted to retrieve a,b and m.

I hope this can be done in R

Thanks a lot

nirgrahamuk · September 19, 2020, 12:53pm

in your given example how many ID's to a and b co-occur with m in ?

sample_df %>% filter(AE =='m')

# ID D AE
# 1 a  m
# 2 e  m
# 2 z  m
# 3 b  m
# 4 f  m
# 5 b  m

a,m is in id 1
there is no b,m in id1 so no cooccurence in that ID, nor any other ID's between a,b - m ??
I can only get them cooccurring if I ignore ID, am I wrong about that given your example as it is ?

Hossam · September 19, 2020, 12:58pm

We are getting closer! I need a function or loop that searches for any couple of values in column D that were co-occuring with a value in column AE. so that , this function or loop would return those IDs where the values that satisfied the condition belonged to. also the couple of values in column D should belong to the same ID

in practice i don't know what are those values that satisfy the condition. it is kind of data mining!

Chuchu · September 19, 2020, 1:07pm

I think for searching specific strings you can use dplyr::filter()?

Hossam · September 19, 2020, 1:11pm

Thank you , the issue is that i don't know the strings themselves in the first place. I just put the values for illustrating how should the code work. I know it is difficult task but I would be grateful for anyone who can help me figuring it out

pieterjanvc · September 19, 2020, 1:19pm

Hi,

I think this might help:

library(dplyr)

#The data
sample=data.frame(
  ID=c(1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5),
  D=c('a','b','c','d','e','f','g','y','z','a','b','g','y','d','e','f','g','b','a','y'),
  AE=c('m','h','j','k','m','h','j','k','m','j','m','h','l','j','k','m','h','m','o','s')
)


#Generate all combinations
result = data.frame(
  D1 = sample$D[-nrow(sample)],
  D2 = sample$D[-1],
  AE = c(sample$AE[-nrow(sample)],sample$AE[-1])
)

#In order to group by, we need to sort alphabetically 
swap = result$D1 > result$D2
result = result %>% mutate(
  D1 = ifelse(swap, .$D2, .$D1),
  D2 = ifelse(swap, .$D1, .$D2)
) 

#Find the occurrence and filter the ones 3+
result = result %>% group_by(D1,D2,AE) %>%  
  summarise(n = n(), .groups = "drop") %>% 
  filter(n > 2)

> result
# A tibble: 1 x 4
  D1    D2    AE        n
  <chr> <chr> <chr> <int>
1 a     b     m         3

I don't know if I fully understood the question, but I do get the answer of 3 times m with ab
I don't really understand what your real goal with this is, and I feel somewhere upstream in your code you could have made other choices that would prevent something like this happening.

Hope this helps,
PJ

nirgrahamuk · September 19, 2020, 1:23pm

in your example there is no single ID for which both a and b coincide with AE of M....
so why would a and b appear in the output as coinciding with M in ID 1.
You have not provided an argument consistent with your data example as far as I can tell.
I suggest you alter one of either a) your explanation, or b) your example data...

Hossam · September 19, 2020, 1:34pm

Thank you so much , it seems to be working. will try it on the original dataset and get back to you.

You are right , but This is the dataset that i have , so i have to work with it.

This is my case:
there is a list of patients that has had adverse events with the drugs they have recieved when havings these AEs. The task is to find the drug combinations of any two drugs with at least three patient has had adverse events and list these combinations with frequencies.

nirgrahamuk · September 19, 2020, 2:00pm

your use case makes sense to me, but it implies that your example data isn't suitable , and certainly your desired output is incorrect, as there are no examples of any AE for which 3 patients had the same pairings of drugs
look at AE 'm'.

filter(sample_df,AE=='m')
  ID D AE
1  1 a  m
2  2 e  m
3  2 z  m
4  3 b  m
5  4 f  m
6  5 b  m

any drug, is unique to a single ID. apart from drug b which patients 3 and 5 had.
but these dont coincide with a , and so the output that would pair a with b for m is unjustified?

I could probably invent some data myself, but I don't want to waste my time if this is a misunderstanding

Hossam · September 19, 2020, 2:13pm

I think yes there is a little bit of misunderstanding. Don't filter by AE=="m". Just look at the original example. you will find that (a,b,m) are repeated at IDs (1,3,5).

Thank you so much for your time and efforts. really much appreciated

Hossam · September 19, 2020, 2:22pm

This is really genius ! Thank you so much

However some combinations weren't calculated exactly as they are present in the dataset , probably because you didn't take the ID column into account in your code.

Anyway , this is very great help and i really can't thank you enough

nirgrahamuk · September 19, 2020, 3:22pm

Here is my solution

library(tidyverse)

sample_df <- mutate(data.frame(
  ID=c(1,1,1,1,2,2,2,2,2,3,3,3,3,4,4,4,4,5,5,5),
  D=c('a','b','c','d','e','f','g','y','z','a','b','g','y','d','e','f','g','b','a','y'),
  AE=c('m','h','j','k','m','h','j','k','m','j','m','h','l','j','k','m','h','m','o','s')
  , stringsAsFactors = FALSE),
  AE = paste0(AE,"*"))



dtoks <- sample_df$D %>% unique
aetoks <- sample_df$AE %>% unique

# get possible triplets
(drugpairs <- enframe(combn(dtoks,2,simplify = FALSE)))

(triplets_0 <- expand.grid(drugpairs$name,aetoks))

(triplets <- left_join(drugpairs,
                       triplets_0,
                       by=c("name"="Var1")) %>% rename(
                         AE=Var2,
                         drugpair = value
                       ))


triplet_in_id <- function(IDnum,triplet){

  df_of_id <- sample_df %>% filter(ID==IDnum)
  dpair <- triplet$drugpair %>% unlist
  tae <- triplet$AE
  # check if both dpair are in the set
  both_drugs_in <- dpair[[1]] %in% df_of_id$D & dpair[[2]] %in% df_of_id$D
  # check if the AE of the triplet pair is present with one of the pair
  tae_df <- filter(df_of_id,AE==tae)
  if(nrow(tae_df)<1)
    return(FALSE)
  ae_with_one_of_paired_drugs <- any(pull(tae_df,D) %in% dpair)
  
  #both conditions have to be TRUE
  both_drugs_in & ae_with_one_of_paired_drugs
}

scan_id_for_all_triplets <- function(IDnum){
  map_lgl(1:nrow(triplets),
          ~triplet_in_id(IDnum,triplets[.,])) 
}


(ids_scanned_for_trips <- map_dfc(sample_df$ID %>% unique,
    ~enframe(name=NULL,
             value=.,
             scan_id_for_all_triplets(.))))
ids_scanned_for_trips$sumIDhits <- rowSums(ids_scanned_for_trips)

(collated_df <- bind_cols(triplets,ids_scanned_for_trips))

(trips_of_interest <- filter(collated_df,
                             sumIDhits>2))
# # A tibble: 1 x 9
# name drugpair       AE   `1`   `2`   `3`   `4`   `5`   sumIDhits
# <int> <list>       <fct>  <lgl> <lgl> <lgl> <lgl> <lgl>     <dbl>
#   1     1 <chr [2]> m*    TRUE  FALSE TRUE  FALSE TRUE          3
trips_of_interest %>% unnest(cols=c(drugpair))
# A tibble: 2 x 9
# name drugpair     AE      `1`   `2`   `3`   `4`   `5`   sumIDhits
# <int> <chr>      <fct>   <lgl> <lgl> <lgl> <lgl> <lgl>     <dbl>
# 1       1 a        m*    TRUE  FALSE TRUE  FALSE TRUE          3
# 2       1 b        m*    TRUE  FALSE TRUE  FALSE TRUE          3

Hossam · September 19, 2020, 3:55pm

nirgrahamuk:

(drugpairs <- enframe(combn(dtoks,2,simplify = FALSE)))

(triplets_0 <- expand.grid(drugpairs$name,aetoks))

(triplets <- left_join(drugpairs,
                       triplets_0,
                       by=c("name"="Var1")) %>% rename(
                         AE=Var2,
                         drugpair = value
                       ))


triplet_in_id <- function(IDnum,triplet){

  df_of_id <- sample_df %>% filter(ID==IDnum)
  dpair <- triplet$drugpair %>% unlist
  tae <- triplet$AE
  # check if both dpair are in the set
  both_drugs_in <- dpair[[1]] %in% df_of_id$D & dpair[[2]] %in% df_of_id$D
  # check if the AE of the triplet pair is present with one of the pair
  tae_df <- filter(df_of_id,AE==tae)
  if(nrow(tae_df)<1)
    return(FALSE)
  ae_with_one_of_paired_drugs <- any(pull(tae_df,D) %in% dpair)
  
  #both conditions have to be TRUE
  both_drugs_in & ae_with_one_of_paired_drugs
}

scan_id_for_all_triplets <- function(IDnum){
  map_lgl(1:nrow(triplets),
          ~triplet_in_id(IDnum,triplets[.,])) 
}


(ids_scanned_for_trips <- map_dfc(sample_df$ID %>% unique,
    ~enframe(name=NULL,
             value=.,
             scan_id_for_all_triplets(.))))
ids_scanned_for_trips$sumIDhits <- rowSums(ids_scanned_for_trips)

(collated_df <- bind_cols(triplets,ids_scanned_for_trips))

(trips_of_interest <- filter(collated_df,
                             sumIDhits>2))

Thanks so much , will try and update you soon

Hossam · September 19, 2020, 4:22pm

nirgrahamuk:

dtoks <- sample_df$D %>% unique
aetoks <- sample_df$AE %>% unique

# get possible triplets
(drugpairs <- enframe(combn(dtoks,2,simplify = FALSE)))

(triplets_0 <- expand.grid(drugpairs$name,aetoks))

(triplets <- left_join(drugpairs,
                       triplets_0,
                       by=c("name"="Var1")) %>% rename(
                         AE=Var2,
                         drugpair = value
                       ))


triplet_in_id <- function(IDnum,triplet){

  df_of_id <- sample_df %>% filter(ID==IDnum)
  dpair <- triplet$drugpair %>% unlist
  tae <- triplet$AE
  # check if both dpair are in the set
  both_drugs_in <- dpair[[1]] %in% df_of_id$D & dpair[[2]] %in% df_of_id$D
  # check if the AE of the triplet pair is present with one of the pair
  tae_df <- filter(df_of_id,AE==tae)
  if(nrow(tae_df)<1)
    return(FALSE)
  ae_with_one_of_paired_drugs <- any(pull(tae_df,D) %in% dpair)
  
  #both conditions have to be TRUE
  both_drugs_in & ae_with_one_of_paired_drugs
}

scan_id_for_all_triplets <- function(IDnum){
  map_lgl(1:nrow(triplets),
          ~triplet_in_id(IDnum,triplets[.,])) 
}


(ids_scanned_for_trips <- map_dfc(sample_df$ID %>% unique,
    ~enframe(name=NULL,
             value=.,
             scan_id_for_all_triplets(.))))
ids_scanned_for_trips$sumIDhits <- rowSums(ids_scanned_for_trips)

(collated_df <- bind_cols(triplets,ids_scanned_for_trips))

(trips_of_interest <- filter(collated_df,
                             sumIDhits>2))

(triplets_0 <- expand.grid(drugpairs$name,aetoks))
Error: cannot allocate vector of size 16.4 Gb

When i got to this line i got this error. My original data set has 214044 record , so trying to produce all possible triplet combination produced a vector of extremely large size that exceeds R memory capacity. Do you think there is a solution ?

Thanks so much for taking the time to help, really appreciate it

nirgrahamuk · September 19, 2020, 4:27pm

how many unique drugs do you have ?
how many unique adverse effects ?

Hossam · September 19, 2020, 4:59pm

1658 unique drug , 46322 unique IDs (patients) , and 3209 unique adverse events

nirgrahamuk · September 19, 2020, 6:18pm

Ahhh, that is a crazy amount. Haha.
Well I'd iterate over the main dataframe, only processing one AE at a time. I dont have time to code it now, but maybe later.

technocrat · September 19, 2020, 9:08pm

Beginning with a null hypothesis, H_0 that there is no association between patients, drugs and effect, it is instructive to begin with a synthetic dataset of the same order of magnitude that is designed to be random.

From there,

Subset patients with more than two effects
Within that subset find drugs that occur more than once

suppressPackageStartupMessages({library(dplyr)})
# synthetic data set
set.seed(37)
patients <- seq(1,50000)
id <- sample(patients,10000,replace= TRUE)
drug <- sample(seq(1,1200),10000,replace = TRUE) %>% paste0("Drug",.)
effect <- sample(seq(1,3500),10000, replace = TRUE)
data.frame(id,drug,effect) -> dat

# find patients with three effect records
 
dat %>% group_by(id) %>% count()  %>% filter(n > 2) %>%
  select(id) -> targets

# convert to vector to use as index

targets <- unlist(targets[[1]])

# subset on targets

dat %>% filter(id %in% targets) -> affected

# find drugs with multiple effects

affected %>% group_by(drug) %>% 
  count() %>%
  filter(n > 1) -> mults

# convert to vector to use as index

mults <- unlist(mults[[1]])

# subset on mults

affected %>% filter(drug %in% mults) %>% 
  group_by(id) %>% 
  count() -> combos

# convert to vector to use as index

combos <- unlist(combos[[1]])

# subset on targets

affected %>% 
  filter(id %in% combos) %>%
  arrange(-desc(id))
#>       id     drug effect
#> 1   1180  Drug501   1893
#> 2   1180  Drug741    171
#> 3   1180  Drug872   2963
#> 4   1767  Drug898   3015
#> 5   1767  Drug793   1563
#> 6   1767   Drug47    430
#> 7   3274  Drug130   1187
#> 8   3274  Drug397   2097
#> 9   3274  Drug816   2793
#> 10  5337  Drug596   2685
#> 11  5337 Drug1141   1014
#> 12  5337  Drug104   3126
#> 13  7412 Drug1136   3208
#> 14  7412  Drug691   3403
#> 15  7412  Drug872   2140
#> 16  9786 Drug1041   1022
#> 17  9786  Drug527    915
#> 18  9786 Drug1002   3275
#> 19 11453  Drug374   3474
#> 20 11453 Drug1010   2493
#> 21 11453   Drug67   1851
#> 22 12001  Drug188   1584
#> 23 12001  Drug311   1489
#> 24 12001  Drug886    555
#> 25 12159 Drug1193    321
#> 26 12159  Drug335   2629
#> 27 12159  Drug881   1519
#> 28 14978  Drug912   1701
#> 29 14978  Drug696   2546
#> 30 14978  Drug816   3294
#> 31 17000  Drug985   2159
#> 32 17000 Drug1044   3251
#> 33 17000 Drug1119   3201
#> 34 19515  Drug589    246
#> 35 19515  Drug671   1787
#> 36 19515 Drug1002     38
#> 37 19515   Drug96    225
#> 38 24407 Drug1154   2827
#> 39 24407  Drug886   2028
#> 40 24407  Drug260   2038
#> 41 25106 Drug1028   2799
#> 42 25106   Drug54    307
#> 43 25106  Drug793    473
#> 44 27576  Drug501   2466
#> 45 27576  Drug441   3366
#> 46 27576  Drug201   1383
#> 47 27889  Drug841    340
#> 48 27889  Drug497    202
#> 49 27889 Drug1058   2854
#> 50 29309  Drug171   1472
#> 51 29309  Drug167   1826
#> 52 29309  Drug425   1244
#> 53 31299  Drug523   1268
#> 54 31299  Drug473    360
#> 55 31299 Drug1044   2432
#> 56 35345  Drug304    920
#> 57 35345 Drug1045   1910
#> 58 35345  Drug849   2331
#> 59 36752  Drug521   2750
#> 60 36752  Drug859   1162
#> 61 36752  Drug374   2805
#> 62 37556 Drug1044   2490
#> 63 37556   Drug78   2829
#> 64 37556  Drug881   2808
#> 65 40709  Drug774   1252
#> 66 40709  Drug886   3213
#> 67 40709  Drug192    714
#> 68 46943  Drug425   3202
#> 69 46943  Drug497    824
#> 70 46943  Drug670    297
#> 71 47437  Drug780   2480
#> 72 47437  Drug566    582
#> 73 47437  Drug304    795
#> 74 48587  Drug104    739
#> 75 48587  Drug201   3261
#> 76 48587 Drug1031   1857

^{Created on 2020-09-19 by the reprex package (v0.3.0)}