Calculate the frequency of association between multiple variables

Hi, I have a df like this:

data <- structure(list(idcampione = c("124682/11", "124682/11", "124682/11", 
"124682/11", "124682/11", "124682/11", "124682/11", "119764/1", 
"119764/1", "119764/1", "119764/1", "119764/1", "123462/1", "123462/1", 
"123462/1", "123485/1", "123485/1", "123485/1", "123485/1", "123789", 
"123789", "123789", "123789", "123789", "123789"), chemioterapico = c("kanamicina", 
"streptomicina", "cloxacillina", "enrofloxacin", "flumequina", 
"sulfonamidi composti", "spectinomicina", "kanamicina", "streptomicina", 
"trimeth.+ sulfam", "flumequina", "spectinomicina", "kanamicina", 
"streptomicina", "flumequina", "kanamicina", "streptomicina", 
"flumequina", "spectinomicina", "eritromicina", "kanamicina", 
"spiramicina", "streptomicina", "tetraciclina", "trimeth.+ sulfam"
)), row.names = c(NA, -25L), class = c("tbl_df", "tbl", "data.frame"
))

I would like to calculate how often there is a combination of antibiotics.

For example: assuming I have 3 ids: and all three have the association: kanamycin + streptomycin. The result I would like is, therefore kanamycin + streptomycin = 3.

This type of cross should be done to understand which antibiotics are most frequently involved in multiresistances.

TL;DR: the question I have to answer is: what are the antibiotics that are most frequently present in multiresistances? (and so far it's easy).
The hard part is: there are multiple antibiotics that are frequently involved together in multiresistance?

If anyone has any ideas about it I am grateful.

Hey,

I am not 100% sure if you are looking for somewhat like that, but maybe you can give it a try:

library('data.table')
library('collapse')

Data_wide <- Data |>
  qDT() |>
  fmutate(present = 1L) |> 
  dcast.data.table(idcampione ~ chemioterapico, value.var = 'present', fill = 0L)

colSums(Data_wide[,cloxacillina:`trimeth.+ sulfam`])
#>         cloxacillina         enrofloxacin         eritromicina 
#>                    1                    1                    1 
#>           flumequina           kanamicina       spectinomicina 
#>                    4                    5                    3 
#>          spiramicina        streptomicina sulfonamidi composti 
#>                    1                    5                    1 
#>         tetraciclina     trimeth.+ sulfam 
#>                    1                    2

Created on 2022-11-08 by the reprex package (v2.0.1)

This will give you all present antibiotica for all 5 IDs in your dataset. Streptomicina and Kanamicina however are present for all 5 IDs, not just for 3. Hence there is a 5 instead of a 3. I don't know if it is an issue regarding the IDs in your sample data or if it has to do with different time stamps encoded in your IDs behind the slash?

Kind regards

Yea, it can be a starting point, thank you.
For IDs, I treat them as if they were all different, even if the number before the slash is the same

This was quite some effort, but if you only care about the maximum of common antibiotics (e.g. the longest common subsequence), this will set you up:

library('data.table')
library('collapse')

### prepare the present antibiotics to be in one entry
Data_ <- Data |>
  # order by ID and antibiotic
  roworder(idcampione, chemioterapico) |>
  # get all present antibiotics per ID and remove whitespaces
  fgroup_by(idcampione) |>
  fsummarise(
    present_antibiotics = stringr::str_replace_all(paste(chemioterapico, collapse = "+"), pattern = "\\s",  replacement = "")
  )

### https://stat.ethz.ch/pipermail/r-help/2011-March/273052.html
# helper function to identify positions
intersect_strings <- function (x, y){
  y <- as.vector(y)
  y[match(as.vector(x), y, 0L)]
}

# execution of string finding
find_common_antibiotics <- function(string1, string2){
  paste(
    Reduce(intersect_strings, strsplit(c(string1,string2), split = "\\+")),
    collapse = '+'
    )
}

### execute common antibiotics pairwise
# https://stackoverflow.com/questions/17171148/non-redundant-version-of-expand-grid
expand.grid.unique <- function(x, y, include.equals=FALSE){
  x <- unique(x);  y <- unique(y)
  
  g <- function(i){
    z <- setdiff(y, x[seq_len(i-include.equals)])
    
    if(length(z)) cbind(x[i], z, deparse.level=0)
  }
  do.call(rbind, lapply(seq_along(x), g))
}

expand.grid.unique(
  x = Data_$present_antibiotics, y = Data_$present_antibiotics) |>
  qDT() |>
  dplyr::rowwise() |>
  dplyr::mutate(common_antibiotics = find_common_antibiotics(V1,V2)) |>
  dplyr::count(common_antibiotics)
#> # A tibble: 4 × 2
#> # Rowwise: 
#>   common_antibiotics                                     n
#>   <chr>                                              <int>
#> 1 flumequina+kanamicina+spectinomicina+streptomicina     3
#> 2 flumequina+kanamicina+streptomicina                    3
#> 3 kanamicina+streptomicina                               3
#> 4 kanamicina+streptomicina+trimeth.+sulfam               1

Created on 2022-11-08 by the reprex package (v2.0.1)

This does however not include all possible common combinations, but only the maximum of common antibiotics pairwise and count them up.

Kind regards

Wow, this seems to be perfect! Thank you so much!!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.