Finding closest date relative to another date in a column and subtracting

I'm running an analysis on a disease.

I have a variable

bd_extracted_rows_disease_date_of_death <- bd_extracted_rows_disease %>%
  select(starts_with("f.40000.0.0"))

and another

bd_extracted_rows_disease_date_of_in_patient_diagnosis <- bd_extracted_rows_disease %>%
  select(starts_with("f.41262.0.x"))

where x ranges from 0 to 65 inclusive (i.e. there's 66 columns of f.41262.0.x, for example f.41262.0.0, f.41262.0.1,...). These are all different dates of various diagnoses and so I would like to search for the closest date to each one in bd_extracted_rows_disease_date_of_death and then subtract like so: bd_extracted_rows_disease_date_of_death - bd_extracted_rows_disease_date_of_in_patient_diagnosis to see how long each participant survived. This works if I just take the first column, f.41262.0.0 but I can't trust that these are the correct dates for this particular disease so I need to find the closest date to the date of death and then take the difference.

Does anyone have any ideas of which functions I might use and how I might implement this? Here's the minimal working example (which works) when I use column f.41262.0.0:

bd_extracted_rows_disease <- bd %>%
  filter(f.eid %in% read.csv("/Volumes/My Passport for Mac/Research/Comparison/PLINK/PatientIDs/disease_patient_id.csv")[[1]])
bd_extracted_rows_disease_date_of_death <- bd_extracted_rows_disease %>%
  select(starts_with("f.40000.0.0"))
bd_extracted_rows_disease_date_of_first_in_patient_diagnosis <- bd_extracted_rows_disease %>%
  select(starts_with("f.41262.0.0"))
disease_survival <- bd_extracted_rows_disease_date_of_death - bd_extracted_rows_disease_date_of_first_in_patient_diagnosis

If I understand you well, this the attached codecould help.

I create a file for 2 id's (persons) with 2 possible diseases (but note that the code is not depended on these particular numbers.
The first two steps only serve to create a file simular to yours. So assume df2 is your input file.
In variable diseases I gather the names of the disease variables.
In df3 I have calculated for all diseases the difference in days with the 'dying' date (and set this difference to 99999 when there is not date for the disease).
In df4 I have calculated the disease that has a date most near the dying date and the difference in days.
I think this is what need and otherwise you will have some inspiration.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

df1 <- tibble::tribble(
  ~id,          ~f.40000.0.0, ~f.41262.0.0,  ~f.41262.0.1,
  1L, "01/01/2014", "01/01/2013" , NA, 
  2L, "01/01/2014", "01/01/2013" , "01/06/2013"
)
df1
#> # A tibble: 2 x 4
#>      id f.40000.0.0 f.41262.0.0 f.41262.0.1
#>   <int> <chr>       <chr>       <chr>      
#> 1     1 01/01/2014  01/01/2013  <NA>       
#> 2     2 01/01/2014  01/01/2013  01/06/2013
df2 =df1 %>%
  mutate(
    across(
      .cols = starts_with("f.4"),
      .fns = ~ lubridate::dmy(.x)
    )
  ) 
df2
#> # A tibble: 2 x 4
#>      id f.40000.0.0 f.41262.0.0 f.41262.0.1
#>   <int> <date>      <date>      <date>     
#> 1     1 2014-01-01  2013-01-01  NA         
#> 2     2 2014-01-01  2013-01-01  2013-06-01
diseases = setdiff(names(df1),c('id','f.40000.0.0'))
df3 = df2 %>%
  mutate(
    across(
      .cols = any_of(diseases),
      .fns = function(x,y) ifelse(is.na(x),99999,y-x),
      df2$f.40000.0.0
    ) 
  )
df3   
#> # A tibble: 2 x 4
#>      id f.40000.0.0 f.41262.0.0 f.41262.0.1
#>   <int> <date>            <dbl>       <dbl>
#> 1     1 2014-01-01          365       99999
#> 2     2 2014-01-01          365         214

min_i = function(...) {
  a = unlist(list(...))
  order(a)[1]
}
min_v = function(...) {
  a = unlist(list(...))
  a[order(a)[1]]
}

df4 = df3 %>%
  rowwise() %>%
  mutate(
    min_ix = min_i (c_across(cols = any_of(diseases))),
    min_ix = diseases[min_ix],
    min_d  = min_v (c_across(cols = any_of(diseases))),
    ) 

df4
#> # A tibble: 2 x 6
#> # Rowwise: 
#>      id f.40000.0.0 f.41262.0.0 f.41262.0.1 min_ix      min_d
#>   <int> <date>            <dbl>       <dbl> <chr>       <dbl>
#> 1     1 2014-01-01          365       99999 f.41262.0.0   365
#> 2     2 2014-01-01          365         214 f.41262.0.1   214

Created on 2020-06-15 by the reprex package (v0.3.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.