column based on conditional statements from 2 dataframes based on ID

Hello,

I would like to make a new column in one dataframe, that returns a value from another dataframe if a number of conditions are met.
Both dataframes have a grouping variable 'subject'. Each subject uses several medications. The medications could have been prescribed on a planned study-related visit to the doctor or on another moment (unrelated to the study).
I would like the new column, called 'period', to return the planned visit if the medication was started on this date (including 2 days before up to 2 days after the visit). If not, I would like 'period' to return one of 3 possible strings 'before', 'after' or 'in-between'.

I managed to get the right value in 1 patient for 1 medication, but unfortunately did not get further:

# prescribed medication and starting dates in 3 subjects
a <-data.frame(stringsAsFactors=FALSE,
               subject = c(1, 1, 1, 1, 2, 2, 3, 3, 3),
               medication = c('med1', 'med2', 'med3', 'med4', 'med5', 'med6', 'med7', 'med8', 'med9'),
               startdate = c('01-01-2016', '02-02-2016', '10-02-2016', '02-03-2016', '07-03-2015', '10-12-2015', '06-01-2018', '08-03-2018', '11-04-2018'))
a$startdate <- as.Date(a$startdate, format ="%d-%m-%Y") 
# doctor visit number and corresponding dates in the same 3 subjects
b <-data.frame(stringsAsFactors=FALSE,  
               subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
               visitnumber = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
               visitdate = c('02-02-2016', '03-03-2016', '05-04-2016', '04-02-2015', '06-03-2015', '07-04-2015', '06-02-2018', '08-03-2018', '10-04-2018'))
b$visitdate <- as.Date(b$visitdate, format ="%d-%m-%Y")    

# I first tried a subset of only subject 1 and 'med2' 
c <- subset(a, subject == 1)
d <- subset(b, subject ==1)
d
#>   subject visitnumber  visitdate
#> 1       1           1 2016-02-02
#> 2       1           2 2016-03-03
#> 3       1           3 2016-04-05
e <- subset(c, medication == 'med2') 
e
#>   subject medication  startdate
#> 2       1       med2 2016-02-02
#function, e$startdate for 'sd',  d$visitnumber for 'vn' and d$visitdate for'vd'
overlap <- function(sd, vd, vn){
  if (sd >= vd -2 & sd <= vd +2 ) {
    return(vn)}
  else if (sd < min(vd)) { 
    return("before")}
  else if (sd > max(vd)) {
    return("after")}
  else {
    return("in-between")}
}
test<- overlap(e$startdate, d$visitdate, d$visitnumber)
#> Warning in if (sd >= vd - 2 & sd <= vd + 2) {: the condition has length > 1 and
#> only the first element will be used
test
#> [1] 1 2 3
#I read that error can be removed by vectorizing:
f_vec <- Vectorize(overlap, vectorize.args = c("sd", "vd", "vn"))
test1 <- f_vec(e$startdate, d$visitdate, d$visitnumber)
test1
#> [1] "1"      "before" "before"

Created on 2020-05-14 by the reprex package (v0.3.0)

I only need the first output of test1 ("1" here).
Then the function needs to iterate over rows, or maybe grouped by subjects, or another conditional statement needs to be added, to have "period" in dataframe 'a' on all subjects for all prescribed medications.
Any help how to proceed is greatly appreciated!

Not sure if I've got it right but here's an attempt with just the records for subject 1. Can you check and let me know if this is what you are after?

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

a <- data.frame(stringsAsFactors = FALSE,
                subject = c(1, 1, 1, 1, 2, 2, 3, 3, 3),
                medication = c('med1', 'med2', 'med3', 'med4', 'med5', 'med6', 'med7', 'med8', 'med9'),
                startdate = c('01-01-2016', '02-02-2016', '10-02-2016', '02-03-2016', '07-03-2015', '10-12-2015', '06-01-2018', '08-03-2018', '11-04-2018'))

a$startdate <- as.Date(a$startdate, format ="%d-%m-%Y") 

b <- data.frame(stringsAsFactors = FALSE,  
                subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
                visitnumber = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
                visitdate = c('02-02-2016', '03-03-2016', '05-04-2016', '04-02-2015', '06-03-2015', '07-04-2015', '06-02-2018', '08-03-2018', '10-04-2018'))

b$visitdate <- as.Date(b$visitdate, format ="%d-%m-%Y")

a_sub <- filter(a, subject == 1)
b_sub <- filter(b, subject == 1)

a_sub %>% 
  full_join(b_sub, by = "subject") %>% 
  mutate(period = case_when(startdate < visitdate - days(2) ~ "Before",
                            startdate > visitdate + days(2) ~ "After",
                            TRUE ~ "In-between")) %>% 
  ungroup()
#>    subject medication  startdate visitnumber  visitdate     period
#> 1        1       med1 2016-01-01           1 2016-02-02     Before
#> 2        1       med1 2016-01-01           2 2016-03-03     Before
#> 3        1       med1 2016-01-01           3 2016-04-05     Before
#> 4        1       med2 2016-02-02           1 2016-02-02 In-between
#> 5        1       med2 2016-02-02           2 2016-03-03     Before
#> 6        1       med2 2016-02-02           3 2016-04-05     Before
#> 7        1       med3 2016-02-10           1 2016-02-02      After
#> 8        1       med3 2016-02-10           2 2016-03-03     Before
#> 9        1       med3 2016-02-10           3 2016-04-05     Before
#> 10       1       med4 2016-03-02           1 2016-02-02      After
#> 11       1       med4 2016-03-02           2 2016-03-03 In-between
#> 12       1       med4 2016-03-02           3 2016-04-05     Before

Created on 2020-05-14 by the reprex package (v0.3.0)

Basically! I made column "new.period" showing what I mean.

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

a <- data.frame(stringsAsFactors = FALSE,
                subject = c(1, 1, 1, 1, 2, 2, 3, 3, 3),
                medication = c('med1', 'med2', 'med3', 'med4', 'med5', 'med6', 'med7', 'med8', 'med9'),
                startdate = c('01-01-2016', '02-02-2016', '10-02-2016', '02-03-2016', '07-03-2015', '10-12-2015', '06-01-2018', '08-03-2018', '11-04-2018'))

a$startdate <- as.Date(a$startdate, format ="%d-%m-%Y") 

b <- data.frame(stringsAsFactors = FALSE,  
                subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
                visitnumber = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
                visitdate = c('02-02-2016', '03-03-2016', '05-04-2016', '04-02-2015', '06-03-2015', '07-04-2015', '06-02-2018', '08-03-2018', '10-04-2018'))

b$visitdate <- as.Date(b$visitdate, format ="%d-%m-%Y")

a_sub <- filter(a, subject == 1)
b_sub <- filter(b, subject == 1)

ab <- a %>% 
  full_join(b, by = "subject") %>% 
  mutate(period = case_when(startdate >= visitdate - days(2) & startdate <= visitdate + days(2) ~ as.character(visitnumber),
                            startdate < visitdate - days(2) ~ "Before",
                            startdate > visitdate + days(2) ~ "After",
                            TRUE ~ "In-between")) %>% 
  ungroup()
#'In-between' didn't work here

ab$leadm <-lead(ab$medication)
ab$leadp <-lead(ab$period)
ab <- mutate(ab, new.period = case_when(medication == leadm & period == 'After' & leadp == 'Before' ~ 'Inbetween',TRUE ~ period))
ab
#>    subject medication  startdate visitnumber  visitdate period leadm  leadp
#> 1        1       med1 2016-01-01           1 2016-02-02 Before  med1 Before
#> 2        1       med1 2016-01-01           2 2016-03-03 Before  med1 Before
#> 3        1       med1 2016-01-01           3 2016-04-05 Before  med2      1
#> 4        1       med2 2016-02-02           1 2016-02-02      1  med2 Before
#> 5        1       med2 2016-02-02           2 2016-03-03 Before  med2 Before
#> 6        1       med2 2016-02-02           3 2016-04-05 Before  med3  After
#> 7        1       med3 2016-02-10           1 2016-02-02  After  med3 Before
#> 8        1       med3 2016-02-10           2 2016-03-03 Before  med3 Before
#> 9        1       med3 2016-02-10           3 2016-04-05 Before  med4  After
#> 10       1       med4 2016-03-02           1 2016-02-02  After  med4      2
#> 11       1       med4 2016-03-02           2 2016-03-03      2  med4 Before
#> 12       1       med4 2016-03-02           3 2016-04-05 Before  med5  After
#> 13       2       med5 2015-03-07           1 2015-02-04  After  med5      2
#> 14       2       med5 2015-03-07           2 2015-03-06      2  med5 Before
#> 15       2       med5 2015-03-07           3 2015-04-07 Before  med6  After
#> 16       2       med6 2015-12-10           1 2015-02-04  After  med6  After
#> 17       2       med6 2015-12-10           2 2015-03-06  After  med6  After
#> 18       2       med6 2015-12-10           3 2015-04-07  After  med7 Before
#> 19       3       med7 2018-01-06           1 2018-02-06 Before  med7 Before
#> 20       3       med7 2018-01-06           2 2018-03-08 Before  med7 Before
#> 21       3       med7 2018-01-06           3 2018-04-10 Before  med8  After
#> 22       3       med8 2018-03-08           1 2018-02-06  After  med8      2
#> 23       3       med8 2018-03-08           2 2018-03-08      2  med8 Before
#> 24       3       med8 2018-03-08           3 2018-04-10 Before  med9  After
#> 25       3       med9 2018-04-11           1 2018-02-06  After  med9  After
#> 26       3       med9 2018-04-11           2 2018-03-08  After  med9      3
#> 27       3       med9 2018-04-11           3 2018-04-10      3  <NA>   <NA>
#>    new.period
#> 1      Before
#> 2      Before
#> 3      Before
#> 4           1
#> 5      Before
#> 6      Before
#> 7   Inbetween
#> 8      Before
#> 9      Before
#> 10      After
#> 11          2
#> 12     Before
#> 13      After
#> 14          2
#> 15     Before
#> 16      After
#> 17      After
#> 18      After
#> 19     Before
#> 20     Before
#> 21     Before
#> 22      After
#> 23          2
#> 24     Before
#> 25      After
#> 26      After
#> 27          3

Created on 2020-05-14 by the reprex package (v0.3.0)

Your answer is already really helpful. The only problem now is that medication has duplicates (I don't need (all values of) 'visitnumber' and 'visitdate' in the last table).
So for 'med1', I would only need 1 row, with in the 'new.period' column the value 'Before'. For 'med2', I only need value '1'. For 'med3', I only need 'In-between'. For 'med4', I only need '2', etc.

EDIT: as in:

   subject medication  startdate  new.period
#> 1   1    med 1     2016-01-01  Before
#> 2   1    med 2     2016-02-02      1
#> 3   1    med 3     2016-02-10  Inbetween
#> 4   1    med 4     2016-03-02      2
etc.

Would you have a solution for that?

Can you please explain the logic to determine which rows should be retained? The duplicates result from using full_join() which keeps all combinations because I couldn't see any way of associating the two tables besides subject. If there is, we could use a different join instead.

Indeed only subject occurs in both dataframes a and b, so I don't know how to use the join function otherwise.

To determine the rows to retain:
For each unique row in the combination of a$subject, a$medication and a$startdate, I want to compare the date in a$startdate with b$visitdate, if b$subject equals a$subject.
However, the following logic in b$visitdate does not apply to a single row element, but works columnwise, which I initially tried with overlap():

  • if a$startdate occurs within 2 days of b$visitdate, return the corresponding b$visitnumber .
  • if else a$startdate is earlier than any of b$visitdate, return "Before".
  • if else a$startdate is after any of b$visitdate, return "After".
  • else a$startdate occurs in-between two dates in b$visitdate, return "Inbetween". However, defining it as else statement was not specific enough, so in my second post I redefined "Inbetween" in ab$new.period using lead() and case_when().

How about this then? Since there's no join condition, I'm retaining those rows where the difference between startdate and visitdate is minimum for each medication (i.e. the closest visit and medication start date pair). It seems to match your desired output.

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

a <- data.frame(stringsAsFactors = FALSE,
                subject = c(1, 1, 1, 1, 2, 2, 3, 3, 3),
                medication = c('med1', 'med2', 'med3', 'med4', 'med5', 'med6', 'med7', 'med8', 'med9'),
                startdate = c('01-01-2016', '02-02-2016', '10-02-2016', '02-03-2016', '07-03-2015', '10-12-2015', '06-01-2018', '08-03-2018', '11-04-2018'))

a$startdate <- as.Date(a$startdate, format ="%d-%m-%Y") 

b <- data.frame(stringsAsFactors = FALSE,  
                subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
                visitnumber = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
                visitdate = c('02-02-2016', '03-03-2016', '05-04-2016', '04-02-2015', '06-03-2015', '07-04-2015', '06-02-2018', '08-03-2018', '10-04-2018'))

b$visitdate <- as.Date(b$visitdate, format ="%d-%m-%Y")

a_sub <- filter(a, subject == 1)
b_sub <- filter(b, subject == 1)

a_sub %>% 
  left_join(b_sub, by = "subject") %>% 
  group_by(subject, medication) %>% 
  mutate(days_diff = abs(startdate - visitdate)) %>% 
  filter(days_diff == min(days_diff)) %>% 
  ungroup() %>% 
  mutate(period = case_when(startdate >= visitdate - days(2) & startdate <= visitdate + days(2) ~ as.character(visitnumber), 
                            startdate >= min(visitdate) & startdate <= max(visitdate) ~ "In-between", 
                            startdate < visitdate - days(2) ~ "Before", 
                            startdate > visitdate + days(2) ~ "After")) %>% 
  select(-c(visitnumber:days_diff))
#> # A tibble: 4 x 4
#>   subject medication startdate  period    
#>     <dbl> <chr>      <date>     <chr>     
#> 1       1 med1       2016-01-01 Before    
#> 2       1 med2       2016-02-02 1         
#> 3       1 med3       2016-02-10 In-between
#> 4       1 med4       2016-03-02 2

Created on 2020-05-15 by the reprex package (v0.3.0)

Note: Some changes will be needed to make it work with multiple subject IDs, so let me know if this is fine first.

1 Like

Yes this is great! With reordering of your code, I make a new column for all subjects, hope you had something similar in mind:

library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)

a <- data.frame(stringsAsFactors = FALSE,
                subject = c(1, 1, 1, 1, 2, 2, 3, 3, 3),
                medication = c('med1', 'med2', 'med3', 'med4', 'med5', 'med6', 'med7', 'med8', 'med9'),
                startdate = c('01-01-2016', '02-02-2016', '10-02-2016', '02-03-2016', '07-03-2015', '10-12-2015', '06-01-2018', '08-03-2018', '11-04-2018'))

a$startdate <- as.Date(a$startdate, format ="%d-%m-%Y") 

b <- data.frame(stringsAsFactors = FALSE,  
                subject = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
                visitnumber = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
                visitdate = c('02-02-2016', '03-03-2016', '05-04-2016', '04-02-2015', '06-03-2015', '07-04-2015', '06-02-2018', '08-03-2018', '10-04-2018'))

b$visitdate <- as.Date(b$visitdate, format ="%d-%m-%Y")

ab <-  a %>% 
  left_join(b, by = "subject") %>% 
  group_by(subject, medication) %>% 
  mutate(days_diff = abs(startdate - visitdate)) %>% 
  mutate(period = case_when(startdate >= visitdate - days(2) & startdate <= visitdate + days(2) ~ as.character(visitnumber), 
                            startdate >= min(visitdate) & startdate <= max(visitdate) ~ "In-between", 
                            startdate < visitdate - days(2) ~ "Before", 
                            startdate > visitdate + days(2) ~ "After"))   %>%
  filter(days_diff == min(days_diff)) %>% 
  select(-c(visitnumber:days_diff))
ab
#> # A tibble: 9 x 4
#> # Groups:   subject, medication [9]
#>   subject medication startdate  period    
#>     <dbl> <chr>      <date>     <chr>     
#> 1       1 med1       2016-01-01 Before    
#> 2       1 med2       2016-02-02 1         
#> 3       1 med3       2016-02-10 In-between
#> 4       1 med4       2016-03-02 2         
#> 5       2 med5       2015-03-07 2         
#> 6       2 med6       2015-12-10 After     
#> 7       3 med7       2018-01-06 Before    
#> 8       3 med8       2018-03-08 2         
#> 9       3 med9       2018-04-11 3

Created on 2020-05-15 by the reprex package (v0.3.0)

That works but I think it might be safer to regroup by subject alone before the case_when() step since we want to use min/max values pertaining to that subject alone. This is what I had in mind.

# Code blocks to create data frames a and b not shown.

a %>% 
  left_join(b, by = "subject") %>% 
  group_by(subject, medication) %>% 
  mutate(days_diff = abs(startdate - visitdate)) %>% 
  filter(days_diff == min(days_diff)) %>% 
  group_by(subject) %>% 
  mutate(period = case_when(startdate >= visitdate - days(2) & startdate <= visitdate + days(2) ~ as.character(visitnumber), 
                            startdate >= min(visitdate) & startdate <= max(visitdate) ~ "In-between", 
                            startdate < visitdate - days(2) ~ "Before", 
                            startdate > visitdate + days(2) ~ "After")) %>% 
  select(-c(visitnumber:days_diff))
#> # A tibble: 9 x 4
#> # Groups:   subject [3]
#>   subject medication startdate  period    
#>     <dbl> <chr>      <date>     <chr>     
#> 1       1 med1       2016-01-01 Before    
#> 2       1 med2       2016-02-02 1         
#> 3       1 med3       2016-02-10 In-between
#> 4       1 med4       2016-03-02 2         
#> 5       2 med5       2015-03-07 2         
#> 6       2 med6       2015-12-10 After     
#> 7       3 med7       2018-01-06 Before    
#> 8       3 med8       2018-03-08 2         
#> 9       3 med9       2018-04-11 3

Created on 2020-05-15 by the reprex package (v0.3.0)

Thank you for your help and final comment!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.