Choosing the best of 3 observations based on date and identification

I'm currently tidying a dataset as part of a clinical project I am working on. I am trying to extract the best of the measurements for each parameter (e.g. FEV1/FVC/FEF) per patient when there are duplicate dates on which their measurements were taken. I'll attach my reprex at the end to explain further.

Basically I want to make a new variable for each parameter with the best of measurements, then combine them together with the other variables to make a new, refined dataframe in order to have it matched in size to another dataframe I am using.

Was wondering if anyone would be able to help me with this, as so the only way I've found is to do it manually. This works but I'd rather not do that due to the risk of human error and it being quite time consuming (I intend to use my script on multiple other sites in this project that could have larger data and thus take more time to refine).

Here is the reprex. Thanks for any help that can be provided.

#Test data frame with some extracted values
#This is just a sample of one patient, but it shows the duplicate dates I am referring to

data <- data.frame(PATIENT_ID = c(-2147483646, -2147483646, -2147483646, -2147483646, -2147483646, -2147483646, -2147483646, -2147483646, -2147483646, -2147483646), 
                   DATE = c("2018-08-06", "2018-08-20", "2018-08-20", "2018-08-20", "2018-10-01", "2018-10-01", "2018-10-01", "2019-02-04", "2019-02-04", "2019-02-04"),
                   FEV1_L = c(3.974, 4.024, 4.053, 4.046, 4.241, 4.213, 3.944, 4.027, 3.984, 4.247),
                   FVC_L = c(5.198, 5.250, 5.043, 5.286, 5.305, 5.284, 4.922, 5.255, 5.426, 5.526),
                   FEF_LM = c(199.2, 196.8, 226.9, 199.3, 249.0, 237.7, 225.2, 223.8, 205.1, 226.9))


#Manual attempt to extract the best measurements and make them into separate variables

data.fev1 <- data[c(3, 5, 10),]
data.fvc <- data[c(4, 5, 10),]
data.fef <- data[c(3, 5, 10),]

#Combining manually extracted best of together with rest of data frame to make a new refined dataframe

data.new <- dplyr::bind_rows(data.fev1, data.fvc)
data.final <- dplyr::bind_rows(data.new, data.fef)


#Alternative that results in the same but isn't done manually?

I believe you can use this sort of approach to repace

data.fev1 <- data[c(3, 5, 10),]

with

data.fev1 <- top_n(group_by(data,PATIENT_ID),n = 3,wt = FEV1_L)

where n=3 gets the maximum 3 fev1_l values seen for the patient_id

Would it be possible to use this to also include date with patient ID?

As the same patient has multiple visits, so there are duplicate dates but I wish to select the best of 3 of each parameter for those duplicate dates (e.g. 2018-08-20).

data.fev1 <- top_n(group_by(data,PATIENT_ID,DATE),n = 3,wt = FEV1_L)

would give top 3 per patient per date.
If a patients has 3 dates, and 3 observations on each, you would get 9 measures from this.

if you have many variables to do this for(pick top 3 of), that aspect of repeating the approach for different measure variables can be generalised as well; recommend purrr packages map functions. Would that be useful, or overkill for the scenario ?

Unfortunately it isn't what I'm looking for, as it still includes duplicate dates. But thank you for the suggestion.

I tried this and it seems to have work, checking that it hasn't disregarded PATIENT_ID however. This is from the actual data.

Home.raw7.FEV1 <- setDT(Home.raw6)[, .SD[which.max(FEV1_L)], DATE] #Choosing best of FEV1 readings

your solution involving setDT is equivalent to the form:

top_n(group_by(data,PATIENT_ID,DATE),n = 1,wt = FEV1_L)

i.e. take the single top value from every patient/date combination

I'm thinking perhaps you mean to do a first pass where you take the single highest value per patient per date.
then a second pass to find the top 3 values by patient regardless of date.

If its not that then I'm stumped :laughing:

Does this work for you?

library(dplyr)

max_check <- function(x) {x == max(x)}

data.final <- data %>% 
  group_by(PATIENT_ID, DATE) %>% 
  mutate_at(
    c("FEV1_L", "FVC_L", "FEF_LM"), 
    list(max = max_check)
  ) %>% 
  ungroup() %>% 
  filter_at(
    vars(ends_with("_max")), 
    any_vars(. == TRUE)
  ) %>% 
  select_at(
    vars(!ends_with("_max"))
  )

This could be made more robust, but should be a good strting point.

2 Likes

This looks promising, but unfortunately it will not work on my real dataset for whatever reason, despite being set out the same.

I'm greeted with the following error:

Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"

Hi, this code, while not very elegant nor concise (still a beginner here), on the reprex does the job. Check it out:

data.final <- 
  bind_rows(
  data %>% 
    group_by(PATIENT_ID, DATE) %>% 
    mutate(maxfev = max(FEV1_L)) %>% 
    filter(FEV1_L==maxfev) %>% 
    ungroup %>% 
    top_n(n=3, maxfev)
  ,
  data %>% 
    group_by(PATIENT_ID, DATE) %>% 
    mutate(maxfef = max(FEF_LM)) %>% 
    filter(FEF_LM==maxfef) %>% 
    ungroup %>% 
    top_n(n=3, maxfef)
  ,
  data %>% 
    group_by(PATIENT_ID, DATE) %>% 
    mutate(maxfvc = max(FVC_L)) %>% 
    filter(FVC_L==maxfvc) %>% 
    ungroup %>% 
    top_n(n=3, maxfvc)
  ) %>% 
  select(-starts_with("max"))

EDIT: I'll throw a bunch of distinct in as well, to protect from 2 identical measurements at the same date

data.final <- 
  bind_rows(
    data %>% 
      group_by(PATIENT_ID, DATE) %>% 
      mutate(maxfev = max(FEV1_L)) %>% 
      filter(FEV1_L==maxfev) %>% 
      ungroup %>% 
      distinct(DATE, FEV1_L, .keep_all = T) %>% 
      top_n(n=3, maxfev)
    ,
    data %>% 
      group_by(PATIENT_ID, DATE) %>% 
      mutate(maxfef = max(FEF_LM)) %>% 
      filter(FEF_LM==maxfef) %>% 
      ungroup %>% 
      distinct(DATE, FEF_LM, .keep_all = T) %>% 
      top_n(n=3, maxfef)
    ,
    data %>% 
      group_by(PATIENT_ID, DATE) %>% 
      mutate(maxfvc = max(FVC_L)) %>% 
      filter(FVC_L==maxfvc) %>% 
      ungroup %>% 
      distinct(DATE, FVC_L, .keep_all = T) %>% 
      top_n(n=3, maxfvc)
  ) %>% 
  select(-starts_with("max"))
2 Likes

I tried this again and it has worked! Fantastic!

Thanks very much for the help.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.