Tidy data with row for each combination of values in multiple variables

brad.cannell · September 24, 2018, 10:10pm

I have a dataset that I am trying to tidy.

I'm using Hadley's paper -- and the corresponding R4DS chapter -- as a guide. However, I'm not seeing a one-to-one correlation between my situation and the examples in the article/chapter. So, I would appreciate a little feedback from the community.

Setup:

I have a relatively large dataset of electronic health records.
For each patient encounter, there are multiple rows. The number of rows is equal to the number of combinations of several variables that are able to take multiple values. In the example below, I'm using employee id (emp_id), medical history (med_hist), and current medications (meds).
The number of values varies from encounter to encounter.

Here is a small illustrative example:


library(tidyverse)

df <- tibble::tibble(
  encounter_id = c(rep(101, 12), rep(102, 2)),
  date_entered = as.Date(rep("2015-09-17", 14)),
  gender =  c(rep("f", 12), rep("m", 2)),
  emp_id = c(rep(119, 6), rep(430, 6), 286, 559),
  med_hist = c(rep(c(rep("diabetes", 3), rep("hypertension", 3)), 2), rep("mi", 2)),
  meds = c(rep(c("asa", "avandamet", "claritin"), 4), rep("lipitor", 2))
) %>% 
  print()

# A tibble: 14 x 6
   encounter_id date_entered gender emp_id med_hist     meds     
          <dbl> <date>       <chr>   <dbl> <chr>        <chr>    
 1          101 2015-09-17   f         119 diabetes     asa      
 2          101 2015-09-17   f         119 diabetes     avandamet
 3          101 2015-09-17   f         119 diabetes     claritin 
 4          101 2015-09-17   f         119 hypertension asa      
 5          101 2015-09-17   f         119 hypertension avandamet
 6          101 2015-09-17   f         119 hypertension claritin 
 7          101 2015-09-17   f         430 diabetes     asa      
 8          101 2015-09-17   f         430 diabetes     avandamet
 9          101 2015-09-17   f         430 diabetes     claritin 
10          101 2015-09-17   f         430 hypertension asa      
11          101 2015-09-17   f         430 hypertension avandamet
12          101 2015-09-17   f         430 hypertension claritin 
13          102 2015-09-17   m         286 mi           lipitor  
14          102 2015-09-17   m         559 mi           lipitor

In this toy example, I have 14 rows for just two hypothetical encounters. In the real data, I have some encounters with 50+ rows. This just doesn't feel efficient/tidy to me.

Having said that, I believe that in its current form:

Each variable is placed in its own column.
Each observation (defined as combination of encounter/emp_id/med_hist/meds) is its own row.
Each value is placed in its own cell.

As an extra wrinkle, I need to join this dataset with another dataset that has one or more rows per encounter. This currently results in an even larger dataset with all possible combinations of variables. Therefore, it would be nice to have a single row for each encounter.

Another option for organizing this data that occurred to me was to reshape the data to a wide format. For example, have a variable called med_hist_1, med_hist_2 ... med_hist_n (repeat for emp_id, meds, etc). That would result in each row corresponding to a single encounter. However, it would also be inefficient in the sense that it would add a lot of NA values to the data.

So, I would appreciate your thoughts about whether or not this data is already in a "tidy" format or if it should be reorganized. If so, how?

joels · September 25, 2018, 6:24am

Would it make more sense to consider the unit of observation to be an encounter_id for a given emp_id? I don't know enough about how your real data set is structured or how it will be used, but below are a couple of possibilities.

You could nest med_hist and meds into lists for each encounter:

d1 = df %>% 
  group_by(encounter_id, date_entered, gender, emp_id) %>% 
  summarise(med_hist = list(unique(med_hist)),
            meds = list(unique(meds)))

d1

  encounter_id date_entered gender emp_id med_hist  meds     
         <dbl> <date>       <chr>   <dbl> <list>    <list>   
1          101 2015-09-17   f         119 <chr [2]> <chr [3]>
2          101 2015-09-17   f         430 <chr [2]> <chr [3]>
3          102 2015-09-17   m         286 <chr [1]> <chr [1]>
4          102 2015-09-17   m         559 <chr [1]> <chr [1]>

Or, to see what's inside each list:

as.data.frame(d1)

  encounter_id date_entered gender emp_id               med_hist                     meds
1          101   2015-09-17      f    119 diabetes, hypertension asa, avandamet, claritin
2          101   2015-09-17      f    430 diabetes, hypertension asa, avandamet, claritin
3          102   2015-09-17      m    286                     mi                  lipitor
4          102   2015-09-17      m    559                     mi                  lipitor

The code above doesn't assume any necessary linkage between med_hist and meds. They're listed independently for each encounter. But maybe you want meds to go with med_hist. In that case, maybe there should be a row for each med_hist for each unique combination of encounter_id and emp_id (though that results in more than one row per encounter if a patient has more than one condition). See the example below.

Your data sample has the exact same meds for each med_hist but this data structure allows different meds for each med_hist for a given emp_id and encounter. But you probably have cases where the same drug is prescribed for more than one condition, so the same med could appear in two different med_hist rows for the same encounter_id and emp_id.

d2 = df %>% 
  group_by(encounter_id, date_entered, gender, emp_id, med_hist) %>% 
  summarise(meds = list(unique(meds)))

d2

  encounter_id date_entered gender emp_id med_hist     meds     
         <dbl> <date>       <chr>   <dbl> <chr>        <list>   
1          101 2015-09-17   f         119 diabetes     <chr [3]>
2          101 2015-09-17   f         119 hypertension <chr [3]>
3          101 2015-09-17   f         430 diabetes     <chr [3]>
4          101 2015-09-17   f         430 hypertension <chr [3]>
5          102 2015-09-17   m         286 mi           <chr [1]>
6          102 2015-09-17   m         559 mi           <chr [1]>

I haven't worked with nested data frames very much and I'm not sure if any of this gets you closer to where you want to go. It would help to have a better idea of how the data set is generated and what you're trying to learn from it.

rensa · September 25, 2018, 10:12am

I agree with @joels: nested data frames add some complexity here, but the fact is that your data is complex, and this is the most natural way to structure it. Having a long format with multiple rows for each medical history is also reasonable, but if you have multiple variables that work this way then the dataset quickly becomes difficult to parse.

If this were stored in a database, you'd likely have one table for encounters and another for medical history and medications (or maybe one for each of those), and they tables would be linked through the encounter.

Nested structures gives you a similar benefit, as they let you maintain 1:many relationships with multiple variables simultaneously. As @joels points you, the specific way you do this depends on those relationships. If it's 1:many for encounter-to-medical history and 1:1 for medical history-to-medicine, you might want to consider a nested data frame. If it's 1:many, many:1 or many:many for medical history-to-medicine, though, nested lists like @joels demonstrates might be better

EDIT: one downside of list columns or nested data structures is that they do make exporting to CSV more difficult! You might need to unnest first.

nwerth · September 25, 2018, 1:36pm

It depends on what you want to do.

@rensa is right, that this type of data would logically be stored as separate tables. This would be simple to update and simple to adapt to any analysis.

List columns do avoid the Cartesian product of history and medication. Of course, that super-long form might be useful if you're looking for medications that shouldn't be given to people with a certain history.

And, because I can't resist throwing my code into the ring, here's a version that would be useful for regression:

all_conditions <- unique(df[["med_hist"]])
all_meds       <- unique(df[["meds"]])

df[all_conditions] <- lapply(
  all_conditions,
  function(condition) df[["med_hist"]] == condition
)
df[all_meds] <- lapply(
  all_meds,
  function(med) df[["meds"]] == med
)

regression_ready <- df %>%
  group_by(encounter_id, date_entered, gender, emp_id) %>%
  summarise_at(
    .vars = c(all_conditions, all_meds),
    .funs = any
  ) %>%
  print()
# # A tibble: 4 x 11
# # Groups:   encounter_id, date_entered, gender [?]
#   encounter_id date_entered gender emp_id diabetes hypertension mi    asa   avandamet claritin lipitor
#          <dbl> <date>       <chr>   <dbl> <lgl>    <lgl>        <lgl> <lgl> <lgl>     <lgl>    <lgl>  
# 1          101 2015-09-17   f         119 TRUE     TRUE         FALSE TRUE  TRUE      TRUE     FALSE  
# 2          101 2015-09-17   f         430 TRUE     TRUE         FALSE TRUE  TRUE      TRUE     FALSE  
# 3          102 2015-09-17   m         286 FALSE    FALSE        TRUE  FALSE FALSE     FALSE    TRUE   
# 4          102 2015-09-17   m         559 FALSE    FALSE        TRUE  FALSE FALSE     FALSE    TRUE

brad.cannell · September 25, 2018, 3:38pm

Hi Joel,

Thank you for the feedback! I apologize for being a little bit vague about the origin and purpose of the data. Unfortunately, I'm a little limited in what I can say for legal/ethical/privacy reasons. I think I can say this:

We actually have two datasets: one is from a healthcare organization and one is from a state government.
We are interested in the relationship between a number of exposures in the healthcare dataset and a couple of outcomes in the government dataset.
There is no unique identifier that is common to both datasets
I'm trying to simplify/tidy each to the greatest extent possible before I attempt to link records in the datasets based on name and date of birth (both datasets have the same geographic and time restrictions).

Anyway, it's kind of a hot mess. I definitely think your idea about using nested columns is a good one. I've been playing around with them some already. Like you, I haven't worked with them much in the past, and it isn't 100% clear to me how they fit into the "tidy data" paradigm.

In any case, thanks again and I will repost if I have any breakthroughs!

brad.cannell · September 25, 2018, 3:40pm

Hi rensa,

Thank you for the feedback! I'm am going to play around with nested columns as you and Joel suggested. I'll repost if I discover anything interesting that may be of interest to the group.

Brad

brad.cannell · September 25, 2018, 3:43pm

Hi nwerth,

Thank you for the feedback! Yes, the creation of dummy variables is another twist that had occurred to me. Of course, in the real dataset this would lead to the creation of hundreds of new variables, but that may not necessarily be a problem -- we also have many thousands of encounters.

Brad