Working with Nested Data Frames

This is an example of a problem which I've solved, but not to my liking. I have a dataframe, 5 columns by 4884 observations, and I am trying to use tidyr::nest and purrr::map to build a nested data frame for use in a visualization. Here is the solution I tried to use:

  rx_info1 <- rx_post %>%  
    nest(-mem_id, -`Reporting Name`) %>% 
    mutate(refill_df = map(data, function(df){
      is_refill <- df %>%
        group_by(group) %>% 
        mutate(tally = n(), # for testing
               is_refill = n() > 1) %>%
        select(group, is_refill) %>%
        distinct() %>%
        filter(!is.na(group))
      left_join(df, is_refill, by = 'group')
    }))

In that solution, each of the nested dataframes in the refill_df column of rx_info1 has tally == 4884, which is the length of the original data frame, and is_refill == F. However, this solution works:

rx_info2 <- rx_post %>%  
 nest(-mem_id, -`Reporting Name`)

temp <- map(rx_info2$data, function(df){
 is_refill <- df %>%
   group_by(group) %>% 
   mutate(tally = n(), # for testing
          is_refill = n() > 1) %>%
   select(group, is_refill) %>%
   distinct() %>%
   filter(!is.na(group))
 left_join(df, is_refill, by = 'group')
})

rx_info2$refill_df <- temp

In this case, tally is indeed the tally of the "group" number of each individual data data frame, and is_refill does indeed give me the correct boolean value.

For some reason (which hopefully some of you can relate to), I would like for all of this to exist within a single pipe chain. Any help with this is much apprecaited.

Can you try to make your question a bit more general and provide a reprex? At the moment it depends on rx_post, for example, and it is unclear what it is and what is it's structure. Also, you are saying that you are doing this to create a visualization and there is no visualization code, so maybe your problem can be solved easier without using the approach you've came up with.
In the meantime I'm not sure I understand how do you even have tally column in rx_info1 since you drop it in the line select(group, is_refill), so it won't be in is_refill that you left_join with df.

EDIT: Actually, I think you've came across an issue that is a limitation inside of dplyr itself. You can read about it here - https://github.com/tidyverse/dplyr/issues/2080.

2 Likes

Yep! That's exactly the issue. Glad to know somebody tagged the issue on GitHub.