Mutate doesn't work after subsetting with slice_min

Hello! I doing a simple recode of an age categories variable, AGE, in order to get fewer categories. The command works completely fine, until I try to do it on a subset of the data. Something goes wrong after I subset the data with slice_min() to only include the first row of each YEAR. Does anyone understand why?!?

this part works on my original df, reprex

reprex <- reprex %>% 
  mutate(age_r = ifelse(test = (reprex$AGE == 2 | reprex$AGE == 3), yes = 2, 
                            ifelse(test = (reprex$AGE == 4 | reprex$AGE == 5), yes = 3, 
                                   ifelse(test = (reprex$AGE == 6 | reprex$AGE == 7), yes = 4,
                                          ifelse(test = (reprex$AGE == 8 | reprex$AGE == 9), yes = 5, 
                                                 no = reprex$AGE)))))

no longer works after using slice() to only include the first year for each ID.

reprex_2 <- reprex %>% group_by(ID) %>% slice_min(YEAR, n=1)

reprex_2 <- reprex_2 %>% 
  mutate(age_r = ifelse(test = (reprex_2$AGE == 2 | reprex_2$AGE == 3), yes = 2, 
                        ifelse(test = (reprex_2$AGE == 4 | reprex_2$AGE == 5), yes = 3, 
                               ifelse(test = (reprex_2$AGE == 6 | reprex_2$AGE == 7), yes = 4,
                                      ifelse(test = (reprex_2$AGE == 8 | reprex_2$AGE == 9), yes = 5, 
                                             no = reprex_2$AGE)))))

(Note that: I also tried the equivalent command filter(row_number() == 1L)), instead of slice_min.

code to reproduce reprex:
reprex <- structure(list(ID = c(2079682L, 8114104L, 2079682L, 8114104L, 
                      8313555L, 8325419L, 2079682L, 8114104L, 8337003L, 8325419L, 8313555L, 
                      2079682L, 8114104L, 8337003L, 8384153L, 8313555L, 30146527L, 
                      8325419L, 2079682L, 8325419L, 8313555L, 8114104L, 30356349L, 
                      8384153L, 30146527L, 2079682L, 8384153L, 30146527L, 8325419L, 
                      8114104L, 30356349L, 2079682L, 8313555L, 8325419L, 8384153L, 
                      8114104L, 30356349L, 30146527L, 2079682L, 8384153L, 8114104L, 
                      8325419L, 30356349L, 2079682L, 8114104L, 8325419L, 8926672L, 
                      8313555L, 8384153L, 30356349L, 2079682L, 8325419L, 8926672L, 
                      30356349L, 8114104L, 8384153L, 2079682L, 8114104L, 8384153L, 
                      8325419L, 8926672L, 8313555L, 30356349L, 2079682L, 8313555L, 
                      8926672L, 8114104L, 8384153L, 30356349L, 8325419L, 2079682L, 
                      8114104L, 8384153L, 30356349L, 8313555L, 8325419L, 2079682L, 
                      4596268L, 8313555L, 8325419L, 8384153L, 30356349L, 2079682L, 
                      4596268L, 8313555L, 8325419L, 8384153L, 8926672L, 30356349L, 
                      4596268L, 2079682L, 8384153L, 8325419L, 8313555L, 8926672L, 30356349L, 
                      4596268L, 8325419L, 8384153L, 30356349L), AGE = c(8L, 7L, 8L, 
                                                                        7L, 0L, 4L, 8L, 7L, 3L, 4L, 0L, 8L, 8L, 3L, 8L, 0L, 8L, 4L, 8L, 
                                                                        4L, 0L, 8L, 8L, 8L, 8L, 9L, 8L, 8L, 4L, 8L, 8L, 9L, 0L, 5L, 8L, 
                                                                        8L, 8L, 8L, 9L, 8L, 8L, 5L, 8L, 9L, 8L, 5L, 7L, 0L, 9L, 8L, 9L, 
                                                                        5L, 8L, 8L, 8L, 9L, 9L, 8L, 9L, 5L, 8L, 0L, 8L, 9L, 0L, 8L, 8L, 
                                                                        9L, 8L, 6L, 9L, 8L, 9L, 8L, 0L, 6L, 9L, 8L, 8L, 6L, 9L, 9L, 9L, 
                                                                        8L, 9L, 6L, 9L, 8L, 9L, 8L, 9L, 9L, 6L, 9L, 8L, 9L, 8L, 7L, 9L, 
                                                                        9L), YEAR = c(2004L, 2004L, 2005L, 2005L, 2005L, 2005L, 2006L, 
                                                                                      2006L, 2006L, 2006L, 2006L, 2007L, 2007L, 2007L, 2007L, 2007L, 
                                                                                      2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 2008L, 
                                                                                      2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 2010L, 2010L, 
                                                                                      2010L, 2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 
                                                                                      2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2012L, 2013L, 2013L, 
                                                                                      2013L, 2013L, 2013L, 2013L, 2014L, 2014L, 2014L, 2014L, 2014L, 
                                                                                      2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
                                                                                      2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 
                                                                                      2017L, 2017L, 2017L, 2018L, 2018L, 2018L, 2018L, 2018L, 2018L, 
                                                                                      2018L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2019L, 2020L, 
                                                                                      2020L, 2020L, 2020L)), row.names = c(NA, -100L), class = "data.frame")

Try using case_when instead. It is much more readable and works as you want.

j <- reprex |> 
  mutate(age_r = case_when(AGE %in% 2:3 ~ 2,
                           AGE %in% 4:5 ~ 3,
                           AGE %in% 6:7 ~ 4,
                           AGE %in% 8:9 ~ 5,
                           .default = AGE))
                           
k <- reprex |>
  group_by(ID) |> 
  slice_min(YEAR, n = 1) |> 
  mutate(age_r = case_when(AGE %in% 2:3 ~ 2,
                           AGE %in% 4:5 ~ 3,
                           AGE %in% 6:7 ~ 4,
                           AGE %in% 8:9 ~ 5,
                           .default = AGE))

Thank you for that suggestion. Readability is so much better and will re-use case_when in the future.

It seems like the slice function makes changes to the data frame. Apparently, this has something to do with slice transforming the dataset into a tibble. When I change my data into a data frame the original issue I reported above goes away. Don't understand why some functions are picky about whether the object is a data frame or a tibble but I suppose that conversation is on another thread.

solves the initial issue

reprex_2 <- reprex %>% group_by(ID) %>% slice_min(YEAR, n=1)
reprex_2 <- as.data.frame(reprex_2)

The reason it failed is that your command

reprex_2 <- reprex %>% group_by(ID) %>% slice_min(YEAR, n=1)

leaves the data frame regex_2 grouped, so when you follow up with a mutate() command, it tries to do it by group. But... you have dereferenced all the AGE columns in your right hand side with regex_2$ so it matches 10 lines and tries to mutate them into 1 line (the groups are 1 line each). You should make a habit of using ungroup() after you're done with the grouped operations. I run into this all the time, almost nobody bothers with calling ungroup() because many times the follow-up code works anyway.

Here I use ungroup() and remove the dereferencing of the table (reprex_2$) from all the AGE constraints:

reprex_2 <- reprex %>% group_by(ID) %>% slice_min(YEAR, n=1) |> ungroup()
k <- reprex_2 %>% 
  mutate(age_r = ifelse(test = (AGE == 2 | AGE == 3), yes = 2, 
                        ifelse(test = (AGE == 4 | AGE == 5), yes = 3, 
                               ifelse(test = (AGE == 6 | AGE == 7), yes = 4,
                                      ifelse(test = (AGE == 8 | AGE == 9), yes = 5, 
                                             no = AGE)))))

Note that my example using case_when() would also fail if the AGE constraints were dereferenced with the table name reprex_2$. Never do that with dplyr code!

The reason the issue is fixed when you run as.data.frame(reprex_2) is because that transformation removes the grouping, as it re-classes the data frame to a regular data frame from a tibble which is the only kind that can hold a grouped data frame.

Thank you so much for that explanation. I have been lucky enough that "many times the follow-up code works anyway"!

I am unfamiliar with the concept of "dereferenced" but I can see from your example it may mean calling back the ungrouped variable. I've never used .default in any of my coding.

1 Like

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.