Custom Function w/ Group_by

soufin12 · February 26, 2020, 12:58pm

I have a custom function that creates new rows, where it copies the data from row one and adds rows equal to a number in a specific column. Right now, the function works well if there is only one data entry per id. What I need is for the function to work when the data has multiple rows for one id.

My data includes id which is the persons id; Stage which is the stage the person is in; Start/ End which is the Start and End date; MonthDiff which is the difference between the start and end date, and a Censor which is equal to 0 or 1.

I need the function to be grouped by Stage and to copy rows down equal to the month diff in that stage and then restart.

What I have so far:

    df<-data.frame(id=c('A','A','A'),
               Stage=c(1,2,3),
               Start=c(as.Date('2014-01-01'),as.Date('2016-01-01'),as.Date('2019-01-01')),
               End=c(as.Date('2015-12-31'),as.Date('2018-12-31'),as.Date('2020-02-01')),
               MonthDiff=c(23,35,13),
               Censor=c(0,0,1))

    PLPP <- function(data, id,Stage, period, event) 
    {stopifnot(is.matrix(data) || is.data.frame(data))
     stopifnot(c(id, period, event) %in% c(colnames(data), 1:ncol(data)))
  
     if (any(is.na(data[, c(id, period, event)]))) {
    stop("PLPP cannot currently handle missing data in the id, period, or event variables")
    }
           period = {
           index <- rep(1:nrow(data), data[, period])
           idmax<-cumsum(data[, period])
           reve <- !data[, event]
           dat <- data[index, ]
           dat[, period] <- ave(dat[, period], dat[, id], FUN = seq_along)
           dat[, event] <- 0
           dat[idmax, event] <- reve}
         
  
    rownames(dat) <- NULL
     return(dat)
    }
   
    tpp<-PLPP(df,id='id',Stage = 'Stage',period = 'MonthDiff',event = 'Censor')

     test<-df%>%group_by(Stage)%>%do(tpp)

My problem with the current code is that the group_by statement isn't restarting at the new Stage.

Nate · February 26, 2020, 2:01pm

If you use split() %>% map_df(), does that give you your ideal output?

df %>%
  split(.$Stage) %>%
  map_df(~ PLPP(., id='id',Stage = 'Stage',period = 'MonthDiff',event = 'Censor'))

I like using using this pattern because I think it give me more freedom in the type of objects you can manipulate.

soufin12 · February 26, 2020, 2:45pm

This worked perfectly. I was actually trying to use split(), but not with map_df(). Thats just what I needed! Thanks a lot!

Nate · February 26, 2020, 3:31pm

It's def one of my favorite R tricks, glad it helped you and I hope you get some good mileage out of it!

aosmith · February 26, 2020, 3:48pm

This seems like a good place to mention uncount() from tidyr. It is a function for duplicating rows based on a variable; i.e., to "uncount" something. It isn't exactly what you want here because it looks like you are doing some additional work with the Censor column, but I've found it to be pretty useful when I'm "expanding" datasets.

Here's an example of code that takes your current dataset and expands it by adding rows for based on MonthDiff.

library(tidyr)
uncount(df, weights = MonthDiff)

soufin12 · February 26, 2020, 6:46pm

I've never used uncount(), but thank you for bring that to my attention. Seems like a very simple, which is good, way to expand data sets!

soufin12 · February 26, 2020, 6:47pm

I'm running into this error when I run the code:

Error in PLPP(., id = "id", Stage = "Stage", period = "MonthDiff",  : 
  (list) object cannot be coerced to type 'double'

I'm wondering if the way the data is split is what is causing this? Any idea?

Nate · February 27, 2020, 12:55pm

Not sure why that error is popping out. Is it happening on the same data set you shared above?

system · March 5, 2020, 1:05pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.