Restarting a "Counter Variable" Across Groups?

I have the following dataset that represents different students taking an exam across a period of dates - each student can either "pass" (1) or "fail" (0) the exam:

# Load the data.table package
library(data.table)

# Generate some sample data
id = sample.int(10000, 100000, replace = TRUE)
res = c(1,0)
results = sample(res, 100000, replace = TRUE)
date_exam_taken = sample(seq(as.Date('1999/01/01'), as.Date('2020/01/01'), by="day"), 100000, replace = TRUE)

# Create a data table from the sample data
my_data = data.table(id, results, date_exam_taken)
my_data <- my_data[order(id, date_exam_taken)]

# Generate some additional columns for each record
my_data$general_id = 1:nrow(my_data)
my_data$exam_number = ave(my_data$general_id, my_data$id, FUN = seq_along)
my_data$general_id = NULL

   id results date_exam_taken exam_number
1:  1       0      2002-10-06           1
2:  1       1      2003-07-21           2
3:  1       1      2003-10-15           3
4:  1       0      2005-07-21           4
5:  1       1      2014-08-22           5
6:  1       1      2015-09-11           6

Grouped over students, I want to create a variable ("consec") that sums the "consecutive number of passed exams" - and this variable should go back to 0 after the first failed exam.

, I think I can do this for the entire dataset at once:

 my_data$consec =   unlist(lapply(split(my_data$results, c(0, cumsum(abs(diff(!my_data$results == 1))))), function(x) (x[1] == 1) * seq(length(x))))

But this is counting consecutive passed exams without taking into consideration individual students - each time a new student appears, I would like the "consec" variable to restart.

I think I might have found a way to do this:

my_data$i = ave(my_data$results, my_data$id, FUN = function(x){ tmp<-cumsum(x);tmp-cummax((!x)*tmp)})

But I am not sure if this is correct

Can someone please show me how to do this?

Thanks!

I would use dplyr group_by and a mutate to solve this.
Potentially the result of that might be used to prove that other more base r implementations get the same result, butnid probably prefer to keep the more human readable code unless performance was significantly more important than maintenance

1 Like

Yes, you are on the right track to solving this problem. One way to achieve your goal is to use the rle function to calculate the consecutive runs of "pass" results for each student, and then use this information to create the "consec" variable. Here is an example of how you can do this:

Copy code

# Create a variable to store the consecutive runs of "pass" results
my_data$runs = unlist(rle(my_data$results)$lengths)[my_data$results == 1]

# Create the "consec" variable
my_data$consec = ave(my_data$runs, my_data$id, FUN = cumsum)

The rle function returns the lengths of runs of consecutive equal values in a vector. By applying the function to the results variable and extracting the lengths of the runs where the result is "pass" (i.e., where results == 1), we can get a vector of consecutive runs of "pass" results for each student.

The ave function is then used to calculate the cumulative sum of the runs variable for each student, which gives us the desired "consec" variable.

This should work as long as the data is ordered by student ID and exam date.