Selecting Random Consecutive Rows Per Group

omario · February 19, 2023, 6:36pm

I have the following dataset in R:

my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3), exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5), result = rnorm(15,60,10))

The data looks something like this:

> my_data
   student_id exam_no   result
1           1       1 56.60374
2           1       2 55.76655
3           1       3 53.81728
4           1       4 74.82202
5           1       5 34.91834
6           2       1 58.32422
7           2       2 60.38213
8           2       3 49.40390
9           2       4 63.85426
10          2       5 40.32912
11          3       1 69.54969
12          3       2 43.36639
13          3       3 37.97265
14          3       4 52.36436
15          3       5 61.62080

My Question: I want to randomly keep some "consecutive rows" for each student.

For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.

I thought of the following way to do this:

# create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)

library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())

# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)

From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:

Warning message:
In 1:counts$counts :
  numerical expression has 3 elements: only the first used

Warning messages:
1: In counts$min:counts$counts :
  numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
  numerical expression has 3 elements: only the first used

# A tibble: 3 x 4
  student_id counts   min   max
       <dbl>  <int> <int> <int>
1          1      5     4     5
2          2      5     4     5
3          3      5     4     5

I am not sure how to continue this - can someone please show me how to continue?

Thanks!

FactOREO · February 19, 2023, 7:17pm

One approach would be to use split() and lapply(). Then rebind the result with data.table::rbindlist() (easy and conveniant).

get_random_cons_rows <- function(data, group) {
    # get a list of data.frames by group
    split.data.frame(
        data, group
    ) |>
    lapply(
        FUN = \(x) {
            total_rows <- nrow(x)
            start <- sample(seq.default(1,total_rows),1)
            # safety net, to avoid sample(5,1), which would draw any number between 1 and 5
            end <- ifelse(start == total_rows, start, sample(seq.default(start,total_rows),1))
            x[start:end,]
        }
    ) |>
    data.table::rbindlist()
}

get_random_cons_rows(my_data, my_data$student_id)

   student_id exam_no   result
        <num>   <num>    <num>
1:          1       1 47.13861
2:          1       2 61.90719
3:          1       3 46.36716
4:          2       4 52.59153
5:          2       5 68.71606
6:          3       1 58.97139
7:          3       2 62.85379
8:          3       3 49.66545

You can however get a random "1 consecutive row" result, which will be less likely the more rows per group you have.

To speed up things, you could use collapse::rsplit() as well as collapse::rapply2d() and collapse::unlist2d() instead of split(), lapply() and data.table::rbindlist(), but that's up to you.

scottyd22 · February 19, 2023, 10:36pm

Here's an alternative approach that does not require a loop. After seeing @FactOREO's good point about a single consecutive row, I added some logic to ensure at least two consecutive rows are returned by randomly selecting a min record from 1:max exam -1. Then, the max record is selected from min record:max exam. If they are equal, the max is updated to the min + 1. If a single consecutive row is OK, then this logic can be altered to select min from 1:max (drop the minus 1).

my_data %>%
  group_by(student_id) %>%
  mutate(max_val = max(exam_no)) %>%
  # guarantee at least 2 records
  mutate(min_record = sample(1:(unique(max_val) - 1), 1),
         max_record = sample(unique(min_record):unique(max_val), 1)
         ) %>%
  mutate(max_record = ifelse(min_record == max_record, min_record + 1, max_record)) %>%
  # only keep rows between min and max
  filter(between(exam_no, min_record, max_record)) %>%
  select(student_id, exam_no, result) %>%
  ungroup() 
#> # A tibble: 10 × 3
#>    student_id exam_no result
#>         <dbl>   <dbl>  <dbl>
#>  1          1       1   42.9
#>  2          1       2   50.1
#>  3          1       3   47.4
#>  4          1       4   56.7
#>  5          1       5   50.6
#>  6          2       2   85.4
#>  7          2       3   60.9
#>  8          3       1   44.5
#>  9          3       2   44.1
#> 10          3       3   82.8

Created on 2023-02-19 with reprex v2.0.2.9000

system · February 26, 2023, 10:37pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.