group_split() add first row of next group to previous group

Hey,
I have a dataframe (from a soccer log) I would like to split into groups (based on sequence_id) to later do computations on the groups and save them separately. group_split() would be a great solution for it. My only problem is that I will need the first row of the next group to be included in the previous group because it has information needed for the computations. The last group can either have an extra row with NAs or be empty.

This is a reprex of what I wrote now:

library(dplyr)
data <- tibble(sequence_id = c(1, 1, 1, 1, 1, 1,
                               1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4,
                               4, 5, 5, 5, 5, 5, 6, 6),
               possession_id = c(1, 1, 1 ,1, 1, 1,
                                 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
                                 3, 4, 4, 4, 4, 4, 5, 5)
               )

data_split <- data %>% 
  group_split(., sequence_id)

But I would like the output to look like this:

data_split
#> <list_of<
#>   tbl_df<
#>     sequence_id  : double
#>     possession_id: double
#>   >
#> >[6]>
#> [[1]]
#> # A tibble: 8 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           1             1
#> 2           1             1
#> 3           1             1
#> 4           1             1
#> 5           1             1
#> 6           1             1
#> 7           1             1
#> 8           2             2
#> 
#> [[2]]
#> # A tibble: 6 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           2             2
#> 2           2             2
#> 3           2             2
#> 4           2             2
#> 5           2             2
#> 6           3             2
#> 
#> [[3]]
#> # A tibble: 3x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           3             2
#> 2           3             2
#> 3           4             3
#> 
#> [[4]]
#> # A tibble: 5 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           4             3
#> 2           4             3
#> 3           4             3
#> 4           4             3
#> 5           5             4
#> 
#> [[5]]
#> # A tibble: 6 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           5             4
#> 2           5             4
#> 3           5             4
#> 4           5             4
#> 5           5             4
#> 6           6             5
#> 
#> [[6]]
#> # A tibble: 2 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           6             5
#> 2           6             5

Does anyone have any tips on how to achieve this? Would lag() help in this case?

tibble::add_row() is useful for this kind of thing. In the example below, I'm adding NA rows into the last group although you could easily modify it to leave as-is.

library(dplyr, warn.conflicts = FALSE)

data <- tibble(
  sequence_id = c(
    1, 1, 1, 1, 1, 1,
    1, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4,
    4, 5, 5, 5, 5, 5, 6, 6
  ),
  possession_id = c(
    1, 1, 1, 1, 1, 1,
    1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
    3, 4, 4, 4, 4, 4, 5, 5
  )
)

data_split <- group_split(data, sequence_id)

data_out <- vector("list", length(data_split))

for (i in seq_along(data_split)) {
  if (i == length(data_out)) {
    data_out[[i]] <- add_row(data_split[[i]])
  } else {
    data_out[[i]] <- add_row(data_split[[i]], slice(data_split[[i + 1]], 1L))
  }
}

data_out
#> [[1]]
#> # A tibble: 8 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           1             1
#> 2           1             1
#> 3           1             1
#> 4           1             1
#> 5           1             1
#> 6           1             1
#> 7           1             1
#> 8           2             2
#> 
#> [[2]]
#> # A tibble: 6 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           2             2
#> 2           2             2
#> 3           2             2
#> 4           2             2
#> 5           2             2
#> 6           3             2
#> 
#> [[3]]
#> # A tibble: 3 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           3             2
#> 2           3             2
#> 3           4             3
#> 
#> [[4]]
#> # A tibble: 5 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           4             3
#> 2           4             3
#> 3           4             3
#> 4           4             3
#> 5           5             4
#> 
#> [[5]]
#> # A tibble: 6 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           5             4
#> 2           5             4
#> 3           5             4
#> 4           5             4
#> 5           5             4
#> 6           6             5
#> 
#> [[6]]
#> # A tibble: 3 x 2
#>   sequence_id possession_id
#>         <dbl>         <dbl>
#> 1           6             5
#> 2           6             5
#> 3          NA            NA

Created on 2021-02-24 by the reprex package (v1.0.0)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.