Add a row depending on a condition using tidyverse if possible


#1

I'm working on hockey analytics, specifically modeling the goals scored as a poisson distribution.
So far, I've been able to create a new column, time_diff, that finds the number of minutes passed since the last goal scored.

Now, I'd like to add in a row at the end of very game (represented by game_id) that displays the number of minutes passed in the game from the last goal scored to the end of the game.

For example, in game_id 2007020001, the last goal scored in the game was at 59.56 minutes. So, I'd like to insert a row right after this game with event_type column containing the value, game_end and column time_diff containing the value 0.44

I've seen some data.table solutions on SO, but would love to see a tidyverse solution since I'm not used to data.table

Here is subset of my data:

2007020001L, 2007020001L, 2007020002L, 2007020002L, 2007020002L, 
2007020002L, 2007020002L, 2007020003L, 2007020003L, 2007020003L, 
2007020003L, 2007020003L, 2007020004L, 2007020004L, 2007020004L
), session = c("R", "R", "R", "R", "R", "R", "R", "R", "R", "R", 
"R", "R", "R", "R", "R", "R", "R", "R"), game_seconds = c(515L, 
1815L, 2470L, 3189L, 3574L, 649L, 919L, 1373L, 1450L, 1737L, 
217L, 781L, 2971L, 3168L, 3665L, 586L, 2072L, 2358L), event_type = c("GOAL", 
"GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", 
"GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", "GOAL", 
"GOAL"), time_diff = c(8.58, 21.67, 10.92, 11.98, 6.42, 10.82, 
4.5, 7.57, 1.28, 4.78, 3.62, 9.4, 36.5, 3.28, 8.28, 9.77, 24.77, 
4.77)), row.names = c(NA, -18L), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), vars = c("game_id", "session"), drop = TRUE, indices = list(
    0:4, 5:9, 10:14, 15:17), group_sizes = c(5L, 5L, 5L, 3L), biggest_group_size = 5L, labels = structure(list(
    game_id = 2007020001:2007020004, session = c("R", "R", "R", 
    "R")), row.names = c(NA, -4L), class = "data.frame", vars = c("game_id", 
"session"), drop = TRUE))

#2

One option is to check out add_row from the tibble package. If you calculate the time_diff ahead of time and store it in an object finalTime, you can include that object in your add_row() call. A simplified version might look something like this:

hockey <- add_row(hockey, event_type = "game end", time_diff = finalTime)

#3

I'm going to answer my own question here haha.

A great dplyr solution I found is:


df %>%
 group_by(game_id) %>%
  summarise(time_diff_sum = sum(time_diff)) %>% 
  mutate(time_diff_game_end = 60 - time_diff_sum)

#4

I see that my solution only finds the difference between the time of the last goal and the end of the game.

Does anyone have ideas on how to insert this difference into the column time_diff right after the last row of each group of game_id and put down the value game_end in event_type ?


#5

@jsonbaik - it sounds like you have multiple games in the same data object? Can you show us a minimal example, of what the output of

df %>%
 group_by(game_id) %>%
  summarise(time_diff_sum = sum(time_diff)) %>% 
  mutate(time_diff_game_end = 60 - time_diff_sum)

looks like?


#6

Sorry it took so long to respond @chris.prener

Here is the output!


structure(list(game_id = 2007020001:2007020010, time_diff_sum = c(59.57, 
28.95, 61.08, 80.01, 62.95, 57.17, 53.04, 55.01, 59.03, 44.23
), time_diff_game_end = c(0.43, 31.05, -1.08, -20.01, -2.95, 
2.83, 6.96, 4.99, 0.969999999999999, 15.77)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

#7

Would something like the following work?

df %>%
  group_by(game_id, event_type) %>%
  mutate(time_diff_sum = 60 - sum(time_diff)) %>% 
  ungroup() %>% 
  gather(time_diff, time_diff_sum, key = "key", value = "value") %>% 
  arrange(game_id) %>% 
  distinct(game_id, event_type, key, value)

#8

Looks good to me. Thanks


#9

If your question's been answered, would you mind choosing a solution? It helps other people see which questions still need help, or find solutions if they have similar problems. Here’s how to do it:

Thanks!


#10

Solution for future users:

df %>% 
    filter(event_type == event_type_fun) %>% 
    group_by(game_id, session) %>%
    summarise(event_index = last(event_index)) %>%
    ungroup() %>% 
    mutate(event_type = "game_end",
           time_diff = time_diff_game_end) %>%
    bind_rows(df %>% 
                filter(event_type == event_type_fun) %>% 
                group_by(game_id, session) %>% 
                # Time difference in minutes (seconds / 60)
                mutate(time_diff = game_seconds - lag(game_seconds, default = 0),
                       time_diff = round(time_diff / 60, 2)), .) %>% 
    arrange(game_id) %>% 
    select(game_id, session, event_type, time_diff) %>% 
    mutate(event_type = if_else(event_type == event_type_fun, 1, 0))

where event_type_fun == "GOAL" and time_diff_game_end =


df %>% 
    filter(event_type == event_type_fun) %>% 
    # Time difference in minutes (seconds / 60)
    group_by(game_id, session) %>% 
    mutate(time_diff = game_seconds - lag(game_seconds, default = 0),
           time_diff = round(time_diff / 60, 2)) %>% 
    select(game_id, session, game_seconds, event_type, time_diff) %>% 
    group_by(game_id, session) %>%
    summarise(time_diff_sum = sum(time_diff)) %>% 
    mutate(time_diff_game_end = 60 - time_diff_sum) %>% 
    pull(time_diff_game_end)

#11

Thanks for letting me know.


#12

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.