Indexing groups within a dataframe in R dplyr

Bookie · April 10, 2021, 2:10pm

I have a dataset as below:

structure(AI_decs)

                 Horse             Time RaceID dyLTO Value.LTO Draw.IV
1         Warne's Army 06/04/2021 13:00      1    56      3429    0.88
2        G For Gabrial 06/04/2021 13:00      1    57      3299    1.15
3         First Charge 06/04/2021 13:00      1    66      3429    1.06
4        Dream With Me 06/04/2021 13:00      1    62      2862    0.97
5             Qawamees 06/04/2021 13:00      1    61      4690    0.97
6          Glan Y Gors 06/04/2021 13:00      1    59      3429    1.50
7     The Dancing Poet 06/04/2021 13:00      1    42      4690    1.41
8               Finoah 06/04/2021 13:00      1    59     10260    0.97
9            Ravenscar 06/04/2021 13:30      2    58      5208    0.65
10          Arabescato 06/04/2021 13:30      2    57      2862    1.09
11        Thai Terrier 06/04/2021 13:30      2    58      7439    1.30
12   The Rutland Rebel 06/04/2021 13:30      2    55      3429    2.17
13         Red Tornado 06/04/2021 13:30      2    49      3340    0.43
14             Alfredo 06/04/2021 13:30      2    54      5208    1.30
15     Tynecastle Park 06/04/2021 13:30      2    72      7439    0.87
16           Waldkonig 06/04/2021 14:00      3    55      3493    1.35
17       Kaleidoscopic 06/04/2021 14:00      3    68      7439    1.64
18           Louganini 06/04/2021 14:00      3    75     56025    1.26

I have a list of columns with performance data values for horses in a race.
My dataset has many more rows and it contains a number of horse races on a given day.
Each horse race has a unique time and a different number of horses in each race.

Basically, I want to assign a raceId (index number) to each individual race.
I am currently having to do this in excel (see column RaceID) by comparing the Time column and adding 1 to the RaceId value every time we encounter a new race. This has to be done manually each day before I import into R.

I hope there is a way to do this in R Dplyr.
I thought if I use Group_by 'Time' there might be a function a bit like n() or row_number() that would
index the races for me.

Perhaps using Case_when and lag/lead.

Thanks in advance for any help.
Graham

icedoverfire · April 10, 2021, 3:26pm

Give this a try!

# I created a fragment of your data on my machine locally - with just the horse's names and the race date time.
#The race date time is the date and time columns mashed together into one 
data <- data %>% mutate(RACEDATETIME = mdy_hm(RACEDATETIME)) #mdy_hm from the lubridate package 

a <- data$RACEDATETIME %>% unique() %>% as.data.frame() %>% setNames(c("RACEID")) %>% rowid_to_column(var = "ID")
#What I'm calling a is now the ID/RACEDATETIME together 

b <- data %>% left_join(a, by = c("RACEDATETIME" = "RACEID")) #I think this is what you want! :)

mara · April 10, 2021, 3:45pm

This may look a little funky because I, too, had to create your data by splitting things up, since it wasn't in copy-pasteable format. If each race is at a unique datetime, you can use that datetime to group_by() and then get the group index using cur_group_id().

Note that your Time variable will have to be parsed as lubridate::dmy_hm() to create a proper datetime, but I'd already done too much futzing around to get the data into a workable format by the time I made a reprex to go back to your original string. I also kept race_id in there, just so you could see that you get the same result:

suppressPackageStartupMessages(library(tidyverse))

data <- tibble::tribble(
  ~index,               ~horse,                 ~date, ~race_id, ~dy_lto, ~value_lto, ~draw_iv,
     "1",      "Warne's Army ", "2021-04-06 13:00:00",     " 1",   " 56",   " 3429 ",   "0.88",
     "2",     "G For Gabrial ", "2021-04-06 13:00:00",     " 1",   " 57",   " 3299 ",   "1.15",
     "3",      "First Charge ", "2021-04-06 13:00:00",     " 1",   " 66",   " 3429 ",   "1.06",
     "4",     "Dream With Me ", "2021-04-06 13:00:00",     " 1",   " 62",   " 2862 ",   "0.97",
     "5",          "Qawamees ", "2021-04-06 13:00:00",     " 1",   " 61",   " 4690 ",   "0.97",
     "6",       "Glan Y Gors ", "2021-04-06 13:00:00",     " 1",   " 59",   " 3429 ",   "1.50",
     "7",  "The Dancing Poet ", "2021-04-06 13:00:00",     " 1",   " 42",   " 4690 ",   "1.41",
     "8",            "Finoah ", "2021-04-06 13:00:00",     " 1",   " 59",  " 10260 ",   "0.97",
     "9",         "Ravenscar ", "2021-04-06 13:30:00",     " 2",   " 58",   " 5208 ",   "0.65",
    "10",        "Arabescato ", "2021-04-06 13:30:00",     " 2",   " 57",   " 2862 ",   "1.09",
    "11",      "Thai Terrier ", "2021-04-06 13:30:00",     " 2",   " 58",   " 7439 ",   "1.30",
    "12", "The Rutland Rebel ", "2021-04-06 13:30:00",     " 2",   " 55",   " 3429 ",   "2.17",
    "13",       "Red Tornado ", "2021-04-06 13:30:00",     " 2",   " 49",   " 3340 ",   "0.43",
    "14",           "Alfredo ", "2021-04-06 13:30:00",     " 2",   " 54",   " 5208 ",   "1.30",
    "15",   "Tynecastle Park ", "2021-04-06 13:30:00",     " 2",   " 72",   " 7439 ",   "0.87",
    "16",         "Waldkonig ", "2021-04-06 14:00:00",     " 3",   " 55",   " 3493 ",   "1.35",
    "17",     "Kaleidoscopic ", "2021-04-06 14:00:00",     " 3",   " 68",   " 7439 ",   "1.64",
    "18",         "Louganini ", "2021-04-06 14:00:00",     " 3",   " 75",  " 56025 ",   "1.26"
  )


data %>%
  mutate(date = lubridate::ymd_hms(date)) %>%
  group_by(date) %>%
  mutate(group_id = cur_group_id())
#> # A tibble: 18 x 8
#> # Groups:   date [3]
#>    index horse     date                race_id dy_lto value_lto draw_iv group_id
#>    <chr> <chr>     <dttm>              <chr>   <chr>  <chr>     <chr>      <int>
#>  1 1     "Warne's… 2021-04-06 13:00:00 " 1"    " 56"  " 3429 "  0.88           1
#>  2 2     "G For G… 2021-04-06 13:00:00 " 1"    " 57"  " 3299 "  1.15           1
#>  3 3     "First C… 2021-04-06 13:00:00 " 1"    " 66"  " 3429 "  1.06           1
#>  4 4     "Dream W… 2021-04-06 13:00:00 " 1"    " 62"  " 2862 "  0.97           1
#>  5 5     "Qawamee… 2021-04-06 13:00:00 " 1"    " 61"  " 4690 "  0.97           1
#>  6 6     "Glan Y … 2021-04-06 13:00:00 " 1"    " 59"  " 3429 "  1.50           1
#>  7 7     "The Dan… 2021-04-06 13:00:00 " 1"    " 42"  " 4690 "  1.41           1
#>  8 8     "Finoah " 2021-04-06 13:00:00 " 1"    " 59"  " 10260 " 0.97           1
#>  9 9     "Ravensc… 2021-04-06 13:30:00 " 2"    " 58"  " 5208 "  0.65           2
#> 10 10    "Arabesc… 2021-04-06 13:30:00 " 2"    " 57"  " 2862 "  1.09           2
#> 11 11    "Thai Te… 2021-04-06 13:30:00 " 2"    " 58"  " 7439 "  1.30           2
#> 12 12    "The Rut… 2021-04-06 13:30:00 " 2"    " 55"  " 3429 "  2.17           2
#> 13 13    "Red Tor… 2021-04-06 13:30:00 " 2"    " 49"  " 3340 "  0.43           2
#> 14 14    "Alfredo… 2021-04-06 13:30:00 " 2"    " 54"  " 5208 "  1.30           2
#> 15 15    "Tynecas… 2021-04-06 13:30:00 " 2"    " 72"  " 7439 "  0.87           2
#> 16 16    "Waldkon… 2021-04-06 14:00:00 " 3"    " 55"  " 3493 "  1.35           3
#> 17 17    "Kaleido… 2021-04-06 14:00:00 " 3"    " 68"  " 7439 "  1.64           3
#> 18 18    "Lougani… 2021-04-06 14:00:00 " 3"    " 75"  " 56025 " 1.26           3

^{Created on 2021-04-10 by the reprex package (v1.0.0)}

* You won't have those weird padded whitespaces in your data, since I recreated your data frame using regular expressions, etc. and didn't take the time to get rid of extraneous whitespace, since it won't be in there for you.

mara · April 10, 2021, 3:49pm

I basically did the same thing, I just assumed it was dmy_hm, since the June 4, 2021 is in the future.

Bookie · April 10, 2021, 7:42pm

Thanks for the help.
I managed to solve using the following:
group_by(Time) %>%
mutate(RaceN=cur_group_id())

Bookie · April 10, 2021, 7:46pm

Thanks. Because I am in the UK we have a date format: day/Month/Year which I understand is different in the states. cur_group_id() worked a treat!

mara · April 12, 2021, 2:00pm

Glad it helped!

If your question's been answered (even if by you), would you mind choosing a solution? (See FAQ below for how).

Having questions checked as resolved makes it a bit easier to navigate the site visually and see which threads still need help.

Thanks

system · April 19, 2021, 2:01pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.