Calculating the last date & mean score for a given condition. possible nesting issue?

ramakant · July 16, 2020, 7:53am

hello rstudio community,

first ever post so trying to provide reprex as below.

i have a huge data set (3 million+ rows) which has the following format in the tibble named "tribble_test":

dpasta(tribble_test)
tibble::tribble(
  ~title,        ~date, ~bucket, ~score,
     "a", "12-07-2020",       1,     86,
     "a", "13-07-2020",       1,     22,
     "a", "14-07-2020",       1,     24,
     "a", "15-07-2020",       4,     54,
     "b", "12-07-2020",       9,     66,
     "b", "13-07-2020",       7,     76,
     "b", "14-07-2020",      10,     43,
     "b", "15-07-2020",      10,     97,
     "c", "12-07-2020",       8,      9,
     "c", "13-07-2020",       5,     53,
     "c", "14-07-2020",       5,     45,
     "c", "15-07-2020",       5,     40,
     "d", "12-07-2020",       8,     21,
     "d", "13-07-2020",      10,     99,
     "d", "14-07-2020",       9,     91,
     "d", "15-07-2020",      10,     90
  )

For each title, I would like to analyse the following

the lowest bucket possible
the last date when the lowest bucket was detected
average score when the title was in the lowest bucket

the required outcome should appear like this:

outcome_required <- tibble::tribble(
  ~title, ~best_bucket_date, ~best_bucket, ~best_bucket_score,
     "a",       "14-Jul-20",           1,                44,
     "b",       "13-Jul-20",           7,                76,
     "c",       "15-Jul-20",           5,                46,
     "d",       "12-Jul-20",           8,                21
  )

I have tried to use slice_min() but clearly i'm missing something here. attempted code is show below:

outcome_wrong <- tribble_test %>% group_by(title) %>%
  slice_min(bucket, n = 1, with_ties = FALSE) %>%
  summarise(best_bucket_date = max(date), 
            best_bucket = bucket, 
            best_bucket_score = mean(score)) 
dpasta(outcome_wrong)
tibble::tribble(
                  ~title, ~best_bucket_date, ~best_bucket, ~best_bucket_score,
                     "a",      "12-07-2020",            1,                 86,
                     "b",      "13-07-2020",            7,                 76,
                     "c",      "13-07-2020",            5,                 53,
                     "d",      "12-07-2020",            8,                 21
                  )

clearly, slice_min() is giving me the minimum bucket value but is not performing the date or score operations correctly. My hunch is that I have to perform some sort of iterative function at row level. or possibly create some nested vector which is beyond my current comprehension. This is to be performed on 3M+ rows so I'd like to know which approach would be optimal.

Thank you for reading this far

edit: added more context

siddharthprabhu · July 16, 2020, 8:25am

Hi @ramakant, welcome to RStudio Community.

Thank you for posting a nice reprex. I think this is what you are looking for.

library(dplyr, warn.conflicts = FALSE)

data <- tribble(
  ~ title, ~ date, ~ bucket, ~ score,
  "a", "12-07-2020", 1, 86,
  "a", "13-07-2020", 1, 22,
  "a", "14-07-2020", 1, 24,
  "a", "15-07-2020", 4, 54,
  "b", "12-07-2020", 9, 66,
  "b", "13-07-2020", 7, 76,
  "b", "14-07-2020", 10, 43,
  "b", "15-07-2020", 10, 97,
  "c", "12-07-2020", 8, 9,
  "c", "13-07-2020", 5, 53,
  "c", "14-07-2020", 5, 45,
  "c", "15-07-2020", 5, 40,
  "d", "12-07-2020", 8, 21,
  "d", "13-07-2020", 10, 99,
  "d", "14-07-2020", 9, 91,
  "d", "15-07-2020", 10, 90
)

data %>% 
  group_by(title) %>% 
  filter(bucket == min(bucket)) %>% 
  mutate(avg_score = mean(score)) %>% 
  slice_max(date) %>% 
  select(-score)
#> # A tibble: 4 x 4
#> # Groups:   title [4]
#>   title date       bucket avg_score
#>   <chr> <chr>       <dbl>     <dbl>
#> 1 a     14-07-2020      1        44
#> 2 b     13-07-2020      7        76
#> 3 c     15-07-2020      5        46
#> 4 d     12-07-2020      8        21

^{Created on 2020-07-16 by the reprex package (v0.3.0)}

ramakant · July 16, 2020, 8:36am

good sir.. that totally nailed it!!
my capacity to overthink does not help at all. thank you for the solution. I shall try and replicate it with the rest of the data.

two minor queries:

why did we have to write another select command to remove score? if we grouped by title, it shouldn't have appeared in the first place, right?
everytime i'm running reprex, i'm getting this error. anything i'm doing wrong here


> outcome_wrong <- 
+   tribble_test %>% 
+   group_by(title) %>%
+   filter(bucket == min(bucket)) %>%
+   mutate(best_bucket_score = mean(score)) %>%
+   slice_max(date)%>% 
+   select(-score)%>%
+   view()
> reprex()
Rendering reprex...
Error: callr subprocess failed: <text>:18:8: unexpected ','
17: #+ reprex-body
18: clearly,
           ^
Type .Last.error.trace to see where the error occured
> reprex()
Rendering reprex...
Error: callr subprocess failed: <text>:18:8: unexpected ','
17: #+ reprex-body
18: clearly,
           ^
Type .Last.error.trace to see where the error occured

siddharthprabhu · July 16, 2020, 9:00am

group_by() doesn't modify the data, summarise() does. Since I opted for mutate() instead of summarise(), all the original variables are retained including score.

As for the error, maybe you have an extra comma somewhere before this code chunk? If you have a large script, you can also reprex a selection of code by using reprex::reprex_selection(). Maybe that will help you isolate where the error is.

ramakant · July 16, 2020, 9:15am

Thank you and thank you on both counts.

Hope to practice more of these verbs to grasp their meaning better.

as for the reprex, error was being generated inside the tribble definition for some reason. using reprex_selection for a smaller code chunk worked better


outcome_wrong <- 
  tribble_test %>% 
  group_by(title) %>%
  filter(bucket == min(bucket)) %>%
  mutate(best_bucket_score = mean(score)) %>%
  slice_max(date)%>% 
  select(-score)%>%
  view()
#> Error in tribble_test %>% group_by(title) %>% filter(bucket == min(bucket)) %>% : could not find function "%>%"

^{Created on 2020-07-16 by the reprex package (v0.3.0)}

siddharthprabhu · July 16, 2020, 9:34am

From the error, it looks like the pipe function could not be found. Try including library(dplyr) at the start of your reprex selection.

system · July 23, 2020, 11:26am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.