What is the most direct and efficient way to use results from another df on a different df?

realhiphop · January 25, 2019, 12:54pm

I'm working with baseball data. I've created a few separate df that does specific calculations that all are to get used on a master df called value_tbl that has the bulk of my mutate work. The first approach I used was a join of the calc_tbl to the players in long format. That didn't work due to the calc_BA requiring elements from the lg_avg df (Unless I just approached it wrong)

I'm looking for a better and more efficient way to go about writing my last df. Specifically:

Is there a better way to "call" the results from the calc_tbl and lg_avg?
Is there a more efficient way to write that last mutate?

Open to suggestions on the overall workflow as well.

library(tidyverse)

# Table used for league average calculations below
teams <- tibble::tribble(
  ~Team,  ~PA,  ~AB,   ~H, ~HBP, ~BB, ~HR,  ~R, ~RBI, ~SB,  ~AVG,  ~OBP,
  "Athletics", 6255, 5579, 1407,   76, 550, 227, 813,  778,  35, 0.252, 0.325,
  "Red Sox", 6302, 5623, 1509,   55, 569, 208, 876,  829, 125, 0.268, 0.339,
  "Yankees", 6271, 5515, 1374,   62, 625, 267, 851,  821,  63, 0.249, 0.329,
  "Indians", 6300, 5595, 1447,   80, 554, 216, 818,  786, 135, 0.259, 0.332,
  "Astros", 6146, 5453, 1390,   61, 565, 205, 797,  763,  71, 0.255, 0.329
)


# Table used for player calculations (main table)
players <- tibble::tribble(
  ~Name,     ~Team,  ~G, ~PA, ~AB,  ~H, ~HBP, ~BB, ~HR,  ~R, ~RBI, ~SB,  ~AVG,  ~OBP,
  "Mookie Betts", "Red Sox", 136, 614, 520, 180,    8,  81,  32, 129,   80,  30, 0.346, 0.438,
  "Mike Trout",  "Angels", 140, 608, 471, 147,   10, 122,  39, 101,   79,  24, 0.312,  0.46,
  "J.D. Martinez", "Red Sox", 150, 649, 569, 188,    4,  69,  43, 111,  130,   6,  0.33, 0.402,
  "Alex Bregman",  "Astros", 157, 705, 594, 170,   12,  96,  31, 105,  103,  10, 0.286, 0.394,
  "Jose Ramirez", "Indians", 157, 698, 578, 156,    8, 106,  39, 110,  105,  34,  0.27, 0.387
)


# Denominators needed for calculations
calc_tbl <- tibble::tribble(
  ~data_col, ~calc_denom,
  "HR",        14.3,
  "R",        19.6,
  "RBI",        17.5,
  "SB",        26.2,
  "AVG",      0.0045,
  "OBP",      0.0031
) %>% 
  spread(key = data_col, value = "calc_denom")


# Get league average of teams
lg_avg <- teams %>% 
# Divide counting stats by 10 to get the averages for 10 batters
  mutate_at(vars(PA:SB), funs(./10)) %>% 
  summarize_if(is.numeric, mean, na.rm=TRUE)

lg_avg
#> # A tibble: 1 x 11
#>      PA    AB     H   HBP    BB    HR     R   RBI    SB   AVG   OBP
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  625.  555.  143.  6.68  57.3  22.5  83.1  79.5  8.58 0.257 0.331

# Calculate Values
value_tbl <- players %>% 
  mutate(calc_R = R / calc_tbl$R,
         calc_HR = HR / calc_tbl$HR,
         calc_RBI = RBI / calc_tbl$RBI,
         calc_SB = SB / calc_tbl$SB,
         calc_BA = (((lg_avg$H * 13 ) + H)/(AB + (lg_avg$AB * 13)) - lg_avg$AVG) / calc_tbl$AVG,
         calc_Total = (calc_R + calc_HR + calc_RBI + calc_SB + calc_BA))
  
value_tbl 
#> # A tibble: 5 x 20
#>   Name    Team      G    PA    AB     H   HBP    BB    HR     R   RBI    SB
#>   <chr>   <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mookie… Red …   136   614   520   180     8    81    32   129    80    30
#> 2 Mike T… Ange…   140   608   471   147    10   122    39   101    79    24
#> 3 J.D. M… Red …   150   649   569   188     4    69    43   111   130     6
#> 4 Alex B… Astr…   157   705   594   170    12    96    31   105   103    10
#> 5 Jose R… Indi…   157   698   578   156     8   106    39   110   105    34
#> # ... with 8 more variables: AVG <dbl>, OBP <dbl>, calc_R <dbl>,
#> #   calc_HR <dbl>, calc_RBI <dbl>, calc_SB <dbl>, calc_BA <dbl>,
#> #   calc_Total <dbl>

Thanks in advance.

jim89 · January 25, 2019, 1:14pm

I suspect you could probably achieve something similar to what you're looking for with a *_join()-like approach, rather than using the $ syntax to use values from one table in another (although this may mean you need to think about your data structure a little more). I'm not sure if that is necessarily better (as it looks like your current code is working), but it might be a bit easier to read and/or debug, as the intent of your operations would be a bit clearer. For example, for the calculations that use the calc_tbl you could do something like:

players %>% 
  select(Name, Team, R, HR, RBI, SB) %>% 
  gather(key, player_value, -Name, -Team) %>% 
  left_join(gather(calc_tbl)) %>% 
  mutate(calculated = player_value / value) %>% 
  select(Name, Team, key, calculated) %>% 
  spread(key, calculated)

(Please forgive my lazy column names).

You may be able to do something similar for the calculations that use lg_avg, although given that the calculation is a little more complex it may be slightly harder. (I'm guessing that the teams values are in some way aggregates/summaries of the player values, so you might actually be able to start from just the players data and make everything you need for team-level values with judicious use of group_by() and mutate()). It also looks like you're performing calculations you don't need to, e.g. many of the fields in lg_avg do not currently get used. You may need them elsewhere, but at the moment it looks like you're doing work you don't need to be.

realhiphop · January 25, 2019, 2:02pm

Thanks. I nixed that approach since it wouldn't work with the calc_BA which uses the league averages. I probably should add that to my initial post as one of the methods I tried.

jim89 · January 25, 2019, 2:20pm

You mean the join approach?

Am I correct in thinking that team stats are derived from player stats, and league stats are derived from team stats? If that's the case you can probably calculate everything you need from just the player stats, without having to create the intermediate objects as you are currently (although they may, of course, be useful for other things beyond what you're showing here...).

realhiphop · January 25, 2019, 2:44pm

Yep, I meant with a join. You are right! Player stats aggregate out to the team stats. The one reason to keep the average df separate would be if I wanted to use 2019 projections with the 2018 averages.

system · February 15, 2019, 2:44pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.