Create New Data Frame with First Instance ONLY for Year and Team

I am stumped and trying to create a dataframe with the first instance ONLY appearing for Year and Team from my FinalGS dataframe. Screengrab of this dataframe and associated code is below.

Presently, the FinalGS dataframe lists EVERY playerID that appeared in a game for that team. NumSP is the column I need to use for y axis ggplots....how do I create code to reduce every unique playerID to just ONE row per team in the FinalGS dataframe?

For example, for the the 1997 ANA, I need just the first row returned in a dataframe (e.g. finlech01 ) and not all 11 names currently listed after (e.g. langsma01, perisma01) for that specific team and year. Wanting to replicate this for all other unique year and team IDs to then perform analysis.

library(Lahman)
library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)
library(ggrepel)
View(LahmanData)
View(Pitching)
View(Teams)
Totals = merge(Teams, Pitching, by=c("yearID","teamID"))
View(Totals)

#To see how many total pitchers had a GS on the 2011 Milwaukee Brewers, 6 total
MIL <- filter(Totals, yearID == 2011, teamID == "MIL", GS > 0)
View(MIL)
#This below lists any pitcher with a GS on a club from 1990-2021, throwing 2022 out as the season is not complete
GSPitching1 <- filter(Totals, yearID < 2022, yearID >1989)
as_tibble(GSPitching1)
View(GSPitching1)

#This below is using dplyr group by and summarize to get INDIVIDUAL arms that has a GS in TeamGSUSE dataFrame
teamGSUSE <- GSPitching1 %>% 
  select(yearID, teamID, playerID, G.x, GS, W.x, L.x, ERA.x) 
View(teamGSUSE)             
TeamGS <- GSPitching1 %>% 
  group_by(yearID, teamID) %>% 
  summarise(TGS = sum(GS))
head(TeamGS)
View(TeamGS)
head(teamGSUSE)
View(teamGSUSE)
library(utils)
##The issue here is bringing TGS from TeamGS over to teamGSUSE b/c there are differing column totals
#for total number of teams and total number of pitchers. e.g. 13 man pitching staffs for 1 team
#solved w/ join on 2 variables below
library(base)
#This below works to merge them ON TWO COLUMNS
merged <-merge(TeamGS, teamGSUSE, by = c('teamID', 'yearID'))

View(merged)

#This below creates a new DataFrame where a pitcher started a game/had a GS in that season, throws out 100% relief pitchers
GSmerged <- filter(merged, GS > 0)
View(GSmerged)

#Stumbled here and figured it out (hat tip to "ML")
# By TEAM and YEAR. Then COUNT N() 
#Pipe then summarize number starts = n (). This works below!
#n() counts up the values into a new column
newmerged <- GSmerged %>% 
  group_by(teamID, yearID, ) %>% 
  summarise(NumSP = n())
View(newmerged)

#This merged the last two DataFrames into one for analysis and ggplot use with team stats and individual GS numbers by staff

FinalGS <-merge(newmerged, GSmerged, by = c('teamID', 'yearID'))

View(FinalGS)

Can you use group_by() and slice() as in the following example?

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
DF <- data.frame(TeamID = rep(c("A", "B"), each = 4),
                 YearID = c(1997,1997,1998, 1998, 1997, 1997, 1998,1998),
                 Player = c("AA", "BB", "AA", "BB", "CC", "DD", "CC", "DD"),
                 ERA = runif(8,1,6))
DF
#>   TeamID YearID Player      ERA
#> 1      A   1997     AA 2.247019
#> 2      A   1997     BB 5.987332
#> 3      A   1998     AA 4.843289
#> 4      A   1998     BB 3.007367
#> 5      B   1997     CC 5.698770
#> 6      B   1997     DD 5.428261
#> 7      B   1998     CC 4.459919
#> 8      B   1998     DD 4.697724
DF2 <- DF |> group_by(TeamID, YearID) |> slice(1)
DF2
#> # A tibble: 4 x 4
#> # Groups:   TeamID, YearID [4]
#>   TeamID YearID Player   ERA
#>   <chr>   <dbl> <chr>  <dbl>
#> 1 A        1997 AA      2.25
#> 2 A        1998 AA      4.84
#> 3 B        1997 CC      5.70
#> 4 B        1998 CC      4.46

Created on 2022-08-25 by the reprex package (v2.0.1)

Thanks for replying. I am not familiar with the slice() function. Would that do it automatically for first unique instances by team and year? in your example, it appears you made those columns manually before splicing?

slice() takes an arbitrary row, chosen by parameter n; in this case n=1, i.e. the first row of each group

1 Like

Thank you very much for this! Slice() did the trick and I followed the code/example posted by @FJCC upthread to amend my code and get it where I needed it to be to perform analysis/viz in ggplots.

The solution of code for me ended up being:

FinalGS1 <- FinalGS|> group_by(teamID, yearID)|> slice(1)

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.