A bit of a lengthy explanation here, so please bear with me with.
I am working with two dataframes:
- qb.metrics.df
- roster.df
In the end, I would like to left_join them together.
Here is a look at qb.metrics.df
Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
<dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 20889 K. Murray 23.3 34.6 2375 68.2 17
2 2 19801 J. Allen 24.5 36.4 2873 68.4 21
3 3 14536 R. Wilson 32 37.1 2789 69.8 28
4 4 18890 P. Mahomes 25.2 36.6 2687 66.9 25
5 5 2593 A. Rodgers 37 34.9 2578 67.8 26
6 6 4314 T. Brady 43.3 38.5 2739 66 23
7 7 18857 D. Watson 25.2 33.4 2539 68.1 18
8 8 21681 J. Herbert 22.7 37.6 2333 66.8 19
9 9 19781 L. Jackson 23.9 27.4 1762 64 14
10 10 16497 T. Bridgewater 28 33.1 2544 71.9 13
... add here is a look at roster.df
season team position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
<dbl> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <date> <chr> <chr>
1 2020 ARI C C 52 Active Mason Co~ Mason Cole 1996-03-28 6-5 292
2 2020 ARI C C 53 Active Lamont G~ Lamont Gaillard 1996-02-08 6-3 305
3 2020 ARI CB NB 33 Active Byron Mu~ Byron Murphy 1998-01-18 5-11 190
4 2020 ARI CB NA 20 Active Prince A~ Prince Amukamara 1989-06-06 6-0 204
5 2020 ARI CB NA 39 Active Jace Whi~ Jace Whittaker 1995-07-16 5-11 185
6 2020 ARI CB NA 27 Injur~ Kevin Pe~ Kevin Peterson 1994-03-22 5-11 185
7 2020 ARI CB LCB 25 Active Johnatha~ Johnathan Joseph 1984-04-16 5-11 185
8 2020 ARI CB NA 23 Injur~ Robert A~ Robert Alford 1988-10-31 5-10 186
9 2020 ARI CB LCB 21 Active Patrick ~ Patrick Peterson 1990-07-11 6-1 203
10 2020 ARI CB RCB 20 Active Dre Kirk~ Dre Kirkpatr~ 1989-10-26 6-2 190
In order to user left_join, I thought it best to match up 'PlayerShortName' from qb.metrics.df and 'full_name' from roster.df.
To do so, I dropped everything before the first letter of the first name and added a period (to match p with 'PlayerShortName' in qb.metrics.df):
roster.df$full_name <- sub('(.)\\w+ (\\w+)', '\\1. \\2', roster.df$full_name)
It works fine, as you can see in the updated version of roster.df:
season team position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
<dbl> <chr> <chr> <chr> <int> <chr> <chr> <chr> <chr> <date> <chr> <chr>
1 2020 ARI C C 52 Active M. Cole Mason Cole 1996-03-28 6-5 292
2 2020 ARI C C 53 Active L. Gaill~ Lamont Gaillard 1996-02-08 6-3 305
3 2020 ARI CB NB 33 Active B. Murphy Byron Murphy 1998-01-18 5-11 190
4 2020 ARI CB NA 20 Active P. Amuka~ Prince Amukamara 1989-06-06 6-0 204
5 2020 ARI CB NA 39 Active J. Whitt~ Jace Whittaker 1995-07-16 5-11 185
6 2020 ARI CB NA 27 Injur~ K. Peter~ Kevin Peterson 1994-03-22 5-11 185
7 2020 ARI CB LCB 25 Active J. Joseph Johnathan Joseph 1984-04-16 5-11 185
8 2020 ARI CB NA 23 Injur~ R. Alford Robert Alford 1988-10-31 5-10 186
9 2020 ARI CB LCB 21 Active P. Peter~ Patrick Peterson 1990-07-11 6-1 203
10 2020 ARI CB RCB 20 Active D. Kirkp~ Dre Kirkpatr~ 1989-10-26 6-2 190
However, when I do a left_merge, I get all kinds of duplicates ...
qb.metrics.df <- qb.metrics.df %>%
left_join(roster.df, by = c("PlayerShortName" = "full_name"))
Doing so results in this:
Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
<dbl> <dbl> <chr> <dbl> <chr> <chr> <chr> <chr>
1 1 20889 K. Murray 23.3 34.6 2375 68.2 17
2 1 20889 K. Murray 23.3 34.6 2375 68.2 17
3 2 19801 J. Allen 24.5 36.4 2873 68.4 21
4 2 19801 J. Allen 24.5 36.4 2873 68.4 21
5 2 19801 J. Allen 24.5 36.4 2873 68.4 21
6 2 19801 J. Allen 24.5 36.4 2873 68.4 21
7 2 19801 J. Allen 24.5 36.4 2873 68.4 21
8 2 19801 J. Allen 24.5 36.4 2873 68.4 21
9 3 14536 R. Wilson 32 37.1 2789 69.8 28
10 3 14536 R. Wilson 32 37.1 2789 69.8 28
As you can see, it creates duplicates of the information and I cannot figure out why.
Any help is greatly appreciated!
EDIT:
Here is the reproducible example for qb. metrics.df:
structure(list(Rank = c(1, 2, 3, 4, 5, 6), PlayerPlayerId = c(20889,
19801, 14536, 18890, 2593, 4314), PlayerShortName = c("K. Murray",
"J. Allen", "R. Wilson", "P. Mahomes", "A. Rodgers", "T. Brady"
), PlayerAgeExact = c(23.3, 24.5, 32, 25.2, 37, 43.3), PassAttemptsPerGame = c("34.6",
"36.4", "37.1", "36.6", "34.9", "38.5"), PassingYards = c("2375",
"2873", "2789", "2687", "2578", "2739"), CompletionPercentage = c("68.2",
"68.4", "69.8", "66.9", "67.8", "66"), PassingTouchdowns = c("17",
"21", "28", "25", "26", "23"), Interceptions = c("8", "7", "10",
"1", "3", "7"), PlayactionPassCompletionPercentage = c("71.4",
"66.4", "76.5", "71.3", "68.1", "74.7"), TruePasserRating = c("102.6",
"99.4", "119", "128.5", "130.2", "105.8"), ReceiverTargetSeparation = c("1.79",
"1.89", "2.16", "1.96", "2", "1.53"), DroppedPasses = c("9",
"19", "14", "61", "49", "29"), AccuracyRating = c("7.6", "7.5",
"7.8", "7.4", "7.6", "7.2"), RedZoneAttempts = c("34", "50",
"55", "46", "48", "64"), RedZoneCompletionPercentage = c("67.6",
"66", "72.7", "67.4", "66.7", "59.4"), InterceptablePasses = c("11",
"18", "11", "8", "8", "10"), InterceptablePassesPerGame = c("1.2",
"1.8", "1.2", "0.9", "0.9", "1")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
... and the example for roster.df:
structure(list(season = c(2020, 2020, 2020, 2020, 2020, 2020),
team = c("ARI", "ARI", "ARI", "ARI", "ARI", "ARI"), position = c("C",
"C", "CB", "CB", "CB", "CB"), depth_chart_position = c("C",
"C", "NB", NA, NA, NA), jersey_number = c(52L, 53L, 33L,
20L, 39L, 27L), status = c("Active", "Active", "Active",
"Active", "Active", "Injured Reserve"), full_name = c("Mason Cole",
"Lamont Gaillard", "Byron Murphy", "Prince Amukamara", "Jace Whittaker",
"Kevin Peterson"), first_name = c("Mason", "Lamont", "Byron",
"Prince", "Jace", "Kevin"), last_name = c("Cole", "Gaillard",
"Murphy", "Amukamara", "Whittaker", "Peterson"), birth_date = structure(c(9583,
9534, 10244, 7096, 9327, 8846), class = "Date"), height = c("6-5",
"6-3", "5-11", "6-0", "5-11", "5-11"), weight = c("292",
"305", "190", "204", "185", "185"), college = c("Michigan",
"Georgia", "Washington", "Nebraska", "Arizona", "Oklahoma State"
), high_school = c("East Lake (FL)", "Pine Forest (NC)",
"Saguaro (AZ)", "Apollo (AZ)", "Oceanside (CA)", "Wagoner (OK)"
), gsis_id = c("00-0034785", "00-0035536", "00-0035236",
"00-0027957", "00-0036043", "00-0032848"), espn_id = c(3115972L,
3128707L, 4038999L, 13975L, 3821572L, 2977742L), sportradar_id = c("53d25371-e3ce-4030-8d0a-82def5cdc600",
"0c8b0581-9ed2-488b-bcaa-ef783261dfd1", "c025b513-9431-4097-bc25-9777bf08f846",
"f1879cfa-4c07-4140-9da0-c7ebe9af2dfd", "528bf5c9-1d23-40c3-adda-df21f8f0e2ab",
"30e539c3-74dd-4a9a-9ebb-4bdd0f0d39f8"), yahoo_id = c(31067L,
32011L, 31865L, 24806L, 33089L, 29924L), rotowire_id = c(12795L,
13678L, 13560L, 7509L, 14793L, 11101L), update_dt = structure(c(1605942526.92275,
1605942526.92275, 1605942526.92275, 1605942526.92275, 1605942526.92275,
1605942526.92275), tzone = "America/New_York", class = c("POSIXct",
"POSIXt")), headshot_url = c("https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3115972.png",
"https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3128707.png",
"https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/4038999.png",
"https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/13975.png",
"https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3821572.png",
"https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/2977742.png"
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))