left_join creating duplicates when merging dataframes

A bit of a lengthy explanation here, so please bear with me with.

I am working with two dataframes:

  1. qb.metrics.df
  2. roster.df

In the end, I would like to left_join them together.

Here is a look at qb.metrics.df

 Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
   <dbl>          <dbl> <chr>                    <dbl> <chr>            <chr>        <chr>            <chr>           
 1     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 2     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 3     3          14536 R. Wilson                 32   37.1             2789         69.8             28              
 4     4          18890 P. Mahomes                25.2 36.6             2687         66.9             25              
 5     5           2593 A. Rodgers                37   34.9             2578         67.8             26              
 6     6           4314 T. Brady                  43.3 38.5             2739         66               23              
 7     7          18857 D. Watson                 25.2 33.4             2539         68.1             18              
 8     8          21681 J. Herbert                22.7 37.6             2333         66.8             19              
 9     9          19781 L. Jackson                23.9 27.4             1762         64               14              
10    10          16497 T. Bridgewater            28   33.1             2544         71.9             13              

... add here is a look at roster.df

season team  position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
    <dbl> <chr> <chr>    <chr>                    <int> <chr>  <chr>     <chr>      <chr>     <date>     <chr>  <chr> 
 1   2020 ARI   C        C                           52 Active Mason Co~ Mason      Cole      1996-03-28 6-5    292   
 2   2020 ARI   C        C                           53 Active Lamont G~ Lamont     Gaillard  1996-02-08 6-3    305   
 3   2020 ARI   CB       NB                          33 Active Byron Mu~ Byron      Murphy    1998-01-18 5-11   190   
 4   2020 ARI   CB       NA                          20 Active Prince A~ Prince     Amukamara 1989-06-06 6-0    204   
 5   2020 ARI   CB       NA                          39 Active Jace Whi~ Jace       Whittaker 1995-07-16 5-11   185   
 6   2020 ARI   CB       NA                          27 Injur~ Kevin Pe~ Kevin      Peterson  1994-03-22 5-11   185   
 7   2020 ARI   CB       LCB                         25 Active Johnatha~ Johnathan  Joseph    1984-04-16 5-11   185   
 8   2020 ARI   CB       NA                          23 Injur~ Robert A~ Robert     Alford    1988-10-31 5-10   186   
 9   2020 ARI   CB       LCB                         21 Active Patrick ~ Patrick    Peterson  1990-07-11 6-1    203   
10   2020 ARI   CB       RCB                         20 Active Dre Kirk~ Dre        Kirkpatr~ 1989-10-26 6-2    190   

In order to user left_join, I thought it best to match up 'PlayerShortName' from qb.metrics.df and 'full_name' from roster.df.

To do so, I dropped everything before the first letter of the first name and added a period (to match p with 'PlayerShortName' in qb.metrics.df):

roster.df$full_name <- sub('(.)\\w+ (\\w+)', '\\1. \\2', roster.df$full_name)

It works fine, as you can see in the updated version of roster.df:

 season team  position depth_chart_pos~ jersey_number status full_name first_name last_name birth_date height weight
    <dbl> <chr> <chr>    <chr>                    <int> <chr>  <chr>     <chr>      <chr>     <date>     <chr>  <chr> 
 1   2020 ARI   C        C                           52 Active M. Cole   Mason      Cole      1996-03-28 6-5    292   
 2   2020 ARI   C        C                           53 Active L. Gaill~ Lamont     Gaillard  1996-02-08 6-3    305   
 3   2020 ARI   CB       NB                          33 Active B. Murphy Byron      Murphy    1998-01-18 5-11   190   
 4   2020 ARI   CB       NA                          20 Active P. Amuka~ Prince     Amukamara 1989-06-06 6-0    204   
 5   2020 ARI   CB       NA                          39 Active J. Whitt~ Jace       Whittaker 1995-07-16 5-11   185   
 6   2020 ARI   CB       NA                          27 Injur~ K. Peter~ Kevin      Peterson  1994-03-22 5-11   185   
 7   2020 ARI   CB       LCB                         25 Active J. Joseph Johnathan  Joseph    1984-04-16 5-11   185   
 8   2020 ARI   CB       NA                          23 Injur~ R. Alford Robert     Alford    1988-10-31 5-10   186   
 9   2020 ARI   CB       LCB                         21 Active P. Peter~ Patrick    Peterson  1990-07-11 6-1    203   
10   2020 ARI   CB       RCB                         20 Active D. Kirkp~ Dre        Kirkpatr~ 1989-10-26 6-2    190   

However, when I do a left_merge, I get all kinds of duplicates ...

qb.metrics.df <- qb.metrics.df %>%
  left_join(roster.df, by = c("PlayerShortName" = "full_name"))

Doing so results in this:

   Rank PlayerPlayerId PlayerShortName PlayerAgeExact PassAttemptsPer~ PassingYards CompletionPerce~ PassingTouchdow~
   <dbl>          <dbl> <chr>                    <dbl> <chr>            <chr>        <chr>            <chr>           
 1     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 2     1          20889 K. Murray                 23.3 34.6             2375         68.2             17              
 3     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 4     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 5     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 6     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 7     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 8     2          19801 J. Allen                  24.5 36.4             2873         68.4             21              
 9     3          14536 R. Wilson                 32   37.1             2789         69.8             28              
10     3          14536 R. Wilson                 32   37.1             2789         69.8             28   

As you can see, it creates duplicates of the information and I cannot figure out why.

Any help is greatly appreciated!

EDIT:

Here is the reproducible example for qb. metrics.df:

structure(list(Rank = c(1, 2, 3, 4, 5, 6), PlayerPlayerId = c(20889, 
19801, 14536, 18890, 2593, 4314), PlayerShortName = c("K. Murray", 
"J. Allen", "R. Wilson", "P. Mahomes", "A. Rodgers", "T. Brady"
), PlayerAgeExact = c(23.3, 24.5, 32, 25.2, 37, 43.3), PassAttemptsPerGame = c("34.6", 
"36.4", "37.1", "36.6", "34.9", "38.5"), PassingYards = c("2375", 
"2873", "2789", "2687", "2578", "2739"), CompletionPercentage = c("68.2", 
"68.4", "69.8", "66.9", "67.8", "66"), PassingTouchdowns = c("17", 
"21", "28", "25", "26", "23"), Interceptions = c("8", "7", "10", 
"1", "3", "7"), PlayactionPassCompletionPercentage = c("71.4", 
"66.4", "76.5", "71.3", "68.1", "74.7"), TruePasserRating = c("102.6", 
"99.4", "119", "128.5", "130.2", "105.8"), ReceiverTargetSeparation = c("1.79", 
"1.89", "2.16", "1.96", "2", "1.53"), DroppedPasses = c("9", 
"19", "14", "61", "49", "29"), AccuracyRating = c("7.6", "7.5", 
"7.8", "7.4", "7.6", "7.2"), RedZoneAttempts = c("34", "50", 
"55", "46", "48", "64"), RedZoneCompletionPercentage = c("67.6", 
"66", "72.7", "67.4", "66.7", "59.4"), InterceptablePasses = c("11", 
"18", "11", "8", "8", "10"), InterceptablePassesPerGame = c("1.2", 
"1.8", "1.2", "0.9", "0.9", "1")), row.names = c(NA, -6L), class = c("tbl_df", 
"tbl", "data.frame"))

... and the example for roster.df:

structure(list(season = c(2020, 2020, 2020, 2020, 2020, 2020), 
    team = c("ARI", "ARI", "ARI", "ARI", "ARI", "ARI"), position = c("C", 
    "C", "CB", "CB", "CB", "CB"), depth_chart_position = c("C", 
    "C", "NB", NA, NA, NA), jersey_number = c(52L, 53L, 33L, 
    20L, 39L, 27L), status = c("Active", "Active", "Active", 
    "Active", "Active", "Injured Reserve"), full_name = c("Mason Cole", 
    "Lamont Gaillard", "Byron Murphy", "Prince Amukamara", "Jace Whittaker", 
    "Kevin Peterson"), first_name = c("Mason", "Lamont", "Byron", 
    "Prince", "Jace", "Kevin"), last_name = c("Cole", "Gaillard", 
    "Murphy", "Amukamara", "Whittaker", "Peterson"), birth_date = structure(c(9583, 
    9534, 10244, 7096, 9327, 8846), class = "Date"), height = c("6-5", 
    "6-3", "5-11", "6-0", "5-11", "5-11"), weight = c("292", 
    "305", "190", "204", "185", "185"), college = c("Michigan", 
    "Georgia", "Washington", "Nebraska", "Arizona", "Oklahoma State"
    ), high_school = c("East Lake (FL)", "Pine Forest (NC)", 
    "Saguaro (AZ)", "Apollo (AZ)", "Oceanside (CA)", "Wagoner (OK)"
    ), gsis_id = c("00-0034785", "00-0035536", "00-0035236", 
    "00-0027957", "00-0036043", "00-0032848"), espn_id = c(3115972L, 
    3128707L, 4038999L, 13975L, 3821572L, 2977742L), sportradar_id = c("53d25371-e3ce-4030-8d0a-82def5cdc600", 
    "0c8b0581-9ed2-488b-bcaa-ef783261dfd1", "c025b513-9431-4097-bc25-9777bf08f846", 
    "f1879cfa-4c07-4140-9da0-c7ebe9af2dfd", "528bf5c9-1d23-40c3-adda-df21f8f0e2ab", 
    "30e539c3-74dd-4a9a-9ebb-4bdd0f0d39f8"), yahoo_id = c(31067L, 
    32011L, 31865L, 24806L, 33089L, 29924L), rotowire_id = c(12795L, 
    13678L, 13560L, 7509L, 14793L, 11101L), update_dt = structure(c(1605942526.92275, 
    1605942526.92275, 1605942526.92275, 1605942526.92275, 1605942526.92275, 
    1605942526.92275), tzone = "America/New_York", class = c("POSIXct", 
    "POSIXt")), headshot_url = c("https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3115972.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3128707.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/4038999.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/13975.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/3821572.png", 
    "https://a.espncdn.com/combiner/i?img=/i/headshots/nfl/players/full/2977742.png"
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

Your code seems fine. I'm unable to reproduce the issue with the sample data so I think the complete roster.df probably contains multiple records with the same full_name.

You could try verifying this with dplyr::count(roster.df, full_name, sort = TRUE).

left_join will result in new if, for example, roster.df has more than one row for each player. I see that roster.df has a column called season. If roster.df has multiple seasons and players can appear in more than one season, then you would get multiple rows. However, the joined data frame in your example doesn't seem to have a season column. If roster.df has a season column then the joined data frame should have that column as well.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.