Having some trouble coming up with an approach to solving this problem. Given this data set (which, in real life has lots more observations and far more variables than just these five),
# Name Team H AB BA
# 1 Mookie Betts BOS 7 20 0.350
# 2 Kevin Pillar TOR 10 27 0.370
# 3 Kevin Pillar SFG 2 12 0.167
# 4 Xander Bogaerts BOS 9 32 0.281
I'd like to come up with this:
# Name Team H AB BA
# 1 Mookie Betts BOS 7 20 0.350
# 2 Kevin Pillar --- 12 39 0.308
# 3 Xander Bogaerts BOS 9 32 0.281
Note the grouping by Name and the consolidation of the data from multiple rows into just one.
I tried grouping and summarizing without much luck. My next thought was perhaps to develop subtotals for duplicate observations and then delete the individual observations, but here again kept running into a wall. Usually, the more outlandish my solution, the more I realize I'm overlooking something obvious.
Your problem is very well stated. I would also suggest to add a quick code to construct the dataset as it would make the lives of potential helpers easier.
The fact that Kevin Pillar belongs to 2 different teams in the dataset means that you should consider dropping the Team column in the final dataset. The following code will only work with dplyr 1.0.0
library(dplyr)
df <- data.frame(
Name = c("Mookie Betts","Kevin Pillar",
"Kevin Pillar","Xander Bogaerts"),
Team = c("BOS", "TOR", "SFG", "BOS"),
H = c(7L, 10L, 2L, 9L),
AB = c(20L, 27L, 12L, 32L),
BA = c(0.35, 0.308, 0.167, 0.281)
)
summary_df <-
df %>%
group_by(Name) %>%
summarize(across(c(H, AB, BA), sum))
summary_df
# A tibble: 3 x 4
Name H AB BA
<chr> <int> <int> <dbl>
1 Kevin Pillar 12 39 0.475
2 Mookie Betts 7 20 0.35
3 Xander Bogaerts 9 32 0.281
Thank you for your comment, Ghislain. I will be sure to include code in future queries to build a sample data set.
Your approach is sound, certainly, but ultimately I need access to all the variables of the data set, so I don't really have the option of leaving out Team.
Perhaps I can take a new approach. Extract (and delete) from the original data frame the multiple rows where they exist, perform the summarization steps you describe, append a new "---" Team field for each observation, and then bind that data frame back into the original.
# A tibble: 3 x 5
Name H AB BA Team
<chr> <int> <int> <dbl> <chr>
1 Kevin Pillar 12 39 0.475 --
2 Mookie Betts 7 20 0.35 BOS
3 Xander Bogaerts 9 32 0.281 BOS
@rdr I think I found a solution that you may like because it keeps the information on teams. Do not forget to mark one of the answers as the solution for the sake of potential future readers.
library(dplyr)
library(tidyr)
df <- data.frame(
Name = c("Mookie Betts","Kevin Pillar",
"Kevin Pillar","Xander Bogaerts"),
Team = c("BOS", "TOR", "SFG", "BOS"),
H = c(7L, 10L, 2L, 9L),
AB = c(20L, 27L, 12L, 32L),
BA = c(0.35, 0.308, 0.167, 0.281)
)
df
Name Team H AB BA
1 Mookie Betts BOS 7 20 0.350
2 Kevin Pillar TOR 10 27 0.308
3 Kevin Pillar SFG 2 12 0.167
4 Xander Bogaerts BOS 9 32 0.281
df %>%
group_by(Name) %>%
summarize(
Teams = paste0(Team, collapse = ", "),
across(c(H, AB, BA), sum)
)
# A tibble: 3 x 5
Name Teams H AB BA
<chr> <chr> <int> <int> <dbl>
1 Kevin Pillar TOR, SFG 12 39 0.475
2 Mookie Betts BOS 7 20 0.35
3 Xander Bogaerts BOS 9 32 0.281