Consolidating (i.e., Subtotaling) Rows Based on an ID variable

Having some trouble coming up with an approach to solving this problem. Given this data set (which, in real life has lots more observations and far more variables than just these five),

# Name              Team  H AB    BA
# 1    Mookie Betts  BOS  7 20 0.350
# 2    Kevin Pillar  TOR 10 27 0.370
# 3    Kevin Pillar  SFG  2 12 0.167
# 4 Xander Bogaerts  BOS  9 32 0.281

I'd like to come up with this:

# Name              Team  H AB    BA
# 1    Mookie Betts  BOS  7 20 0.350
# 2    Kevin Pillar  --- 12 39 0.308
# 3 Xander Bogaerts  BOS  9 32 0.281

Note the grouping by Name and the consolidation of the data from multiple rows into just one.

I tried grouping and summarizing without much luck. My next thought was perhaps to develop subtotals for duplicate observations and then delete the individual observations, but here again kept running into a wall. Usually, the more outlandish my solution, the more I realize I'm overlooking something obvious.

Appreciate any and all guidance! Thank you,.

-- Robert

Hi Robert,

Your problem is very well stated. I would also suggest to add a quick code to construct the dataset as it would make the lives of potential helpers easier.

The fact that Kevin Pillar belongs to 2 different teams in the dataset means that you should consider dropping the Team column in the final dataset. The following code will only work with dplyr 1.0.0

library(dplyr)


df <- data.frame(
	Name = c("Mookie Betts","Kevin Pillar",
				"Kevin Pillar","Xander Bogaerts"),
	Team = c("BOS", "TOR", "SFG", "BOS"),
	H = c(7L, 10L, 2L, 9L),
	AB = c(20L, 27L, 12L, 32L),
	BA = c(0.35, 0.308, 0.167, 0.281)
)

summary_df <-
	df %>%
	group_by(Name) %>%
	summarize(across(c(H, AB, BA), sum))

summary_df 

# A tibble: 3 x 4
  Name                H    AB    BA
  <chr>           <int> <int> <dbl>
1 Kevin Pillar       12    39 0.475
2 Mookie Betts        7    20 0.35 
3 Xander Bogaerts     9    32 0.281
1 Like

Thank you for your comment, Ghislain. I will be sure to include code in future queries to build a sample data set.

Your approach is sound, certainly, but ultimately I need access to all the variables of the data set, so I don't really have the option of leaving out Team.

Perhaps I can take a new approach. Extract (and delete) from the original data frame the multiple rows where they exist, perform the summarization steps you describe, append a new "---" Team field for each observation, and then bind that data frame back into the original.

Thank you again for your comments.

-- Robert


charsum <- function(vec) {if (length(vec) < 2) {return(vec)} else {return("--")}}

(summary_df <-
  df %>%
  group_by(Name) %>%
  summarize(across(c(H, AB, BA), sum),across(Team, charsum)))
# A tibble: 3 x 5
  Name                H    AB    BA Team 
  <chr>           <int> <int> <dbl> <chr>
1 Kevin Pillar       12    39 0.475 --   
2 Mookie Betts        7    20 0.35  BOS  
3 Xander Bogaerts     9    32 0.281 BOS
1 Like

@rdr I think I found a solution that you may like because it keeps the information on teams. Do not forget to mark one of the answers as the solution for the sake of potential future readers.

library(dplyr)
library(tidyr)

df <- data.frame(
	Name = c("Mookie Betts","Kevin Pillar",
				"Kevin Pillar","Xander Bogaerts"),
	Team = c("BOS", "TOR", "SFG", "BOS"),
	H = c(7L, 10L, 2L, 9L),
	AB = c(20L, 27L, 12L, 32L),
	BA = c(0.35, 0.308, 0.167, 0.281)
)

df

             Name Team  H AB    BA
1    Mookie Betts  BOS  7 20 0.350
2    Kevin Pillar  TOR 10 27 0.308
3    Kevin Pillar  SFG  2 12 0.167
4 Xander Bogaerts  BOS  9 32 0.281

df %>%
	group_by(Name) %>%
	summarize(
		Teams = paste0(Team, collapse = ", "),
		across(c(H, AB, BA), sum)
	)

# A tibble: 3 x 5
  Name            Teams        H    AB    BA
  <chr>           <chr>    <int> <int> <dbl>
1 Kevin Pillar    TOR, SFG    12    39 0.475
2 Mookie Betts    BOS          7    20 0.35 
3 Xander Bogaerts BOS          9    32 0.281

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Thank you, @nirgrahamuk. This is a very helpful illustration of what's possible with across(). Appreciate your reply.

-- Robert