Assistance in combining row data with different variable

Hey guys, I am very new to R and need some assistance. I have a big data set with numerous variables. I have attached an example below

Age <- c(24, 24, 24)
Name <- c('Pete', 'Pete', 'Pete')
City <- c('PHI', 'MIN', 'TOT')

dat <- data.frame(Age, Name, City)
dat

I am wondering how to combine the rows together so it looks something like this with the cities all in the same cell, and being only one observation rather than 3.

Age <- c(24)
Name <- c('Pete')
City <- c('PHI, MIN, TOT')

dat2 <- data.frame(Age, Name, City)
dat2

The data has a few cases of double ups like this, but not all of it is double ups, so I am wondering if there is possibly a script to run through the whole data set and combine it as such, rather than pulling out each case individually, running this data and then putting it back into the dataset?

Any help will be greatly appreciated!

Thanks

Hi @joshrockd258, if these are all the same Pete, then the structure of dat would seem to be preferable from a data science point of view -- is there are reason you want the structure of dat2?

I should clarify my apologies, in this case they are all the same pete, but there are other names (with last names) and cases. There are about 40 other variables that are all numerical that will be summed and added later on in the analysis, but obviously cant be done with the city data

Could you say what it is you're trying to do with the data? That would help folks understand better how to help.

So the list is of a group of athletes for particular sports, and I am developing a scouting system to determine which athletes are best for the program based on certain metrics which are the other 40 variables. It is looking at about 2 years worth of the athletes data looking at how they perform in certain metrics in their chosen sport, etc. The cities are the teams location of the player, and the numerous cities indicate that the player has moved teams at some point during the 2 years, thus creating numerous observations of the same person. The issue with this is that each case will not provide an accurate representation of how that athlete actually performed over the 2 years, just a number of fragmented representations, so when the data is analysed through R, they are often overlooked as their metrics shadow far under the bottom value, despite the indidual's values potentially being above average when combined.

I think there is no harm in reshaping your data to explore, you can always take a step back if it doesn't serve your purposes, here is how to get your desired result.

library(dplyr)

dat <- data.frame(
    stringsAsFactors = FALSE,
    Age = c(24, 24, 24),
    Name = c("Pete", "Pete", "Pete"),
    City = c("PHI", "MIN", "TOT")
)

dat %>% 
    group_by(Age, Name) %>% 
    summarise(City = paste(City, collapse = ", "))
#> # A tibble: 1 x 3
#> # Groups:   Age [1]
#>     Age Name  City         
#>   <dbl> <chr> <chr>        
#> 1    24 Pete  PHI, MIN, TOT

Created on 2020-03-09 by the reprex package (v0.3.0.9001)

1 Like

Thank you so much! Worked like a dream :grin:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.