Dplyr summarise categorial variable that appears multiple times

dplyr

#1

My dataframe looks like this:

  
head(small)

Now i want to summarize all the data for each Video.ID. The problem are columns like category, where i cannot apply sum or mean or something. I tried the following :slight_smile:

small %>%group_by(Video.ID) %>% summarise(sumr=sum(Partner.Revenue),len=mean(Video.Duration..sec.),cat=mean(Category))

mean(Category) is clearly wrong, but I do not know how to fix it. One video id only ever has one category, how to I get dplyr to just pick that in the summary ?

I know its a beginner question and most likely silly, but I just cannot find a solution. All tips are appreciated.


#2

summarise() just needs a function that will return a single value. So if you're certain that there's only one value of Category per group, you could use:

small %>%
  group_by(Video.ID) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.),
    cat = Category[[1]]
  )

If it's possible you might have more than one value of Category per group, then it's up to you to decide how they should be collapsed into a single value. Concatenation is one simple option, though whether it's a good idea or not depends on what you're doing next with the summarized data (displaying in a table? :grin: further analysis? :fearful:) .

small %>%
  group_by(Video.ID) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.),
    cat = paste(Category, collapse = ", ")
  )

(if you do need to do further analysis on multiple values of Category, summarizing them as a list column with cat = list(Category) is probably the way to go, although then you'll really want to look into purrr and this great tutorial)

P.S. There is nothing wrong with asking beginner questions! :blush: We were all beginners once, and we are all still beginners at countless things.


#3

A reprex would be really useful.


#4

Agreed! @Traumfabrik, I should add a disclaimer that since there was no reprex here, I had to assume that the code in your post was free of typos or other errors. If that's not a safe assumption, then the code in my post may just replicate those problems and therefore not work, for reasons unrelated to your question. Reprexes save us all time and confusion! :grin:


#5

Thank you all. The solution is stunningly simple, but still I did not think of it. Really, thank you all !


#6

If you're using column[1] or first(column) in summarise, it's a sign that variable should also be a grouping variable:

small %>%
  group_by(Video.ID, Category) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.)
  )