Dplyr summarise categorial variable that appears multiple times

Traumfabrik · May 12, 2018, 9:40pm

My dataframe looks like this:

  
head(small)

Now i want to summarize all the data for each Video.ID. The problem are columns like category, where i cannot apply sum or mean or something. I tried the following

small %>%group_by(Video.ID) %>% summarise(sumr=sum(Partner.Revenue),len=mean(Video.Duration..sec.),cat=mean(Category))

mean(Category) is clearly wrong, but I do not know how to fix it. One video id only ever has one category, how to I get dplyr to just pick that in the summary ?

I know its a beginner question and most likely silly, but I just cannot find a solution. All tips are appreciated.

jcblum · May 12, 2018, 10:52pm

summarise() just needs a function that will return a single value. So if you're certain that there's only one value of Category per group, you could use:

small %>%
  group_by(Video.ID) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.),
    cat = Category[[1]]
  )

If it's possible you might have more than one value of Category per group, then it's up to you to decide how they should be collapsed into a single value. Concatenation is one simple option, though whether it's a good idea or not depends on what you're doing next with the summarized data (displaying in a table? further analysis? ) .

small %>%
  group_by(Video.ID) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.),
    cat = paste(Category, collapse = ", ")
  )

(if you do need to do further analysis on multiple values of Category, summarizing them as a list column with cat = list(Category) is probably the way to go, although then you'll really want to look into purrr and this great tutorial)

P.S. There is nothing wrong with asking beginner questions! We were all beginners once, and we are all still beginners at countless things.

prosoitos · May 13, 2018, 1:04am

A reprex would be really useful.

FAQ: What's a reproducible example (`reprex`) and how do I create one? meta

Why reprex? Getting unstuck is hard. Your first step here is usually to create a reprex, or reproducible example. The goal of a reprex is to package your code, and information about your problem so that others can run it and feel your pain. Then, hopefully, folks can more easily provide a solution. What's in a Reproducible Example? Parts of a reproducible example: background information - Describe what you are trying to do. What have you already done? complete set up - include any library() calls and data to reproduce your issue. data for a reprex: Here's a discussion on setting up data for a reprex make it run - include the minimal code required to reproduce your error on the data…

jcblum · May 13, 2018, 2:06am

Agreed! @Traumfabrik, I should add a disclaimer that since there was no reprex here, I had to assume that the code in your post was free of typos or other errors. If that's not a safe assumption, then the code in my post may just replicate those problems and therefore not work, for reasons unrelated to your question. Reprexes save us all time and confusion!

Traumfabrik · May 13, 2018, 2:41am

Thank you all. The solution is stunningly simple, but still I did not think of it. Really, thank you all !

alistaire · May 13, 2018, 3:03am

jcblum:

So if you're certain that there's only one value of Category per group, you could use:
small %>%
  group_by(Video.ID) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.),
    cat = Category[[1]]
  )

If you're using column[1] or first(column) in summarise, it's a sign that variable should also be a grouping variable:

small %>%
  group_by(Video.ID, Category) %>%
  summarise(
    sumr = sum(Partner.Revenue),
    len = mean(Video.Duration..sec.)
  )