How do I include the Total and Filtered variable in same summary Q and %?

dplyr

#1

Hello,

I am struggling to create a filtered variable and a total in the same summary.
I tried a code that does not generate the right numbers, I am looking to have quantity and % in the same summary. I am looking to group by type and have the % of films above the mean and the % of total for each type. So there should be 4 columns in total.

This seems pretty straight forward but I keep getting errors so the code below is the only one that was running for me…

Thank you so much.
Christine

reprex::reprex_info()
#> Warning in as.POSIXlt.POSIXct(Sys.time()): unknown timezone 'zone/tz/2017c.
#> 1.0/zoneinfo/Europe/Paris'
#> Created by the reprex package v0.1.1.9000 on 2018-01-06

suppressWarnings(library(dplyr))
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

data.frame(stringsAsFactors=FALSE,
                        title_type = c("Feature film","Feature film", "TV Movie", 
                                       "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie", 
                                       "Feature film", "Feature film", "TV Movie", 
                                       "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie"),
                        imdb_rating = c(2, 3, 2, 10, 8, 3, 3, 2, 8, 5, 2, 3, 2, 10, 8, 3, 3, 
                                        2, 8, 5)) %>% 
group_by(title_type) %>% 
summarise(count =n(), count_high=length(which(imdb_rating> mean(imdb_rating)))) %>% 
mutate(ratio=count/sum(count))
#> # A tibble: 3 x 4
#>     title_type count count_high ratio
#>          <chr> <int>      <int> <dbl>
#> 1  Documentary     4          2   0.2
#> 2 Feature film    10          6   0.5
#> 3     TV Movie     6          2   0.3

#2

I’m not sure I calculating exactly the percentages that you want. I made the column names long to try to describe just what I was calculating.

Also use suppressPackageStartupMessages(library(dplyr))

There are 4 columns in the result, but one doesn’t fit on the reprex output.

To calculate the percent over all films (I think that is what you want) I added a column with fractional imdb_rating for each film before I did the group_by… that way I could just sum up the fractions in the summary

suppressPackageStartupMessages(library(dplyr))
df <- data.frame(stringsAsFactors=FALSE,
                        title_type = c("Feature film","Feature film", "TV Movie", 
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie", 
                                                     "Feature film", "Feature film", "TV Movie", 
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie"),
                        imdb_rating = c(2, 3, 2, 10, 8, 3, 3, 2, 8, 5, 2, 3, 2, 10, 8, 3, 3, 
                                                        2, 8, 5)) 


df %>% 
    mutate(pct_imdb_total = imdb_rating / sum(imdb_rating)) %>%
    group_by(title_type) %>% 
    summarise(count =n(),
                        pct_of_group_over_mean_of_group = 100 * length(which(imdb_rating> mean(imdb_rating))) / sum(count),
                        pct_of_rating_over_all_films = 100 * sum(pct_imdb_total)) 
#> # A tibble: 3 x 4
#>     title_type count pct_of_group_over_mean_of_group
#>          <chr> <int>                           <dbl>
#> 1  Documentary     4                        50.00000
#> 2 Feature film    10                        60.00000
#> 3     TV Movie     6                        33.33333
#> # ... with 1 more variables: pct_of_rating_over_all_films <dbl>

#3

Hello Dan, Thank you again for your advice!

I was looking for something even more basic, the pct_of_rating_over_all_films calculates the % of scores and pct_of_group_over_mean_of_group seems to to something simuliar.
I was trying to get the ratio eg Documentary 4/20=20% and then the % of films that were above the total mean for each category so 100% for Documentary: (10, 8, 10, 8) as mean= 4,6.
I am sorry I explained so bad.
The objective is to get the Q and the % of films for each category above the total mean.


#4

Wait I see you want the % in group above total mean not group mean… back in justs a sec

I see now… there are always so many ways to aggregate data :slight_smile: I think this gives the results you want but I think there is a better way to find the group members above the mean of the group.

suppressPackageStartupMessages(library(dplyr))
df <- data.frame(stringsAsFactors=FALSE,
                        title_type = c("Feature film","Feature film", "TV Movie",
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie",
                                                     "Feature film", "Feature film", "TV Movie",
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie"),
                        imdb_rating = c(2, 3, 2, 10, 8, 3, 3, 2, 8, 5, 2, 3, 2, 10, 8, 3, 3,
                                                        2, 8, 5))


df %>%
    group_by(title_type) %>%
    summarise(count = n(),
                        pct_of_group_over_mean_of_group = 100 * length(which(imdb_rating> mean(imdb_rating))) / count,
                        pct_all_films = 100 * count / nrow(.))
#> # A tibble: 3 x 4
#>     title_type count pct_of_group_over_mean_of_group pct_all_films
#>          <chr> <int>                           <dbl>         <dbl>
#> 1  Documentary     4                        50.00000            20
#> 2 Feature film    10                        60.00000            50
#> 3     TV Movie     6                        33.33333            30

#5

Documentary should be 100% so I am not sure why the data is not aggregating as it should…


#6

I think I’ve got it now

suppressPackageStartupMessages(library(dplyr))
df <- data.frame(stringsAsFactors=FALSE,
                        title_type = c("Feature film","Feature film", "TV Movie",
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie",
                                                     "Feature film", "Feature film", "TV Movie",
                                                     "Documentary", "Documentary", "Feature film", "Feature film",  "Feature film", "TV Movie", "TV Movie"),
                        imdb_rating = c(2, 3, 2, 10, 8, 3, 3, 2, 8, 5, 2, 3, 2, 10, 8, 3, 3,
                                                        2, 8, 5))


df %>%
    mutate(above_mean = imdb_rating > mean(imdb_rating)) %>%
    group_by(title_type) %>%
    summarise(count = n(),
                        pct_of_group_over_mean_of_group =  100 * sum(above_mean) / count,
                        pct_all_films = 100 * count / nrow(.))
#> # A tibble: 3 x 4
#>     title_type count pct_of_group_over_mean_of_group pct_all_films
#>          <chr> <int>                           <dbl>         <dbl>
#> 1  Documentary     4                       100.00000            20
#> 2 Feature film    10                         0.00000            50
#> 3     TV Movie     6                        66.66667            30

#7

Data never aggregates the way it should :grinning:


#8

Hahahaha I feel better.


#9

For future reference, it’s useful to post the code you’re getting errors from, too! It can help make what you’re trying to do more clear. :+1:


#10

Thank you Mara. I am not sure what more I can post, I am not getting any errors from my code it is just not doing what I want it to do.


#11

I posted a second version, above. It produced:

#> title_type count pct_of_group_over_mean_of_group pct_all_films
#>
#> 1 Documentary 4 100.00000 20
#> 2 Feature film 10 0.00000 50
#> 3 TV Movie 6 66.66667 30

Isn’t that what you wanted?

You might have missed because I posted it just before my comment about data aggregation.


#12

I was referring to right from the start:


#13

woop woop!! thank you so much Dan for your excellent help. Yes, pardon I missed it completely.


#14

No problem, glad to help anytime I can


#15

Ahhh my other attempts. Yes I can see why that is useful. I will post in future. Thank you!