Plotting bar graph for average rating of all movies in their respective genres? (ggplot - data-wrangling)

Background: A 0 denotes the movie is not of that particular genre, and a 1 denotes that the movie is included in the particular genre.

Hi,

I'm trying to find the average rating of all movies within their respective genres. I know, in principle, that I must group the ratings based on whether or not the movie's rating corresponds to a 1 or 0 in a genre column {Action, Animation, Comedy, ... }

For example, if movie A has a rating 6.4, and a 0 in the column Action, then it will not be included in the summation of ratings for Action. If movie A has a 1 for Comedy, then its rating will be included in Comedy's rating summation.

In the end, once I have summed all of the ratings for the specific genres, I need to divide by the total observations of movies in that genre to get the average. Finally, I must produce a bar graph for it, I'm assuming to use ggplot2.

After that part, I need to do the same thing, but only for the span of 5 years [2000, 2005].

I haven't the clue on how to write a code to do that, and I really need help.

library(dplyr)
library(tidyverse)
library(ggplot2)
library(ggplot2movies)

df <-  data.frame(Action, Animation, Comedy, Drama, Documentary, Romance, Short )
genre.tot <- colSums(df); # total observations for respective genre columns

If anyone could help me, thank you

Hi, and welcome!

Two preliminaries:

  1. Please see the FAQ: What's a reproducible example (`reprex`) and how do I do one? Using a reprex, complete with representative data will attract quicker and more answers.

  2. Check the community homework policy, which requires some disclosure of the assignment and explains members are here to help you get unstuck, but not to "give you the answer"

Let's start by looking at the structure of the movies data set

library(ggplot2movies)
str(movies)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    58788 obs. of  24 variables:
#>  $ title      : chr  "$" "$1000 a Touchdown" "$21 a Day Once a Month" "$40,000" ...
#>  $ year       : int  1971 1939 1941 1996 1975 2000 2002 2002 1987 1917 ...
#>  $ length     : int  121 71 7 70 71 91 93 25 97 61 ...
#>  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
#>  $ rating     : num  6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
#>  $ votes      : int  348 20 5 6 17 45 200 24 18 51 ...
#>  $ r1         : num  4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
#>  $ r2         : num  4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
#>  $ r3         : num  4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
#>  $ r4         : num  4.5 24.5 0 0 14.5 14.5 4.5 4.5 0 4.5 ...
#>  $ r5         : num  14.5 14.5 0 0 14.5 14.5 24.5 4.5 0 4.5 ...
#>  $ r6         : num  24.5 14.5 24.5 0 4.5 14.5 24.5 14.5 0 44.5 ...
#>  $ r7         : num  24.5 14.5 0 0 0 4.5 14.5 14.5 34.5 14.5 ...
#>  $ r8         : num  14.5 4.5 44.5 0 0 4.5 4.5 14.5 14.5 4.5 ...
#>  $ r9         : num  4.5 4.5 24.5 34.5 0 14.5 4.5 4.5 4.5 4.5 ...
#>  $ r10        : num  4.5 14.5 24.5 45.5 24.5 14.5 14.5 14.5 24.5 4.5 ...
#>  $ mpaa       : chr  "" "" "" "" ...
#>  $ Action     : int  0 0 0 0 0 0 1 0 0 0 ...
#>  $ Animation  : int  0 0 1 0 0 0 0 0 0 0 ...
#>  $ Comedy     : int  1 1 0 1 0 0 0 0 0 0 ...
#>  $ Drama      : int  1 0 0 0 0 1 1 0 1 0 ...
#>  $ Documentary: int  0 0 0 0 0 0 0 1 0 0 ...
#>  $ Romance    : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ Short      : int  0 0 1 0 0 0 0 1 0 0 ...

Created on 2020-03-18 by the reprex package (v0.3.0)

Ok, it's a data frame with 24 variables capturing various aspects of the 58,788 movies it describes.

What's needed? Average rating by genre. Which variable holds the rating for a movie? I'm going to call that SCORE to not spoil the fun.

Which variables indicate the genre? No spoilers here: Action, Animation, Comedy, Drama, Documentary, Romance, and Short.

Using the dplyr package's select function, you can create a skinnier data frame to work with for this problem

movies %>% select(SCORE, Action, Animation, Comedy, Drama, Documentary, Romance, and Short) -> genres

Not needed strictly, but easier on the eyes.

genres <- structure(list(SCORE = c(6.4, 6, 8.2, 8.2, 3.4, 4.3), Action = c(0L, 0L, 0L, 0L, 0L, 0L), Animation = c(0L, 0L, 1L, 0L, 0L, 0L), Comedy = c(1L, 1L, 0L, 1L, 0L, 0L), Drama = c(1L, 0L, 0L, 0L, 0L, 1L), Documentary = c(0L, 0L, 0L, 0L, 0L, 0L), Romance = c(0L, 0L, 0L, 0L, 0L, 0L), Short = c(0L, 0L, 1L, 0L, 0L, 0L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))
genres
#>   SCORE Action Animation Comedy Drama Documentary Romance Short
#> 1   6.4      0         0      1     1           0       0     0
#> 2   6.0      0         0      1     0           0       0     0
#> 3   8.2      0         1      0     0           0       0     1
#> 4   8.2      0         0      1     0           0       0     0
#> 5   3.4      0         0      0     0           0       0     0
#> 6   4.3      0         0      0     1           0       0     0

Created on 2020-03-18 by the reprex package (v0.3.0)

(These are just the first few rows, of course.)

Assuming you were just interested in Comedy, how would you further reduce genres to just those films?

suppressPackageStartupMessages(library(dplyr)) 
# OMITTED genres <- structure(list ...
comedies <- genres %>% filter(Comedy == 1) %>% select(SCORE,Comedy)
comedies
#> # A tibble: 3 x 2
#>   SCORE Comedy
#>   <dbl>  <int>
#> 1   6.4      1
#> 2   6        1
#> 3   8.2      1

Created on 2020-03-18 by the reprex package (v0.3.0)

The function mean() will find your average SCORE, so back to you to fill in the blank

mean(_____)

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.