Multiple column with multiple values

Dear RStudio Community,
**Dataset brief: Data set has about 5000 rows and 10+ columns of data about Movies and TV shows. **
"Title" is unique here but many columns (Director, Cast, Country, listed in) have more than one value but I need single value per column for counting purpose w.r.t. each attribute so I an analyze data at any micro level to find answers of followings type of questions:

  • No. of movies or TV shows specific to "listed in" category?
  • How many movies/TV shows are made for each country?
  • How many movies made by each director or Actor?

My logic/ Understanding: RStudio is new tool to me but I know database and SQL. I think if I can create more tables for each combination separately using "title" which is unique column and one for each column that have multiple value (like title and director, title and country, title and listed in etc. ) so I can apply conditional functions for everything. I wonder if RStudio provides better way solution, hence writing here for better understanding and effective solution. Sub set of data is below as sample.

Welcome to RStudio Community.

Thank you for providing a description of your data, but there's nothing better than a reproducible example. The easiest way to provide one is providing the output of dput(your_data) (or dput(head(your_data, 100)) if your data is huge).

Without having access to your data, I think you could probably use dplyr::count() for all of your questions. Here's an example using some data built-in to the tidyverse.

library(tidyverse)

glimpse(diamonds)
#> Rows: 53,940
#> Columns: 10
#> $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.~
#> $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver~
#> $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,~
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, ~
#> $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64~
#> $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58~
#> $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34~
#> $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.~
#> $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.~
#> $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.~

count(diamonds, cut)
#> # A tibble: 5 x 2
#>   cut           n
#>   <ord>     <int>
#> 1 Fair       1610
#> 2 Good       4906
#> 3 Very Good 12082
#> 4 Premium   13791
#> 5 Ideal     21551

count(diamonds, color)
#> # A tibble: 7 x 2
#>   color     n
#>   <ord> <int>
#> 1 D      6775
#> 2 E      9797
#> 3 F      9542
#> 4 G     11292
#> 5 H      8304
#> 6 I      5422
#> 7 J      2808

count(diamonds, cut, color)
#> # A tibble: 35 x 3
#>    cut   color     n
#>    <ord> <ord> <int>
#>  1 Fair  D       163
#>  2 Fair  E       224
#>  3 Fair  F       312
#>  4 Fair  G       314
#>  5 Fair  H       303
#>  6 Fair  I       175
#>  7 Fair  J       119
#>  8 Good  D       662
#>  9 Good  E       933
#> 10 Good  F       909
#> # ... with 25 more rows

For your columns that have lots of values in, you may want to separate them using tidyr::separate_rows():

library(tidyverse)

tibble(x = "TV, Movie, Book, Video Game")
#> # A tibble: 1 x 1
#>   x                          
#>   <chr>                      
#> 1 TV, Movie, Book, Video Game

tibble(x = "TV, Movie, Book, Video Game") %>% 
  separate_rows(x, sep = ", ")
#> # A tibble: 4 x 1
#>   x         
#>   <chr>     
#> 1 TV        
#> 2 Movie     
#> 3 Book      
#> 4 Video Game

Created on 2021-12-21 by the reprex package (v2.0.1)

Kidly find the data file here .DATA FILE

library(tidyverse)

mydata = read_csv("disney_plus_titles.csv")
#> Rows: 1450 Columns: 12
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (11): show_id, type, title, director, cast, country, date_added, rating,...
#> dbl  (1): release_year
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# No. of movies or TV shows specific to "listed in" category?

mydata %>% 
  separate_rows(listed_in, sep = ", ") %>% 
  count(type, listed_in, sort = T)
#> # A tibble: 71 x 3
#>    type    listed_in            n
#>    <chr>   <chr>            <int>
#>  1 Movie   Family             533
#>  2 Movie   Comedy             407
#>  3 Movie   Animation          381
#>  4 Movie   Action-Adventure   314
#>  5 Movie   Documentary        174
#>  6 TV Show Animation          161
#>  7 Movie   Fantasy            158
#>  8 Movie   Coming of Age      153
#>  9 TV Show Action-Adventure   138
#> 10 Movie   Animals & Nature   130
#> # ... with 61 more rows

# How many movies/TV shows are made for each country?

mydata %>% 
  separate_rows(country, sep = ", ") %>% 
  count(type, country, sort = T)
#> # A tibble: 64 x 3
#>    type    country            n
#>    <chr>   <chr>          <int>
#>  1 Movie   United States    923
#>  2 TV Show United States    261
#>  3 TV Show <NA>             113
#>  4 Movie   <NA>             106
#>  5 Movie   United Kingdom    76
#>  6 Movie   Canada            61
#>  7 TV Show United Kingdom    25
#>  8 Movie   Australia         20
#>  9 TV Show Canada            16
#> 10 Movie   France            13
#> # ... with 54 more rows

# write a function to make this easier?

sep_count = function(df, col){
  
  df %>% 
    separate_rows({{col}}, sep = ", ") %>% 
    mutate({{col}} := str_trim({{col}})) %>% 
    count(type, {{col}}, sort = T)
  
}

sep_count(mydata, cast)
#> # A tibble: 4,048 x 3
#>    type    cast                    n
#>    <chr>   <chr>               <int>
#>  1 Movie   <NA>                  119
#>  2 TV Show <NA>                   71
#>  3 Movie   Jim Cummings           20
#>  4 Movie   Walt Disney            20
#>  5 Movie   Larry the Cable Guy    19
#>  6 Movie   Pinto Colvig           15
#>  7 Movie   Bob Peterson           14
#>  8 Movie   Keith Ferguson         14
#>  9 Movie   Tim Allen              14
#> 10 Movie   Tony Hale              13
#> # ... with 4,038 more rows

sep_count(mydata, director)
#> # A tibble: 637 x 3
#>    type    director             n
#>    <chr>   <chr>            <int>
#>  1 TV Show <NA>               398
#>  2 Movie   <NA>                75
#>  3 Movie   Jack Hannah         17
#>  4 Movie   John Lasseter       16
#>  5 Movie   Paul Hoen           16
#>  6 Movie   Wilfred Jackson     16
#>  7 Movie   Clyde Geronimi      13
#>  8 Movie   Robert Stevenson    13
#>  9 Movie   Charles Nichols     12
#> 10 Movie   Jack Kinney         12
#> # ... with 627 more rows

Created on 2021-12-21 by the reprex package (v2.0.1)

Thank you very much @JackDavison for your quick help. Let me put it into practice for better understanding :slightly_smiling_face:

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.