Define a new column for genre of the movie

jayant · July 8, 2018, 8:14pm

Good afternoon colleagues,
I am trying to define a new column called genre that will take a value such as action, drama, sci-fi and their combinations. But as seen in the attached snippet, we have a column corresponding to each genre type, and they take values True, False. The goal is to get rid of all these columns and define a new variable called genre

One option is to use if-else() here but is there any method from tidyverse package. Advice/ feedback is greatly appreciated.

cderv · July 8, 2018, 8:39pm

There are several ways to do that using tidyverse tools. If you can provide a reprex, it will be easier to show you how.

What your new genre column will contain ? A concatenated string with all genre ? A vector of genre ? A list ?

I would suggest two approaches:

Replace in each column the TRUE values by the corresponding genre, then create a column to concatenate the strings or create the vector / list
Use a reshaping and filtering (something with gather, separate, filter, mutate) to add a column with row number, pass the column name in one column (several rows for each previous row number), keep only rows where values is true, reshape in one row per original row number creating a column genre.

jayant · July 8, 2018, 9:30pm

Thaks for the prompt response.

library(dplyr)

new = read.csv("Book1.csv", header=T, sep=",")
View(new)

I tried to use repex (for the first time) as highlighted on https://www.jessemaegan.com/post/so-you-ve-been-asked-to-make-a-reprex/
Does it help you, Sir?Kindly let me know

cderv · July 9, 2018, 5:50am

Thanks for trying.

View open the viewer in RStudio - no need to use it in a reprex.
You can use read.csv but in this case you need to provide the csv file. Without the csv file, I do not have the data at hand.
Otherwise, create a subset of your data in R code, not in a csv file. Follow advice on best pratices for non built in dataset. (dput in base R or datapasta can help you)

Best Practices: how to prepare your own data for use in a `reprex` if you can’t, or don’t know how to reproduce a problem with a built-in dataset? tidyverse

@EconomiCurtis split this out of FAQ: What's a reproducible example (`reprex`) and how do I do one?. Curious if you have anything additional to add specifically on "how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset." I think @jessemaegan's post is about 80% there. The piece it is missing, if your average stack overflow post is any indication, is an explanation about how to prepare your own data for use in a reprex if you can't, or don't know how to reproduce a problem with a built-in dataset. Some handy things to know for this situation: deparse() The ugly as sin, gold standard: head(my_data, 2) %>% depa…

You can also recreate your data by hand. tribble can help you.

If you have improvement in mind on the reprex guide or found unclear parts, do not hesitate to tell us!

Leon · July 9, 2018, 6:03am

Like so?

set.seed(116647)
get_bools = function(n=10){
  return( sample(x = c(TRUE,FALSE), size = n, replace = TRUE) )
}
movie_dat = tibble(genre_action      = get_bools(),
                   genre_adventure   = get_bools(),
                   genre_thriller    = get_bools(),
                   genre_scifi       = get_bools(),
                   genre_fantasy     = get_bools(),
                   genre_documentary = get_bools(),
                   genre_romance     = get_bools(),
                   genre_comedy      = get_bools())
movie_dat
# A tibble: 10 x 8
   genre_action genre_adventure genre_thriller genre_scifi genre_fantasy genre_documentary
   <lgl>        <lgl>           <lgl>          <lgl>       <lgl>         <lgl>            
 1 TRUE         TRUE            TRUE           FALSE       TRUE          TRUE             
 2 FALSE        FALSE           FALSE          FALSE       FALSE         FALSE            
 3 FALSE        TRUE            TRUE           TRUE        TRUE          TRUE             
 4 FALSE        TRUE            TRUE           FALSE       TRUE          TRUE             
 5 FALSE        TRUE            FALSE          FALSE       TRUE          TRUE             
 6 TRUE         FALSE           TRUE           FALSE       TRUE          TRUE             
 7 FALSE        TRUE            TRUE           TRUE        TRUE          TRUE             
 8 TRUE         FALSE           FALSE          FALSE       FALSE         FALSE            
 9 FALSE        FALSE           FALSE          TRUE        FALSE         TRUE             
10 FALSE        TRUE            FALSE          FALSE       TRUE          FALSE            
# ... with 2 more variables: genre_romance <lgl>, genre_comedy <lgl>

Then we can do the following:

movie_dat %>% gather(genre, bool)
# A tibble: 80 x 2
   genre        bool 
   <chr>        <lgl>
 1 genre_action TRUE 
 2 genre_action FALSE
 3 genre_action FALSE
 4 genre_action FALSE
 5 genre_action FALSE
 6 genre_action TRUE 
 7 genre_action FALSE
 8 genre_action TRUE 
 9 genre_action FALSE
10 genre_action FALSE
# ... with 70 more rows

This is called converting from wide to long data

Hope it's useful

cderv · July 9, 2018, 6:11am

This is a great example on how to create a dummy data.frame to help with the question ! Thanks!

@jayant you now still need to precise what your column genre should look like. something like comma-concatenated genre "action, drama, comedy" if these are TRUE in there respective column ?

Leon · July 9, 2018, 11:13am

Here is a slight modification of my previous post, in my view the data fits best in long format:

set.seed(116647)
get_bools = function(n=10){
  return( sample(x = c(TRUE,FALSE), size = n, replace = TRUE) )
}
movie_dat = tibble(movie_name = paste0("movie_", LETTERS[1:10]),
                   genre_action      = get_bools(),
                   genre_adventure   = get_bools(),
                   genre_thriller    = get_bools(),
                   genre_scifi       = get_bools(),
                   genre_fantasy     = get_bools(),
                   genre_documentary = get_bools(),
                   genre_romance     = get_bools(),
                   genre_comedy      = get_bools())
movie_dat %>% gather(genre, bool, -movie_name) %>% filter(bool) %>% 
  select(movie_name, genre)
# A tibble: 37 x 2
   movie_name genre          
   <chr>      <chr>          
 1 movie_A    genre_action   
 2 movie_F    genre_action   
 3 movie_H    genre_action   
 4 movie_A    genre_adventure
 5 movie_C    genre_adventure
 6 movie_D    genre_adventure
 7 movie_E    genre_adventure
 8 movie_G    genre_adventure
 9 movie_J    genre_adventure
10 movie_A    genre_thriller 
# ... with 27 more rows

jayant · July 10, 2018, 1:16am

Good afternoon!

Thanks for your advice and helpful comments. I was able to get the data in long format as shown above. I used the following code to achieve it.

selected_data <- select (movie_renamed,movie_title,action,adventure,thriller,scifi,fantasy,
                         documentary,romance,comedy,animation,family,drama,horror)
melted_data <- melt(selected_data,"movie_title")
melted_data_filtered <-  melted_data %>% filter(value=='TRUE')

And I obtain the following data.

movie_data = tibble(movie_title=c("Avatar","Spectre","Avatar"),genre=c("action","fantasy","scifi"))
tibble: 3 × 2
  movie_title   genre
        <chr>   <chr>
1      Avatar  action
2     Spectre fantasy
3      Avatar   scifi

But my goal is to avoid repetition of movie title and have a concatenated list of the genre, for example;

Avatar      scifi+action

In other words, we need to go back to wide format with two columns, movie_title and genre. Can I kindly get help here? thanks

Leon · July 10, 2018, 6:12am

You could do something along the lines of this then:

set.seed(116647)
get_bools = function(n=10){
  return( sample(x = c(TRUE,FALSE), size = n, replace = TRUE) )
}
movie_dat = tibble(genre_action      = get_bools(),
                   genre_adventure   = get_bools(),
                   genre_thriller    = get_bools(),
                   genre_scifi       = get_bools(),
                   genre_fantasy     = get_bools(),
                   genre_documentary = get_bools(),
                   genre_romance     = get_bools(),
                   genre_comedy      = get_bools())
movie_dat %>% apply(1, function(x){ return(names(x[x])) }) %>%
  lapply(paste, collapse='+') %>% unlist
 [1] "genre_action+genre_adventure+genre_thriller+genre_fantasy+genre_documentary+genre_romance"
 [2] "genre_comedy"                                                                             
 [3] "genre_adventure+genre_thriller+genre_scifi+genre_fantasy+genre_documentary"               
 [4] "genre_adventure+genre_thriller+genre_fantasy+genre_documentary+genre_comedy"              
 [5] "genre_adventure+genre_fantasy+genre_documentary"                                          
 [6] "genre_action+genre_thriller+genre_fantasy+genre_documentary+genre_comedy"                 
 [7] "genre_adventure+genre_thriller+genre_scifi+genre_fantasy+genre_documentary"               
 [8] "genre_action+genre_comedy"                                                                
 [9] "genre_scifi+genre_documentary+genre_romance"                                              
[10] "genre_adventure+genre_fantasy"

cderv · July 10, 2018, 6:16am

You can use the tidyverse to achieve what you want. It is the same as @Leon answer but only with a tidyverse pipe

set.seed(116647)
get_bools = function(n=10){
  return( sample(x = c(TRUE,FALSE), size = n, replace = TRUE) )
}
library(tidyverse)
movie_dat = tibble(movie_name = paste0("movie_", LETTERS[1:10]),
                   genre_action      = get_bools(),
                   genre_adventure   = get_bools(),
                   genre_thriller    = get_bools(),
                   genre_scifi       = get_bools(),
                   genre_fantasy     = get_bools(),
                   genre_documentary = get_bools(),
                   genre_romance     = get_bools(),
                   genre_comedy      = get_bools())
movie_dat %>% 
  gather(genre, bool, -movie_name) %>% 
  filter(bool) %>% 
  select(-bool) %>%
  separate(genre, c('to_delete', 'genre')) %>%
  select(-to_delete) %>%
  group_by(movie_name) %>%
  summarise(genre = paste(genre, collapse = "+"))
#> # A tibble: 10 x 2
#>    movie_name genre                                                
#>    <chr>      <chr>                                                
#>  1 movie_A    action+adventure+thriller+fantasy+documentary+romance
#>  2 movie_B    comedy                                               
#>  3 movie_C    adventure+thriller+scifi+fantasy+documentary         
#>  4 movie_D    adventure+thriller+fantasy+documentary+comedy        
#>  5 movie_E    adventure+fantasy+documentary                        
#>  6 movie_F    action+thriller+fantasy+documentary+comedy           
#>  7 movie_G    adventure+thriller+scifi+fantasy+documentary         
#>  8 movie_H    action+comedy                                        
#>  9 movie_I    scifi+documentary+romance                            
#> 10 movie_J    adventure+fantasy

Created on 2018-07-10 by the reprex package (v0.2.0).

jayant · July 10, 2018, 10:04pm

Thank you for your ideas, @cderv and @Leon. I was able to achieve my goal using a blend of both ideas;

melted_data <- melt(selected_data,"movie_title")
melted_data_filtered <-  melted_data %>% filter(value=='TRUE')
melted_data_filtered$value <- melted_data_filtered$variable
melted_data_filtered <- melted_data_filtered %>% select(-variable)
melted_data_filtered_sorted <- melted_data_filtered %>%   arrange(movie_title)
data_filtered <- melted_data_filtered_sorted %>% group_by(movie_title)%>%
  summarise(genre = paste(value, collapse = "+"))

I just gave a code snippet here, which is working for me. As seen, after melting I tried to use group_by to do the concatenation. thanks for the assistance.