How should I create dummy variables based on a column in a long format table(the same records appearing more than once)

Zhen · December 31, 2018, 2:45am

Hi, I have a tibble that looks like this.

       id title                                    genres         
    <dbl> <chr>                                    <chr>          
 1  19995 Avatar                                   Action         
 2  19995 Avatar                                   Adventure      
 3  19995 Avatar                                   Fantasy        
 4  19995 Avatar                                   Science Fiction
 5    285 Pirates of the Caribbean: At World's End Adventure      
 6    285 Pirates of the Caribbean: At World's End Fantasy        
 7    285 Pirates of the Caribbean: At World's End Action         
 8 206647 Spectre                                  Action         
 9 206647 Spectre                                  Adventure      
10 206647 Spectre                                  Crime

I want to create dummy variables based on genres. I tried dummy_cols from fastDummies package. However, I would get this.

       id title genres genres_Action genres_Adventure genres_Fantasy `genres_Science~ genres_Crime genres_Drama
    <dbl> <chr> <chr>          <int>            <int>          <int>            <int>        <int>        <int>
 1  19995 Avat~ Action             1                0              0                0            0            0
 2  19995 Avat~ Adven~             0                1              0                0            0            0
 3  19995 Avat~ Fanta~             0                0              1                0            0            0
 4  19995 Avat~ Scien~             0                0              0                1            0            0
 5    285 Pira~ Adven~             0                1              0                0            0            0
 6    285 Pira~ Fanta~             0                0              1                0            0            0
 7    285 Pira~ Action             1                0              0                0            0            0
 8 206647 Spec~ Action             1                0              0                0            0            0
 9 206647 Spec~ Adven~             0                1              0                0            0            0
10 206647 Spec~ Crime              0                0              0                0            1            0

As you can see, there are repeating titles. How could I merge the same title either based on id or title? Or I should have done something before I dummify the column?

I have a feeling that this is really easy to solve, but I just can't remember anything that would help. I couldn't find an answer through googling too.

Thank you.

technocrat · December 31, 2018, 4:06am

Actually, i spent the day yesterday on analogous raw data file, where the equivalent of the id field appears in multiple rows. Fortunately, like your fastdummies package, I was able to create a wide tibble of binary values.

The secret sauce is to isolate those columns (preserving the ability to cbind them back to the first three columns), transform them to a matrix, take colSums(), which will produce a named vector output of results, translate that back to a tibble, mutate the columns genres_Action to genres_Action = as.logical()` which will transform the 0 results to FALSE and the non-zero results to TRUE.

Then of course, you have to stick back the first three columns, but now you should have collapsed the rows into a single record. NB: this won't work if you have non-binary variables with conflicting values. My head still hurts too much from the exercise (I may have made an unnecessary t() to the matrix), but I plan to write this up as a more generalized snippet than my present code and post it on my blog.

But, yes! It is possible.

Zhen · December 31, 2018, 7:24pm

Hi technocrat,

Thank you for your reply. However, there are a few points I didn't get.

How do I isolate the first three columns while transforming the rest to a matrix? Data in the same matrix need to be the same data type. If I take out first three columns, and then transform them into a matrix, take colSums. It will give me a list of values. As you pointed out, they all collapsed into a single record. I don't understand how can I get the data frame I need.

> movie_with_tag %>% group_by(id) %>% dummy_cols("genres") %>% select(-genres, -id, -title) %>% as.matrix() %>% colSums()
         genres_Action       genres_Adventure         genres_Fantasy genres_Science Fiction           genres_Crime 
                  1154                    790                    424                    535                    696 
          genres_Drama        genres_Thriller       genres_Animation          genres_Family         genres_Western 
                  2297                   1274                    234                    513                     82 
         genres_Comedy         genres_Romance          genres_Horror         genres_Mystery         genres_History 
                  1722                    894                    519                    348                    197 
            genres_War           genres_Music     genres_Documentary         genres_Foreign        genres_TV Movie 
                   144                    185                    110                     34                      8

technocrat · December 31, 2018, 8:35pm

It was a sketchy outline I have you. Let me supplement that with a bit of code from some medical records I'm helping with and see if that helps you follow the logic better. The input file patients had multiple entries of diagnostic codes. As you did, I spread those into new columns, which is how I ended up in your situation. Here's what I did to get them back to a single record where each of the columns has a T/F logical value:

Example of a record with multiple entries to be reduced
> patient150580164                                # the twin duplicates
# A tibble: 4 x 24
    INC_KEY   AGE GENDER ADULT `808.0` `808.1` `808.2` `808.3` `808.4` `808.41` `808.42` `808.43` `808.49`
      <int> <int> <chr>  <lgl> <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>    <lgl>    <lgl>    <lgl>   
1 150580164    29 Female TRUE  FALSE   FALSE   TRUE    FALSE   FALSE   FALSE    FALSE    FALSE    FALSE   
2 150580164    29 Female TRUE  FALSE   FALSE   TRUE    FALSE   FALSE   FALSE    FALSE    FALSE    FALSE   
3 150580164    29 Female TRUE  TRUE    FALSE   FALSE   FALSE   FALSE   FALSE    FALSE    FALSE    FALSE   
4 150580164    29 Female TRUE  TRUE    FALSE   FALSE   FALSE   FALSE   FALSE    FALSE    FALSE    FALSE   
# ... with 11 more variables: `808.5` <lgl>, `808.51` <lgl>, `808.52` <lgl>, `808.53` <lgl>,
#   `808.59` <lgl>, `808.8` <lgl>, `808.9` <lgl>, Expired <lgl>, Intervention <lgl>, SURGERY <lgl>,
#   YEAR <dbl>
>

With that tibble as iinput

setasides150580164 <- patient150580164 %>%      # save for later
    select(INC_KEY, AGE, GENDER, ADULT, YEAR) %>% 
    distinct()

current_INC_KEY <- setasides150580164$INC_KEY   # save INC_KEY separately

patient150580164 <- patient150580164 %>%        # strip the saved data
    select(-INC_KEY, -AGE,-GENDER,-ADULT, -YEAR)

mat150580164 <- as.matrix(patient150580164)     # convert to matrix
list_of_col <- colnames(mat150580164)           # save list of column names

# sum the columns, convert from numeric to T/F, convert to matrix, transposed
patient150580164 <- t(as.matrix(as.logical(colSums(mat150580164 > 0)))) 

colnames(patient150580164) <- list_of_col       # use the save column names 

patient150580164 <- as.tibble(patient150580164) %>% # add back in saved INC_KEY
    mutate(INC_KEY = current_INC_KEY) %>%
    select(INC_KEY, everything())

# Add back in all the data set aside and make a new object with the transformed data
patient150580164 <- inner_join(patient150580164, setasides150580164, by = "INC_KEY") %>% 
    select(INC_KEY, AGE, GENDER, ADULT, Expired, Intervention, list_of_col)

patient150580164                                # show the result
# A tibble: 1 x 23
    INC_KEY   AGE GENDER ADULT Expired Intervention `808.0` `808.1` `808.2` `808.3` `808.4` `808.41`
      <int> <int> <chr>  <lgl> <lgl>   <lgl>        <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   <lgl>   
1 150580164    29 Female TRUE  FALSE   TRUE         TRUE    FALSE   TRUE    FALSE   FALSE   FALSE   
# ... with 11 more variables: `808.42` <lgl>, `808.43` <lgl>, `808.49` <lgl>, `808.5` <lgl>,
#   `808.51` <lgl>, `808.52` <lgl>, `808.53` <lgl>, `808.59` <lgl>, `808.8` <lgl>, `808.9` <lgl>,
#   SURGERY <lgl>

Zhen · January 1, 2019, 3:48pm

I see your method, thank you technocrat.

However, I have thousands of unique IDs and will need to split the data frame into many matrices first.

Thanks to your inspiration, I found another solution. In my case, the code would be like

movie_with_tag_dummy %>% select(-genres) %>% group_by(id, title) %>% summarize_all(sum)

The code itself is pretty much self explanatory. First I take out the column I do not need, then group by id and title, then sum them together.

Since they are all dummy variables that contains only 1 and 0, and each row previously only records one genre for each movie. The code will do wonderfully.

system · January 8, 2019, 3:48pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.