Code question, splitting a column with multiple values, with delimiter, into individual indicator columns

onions94 · October 23, 2019, 8:59pm

Hi I'm new to the RStudio community,

This is a really newby question. But say if I have a column X which indicates 10 categories delimited by ","
E.g.
row 1: c1,c3,c10
row 2: c2,c7,c10
row 3: c1,c2,c3

How would I go about creating 10 columns, each are indicators for which is in X
E.g. (new columns are 1 to 10, left to right)
row 1: 1,0,1,0,0,0,0,0,0,1
row 2: 0,1,0,0,0,0,1,0,0,1
row 3: 1,1,1,0,0,0,0,0,0,0

I know I can do a for loop to scan through each row and split X by delimiter ",".
Then have 10 if statements that checks for the new columns. But is there a more efficient way for this? I imagine wither bigger data sets this could get computational difficult.

Thanks for the help!

mattwarkentin · October 23, 2019, 9:44pm

Hi @onions94,

This code gets you most of the way there:

library(dplyr)
library(tidyr)
library(stringr)

data <- tibble(id = 1:3,
               x = c('c1,c3,c10',
                     'c2,c7,c10',
                     'c1,c2,c3'))

data %>% 
  separate_rows(x, sep = ',') %>% 
  mutate(x = str_replace(x, 'c', 'var'),
         y = 1) %>%
  spread(x, y, fill = 0) %>% 
  select(id, num_range('var', range = 1:10))

You end up with a dataset like this:

# A tibble: 3 x 6
     id  var1  var2  var3  var7 var10
  <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     0     1     0     1
2     2     0     1     0     1     1
3     3     1     1     1     0     0

I could add some more code that would create extra variables (e.g. var4) and fill it in with 0's, but the current code will create a variable for every observed value in x, which I think might be good enough.

Hope this is on the right track.

pieterjanvc · October 23, 2019, 9:46pm

Hi,

I just came up with a slightly different implementation at the same time, so I'll just post it

library(stringr)
myData = data.frame(x = c("c1,c3,c10", "c2,c7,c10", "c1,c2,c3"))
newData = data.frame(matrix(0, nrow = nrow(myData), ncol = 10))

for(i in 1:nrow(myData)){
  newData[i, as.integer(unlist(str_extract_all(myData$x[i], "\\d+")))] = 1
}

  X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1  1  0  1  0  0  0  0  0  0   1
2  0  1  0  0  0  0  1  0  0   1
3  1  1  1  0  0  0  0  0  0   0

PJ

andresrcs · October 23, 2019, 11:40pm

Yet another way to do it

library(tidyverse)

df = data.frame(stringsAsFactors = FALSE,
                x = c("c1,c3,c10", "c2,c7,c10", "c1,c2,c3")
                )

categories <- paste0("c", 1:10) %>% set_names()

df %>% 
    bind_cols(map_dfc(categories, ~ as.numeric(str_detect(df$x, paste0(.x,"(,|$)")))))
#>           x c1 c2 c3 c4 c5 c6 c7 c8 c9 c10
#> 1 c1,c3,c10  1  0  1  0  0  0  0  0  0   1
#> 2 c2,c7,c10  0  1  0  0  0  0  1  0  0   1
#> 3  c1,c2,c3  1  1  1  0  0  0  0  0  0   0

onions94 · October 27, 2019, 1:15am

Thanks for all the replies!

@mattwarkentin's solution was the closest to what I need, since I actually don't know if it's c1,c2, ... , c10.. the number of categories were unknown.

system · November 3, 2019, 1:26am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.