Unlisting a Column in a Data Frame By Row

liberry · April 9, 2019, 9:26pm

I have a data set where each row is a transaction with a unique customer ID number. There are four separate status columns on each row. I am trying to figure out how to report out the unique status values that occur for each customer ID.

I used the unite() function to combine the four status columns into one. This returns a single long string containing all statuses present on each row. I need each separate status as a separate element in a vector so that I can get the unique values. I used strsplit to try and break up the long character string into multiples, my problem is that it's returning a list instead of a series of vectors. I tried to use unlist, but it throws an error message. It looks like the way I am trying to use it, it wants to combine the entire list into a single vector.

I want to unlist by row, then I'd have a vector of statuses for each customer number and then I could unduplicate the statuses by customer number. Reprex below.

library(tidyr)

customer <- rep(1:5, times=4)

order.num <- 1:20

stat.cols <- paste("status", 1:4, sep = "")

mydf <- data.frame(customer, order.num)

mydf[stat.cols] <- NA


mydf$status1 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status2 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status3 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status4 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )


mydf <- unite(mydf, col="status.all", sep=",", c(status1, status2, status3, status4))

mydf$status.all <- strsplit(mydf$status.all, ",")

technocrat · April 9, 2019, 11:06pm

Here's a down and dirty way

library(tidyverse)
customer <- rep(1:5, times=4)
order.num <- 1:20
mydf <- data.frame(customer, order.num)
mydf %>% mutate(status1 = sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) ))
#>    customer order.num status1
#> 1         1         1      C3
#> 2         2         2      C3
#> 3         3         3      B2
#> 4         4         4      B2
#> 5         5         5      C3
#> 6         1         6      C3
#> 7         2         7      C3
#> 8         3         8      C3
#> 9         4         9      C3
#> 10        5        10      C3
#> 11        1        11      C3
#> 12        2        12      C3
#> 13        3        13      D4
#> 14        4        14      B2
#> 15        5        15      C3
#> 16        1        16      C3
#> 17        2        17      B2
#> 18        3        18      C3
#> 19        4        19      C3
#> 20        5        20      C3

^{Created on 2019-04-09 by the reprex package (v0.2.1)}

Oh, and if you want to have this reproducible, you need to

set.seed(1234) # or any other integer

markdly · April 10, 2019, 12:15am

I've taken from the question that the ultimate end goal you are looking for is to:

To do this, I thought using gather() combined with distinct() might make the job easier:

library(tidyverse)

# Original dataset
customer <- rep(1:5, times=4)
order.num <- 1:20
stat.cols <- paste("status", 1:4, sep = "")
mydf <- data.frame(customer, order.num)
mydf[stat.cols] <- NA
set.seed(1234) # for reproducibility
mydf$status1 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status2 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status3 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )
mydf$status4 <- sample(paste(LETTERS[1:4], 1:4, sep = ""),20, replace=TRUE, prob=c(0.10, 0.20, 0.65, 0.05) )

# Solution
mydf <- mydf %>% 
  gather(key = "statusid", value = "status", status1:status4) %>% 
  distinct(customer, status) %>% 
  arrange(customer, status) 

mydf
#>    customer status
#> 1         1     A1
#> 2         1     B2
#> 3         1     C3
#> 4         2     A1
#> 5         2     B2
#> 6         2     C3
#> 7         3     A1
#> 8         3     B2
#> 9         3     C3
#> 10        4     A1
#> 11        4     B2
#> 12        4     C3
#> 13        4     D4
#> 14        5     A1
#> 15        5     B2
#> 16        5     C3

# Or if you prefer keeping the status values as a list column
mydf %>% nest(status)
#>   customer           data
#> 1        1     A1, B2, C3
#> 2        2     A1, B2, C3
#> 3        3     A1, B2, C3
#> 4        4 A1, B2, C3, D4
#> 5        5     A1, B2, C3

liberry · April 10, 2019, 12:26am

Thanks, that works. The gather approach should be easier for number-crunching but the list-column looks better for producing an end-user extract. I think I need to practice working with list-columns some more, that threw me for a conceptual loop.

system · April 17, 2019, 12:35am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.