Split character column into unknown number of columns

robinsones · February 12, 2018, 7:16pm

I have a character column (not list, it's from a sql dataset) that has the structure of "List((query, count))" with differing amounts of entires. I'd like to transform this into a dataframe:

library(tibble)
#> Warning: package 'tibble' was built under R version 3.4.3
original_df <- tribble(
  ~query, ~id,
  "List((slime, 2))", 1L,
  "List((gift, 3), (bear, 1))", 2L,
  "List((sugar, 10))", 3L,
  "List((fluff, 2), (pie, 5), (plum, 8), (dark, 10), (long, 2))", 4L
)

desired_df <- tribble(
  ~query, ~count, ~id,
  "slime", 2L, 1L,
  "gift", 3L, 2L,
  "bear", 1L, 2L,
  "sugar", 10L, 3L,
  "fluff", 2L, 4L, 
  "pie", 5L, 4L, 
  "plum", 8L, 4L,
  "dark", 10L, 4L,
  "long", 2L, 4L
)

I'm struggling splitting the column and extracting the relevant pairings (query and count); if I could get it to a wide format (e.g. id, query1, count1, query2, count2, etc. with many of them having null entries for the later ones), I can convert it to the long tidy dataframe. Ideally solution will be fairly performant, as I will have to do it over tens of thousands of rows, some of which might have dozens of entries.

edgararuiz · February 12, 2018, 7:48pm

Hi @robinsones, what's the database back-end you are interacting with?

robinsones · February 12, 2018, 7:57pm

I am rather embarrassed by this - I use RJDBC. Here's the code I include at the top of each analysis (where LINK is the connection):

drv <- JDBC(
  driverClass = "com.vertica.jdbc.Driver",
  classPath = Sys.getenv("classPath")
)

conn <- dbConnect(
  drv,
  "jdbc:vertica://LINK",
  user = Sys.getenv("vertica_user"),
  password = Sys.getenv("vertica_pw")
)

And then sql chunks and dbGetQuery(conn, query) work. I tried to set up ODBC some months ago and was not able to manage.

nutterb · February 12, 2018, 7:59pm

Perhaps something along the lines of:

library(dplyr)
library(purrr)
library(stringr)
library(tidyr)

x <- original_df %>% 
  # Remove the `List(` as the start and the `)` at the end
  mutate(query = str_replace_all(query, "(^List[(]|[)]$)", ""))

mapply(function(q, i){
         # Remove the rest of the parentheses.  
         # now we can split on `, `
         q <- str_replace_all(q, "([(]|[)])", "")
         # create a matrix of the split strings and convert to a tibble
         df <- as.tibble(trimws(str_split_fixed(q, ", ", 2)))
         names(df) <- c("query", "count")
         # add the ID and return the tibble
         df$id <- i
         df
       },
       q = strsplit(x$query, "),"),
       i = x$id,
       SIMPLIFY = FALSE) %>% 
  bind_rows()

For a point of reference, running this over 10,000 rows of your original_df (resampled) takes about 4.8 seconds.

robinsones · February 12, 2018, 8:29pm

Works great, thank you!

tbradley · February 12, 2018, 8:30pm

Here is a solution using stringr, dplyr, and purrr:

original_df <- tribble(
  ~query, ~id,
  "List((slime, 2))", 1L,
  "List((gift, 3), (bear, 1))", 2L,
  "List((sugar, 10))", 3L,
  "List((fluff, 2), (pie, 5), (plum, 8), (dark, 10), (long, 2))", 4L
)


output <- original_df %>% 
  mutate(new = str_extract(query, "(?<=List\\()(.*)(?=\\))")) %>%
  group_by(id) %>% 
  nest() %>% 
  mutate(
    temp_col = map(
      data, 
      ~ str_extract_all(.x$new, "(?<=\\()(.*?)(?=\\))") %>% 
          flatten() %>% 
          map_chr(~return(.x)) %>% 
          as_tibble()
    )
  ) %>% 
  unnest(temp_col) %>% 
  separate(value, into = c("query", "count"), sep = ", ") %>% 
  select(query, count, id) %>% 
  mutate(count = as.integer(count))

That gave me this:

# A tibble: 9 x 3
  query count    id
  <chr> <int> <int>
1 slime     2     1
2 gift      3     2
3 bear      1     2
4 sugar    10     3
5 fluff     2     4
6 pie       5     4
7 plum      8     4
8 dark     10     4
9 long      2     4

> identical(output, desired_df)
[1] TRUE

As for speed, I sampled your dataset using dplyr::sample_n 10,000 times and this method took 0.12 seconds.

edgararuiz · February 12, 2018, 8:33pm

That worked against a live DB connection?