Split/explode comma-separated string column into pairs

dplyr
sparklyr

#1

Hi, folks.

I have seen many solutions to the problem of exploding a comma-separated
string column (e.g., this nice SO answer).

However, how can I go from this data frame
df <- data.frame(x = 1, y = "A,B,C,D")
to this one:
tb <- data.frame(x = c(1,1,1), y = c("A,B", "B,C", "C,D"))?
The goal is to split the string column by pairs of consecutive letters. It is easy to solve this problem with non-vectorized R code. The main problem for me is that the table is too big to fit in memory, so a solution with sparklyr would be great.

Thanks, in advance!


#2

What do you mean in this context with unvectorized R code?

I would usually approach this via an index shift:

library(tidyverse)

solo <- stringr::str_split("A,B,C,D", pattern = ",") %>% pluck(1)
paste0(solo[-length(solo)], ",", solo[-1])
#> [1] "A,B" "B,C" "C,D"

Created on 2018-05-08 by the reprex package (v0.2.0).


#3

This is nice, thanks. However, I am having trouble when trying such a solution with sparklyr, e.g. using the spark_apply function. I am thinking that it would be enough to map each "A,B,C,D" entry into list("A,B", "B,C", "C,D"), which can be exploded with sparklyr.nested::sdf_explode.


#4

For instance, consider this function (using @Tazinho's idea):

get_pairs <- function(sequence_of_letters) {
      x <- strsplit(sequence_of_letters, ",")[[1]]
      paste0(x[-length(x)], ",", x[-1])
}

Then, tb <- mutate(df, z = sapply(y, get_pairs)) works just fine if df is a local data frame.
The column z is a list, which can be "exploded" with tidyr::unnest(tb, z).

However, if df is a tbl_spark, this code (analogous to this one)

tb <- df %>%
  spark_apply(function(d) {
    library(dplyr)
    get_pairs <- function(sequence_of_letters) {
      x <- strsplit(sequence_of_letters, ",")[[1]]
      paste0(x[-length(x)], ",", x[-1])
    }
    mutate(d, z = sapply(y, get_pairs))
  })

raises warnings and errors. The natural step after that would be sparklyr.nested::sdf_explode(tb, z).
The error messages start like:

Warning messages:
1: In if (is.na(object)) { :
  the condition has length > 1 and only the first element will be used
2: In if (is.na(object)) { :
  the condition has length > 1 and only the first element will be used
ERROR sparklyr: Worker (X) failed to complete R process

#5

The following does the trick. df_tbl is the tbl_spark version of df.

library(sparklyr)
library(sparklyr.nested)
library(dplyr)

sc <- spark_connect(master = "local")
df <- data.frame(x = 1, y = "A,B,C,D")
df_tbl <- copy_to(sc, df)

df_split <- df_tbl %>%
  mutate(y = split(y, ",")) %>%
  sdf_explode(y)

tb1 <- df_split %>%
  arrange(x) %>%
  mutate(rn = row_number())
tb2 <- df_split %>%
  arrange(x) %>%
  mutate(rn = row_number() - 1L)
tb <- sdf_bind_rows(tb1, tb2) %>%
  arrange(x, rn) %>%
  group_by(x, rn) %>%
  summarise(seq = collect_list(y)) %>%
  filter(size(seq) == 2) %>% 
  sdf_explode(seq) %>%
  group_by(x, rn) %>%
  summarize(pairs = paste(collect_list(seq), sep = ",")) %>%
  ungroup()

The functions not available in the sparklyr package (split(), collect_list() and size()) are spark functions.

Big thanks to @samuelmacedo83 who showed me this solution!