For instance, consider this function (using @Tazinho's idea):
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
Then, tb <- mutate(df, z = sapply(y, get_pairs)) works just fine if df is a local data frame.
The column z is a list, which can be "exploded" with tidyr::unnest(tb, z).
However, if df is a tbl_spark, this code (analogous to this one)
tb <- df %>%
spark_apply(function(d) {
library(dplyr)
get_pairs <- function(sequence_of_letters) {
x <- strsplit(sequence_of_letters, ",")[[1]]
paste0(x[-length(x)], ",", x[-1])
}
mutate(d, z = sapply(y, get_pairs))
})
raises warnings and errors. The natural step after that would be sparklyr.nested::sdf_explode(tb, z).
The error messages start like:
Warning messages:
1: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
2: In if (is.na(object)) { :
the condition has length > 1 and only the first element will be used
ERROR sparklyr: Worker (X) failed to complete R process