sparklyr::spark_apply(x,nrow,group_by) is taking 33 mins in Azure DBX on DS14v2 with 2(min)-8(max) workers to finish on an input data frame with around 100k rows and 50 columns (string & double).
I don't have a reproducible example as my data is proprietary, but my code looks like:
dfResult <- sdf_input %>%
sparklyr::spark_apply(nrow,group_by='ID')
(group by ID results in 34 distinct groups of data)
I understand there is an overhead for serialization/deserialization between driver and worker nodes, but 33 mins to do an 'nrow' on 100k rows x 50 cols seems excessive...