Split hundreds of millions of rows into lists to apply function

AnishaSinghal · July 31, 2019, 12:53pm

I am trying to run prophet model to forecast demand on each of the store item pairs. This is the function which I have to run on 1.5 million store item pairs. This is the function which I am trying to apply to each pair:

prophet_model<-function(df)
{ #Sort by Date
  df<-df[order(df$ds),]
  
  #Divide data into test and train
  test<-tail(df, ceiling(nrow(df)*0.06))
  train<-df[!df$ds %in% test$ds,]
  
  #Train model
  model_prophet <- prophet(train[,c('ds','y')], holidays = holidays,daily.seasonality = FALSE)
  
  #Test Model
  test_forecast = predict(model_prophet, test)
  test_forecast$ds<-as.Date(test_forecast$ds)
  
  #Predict for next week
  dates<-as.data.frame(seq(as.Date(Sys.Date())+1, by = "day", length.out = 7))
  colnames(dates)<-'ds'
  forecast = predict(model_prophet, dates)
  forecast<- forecast[, c("ds","yhat","yhat_upper","yhat_lower")]
  forecast<-forecast %>% mutate(item = unique(factor(df$item)), store=unique(factor(df$store)))
  
  #Test accuracy 
  testdata<-merge(test,test_forecast[,c('yhat','ds')],by="ds",all.x=TRUE)
  forecast$PredictionAccuracy<-accuracy_func(testdata)
  return(forecast)
}

I need to split 180 million rows into lists of unique pair of columns. Then, I want to apply a function on each of these lists using parLapply(). But the R session crashes or just keeps on running when I try to split the dataframe into lists. I have tried the split() and group_split() so far:

data<-df %>% group_split(col1,col2)

data <- split (df, list( df$col1, df$col2)))

I am trying to do parLapply but couldn't run without splitting the dataframe into lists. Also, since I am working on Windows it is difficult to load this data on each cluster.

result <- parLapply(cl, data, prophet_model))

I also tried to apply function directly using do() but it shown 1000 hours for completion:

data<-df %>% group_by(col1,col2) %>% do(function(.))

This function works on a small dataset. I have tried parallel processing and do() function for few pairs and it worked fine.
Please let me know if there is any other way of splitting or applying function to this large dataset.

mishabalyasin · July 31, 2019, 2:31pm

I don't know what you are doing, but I'll bet that it is probably not required for you to actually split data into lists. Can you give a small example of what you are trying to do and why it's only possible with lapply? If your data fits into memory you can also use future/furrr to run things in parallel.

If you do need to do it all at once then I would first make sure that it works on subset of data. However, doing something with that much data will always be a challenge, so you might want to do it in a more performant language or use something like Spark via SparkR or sparlklyr.

AnishaSinghal · July 31, 2019, 3:12pm

Thanks for your response. I have added more details about my problem in the above description. I am trying to scale the implementation of my model. I was trying to split it into lists so as to apply parallel processing.

I will try future/furr if that works. Will that work on the entire dataset directly?

mishabalyasin · July 31, 2019, 3:38pm

Yes, the way furrr works is that it uses future as a way to distribute the computation. You can even distribute to remote cluster to speed up the process. But if you are trying to use prophet on 180 mln lines you are going to have bad time. prophet uses Bayesian statistics/Stan that is quite computationally heavy, so regardless of what you do, it'll take a very (very) long time.

So, either simplify your problem (e.g., take only x days/weeks/months), rent beefy server with lots and lots of cores and RAM, or prepare to wait for days.

system · August 7, 2019, 3:38pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.