I need to calculate the group_by lag value of an existing column using sparklyr. My code is as below:
require(sparklyr)
require(tidyverse)
USD2010_tbl <- spark_read_csv(sc, "USD2010", "USD2010.csv", header = FALSE) # read data into Spark
src_tbls(sc) #list the data in Spark
USD2010_tbl %>% rename(pair = V1, timestamp = V2, bid = V3, ask = V4) %>% # add column names
mutate(price = (ask + bid)/2) %>% # calculate price as mid point of bid and ask
group_by(pair) %>%
arrange(pair) %>%
mutate(price_lag = lag(price))
I got the error message:
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 1 times, most recent failure: Lost task 3.0 in stage 15.0 (TID 444, localhost, executor driver): java.io.IOException: No space left on device
as the error says you dont have enough space on your tmp drive to perform the operation. make sure to increase the space available on that folder (or whatever folder used by R as /temp) and the code will run fine
My data is a 10 GB csv file. When I import it to R using data.table::fread, it crashed my computer, which has a 16 GB ram. I am surprised that sparklyr cannot even handle a data of this size.
Just to clarify, that error message is related to disk space not RAM, if your temp folder is located on a small disk and it doesn't have to much free space is going to fail, you can define the location by setting.
Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 21.0 failed 1 times, most recent failure: Lost task 2.0 in stage 21.0 (TID 571, localhost, executor driver): java.io.IOException: No space left on device