sparklyr operation failing: “No space left on device”

I need to calculate the group_by lag value of an existing column using sparklyr. My code is as below:

require(sparklyr)
require(tidyverse)

USD2010_tbl <- spark_read_csv(sc, "USD2010", "USD2010.csv", header = FALSE) # read data into Spark
src_tbls(sc) #list the data in Spark


USD2010_tbl %>% rename(pair = V1, timestamp = V2, bid = V3, ask = V4) %>% # add column names 
  mutate(price = (ask + bid)/2) %>% # calculate price as mid point of bid and ask
  group_by(pair) %>% 
  arrange(pair) %>% 
  mutate(price_lag = lag(price))

I got the error message:


Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 1 times, most recent failure: Lost task 3.0 in stage 15.0 (TID 444, localhost, executor driver): java.io.IOException: No space left on device

as the error says you dont have enough space on your tmp drive to perform the operation. make sure to increase the space available on that folder (or whatever folder used by R as /temp) and the code will run fine

how to increase the space available on tmp drive?

buy more space :smiley:

My data is a 10 GB csv file. When I import it to R using data.table::fread, it crashed my computer, which has a 16 GB ram. I am surprised that sparklyr cannot even handle a data of this size.

Just to clarify, that error message is related to disk space not RAM, if your temp folder is located on a small disk and it doesn't have to much free space is going to fail, you can define the location by setting.

config = spark_config()
config$`sparklyr.shell.driver-java-options` <-  paste0("-Djava.io.tmpdir=", spark_dir)

Take a look to this blog post

have you tried allocating more memory to spark like described here?
https://spark.rstudio.com/guides/connections/

I followed this blog and add the code:

spark_dir = "/home/PeterGriffin/Downloads/SparkTemp"
config = spark_config()

config$`sparklyr.shell.driver-java-options` <-  paste0("-Djava.io.tmpdir=", spark_dir)
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
config$`spark.yarn.executor.memoryOverhead` <- "512"

but got the error message:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 21.0 failed 1 times, most recent failure: Lost task 2.0 in stage 21.0 (TID 571, localhost, executor driver): java.io.IOException: No space left on device

is there any rule for the choice of directory?

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.