sparklyr operation failing: “No space left on device”

Peter_Griffin · August 27, 2019, 6:28am

I need to calculate the group_by lag value of an existing column using sparklyr. My code is as below:

require(sparklyr)
require(tidyverse)

USD2010_tbl <- spark_read_csv(sc, "USD2010", "USD2010.csv", header = FALSE) # read data into Spark
src_tbls(sc) #list the data in Spark


USD2010_tbl %>% rename(pair = V1, timestamp = V2, bid = V3, ask = V4) %>% # add column names 
  mutate(price = (ask + bid)/2) %>% # calculate price as mid point of bid and ask
  group_by(pair) %>% 
  arrange(pair) %>% 
  mutate(price_lag = lag(price))

I got the error message:


Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 15.0 failed 1 times, most recent failure: Lost task 3.0 in stage 15.0 (TID 444, localhost, executor driver): java.io.IOException: No space left on device

von_olaf · August 27, 2019, 10:59am

as the error says you dont have enough space on your tmp drive to perform the operation. make sure to increase the space available on that folder (or whatever folder used by R as /temp) and the code will run fine

Peter_Griffin · August 27, 2019, 1:54pm

how to increase the space available on tmp drive?

von_olaf · August 27, 2019, 9:46pm

buy more space

Peter_Griffin · August 27, 2019, 10:46pm

My data is a 10 GB csv file. When I import it to R using data.table::fread, it crashed my computer, which has a 16 GB ram. I am surprised that sparklyr cannot even handle a data of this size.

andresrcs · August 27, 2019, 11:26pm

Just to clarify, that error message is related to disk space not RAM, if your temp folder is located on a small disk and it doesn't have to much free space is going to fail, you can define the location by setting.

config = spark_config()
config$`sparklyr.shell.driver-java-options` <-  paste0("-Djava.io.tmpdir=", spark_dir)

Take a look to this blog post

yan_lyesin · August 27, 2019, 11:27pm

have you tried allocating more memory to spark like described here?
https://spark.rstudio.com/guides/connections/

Peter_Griffin · August 28, 2019, 12:06am

I followed this blog and add the code:

spark_dir = "/home/PeterGriffin/Downloads/SparkTemp"
config = spark_config()

config$`sparklyr.shell.driver-java-options` <-  paste0("-Djava.io.tmpdir=", spark_dir)
config$`sparklyr.shell.driver-memory` <- "4G"
config$`sparklyr.shell.executor-memory` <- "4G"
config$`spark.yarn.executor.memoryOverhead` <- "512"

but got the error message:

Error: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 21.0 failed 1 times, most recent failure: Lost task 2.0 in stage 21.0 (TID 571, localhost, executor driver): java.io.IOException: No space left on device

is there any rule for the choice of directory?

system · September 18, 2019, 12:06am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.