Not able to load local text file into hive table / spark data frame

sankar1749 · February 21, 2019, 11:21am

I have one text file in my local drive and wanted to save it as a spark data frame. Used sdf_copy_to() but i got the below error .

<-df = fread('/home/cdsw/HIST 1.txt')

|--------------------------------------------------| |==================================================| |--------------------------------------------------| |==================================================|

sdf_copy_to(con,df,name="sdf")

|=================================================================| 100% 1399 MB

Engine exhausted available memory, consider a larger engine size.

Engine exited with status 137.

I was also trying using
spark_read_text(con,name="Month1_IntlData",path="/home/cdsw/SUMMARY_DETAIL_HIST 1.txt",overwrite = TRUE)
And I got this error.

Error: org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://<<server_name>>/home/cdsw/SUMMARY_DETAIL_HIST

1.txt;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:360)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:348)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:348)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:623)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:603)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Can you tell me how can we copy the local text file to spark data frame or to a hive table .

Thanks and Regards
Sankar Narayana

mara · February 21, 2019, 11:30am

It looks like you're out of memory. Are you per chance working in Docker?

Error 137 in Docker denotes that the container was ‘KILL’ed by ‘oom-killer’ (Out of Memory). This happens when there isn’t enough memory in the container for running the process.

‘OOM killer’ is a proactive process that jumps in to save the system when its memory level goes too low, by killing the resource-abusive processes to free up memory for the system.

Even if not, it does expressly say that you'll need a larger engine.

sankar1749 · February 21, 2019, 11:39am

Text file size is 1.9 GB.

You mean we need to increase driver memory attributes?

mara · February 21, 2019, 12:13pm

It sounds like it, though, to be honest, I'm not 100% sure. Hopefully someone with a bit more expertise will chime in.

javierluraschi · February 21, 2019, 2:20pm

Mara is correct, copy_to() needs additional memory in the driver machine which you can increment to, say, 8GB as follows:

config <- spark_config()
config["sparklyr.shell.driver-memory"] <- "8g"

# then add the config parameter to spark_connect()

The spark_read_*() functions only support loading data from local paths when connected in master = "local" mode, if you are running a proper Spark cluster, you would need to use an HDFS path instead of a local path, for instance:

spark_read_text(con,
                name = "Month1_IntlData",
                path = "hdfs://SUMMARY_DETAIL_HIST 1.txt",
                overwrite = TRUE)

If you are using HDFS, you can use appropriate tools, like running hadoop fs -ls from the terminal, to find out the correct path to a file in HDFS. See Hadoop FileSystemShell.

sankar1749 · February 21, 2019, 5:51pm

my text file is stored in my project workspace '/home/cdsw', but not is HDFS path.

Ran with driver-memory 8g & 10g , but R engine is still exiting of exhausting memory.

Let me know if there is any other way to do it.

conf <- spark_config()

conf["sparklyr.shell.driver-memory"] <- "10g"

con <- spark_connect(master = "yarn",config = conf)

df <- fread(files[[1]])

|--------------------------------------------------| |==================================================| |--------------------------------------------------| |==================================================|

sdf_copy_to(con,df,name="sdf")

|=================================================================| 100% 1999 MB

Engine exhausted available memory, consider a larger engine size.

system · March 14, 2019, 5:57pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.