use sparkly R for HDFS file count

I am pulling files from an HDFS directory and writing them to a hive table using sparkly R.

This will be done each month but the number of csv files that will be required will vary each month.
The files are numbered like below....

January
part0000.csv
part0001.csv
part0002.csv
....
part0453.csv

June
part0000.csv
.....
part0268.csv

I was thinking of doing an initial call to check to see the total number of files in the directory then looping through using a counter to grab them.

january_count = 454
March_count = 269

However, I am running into problems on figuring out how to tell the total number of files in the directory.

Recommendations?

Hi, are there data transformations made in R or Spark that need to occur before being shipped to Hive? If not, then you should not need Spark to add the data in Hive, as long as all of the files have the same layout, then once you point the data store to the HDFS folder where the files line, Hive should pick them up automatically. That is more of a Hive thing than a R/Spark/sparklyr thing.

When you "upload" data to Hive from an external source, such as R or Spark, under the hood, the data is being written to parquet files, that then is presented in the data store as a "table". Hive tables are not "physical" tables, they are all mapped logically to files in Hadoop.

Hope this helps

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.