I had understood that the data was for a single station and was showing that if that takes 0.25 sec. to do 12 months then 500 ... .
It just took me 2 sec to download the 2.6MB example file. So, that should be 20 minutes top for 500 over any reasonably fast connection to any reasonably provisioned server.
I wrote back locally the file in 0.3 sec, implying 2.5 minutes for 500 write operations with an aggregate file size of 1.3GB.
So, altogether, 20 minutes to read, 2 minutes to process, another 20 minutes to write locally and, what?, 5 minutes to send and write remotelyβfor a total of less than an hour.
That makes it seem like the time to read all 900 CSV files from folder-A , process them and write them to folder-B is excessive by at least a factor of two, I suspect more. If that's the time for just one file, the problem is either connectivity, server load, or the read/write/process logic in your R
program.
Conventional tools like ping
or traceroute
can eliminate connectivity issues. Benchmarking time to run for the same code on the local and target servers for a test suite, can eliminate server load issues. So, let's assume that the problem is with the R
code to make sure that it's not the issue. There's no point in wasting time on 500 different workstations if that's not necessary.
So, what workflow are you using now? For example, is the source directory for the csv files under git or other version control? For that
$ git clone your_source # put it on local station
$ git clone your_source # put it on remote server
> source("your_script") # process each csv in local repo and write back
$ git push origin main # write back the differences
$ git pull origin main # from the remote server
This has the advantage of only transmitting the changed data.
How are you handling it now?