Hello,
I have been successfully using the RStudio Server on AWS for several months, and the GPU was greatly accelerating the training time for my deep networks (by almost 2 orders of magnitude over the CPU implementation of the same). However, a few weeks ago, the performance slowed considerably. Whereas before, I could train a 6-million-parameter network for 5000 epochs overnight, now the same training would take weeks. I had thought it might be an issue with memory overload as I moved to larger datasets, but even accounting for that, the system runs much slower than it used to.
Recently, I restarted both the R session and the Keras backend and tried reloading a saved model. Unfortunately, training was still slogging along at mere CPU speeds, but I noticed that I got the following error message right after calling the very first Keras function in my script:
2019-05-11 00:02:47.673851: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-11 00:02:47.753743: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN
2019-05-11 00:02:47.753808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (ip-10-217-37-254): /proc/driver/nvidia/version does not exist
It looks like the EC2 instance is no longer able to access the GPU at all, even though it clearly could before. Other than downloading all my data, terminating this instance, spinning up a new instance of RStudio Server, and reuploading, is there any way I can get the GPU working for me again? Any help is appreciated.