GPU No Longer Working in RStudio Server with Tensorflow-GPU for AWS

GPUser · May 13, 2019, 5:48pm

Hello,

I have been successfully using the RStudio Server on AWS for several months, and the GPU was greatly accelerating the training time for my deep networks (by almost 2 orders of magnitude over the CPU implementation of the same). However, a few weeks ago, the performance slowed considerably. Whereas before, I could train a 6-million-parameter network for 5000 epochs overnight, now the same training would take weeks. I had thought it might be an issue with memory overload as I moved to larger datasets, but even accounting for that, the system runs much slower than it used to.

Recently, I restarted both the R session and the Keras backend and tried reloading a saved model. Unfortunately, training was still slogging along at mere CPU speeds, but I noticed that I got the following error message right after calling the very first Keras function in my script:

2019-05-11 00:02:47.673851: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-11 00:02:47.753743: E tensorflow/stream_executor/cuda/cuda_driver.cc:397] failed call to cuInit: CUDA_ERROR_UNKNOWN
2019-05-11 00:02:47.753808: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:150] kernel driver does not appear to be running on this host (ip-10-217-37-254): /proc/driver/nvidia/version does not exist

It looks like the EC2 instance is no longer able to access the GPU at all, even though it clearly could before. Other than downloading all my data, terminating this instance, spinning up a new instance of RStudio Server, and reuploading, is there any way I can get the GPU working for me again? Any help is appreciated.

dfalbel · May 13, 2019, 5:52pm

This is probably related to the GPU drivers and CUDA Version. Can you share the results of running nvidia-smi on your terminal?

GPUser · May 13, 2019, 5:54pm

I get the following error:

Error: object 'nvidia' not found

dfalbel · May 13, 2019, 5:56pm

Sorry, I mean running nvidia-smi in your system's terminal, not the R console. You should see something like this:

GPUser · May 13, 2019, 5:58pm

Oops, sorry about that. Here you go:

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

dfalbel · May 13, 2019, 6:22pm

Yes, so here is the problem. You need to make sure the NVIDIA driver is correctly installed in the machine.

Follow the tutorials linked in the "Software Requirements" session of this page so you have all necessary software to make TensorFlow run in the GPU.

GPUser · May 13, 2019, 6:30pm

Thanks for the link. Unfortunately, I get stuck when I try running the scripts in the tutorial:

GPUser · May 13, 2019, 6:36pm

That picture has horrible resolution. Here are the steps I run:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.0.130-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1804_10.0.130-1_amd64.deb

The first two ran fine, but I have a sudo problem with the last one:

sudo: unable to resolve host ip-10-217-37-254
[sudo] password for rstudio-user:
rstudio-user is not in the sudoers file. This incident will be reported.

GPUser · May 13, 2019, 9:06pm

I had to run "install_keras()" in RStudio to get Keras to work again. Unfortunately, it still runs very slowly.

Now, when I first define a Keras layer, I get the following:

WARNING:tensorflow:From /home/rstudio-user/.virtualenvs/r-tensorflow/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2019-05-13 20:57:14.132375: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-05-13 20:57:14.136902: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300065000 Hz
2019-05-13 20:57:14.137122: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x1f477a80 executing computations on platform Host. Devices:
2019-05-13 20:57:14.137150: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>

I'm not sure whether it's even trying to use the GPU now.

GPUser · May 13, 2019, 9:31pm

I have a comment awaiting approval. In the meantime, I was able to get the NVidia/CUDA drivers installed.

Now when I type "nvidia-smi" into my RStudio terminal, I see this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

GPUser · May 13, 2019, 9:43pm

Okay, after reinstalling the Nvidia/CUDA drivers, it looks like all I had to do was run

install_keras(tensorflow = "gpu")

and now training runs fast again.

Thank you for your help. I hope this helps anyone else who encounters this issue in the future.

GPUser · May 13, 2019, 10:55pm

So I realized that I could run the tutorials from the cmd.exe tunnel (my company has a lot of firewalls, so I have to run a tunnel on cmd.exe to access the 8787 port through my browser). I ran the following commands and rebooted the EC2 instance:

# Add NVIDIA package repositories
# Add HTTPS support for apt-key
sudo apt-get install gnupg-curl
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_10.0.130-1_amd64.deb
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/7fa2af80.pub
sudo apt-get update
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1604/x86_64/nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt install ./nvidia-machine-learning-repo-ubuntu1604_1.0.0-1_amd64.deb
sudo apt-get update

# Install NVIDIA driver
# Issue with driver install requires creating /usr/lib/nvidia
sudo mkdir /usr/lib/nvidia
sudo apt-get install --no-install-recommends nvidia-410
# Reboot. Check that GPUs are visible using the command: nvidia-smi

# Install development and runtime libraries (~4GB)
sudo apt-get install --no-install-recommends \
    cuda-10-0 \
    libcudnn7=7.4.1.5-1+cuda10.0  \
    libcudnn7-dev=7.4.1.5-1+cuda10.0


# Install TensorRT. Requires that libcudnn7 is installed above.
sudo apt-get update && \
        sudo apt-get install nvinfer-runtime-trt-repo-ubuntu1604-5.0.2-ga-cuda10.0 \
        && sudo apt-get update \
        && sudo apt-get install -y --no-install-recommends libnvinfer-dev=5.0.2-1+cuda10.0

Now when I type "nvidia-smi" into my RStudio terminal, I see this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P8    28W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

system · May 20, 2019, 10:55pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.