Both PyTorch and TensorFlow can utilize GPUs to train neural networks faster. To make them work, it’s important to properly set up the whole mechanics. I have done a few projects using classic machine learning methods and I also watched some tutorials and books here and there, but I never had the chance to use deep learning on a practical project. So, I’m kinda new to deep learning. Recently, I was trying to train different deep learning models for an instance segmentation task, since I had zero knowledge about how to make things work, I ended up struggling with various challenges for hours. I decided to write about the steps, tricks, and solutions for the issues I encountered. I was trying to train models on a local computer equipped with Nvidia GPUs (lambda workstation).
The Cuda toolkit is another software layer on top of the Nvidia Driver. As is mentioned on the Nvidia website, different Cuda toolkit versions are mostly forward-compatible with Cuda drivers. It means that if you already have nvidia-driver-515, which is a fairly new version, it is compatible with cuda-toolkit-11-2.
Installing Nvidia Driver
Nvidia driver is the underlying library necessary for making the operating system (In our case Ubuntu 20.04) work with the graphical processors. To install drivers you can simply run the following command in the terminal:
sudo ubuntu-drivers autoinstall
or the following command in case you need a specific version:
sudo apt install nvidia-driver-470
it is also possible to install the drivers using Ubuntu Software Center
to verify successful installation you should run the command below:
If you look at the TensorFlow installation instructions, the compatible cuda-toolkit version is mentioned in Conda install command.
To make TensorFlow work, we should install the cuda-toolkit 11.2. basically, we assume that Conda will take care of the Cuda toolkit installation, but it DOESN’T. It is necessary to install the Cuda toolkit separately on the system. We can either install the Cuda toolkit using the run file provided by Nvidia or install it using an apt command.
sudo sh cuda_11.2.0_460.27.04_linux.run
sudo apt install nvidia-cuda-toolkit-11-2
Note: it is necessary to set the Cuda toolkit path to environment variables to make TensorFlow and PyTorch able to find libraries and tools. you can add the following commands to your ~/.bashrc to pick the Cuda toolkit automatically in every new bash.
To verify the cuda toolkit installation you should run:
Setting up PyTorch(GPU):
As it is provided by the PyTorch website, it’s possible to set up different versions of PyTorch using Conda. It is supporting the Cuda toolkit 10.7, 11.3, and 11.6.
Note that any other versions of the Cuda toolkit will not work with PyTorch, so make sure the installed version of the cuda toolkit is among the versions mentioned in PyTorch installation instructions!
Setting up TensorFlow(GPU):
After setting up the Cuda driver and toolkit, we are ready to install TensorFlow using the commands below:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0
python3 -m pip install tensorflow
# Verify install:
python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Note that before using TensorFlow, each time a new terminal is opened, it is necessary to define the Cuda toolkit path in the environment with the command below. It is also possible to add the Cuda toolkit path permanently using the commands below,
mkdir -p $CONDA_PREFIX/etc/conda/activate.d
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
This way Conda will automatically load the cuda toolkit path each time activating an environment.
Setting up TensorFlow(GPU) with Docker:
One way to avoid problems with the cuda-toolkit and also a way to be able to use different versions of TensorFlow is running Tensorflow docker containers which DIRECTLYinteract with the Cuda driver. There are different containers of TensorFlow each integrated with Jupiter lab. After running the container, TensorFlow will be accessible through Jupyter lab.
after installing Docker you should install nvidia-docker2 using the following command:
sudo apt-get install -y nvidia-docker2
pull the TensorFlow(bundled with Jupyterlab) container:
docker pull tensorflow/tensorflow:latest-gpu-jupyter
then run the container and access Jupyter lab through your browser:
docker run --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
make sure to add the –gpus all flag to the command to give docker access to GPUs. you can then access the jupyter notebook through http://yourip:8888/ (use localhost if you are running it on a local computer). Also, you can add –restart unless-stopped flag to make the docker container run after each restart(unless stopped manually).
docker run --restart unless-stopped --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
To verify if GPU devices are available in TensorFlow run the following code in a jupyter notebook cell
import tensorflow as tf