Both PyTorch and TensorFlow can utilize GPUs to train neural networks faster. To make them work, it’s important to setup the whole mechanics properly. I have done a few projects using classic machine learning methods and I also had watched some tutorials and books here and there, but I never had the chance to use deep learning on a practical project. So, I’m kinda new to deep learning. Recently, I was trying to train different deep learning models for an instance segmentation task, since I had zero knowledge about how to make things work, I ended up struggling with various challenges for hours. I decided to write about the steps, tricks, and solutions for the issues I encountered. I was trying to train models on local computer equiped with Nvidia GPUs (lambda workstation).
Cuda toolkit is another software layers on top of the Nvidia Driver. As it is mentioned in Nvidia website, different Cuda toolkit versions are mostly forward compatible with Cuda drivers. It means that if you already have nvidia-driver-515, which is a fairly new version, it is compatible with cuda-toolkit-11-2.
Installing Nvidia Driver
Nvidia driver is the underlying libraries necessary for making the operating system (In our case Ubuntu 20.04) work graphical processors. To install drivers you can simply run following command in the terminal:
sudo ubuntu-drivers autoinstall
or the following command in case you need a specific version:
sudo apt install nvidia-driver-470
it is also possible to install the drivers using Ubuntu Software Center
to verify successful installation you should run the command below:
If you look at TensorFlow installation instructions, the compatible cuda-toolkit version is mentioned in Conda install command.
To make TensorFlow work, we should install cuda-toolkit 11.2. basically, we assume that Conda will take care of cudatoolkit installation, but it DOESN’T. It is necessary to install cuda toolkit separately on the system. We can either install Cuda toolkit using runfile provided by Nvidia or install it using apt command.
wget https://developer.download.nvidia.com/compute/cuda/11.2.0/local_installers/cuda_11.2.0_460.27.04_linux.run sudo sh cuda_11.2.0_460.27.04_linux.run
sudo apt install nvidia-cuda-toolkit-11-2
Note: it is necessary to set cuda toolkit path to environment variables to make tensorflow and pytorch able to find libraries and tools. you can add the following commands to your ~/.bashrc to pick the cudatoolkit automatically in every new bash.
export CUDA_HOME=/usr/local/cuda export PATH=/usr/local/cuda/bin:$PATH export CPATH=/usr/local/cuda/include:$CPATH export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH
To verify cuda toolkit installation you should run:
Setting up PyTorch(GPU):
As it is provided by the PyTorch website, it’s possible to setup different versions of PyTorch using Conda. It is supporting cuda toolkit 10.7, 11.3 and 11.6.
Note that any other versions of cuda toolkit will not work with PyTorch, so make sure the installed version of cuda toolkit is among the versions mentioned in PyTorch installation instructions!
Setting up TensorFlow(GPU):
After setting up cuda driver and toolkit, we are ready to install tensorflow using commands bellow:
conda install -c conda-forge cudatoolkit=11.2 cudnn=8.1.0 export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/ python3 -m pip install tensorflow # Verify install: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Note that before using TensorFlow, each time a new terminal is opened, it is necessary to define cuda toolkit path in the environment with the command below. It is also possible to add the cuda toolkit path permanently using the commands below,
mkdir -p $CONDA_PREFIX/etc/conda/activate.d echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CONDA_PREFIX/lib/' > $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
This way Conda will automatically load cuda toolkit path each time activating an environment.
Setting up TensorFlow(GPU) with Docker:
One way to avoid problems with cuda-toolkit and also a way to be able to use different versions of TensorFlow is running Tensorflow docker containers which DIRECTLYinteract with Cuda driver. There are different containers of tensorflow each integrated with Jupiter lab. After running the container, TensorFlow will be accessible through Jupyter lab.
after installing Docker you should install nvidia-docker2 using the following command:
sudo apt-get install -y nvidia-docker2
pull the TensorFlow(bundled with Jupyterlab) container:
docker pull tensorflow/tensorflow:latest-gpu-jupyter
then run the container and access Jupyter lab thorugh your browser:
docker run --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
make sure adding –gpus all flag to the command to give docker access to GPUs. you can then access to the jupyter notebook through http://yourip:8888/ (use localhost if you are running it on a local computer). Also you can add –restart unless-stopped flag to make docker container run after each restart(unless stoped manually).
docker run --restart unless-stopped --gpus all -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-jupyter
to verify if GPU devices are available in tensorflow run the following code in a jupyter notebook cell
import tensorflow as tf print(tf.config.list_physical_devices('GPU'))