Distributed Deep Learning

Deep Learning (DL) is a powerful tool transforming scientific workflows. Because DL training is computationally intensive, it is well suited to HPC systems. Effective use of UVA HPC resources can greatly accelerate your DL workflows. In this tutorial, we will discuss:

When is it appropriate to use a GPU?
How to optimize single-GPU code?
How to convert single-GPU code to Multi-GPU code in different frameworks and run it on UVA HPC?

The following outline presents topics that will be discussed as well as when code will be provided:

Overview of Deep Learning
- Example: Convolutional Neural Network
Introduction to GPUs and GPU Computing
Tensorflow/Keras
- Single-GPU Code Example
PyTorch
- Single-GPU Code Example
Distributed Training
- TF Multi-GPU Code Example
- PT Multi-GPU Code Example
- PT Lightning Multi-GPU Code Example
Effective Use/Remarks

Prior experience with the Python programming language and some familiarity with machine learning concepts are helpful for this tutorial.

If you have not already done so, please download the example code here:

distributed_dl.zip

Last updated on Jun 6, 2024