Jan 1, 0001

Multi-GPU LLM Inference

RC Workshop

Kathryn Linehan, Bruce Rushing

June 3rd, 2025

Workshop Overview

__The first draft outline of this workshop was created by ChatGPT! __
Introduction
UVA HPC
Multi-GPU Strategies
Accelerate
DeepSpeed
vLLM
Best Practices
Wrap Up

INTRODUCTION

Terminology

__VRAM: __ Video RAM (GPU Memory)
__Context window: __ the maximum number of tokens an LLM can keep “in-memory”
__Precision: __ the amount of storage used for an LLM parameter
- Examples: fp32 is 4 bytes/parameter, fp16 is 2 bytes/parameter
__Quantization: __ technique to lower the amount of storage used by LLM parameters
__Embedding Dimension: __ length of embedding vector
__QKV Cache __ (Query Key Value Cache): short term memory of LLM

Why Multi-GPU LLM inference?

LLMs are getting larger and larger as are context windows
More recent LLMs generally do not fit in a single GPU due to VRAM limits
- This is the case we will focus on today

GPU Memory Requirements for Google’s Gemma 3

Model	Context Window
	4K Tokens	8K Tokens	32K Tokens	128K Tokens
4B @ FP16	11 GB	12.4 GB	20.8 GB	54.4 GB
12B @ FP16	31.9 GB	35 GB	53.6 GB	128 GB
27B @ FP16	70.3 GB	75.8 GB	108.8 GB	241 GB
4B @ INT4	2.8 GB	3.2 GB	5.3 GB	14 GB
12B @ INT4	8 GB	8.8 GB	13.6 GB	32.8 GB
27B @ INT4	17.6 GB	19 GB	27.4 GB	61 GB

Models given in parameter counts (billions) at specific precisions (i.e., floating point 16 or integer 4 quantization) and variable context windows.

UVA HPC

UVA HPC – H200 GPUs

More info: https://www.rc.virginia.edu/2025/05/new-nvidia-h200-gpu-node-added-to-afton/

UVA HPC - GPUs

GPU	HPC System	Partition	Memory
H200	Afton and Rio	gpu*	141GB
A100	Afton	gpu	40GB or 80GB
A40	Afton	gpu	48GB
A6000	Afton	gpu	48GB
V100	Afton	gpu	32GB
RTX3090	Afton	Interactive	24GB
RTX2080Ti	Afton	Interactive	11GB

*Only available through batch job requests (not Open OnDemand)

More info: https://www.rc.virginia.edu/userinfo/computing-environments/,

https://www.rc.virginia.edu/userinfo/hpc/#system-details

GPU access on UVA HPC

**When you request memory for HPC, that is CPU memory.**

Open OnDemand (OOD)
Choose “GPU” or “Interactive” as the Rivanna/Afton Partition
Optional: choose GPU type and number of GPUs
__Cannot use H200 GPUs __
SLURM
Specify GPU partition: #SBATCH -p gpu
Request n GPUs: #SBATCH --gres=gpu:<optional_gpu_name>:n
- Ex) Request 1 GPU: #SBATCH --gres=gpu:1
- Ex) Request 4 A100s: #SBATCH --gres=gpu:a100:4
To request 80GB A100, additionally use:
- #SBATCH --constraint=a100_80gb

More info: https://www.rc.virginia.edu/userinfo/hpc/slurm/#gpu-intensive-computation

As of now: only one person can be using a GPU at a time. If you request a GPU, you will receive all of the GPU memory.

UVA HPC – NVIDIA DGX BasePOD on Rivanna/Afton

Cluster of high-performance GPUs that can be used for large deep-learning models

__18 DGX A100 nodes __ with:

2TB of RAM memory per node

8 A100s per node

80 GB GPU memory per GPU device

__Advanced Features __ (compared to regular GPU nodes):

NVLink for fast multi-GPU communication

GPUDirect RDMA Peer Memory for fast multi-node multi-GPU communication

GPUDirect Storage with 200 TB IBM ESS3200 (NVMe) SpectrumScale storage array

Ideal Scenarios:

Job needs multiple GPUs on a single node or even multiple nodes

Job (single or multi-GPU) is I/O intensive

If you have ever used an A100 with 80 GB on our system, you were using a POD node!

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/

**When you request memory for HPC, that is CPU memory.**

POD access on UVA HPC

Open OnDemand (OOD)

Choose “GPU” as the Rivanna/Afton Partition

Choose “NVIDIA A100” as GPU type and fill in number of GPUs

Select “Yes” for Show Additional Options and type “--constraint=gpupod” in the Slurm Option textbox

SLURM

#SBATCH -p gpu

#SBATCH --gres=gpu:a100:n n: requested # of GPUs/node

#SBATCH -C gpupod

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/#accessing-the-pod

As of now: only one person can be using a GPU at a time. If you request a GPU, you will receive all of the GPU memory.

POD on UVA HPC

Before running a multi-node job, please make sure the job can scale well to 8 GPUs on a single node.

Multi-node jobs on the POD should request all GPUs on the nodes, i.e. --gres=gpu:a100:8.

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/#accessing-the-pod

GPU Limit on UVA HPC

The maximum number of GPUs you can request for a UVA HPC job is 32.

The maximum number of nodes is 4.

More info: https://www.rc.virginia.edu/userinfo/hpc/#job-queues

Wait Time in the Queue

You may not need to request A100(s)!
Requesting A100(s) may mean you wait in the queue for a much longer time than using another GPU,
This could give you a slower overall time (wait time + execution time) than if you had used another GPU.

Graphic Source: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Photo Source: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Viewing Available GPUs

The gpu partition can be very busy!

To view information about the gpu partition including total (T) and allocated (A) GPUs, type qlist –p gpu at the command line

GPU Dashboard in OOD JupyterLab

This will be demoed during today’s workshop
Includes GPU and CPU memory and utilization tracking in real time
Helpful for GPU selection

GPU Dashboard: Memory Usage

PyTorch

Correct GPU memory usage will be reported

TensorFlow/ Keras

By default, TF automatically allocates ALL of the GPU memory so GPU Dashboard may show that all (or almost all) of the GPU memory is being used

To track the amount of GPU memory actually used, you can add these lines to your python script:

import os

os.environ[‘TF_FORCE_GPU_ALLOW_GROWTH’] = ’true’

More Info: https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth

Homework for Keras users: try out GPU dashboard and see if it reports all of the GPU memory as used

Calculating GPU Memory Requirements

GPU Memory Requirements for Google’s Gemma 3

Model	Context Window
	4K Tokens	8K Tokens	32K Tokens	128K Tokens
4B @ FP16	9.6 + 1.4 = 11 GB	9.6 + 2.8 = 12.4 GB	9.6 + 11.2 = 20.8 GB	9.6 + 44.8 = 54.4 GB
4B @ INT4	2.4 + 0.4 = 2.8 GB	2.4 + 0.8 = 3.2 GB	2.4 + 2.9 = 5.3 GB	2.4 + 11.6 = 14 GB
12B @ FP16	28.8 + 3.1 = 31.9 GB	28.8 + 6.2 = 35 GB	28.8 + 24.8 = 53.6 GB	28.8 + 99.2 = 128 GB
12B @ INT4	7.2 + 0.8 = 8 GB	7.2 + 1.6 = 8.8 GB	7.2 + 6.4 = 13.6 GB	7.2 + 25.6 = 32.8 GB
27B @ FP16	64.8 + 5.5 = 70.3 GB	64.8 + 11 = 75.8 GB	64.8 + 44 = 108.8 GB	64.8 + 176 = 241 GB
27B @ INT4	16.2 + 1.4 = 17.6 GB	16.2 + 2.8 = 19 GB	16.2 + 11.2 = 27.4 GB	16.2 + 44.8 = 61 GB

Models given in parameter counts (billions) at specific precisions (i.e., floating point 16 or integer 4 quantization) and variable context windows.

4B: 34 Layers, 2560 Embedding Dim

12B: 48 Layers, 3840 Embedding Dim

27B: 62 Layers, 5376 Embedding Dim

Each calculation is a breakdown of M + N (from previous slide)

CPU Resource Allocation for LLM Inference

CPU memory
- Have enough RAM to fit LLM
  - Accelerate will offload computations to the CPU after GPUs are full
CPU cores
- Use enough cores for any data preprocessing
- Use a GPU!

Source: https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/

Check your resource usage using GPU Dashboard, seff (completed jobs), or sstat (running jobs)

It may be the case that even if CPU Efficiency is a low percentage, you need all of the requested CPU cores for a specific part of the code, e.g., data preprocessing. In this case, request the number of CPU cores that you need for the compute intensive part of the code.

MULTI-GPU STRATEGIES

Overview of Multi-GPU Parallelism Strategies

Data Parallelism : __ __ data is split across GPUs
Model Parallelism : model is split across GPUs
- Pipeline Parallelism
- Tensor Parallelism

Parallelism Type	What is Split?	When Used?
Data Parallelism	Input data	Long prompts/inputs
Pipeline Parallelism	Model layers	LLM exceeds single GPU memory: LLM is deep but not too wide
Tensor Parallelism	Inside model layers (tensors)	LLM exceeds single GPU memory: LLM layers are too large for one device

Data Parallelism

Each GPU contains the full model

Data is split across GPUs and inference is performed on GPUs in parallel

Provides inference speed up

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#data-parallel

Pipeline Parallelism (Inter-Layer)

Each GPU contains a different model stage (1+ layers).
GPUs compute on batches of data in parallel but must wait for previous stage to complete.
Pros:
- Reduces per-GPU memory use
- Improves inference throughput
Con:
- Adds inference latency

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#pipeline-parallel, https://arxiv.org/abs/1811.06965

Tensor Parallelism (Intra-Layer)

Matrix Multiplication Example

Splits tensor computations across GPUs
- Each GPU contains part of the tensor
- GPUs compute on their part in parallel
Good for transformer layers
Frequent communication
- Requires fast interconnect (e.g., NVLink, which UVA HPC has in the BasePOD)

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#tensor-parallel

Key Tools and Frameworks

Hugging Face Accelerate : easy interface with transformers and PyTorch.

DeepSpeed : good for large scale (thousand+ GPUs) well-optimized training and inference.

vLLM : focus on high-speed, efficient LLM inference.

Megatron : low-level, open source API for custom development frameworks.

Key Tools and Frameworks cont.

We will focus on Accelerate in this workshop.
- RC’s resources are best used for Accelerate, unless there are specific inference/training-centric memory management or speed-ups required from DeepSpeed or vLLM.
- Accelerate will set up model parallelism for you
- Code examples will be provided
We will present some basic information about DeepSpeed and vLLM.

More information: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference

ACCELERATE

What is Accelerate?

Popular, easy-to-use library built to work with PyTorch models.

Very straightforward API that wraps with PyTorch and transformers code.

Automatically can allocate GPU memory usage across nodes.

Used for both training and inference.

Built on top of Megatron and can interface with DeepSpeed.

Why use Accelerate?

If you need to prototype or are expecting to run models on “fewer” GPUs, Accelerate is the preferred framework.

It is mostly plug-and-play with existing code, with options already built-in to transformers and classes that easily wrap with torch.

You can use it both for inference and for training.

There is some fine-grained control of GPU and CPU usage, allowing a degree of customization for improved performance depending on your use case.

Accelerate Code Demo

The notebook is available on Rivanna/Afton in the folder /project/hpc_training/multigpu_llm_inf

DeepSpeed

What is DeepSpeed?

Microsoft’s open source that focuses on large-scale training and inference operation through efficient memory management for PyTorch.

Key components for inference are custom CUDA kernels for common LLM operations like attention and MLP and tensor parallelism for efficient memory usage and low latency.

Also has several architecture and quantization specific optimizations.

Upshot: allows you to load larger models that would not ordinarily fit within GPU memory, though your mileage may vary.

Transformer-specific Latency Advantages

Transformer module latency reductions per module

Why use DeepSpeed?

You should use DeepSpeed if you need to either perform large-scale inference (hundreds+ GPUs), you need fast service, or every MB in GPU memory counts.

Remark: the maximum number of GPUs you can request for a UVA HPC job is 32 (max number of nodes is 4).

The inference latency reductions can matter if you only have X amount of time for experiments.

DeepSpeed also offers very efficient large-scale training parallelism, which we won’t discuss here.

Source: https://www.rc.virginia.edu/userinfo/hpc/#job-queues

For most users: accelerate is probably better due to ease of use. DeepSpeed does improve inference, but is mainly used for training. If serving an LLM, vLLM is a good choice.

vLLM

What is vLLM (Virtual Large Language Model)?

Open source from UC Berkeley.

Key innovation for inference is PagedAttention: an efficient virtual memory/page method for storing the KV cache, which in longer contexts can end up committing more memory than the models themselves.

Also allows continuous batching: a method for handling multiple requests to LLMs that enables less idle time.

Like DeepSpeed, also supports optimized CUDA kernels for lower latency inference.

Built on top of Megatron and can interface with DeepSpeed.

Why use vLLM?

You should use vLLM if you expect to serve lots of requests to LLMs for inference or if you have particularly long generated sequences.

Can cut memory usage in some applications by 60-80% compared to vanilla transformers, though your mileage may vary.

Compatible with most Hugging Face models.

Generally used for large-scale serving; benefits over Accelerate are likely to be less pronounced in small use-cases.

Small: less than 1 million calls. Looking at number of calls and length of calls.

BEST PRACTICES

Plan first
- Calculate how much GPU memory you will need to run inference
- Check the gpu queue to see which GPUs are available
- Determine how many GPUs and which types (A40, A100, etc.) you will need
Need fast GPU to GPU communication? Use the BasePOD
If you would like to use fewer GPUs, you can try using a quantized model, an LLM with a lower precision and/or a smaller number of parameters, etc. and see if performance is acceptable
Use a well-established framework such as Accelerate. If you need extra gains, you can try DeepSpeed or vLLM.

Identifying Bottlenecks

Accelerate and DeepSpeed offer methods for identifying GPU usage.

DeepSpeed’s tools are considerably more in-depth, so consider using it for real fine-grained monitoring of your code usage.

WRAP UP

Recap

Modern LLMs cannot fit into a single GPU and thus multi-GPU jobs are needed for LLM inference
UVA HPC provides a variety of GPUs as well as the NVIDIA BasePOD
- H200s were recently added
Plan how many and which GPUs you will need before running your LLM inference job
Use a well-established framework to implement model parallelism

References

Hugging Face: https://huggingface.co/docs

Accelerate: https://huggingface.co/docs/accelerate

DeepSpeed: https://www.deepspeed.ai/inference/

vLLM: https://docs.vllm.ai/en/latest/

GPU Memory allocation: https://ksingh7.medium.com/calculate-how-much-gpu-memory-you-need-to-serve-any-llm-67301a844f21

QKV Cache Memory allocation: https://unfoldai.com/gpu-memory-requirements-for-llms/

Need more help?

Office Hours via Zoom

Tuesdays: 3 pm - 5 pm

Thursdays: 10 am - noon

Zoom Links are available at https://www.rc.virginia.edu/support/#office-hours

Website: https://rc.virginia.edu