Skip to main content

Multi-GPU LLM Inference

RC Workshop

Kathryn Linehan, Bruce Rushing

June 3rd, 2025

Workshop Overview

  • __The first draft outline of this workshop was created by ChatGPT! __
  • Introduction
  • UVA HPC
  • Multi-GPU Strategies
  • Accelerate
  • DeepSpeed
  • vLLM
  • Best Practices
  • Wrap Up

INTRODUCTION

Terminology

  • __VRAM: __ Video RAM (GPU Memory)
  • __Context window: __ the maximum number of tokens an LLM can keep “in-memory”
  • __Precision: __ the amount of storage used for an LLM parameter
    • Examples: fp32 is 4 bytes/parameter, fp16 is 2 bytes/parameter
  • __Quantization: __ technique to lower the amount of storage used by LLM parameters
  • __Embedding Dimension: __ length of embedding vector
  • __QKV Cache __ (Query Key Value Cache): short term memory of LLM

Why Multi-GPU LLM inference?

  • LLMs are getting larger and larger as are context windows
  • More recent LLMs generally do not fit in a single GPU due to VRAM limits
    • This is the case we will focus on today

GPU Memory Requirements for Google’s Gemma 3

Model Context Window
4K Tokens 8K Tokens 32K Tokens 128K Tokens
4B @ FP16 11 GB 12.4 GB 20.8 GB 54.4 GB
12B @ FP16 31.9 GB 35 GB 53.6 GB 128 GB
27B @ FP16 70.3 GB 75.8 GB 108.8 GB 241 GB
4B @ INT4 2.8 GB 3.2 GB 5.3 GB 14 GB
12B @ INT4 8 GB 8.8 GB 13.6 GB 32.8 GB
27B @ INT4 17.6 GB 19 GB 27.4 GB 61 GB

Models given in parameter counts (billions) at specific precisions (i.e., floating point 16 or integer 4 quantization) and variable context windows.

UVA HPC

UVA HPC – H200 GPUs

More info: https://www.rc.virginia.edu/2025/05/new-nvidia-h200-gpu-node-added-to-afton/

UVA HPC - GPUs

GPU HPC System Partition Memory
H200 Afton and Rio gpu* 141GB
A100 Afton gpu 40GB or 80GB
A40 Afton gpu 48GB
A6000 Afton gpu 48GB
V100 Afton gpu 32GB
RTX3090 Afton Interactive 24GB
RTX2080Ti Afton Interactive 11GB

*Only available through batch job requests (not Open OnDemand)

More info: https://www.rc.virginia.edu/userinfo/computing-environments/,

https://www.rc.virginia.edu/userinfo/hpc/#system-details

GPU access on UVA HPC

**When you request memory for HPC, that is CPU memory.**

  • Open OnDemand (OOD)
  • Choose “GPU” or “Interactive” as the Rivanna/Afton Partition
  • Optional: choose GPU type and number of GPUs
  • __Cannot use H200 GPUs __
  • SLURM
  • Specify GPU partition: #SBATCH -p gpu
  • Request n GPUs: #SBATCH --gres=gpu:<optional_gpu_name>:n
    • Ex) Request 1 GPU: #SBATCH --gres=gpu:1
    • Ex) Request 4 A100s: #SBATCH --gres=gpu:a100:4
  • To request 80GB A100, additionally use:
    • #SBATCH --constraint=a100_80gb

More info: https://www.rc.virginia.edu/userinfo/hpc/slurm/#gpu-intensive-computation


As of now: only one person can be using a GPU at a time. If you request a GPU, you will receive all of the GPU memory.

UVA HPC – NVIDIA DGX BasePOD on Rivanna/Afton

Cluster of high-performance GPUs that can be used for large deep-learning models

__18 DGX A100 nodes __ with:

2TB of RAM memory per node

8 A100s per node

80 GB GPU memory per GPU device

__Advanced Features __ (compared to regular GPU nodes):

NVLink for fast multi-GPU communication

GPUDirect RDMA Peer Memory for fast multi-node multi-GPU communication

GPUDirect Storage with 200 TB IBM ESS3200 (NVMe) SpectrumScale storage array

Ideal Scenarios:

Job needs multiple GPUs on a single node or even multiple nodes

Job (single or multi-GPU) is I/O intensive

If you have ever used an A100 with 80 GB on our system, you were using a POD node!

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/

**When you request memory for HPC, that is CPU memory.**

POD access on UVA HPC

Open OnDemand (OOD)

Choose “GPU” as the Rivanna/Afton Partition

Choose “NVIDIA A100” as GPU type and fill in number of GPUs

Select “Yes” for Show Additional Options and type “--constraint=gpupod” in the Slurm Option textbox

SLURM

#SBATCH -p gpu

#SBATCH --gres=gpu:a100:n n: requested # of GPUs/node

#SBATCH -C gpupod

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/#accessing-the-pod


As of now: only one person can be using a GPU at a time. If you request a GPU, you will receive all of the GPU memory.

POD on UVA HPC

Before running a multi-node job, please make sure the job can scale well to 8 GPUs on a single node.

Multi-node jobs on the POD should request all GPUs on the nodes, i.e. --gres=gpu:a100:8.

More info: https://www.rc.virginia.edu/userinfo/hpc/basepod/#accessing-the-pod

GPU Limit on UVA HPC

The maximum number of GPUs you can request for a UVA HPC job is 32.

The maximum number of nodes is 4.

More info: https://www.rc.virginia.edu/userinfo/hpc/#job-queues

Wait Time in the Queue

  • You may not need to request A100(s)!
  • Requesting A100(s) may mean you wait in the queue for a much longer time than using another GPU,
  • This could give you a slower overall time (wait time + execution time) than if you had used another GPU.

Graphic Source: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Photo Source: https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Viewing Available GPUs

The gpu partition can be very busy!

To view information about the gpu partition including total (T) and allocated (A) GPUs, type qlist –p gpu at the command line

GPU Dashboard in OOD JupyterLab

  • This will be demoed during today’s workshop
  • Includes GPU and CPU memory and utilization tracking in real time
  • Helpful for GPU selection

GPU Dashboard: Memory Usage

PyTorch

Correct GPU memory usage will be reported

TensorFlow/ Keras

By default, TF automatically allocates ALL of the GPU memory so GPU Dashboard may show that all (or almost all) of the GPU memory is being used

To track the amount of GPU memory actually used, you can add these lines to your python script:

import os

os.environ[‘TF_FORCE_GPU_ALLOW_GROWTH’] = ’true’

More Info: https://www.tensorflow.org/guide/gpu#limiting_gpu_memory_growth


Homework for Keras users: try out GPU dashboard and see if it reports all of the GPU memory as used

Calculating GPU Memory Requirements

GPU Memory Requirements for Google’s Gemma 3

Model Context Window
4K Tokens 8K Tokens 32K Tokens 128K Tokens
4B @ FP16 9.6 + 1.4 = 11 GB 9.6 + 2.8 = 12.4 GB 9.6 + 11.2 = 20.8 GB 9.6 + 44.8 = 54.4 GB
4B @ INT4 2.4 + 0.4 = 2.8 GB 2.4 + 0.8 = 3.2 GB 2.4 + 2.9 = 5.3 GB 2.4 + 11.6 = 14 GB
12B @ FP16 28.8 + 3.1 = 31.9 GB 28.8 + 6.2 = 35 GB 28.8 + 24.8 = 53.6 GB 28.8 + 99.2 = 128 GB
12B @ INT4 7.2 + 0.8 = 8 GB 7.2 + 1.6 = 8.8 GB 7.2 + 6.4 = 13.6 GB 7.2 + 25.6 = 32.8 GB
27B @ FP16 64.8 + 5.5 = 70.3 GB 64.8 + 11 = 75.8 GB 64.8 + 44 = 108.8 GB 64.8 + 176 = 241 GB
27B @ INT4 16.2 + 1.4 = 17.6 GB 16.2 + 2.8 = 19 GB 16.2 + 11.2 = 27.4 GB 16.2 + 44.8 = 61 GB

Models given in parameter counts (billions) at specific precisions (i.e., floating point 16 or integer 4 quantization) and variable context windows.

4B: 34 Layers, 2560 Embedding Dim

12B: 48 Layers, 3840 Embedding Dim

27B: 62 Layers, 5376 Embedding Dim


Each calculation is a breakdown of M + N (from previous slide)

CPU Resource Allocation for LLM Inference

  • CPU memory
    • Have enough RAM to fit LLM
      • Accelerate will offload computations to the CPU after GPUs are full
  • CPU cores
    • Use enough cores for any data preprocessing
    • Use a GPU!

Source: https://timdettmers.com/2018/12/16/deep-learning-hardware-guide/


Check your resource usage using GPU Dashboard, seff (completed jobs), or sstat (running jobs)

It may be the case that even if CPU Efficiency is a low percentage, you need all of the requested CPU cores for a specific part of the code, e.g., data preprocessing. In this case, request the number of CPU cores that you need for the compute intensive part of the code.

MULTI-GPU STRATEGIES

Overview of Multi-GPU Parallelism Strategies

  • Data Parallelism : __ __ data is split across GPUs
  • Model Parallelism : model is split across GPUs
    • Pipeline Parallelism
    • Tensor Parallelism
Parallelism Type What is Split? When Used?
Data Parallelism Input data Long prompts/inputs
Pipeline Parallelism Model layers LLM exceeds single GPU memory: LLM is deep but not too wide
Tensor Parallelism Inside model layers (tensors) LLM exceeds single GPU memory: LLM layers are too large for one device

Data Parallelism

Each GPU contains the full model

Data is split across GPUs and inference is performed on GPUs in parallel

Provides inference speed up

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#data-parallel

Pipeline Parallelism (Inter-Layer)

  • Each GPU contains a different model stage (1+ layers).
  • GPUs compute on batches of data in parallel but must wait for previous stage to complete.
  • Pros:
    • Reduces per-GPU memory use
    • Improves inference throughput
  • Con:
    • Adds inference latency

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#pipeline-parallel, https://arxiv.org/abs/1811.06965

Tensor Parallelism (Intra-Layer)

Matrix Multiplication Example

  • Splits tensor computations across GPUs
    • Each GPU contains part of the tensor
    • GPUs compute on their part in parallel
  • Good for transformer layers
  • Frequent communication
    • Requires fast interconnect (e.g., NVLink, which UVA HPC has in the BasePOD)

Source: https://colossalai.org/docs/concepts/paradigms_of_parallelism/#tensor-parallel

Key Tools and Frameworks

Hugging Face Accelerate : easy interface with transformers and PyTorch.

DeepSpeed : good for large scale (thousand+ GPUs) well-optimized training and inference.

vLLM : focus on high-speed, efficient LLM inference.

Megatron : low-level, open source API for custom development frameworks.

Key Tools and Frameworks cont.

  • We will focus on Accelerate in this workshop.
    • RC’s resources are best used for Accelerate, unless there are specific inference/training-centric memory management or speed-ups required from DeepSpeed or vLLM.
    • Accelerate will set up model parallelism for you
    • Code examples will be provided
  • We will present some basic information about DeepSpeed and vLLM.

More information: https://huggingface.co/docs/accelerate/en/concept_guides/big_model_inference

ACCELERATE

What is Accelerate?

Popular, easy-to-use library built to work with PyTorch models.

Very straightforward API that wraps with PyTorch and transformers code.

Automatically can allocate GPU memory usage across nodes.

Used for both training and inference.

Built on top of Megatron and can interface with DeepSpeed.

Why use Accelerate?

If you need to prototype or are expecting to run models on “fewer” GPUs, Accelerate is the preferred framework.

It is mostly plug-and-play with existing code, with options already built-in to transformers and classes that easily wrap with torch.

You can use it both for inference and for training.

There is some fine-grained control of GPU and CPU usage, allowing a degree of customization for improved performance depending on your use case.

Accelerate Code Demo

The notebook is available on Rivanna/Afton in the folder /project/hpc_training/multigpu_llm_inf

DeepSpeed

What is DeepSpeed?

Microsoft’s open source that focuses on large-scale training and inference operation through efficient memory management for PyTorch.

Key components for inference are custom CUDA kernels for common LLM operations like attention and MLP and tensor parallelism for efficient memory usage and low latency.

Also has several architecture and quantization specific optimizations.

Upshot: allows you to load larger models that would not ordinarily fit within GPU memory, though your mileage may vary.

Transformer-specific Latency Advantages

Transformer module latency reductions per module

Why use DeepSpeed?

You should use DeepSpeed if you need to either perform large-scale inference (hundreds+ GPUs), you need fast service, or every MB in GPU memory counts.

Remark: the maximum number of GPUs you can request for a UVA HPC job is 32 (max number of nodes is 4).

The inference latency reductions can matter if you only have X amount of time for experiments.

DeepSpeed also offers very efficient large-scale training parallelism, which we won’t discuss here.

Source: https://www.rc.virginia.edu/userinfo/hpc/#job-queues


For most users: accelerate is probably better due to ease of use. DeepSpeed does improve inference, but is mainly used for training. If serving an LLM, vLLM is a good choice.

vLLM

What is vLLM (Virtual Large Language Model)?

Open source from UC Berkeley.

Key innovation for inference is PagedAttention: an efficient virtual memory/page method for storing the KV cache, which in longer contexts can end up committing more memory than the models themselves.

Also allows continuous batching: a method for handling multiple requests to LLMs that enables less idle time.

Like DeepSpeed, also supports optimized CUDA kernels for lower latency inference.

Built on top of Megatron and can interface with DeepSpeed.

Why use vLLM?

You should use vLLM if you expect to serve lots of requests to LLMs for inference or if you have particularly long generated sequences.

Can cut memory usage in some applications by 60-80% compared to vanilla transformers, though your mileage may vary.

Compatible with most Hugging Face models.

Generally used for large-scale serving; benefits over Accelerate are likely to be less pronounced in small use-cases.


Small: less than 1 million calls. Looking at number of calls and length of calls.

BEST PRACTICES

  • Plan first
    • Calculate how much GPU memory you will need to run inference
    • Check the gpu queue to see which GPUs are available
    • Determine how many GPUs and which types (A40, A100, etc.) you will need
  • Need fast GPU to GPU communication? Use the BasePOD
  • If you would like to use fewer GPUs, you can try using a quantized model, an LLM with a lower precision and/or a smaller number of parameters, etc. and see if performance is acceptable
  • Use a well-established framework such as Accelerate. If you need extra gains, you can try DeepSpeed or vLLM.

Identifying Bottlenecks

Accelerate and DeepSpeed offer methods for identifying GPU usage.

DeepSpeed’s tools are considerably more in-depth, so consider using it for real fine-grained monitoring of your code usage.

WRAP UP

Recap

  • Modern LLMs cannot fit into a single GPU and thus multi-GPU jobs are needed for LLM inference
  • UVA HPC provides a variety of GPUs as well as the NVIDIA BasePOD
    • H200s were recently added
  • Plan how many and which GPUs you will need before running your LLM inference job
  • Use a well-established framework to implement model parallelism

References

Hugging Face: https://huggingface.co/docs

Accelerate: https://huggingface.co/docs/accelerate

DeepSpeed: https://www.deepspeed.ai/inference/

vLLM: https://docs.vllm.ai/en/latest/

GPU Memory allocation: https://ksingh7.medium.com/calculate-how-much-gpu-memory-you-need-to-serve-any-llm-67301a844f21

QKV Cache Memory allocation: https://unfoldai.com/gpu-memory-requirements-for-llms/

Need more help?

Office Hours via Zoom

Tuesdays: 3 pm - 5 pm

Thursdays: 10 am - noon

Zoom Links are available at https://www.rc.virginia.edu/support/#office-hours

  • Website: https://rc.virginia.edu