Introduction

Terminology

Term Definition
VRAM Video RAM (GPU Memory).
Context Window The maximum number of tokens an LLM can keep “in-memory”.
Precision The amount of storage used for an LLM parameter.*
Quantization A technique to lower the amount of storage used by LLM parameters.
Embedding Dimension The length of the embedding vector.
QKV Cache Query Key Value Cache is the short term memory of an LLM.

* For example, fp32 is 4 bytes/parameter, and fp16 is 2 bytes/parameter.

Why use multi-GPU LLM inference?

LLMs are getting larger and larger, as are their context windows. More recent LLMs generally do not fit in a single GPU due to VRAM limits. This will be the focus case of the workshop.

Next
RC Logo RC Logo © 2025 The Rector and Visitors of the University of Virginia