Introduction

Terminology

Term	Definition
VRAM	Video RAM (GPU Memory).
Context Window	The maximum number of tokens an LLM can keep “in-memory”.
Precision	The amount of storage used for an LLM parameter.*
Quantization	A technique to lower the amount of storage used by LLM parameters.
Embedding Dimension	The length of the embedding vector.
QKV Cache	Query Key Value Cache is the short term memory of an LLM.

* For example, fp32 is 4 bytes/parameter, and fp16 is 2 bytes/parameter.

Why use multi-GPU LLM inference?

LLMs are getting larger and larger, as are their context windows. More recent LLMs generally do not fit in a single GPU due to VRAM limits. This will be the focus case of the workshop.

Last updated on Jul 8, 2025