DeepSpeed

DeepSpeed is Microsoft’s open source that focuses on large-scale training and inference operation through efficient memory management for PyTorch.

Key components for inference are custom CUDA kernels for common LLM operations like attention and MLP and tensor parallelism for efficient memory usage and low latency.

DeepSpeed also has several architecture and quantization specific optimizations.

Upshot: DeepSpeed allows you to load larger models that would not ordinarily fit within GPU memory, though your mileage may vary.

Last updated on Jul 8, 2025