DeepSpeed

DeepSpeed is Microsoft’s open source that focuses on large-scale training and inference operation through efficient memory management for PyTorch.

Key components for inference are custom CUDA kernels for common LLM operations like attention and MLP and tensor parallelism for efficient memory usage and low latency.

DeepSpeed also has several architecture and quantization specific optimizations.

Upshot: DeepSpeed allows you to load larger models that would not ordinarily fit within GPU memory, though your mileage may vary.

Previous
Next
RC Logo RC Logo © 2025 The Rector and Visitors of the University of Virginia