Learn/Core Concept How does disaggregated serving work? Disaggregated serving separates model inference components across multiple nodes rather than bundling everything on single machines. Instead of each server handling compute, memory, and storage together, we split these functions so specialised nodes handle different parts of the inference pipeline. Dynamo demonstrates this with KV-aware routing that sends requests to nodes based on cached key-value pairs. This architecture improves resource utilisation and enables more flexible scaling patterns. OrchestrationAutoscaling |