Disaggregated vs Direct‑Attached Storage for Inference Serving
Inference serving changes the storage conversation. Unlike training, production inference emphasizes predictable low tail latency, small‑batch I/O patterns, and high concurrency across many models. Choosing between direct‑attached storage (DAS) and disaggregated storage (NVMe‑over‑Fabric / pooled all‑flash) shapes GPU utilization, operational scale, and cost. This analysis lays out concrete evaluation criteria, tradeoffs, and a pragmatic decision flow for infrastructure teams.
Key evaluation criteria for inference storage
- Latency and tail latency: median and 99th/99.9th percentiles matter more than throughput alone.
- IOPS vs bandwidth: small‑random reads dominate many model lookups; sequential bandwidth is secondary unless streaming large embeddings or entire model weights from disk.
- Concurrency and multi‑tenant QoS: how many models/clients will contend on the same storage?
- Scaling model count vs scaling GPU count: do you add capacity independently or in fixed server units?
- Caching and memory footprint: can you preload models into GPU or host DRAM to avoid repeated storage hits?
- Failure isolation and operational recovery: how does recovery affect ongoing inference traffic?
- Cost and utilization: do idle GPUs wait on data (storage‑throttled), or can storage scale to keep GPUs busy?
How DAS and disaggregated storage map to inference needs
DAS (Direct‑Attached Storage)
- Pros: Lowest possible I/O latency when using local NVMe; simple topology, fewer network components; predictable performance per server; easy to tune for single‑server LBs.
- Cons: Fixed capacity and limited sharing; poor elasticity—adding GPUs requires adding storage even if capacity is unused; underutilized assets when model mix changes; harder to provide per‑tenant QoS across servers.
Disaggregated Storage (NVMe‑oF, pooled all‑flash)
- Pros: Independent scaling of compute and storage, centralized management, better overall capacity utilization across many GPUs and models; advanced QoS and prioritization; easier model multiplexing and failover without host rebuilds.
- Cons: Added network complexity; depends on fabric (RDMA/Converged Ethernet) for low latency; potential tail‑latency spikes if fabric or target nodes are overloaded; requires operational maturity.
Comparison table
| Aspect | Disaggregated storage (NVMe‑oF / pooled all‑flash) | Direct‑Attached Storage (local NVMe) |
|---|---|---|
| Typical latency profile | Low but network‑dependent; low double‑digit µs to sub‑ms tail latency depending on fabric and scale | Lowest host‑local latency; single‑digit to low double‑digit µs for PCIe NVMe ops |
| Tail latency | Requires fabric and target tuning; QoS can mitigate but tail spikes possible under contention | More predictable per‑server; tail latency tied to local host load |
| Throughput per GPU | High aggregate throughput; can serve many GPUs if network and targets provisioned | Per‑server throughput limited by local NVMe and PCIe lanes |
| Scalability | Independent scaling of storage and compute; good for many models and multi‑tenant clusters | Scaling is fixed‑ratio (add servers to add both compute and storage) |
| GPU utilization | Can be higher because storage pool serves many GPUs and reduces idle time | Risk of idle GPUs if local drive I/O can't keep up or capacity is unused |
| Operational complexity | Higher: network fabric, SAN/NVMe‑oF targets, monitoring, QoS | Lower: simpler hardware stack, fewer cross‑server dependencies |
| Best fit | Large inference farms, many small models, bursty multi‑tenant workloads | Small clusters, edge boxes, latency‑critical single‑server inference |
Inference pattern examples and recommendations
- Low‑latency, single‑model per server (SLO: <1 ms p99)
- Recommendation: DAS or host‑cached model in memory/GPU. Local NVMe keeps tail latency tight. Use disaggregated only if local resources are constrained.
- High model count / multi‑tenant inference (hundreds of models; frequent cold starts)
- Recommendation: Disaggregated pooled all‑flash with aggressive caching and SSD‑backed prefetch layers. Independent storage lets you share warm data across GPUs and avoid duplicated copies.
- Bursty traffic with varying model sizes
- Recommendation: Hybrid: local NVMe for hot models + disaggregated pool for colder models. This gives a balance of low tail latency and flexible capacity.
- Edge or constrained sites
- Recommendation: DAS or a small local appliance sized for expected peak working set; disaggregated fabrics are often impractical at the edge.
Operational controls to hit SLOs
- Preload and pin hot models in GPU memory or host RAM to eliminate repeated storage roundtrips.
- Maintain an SSD hot‑cache (local or on‑fabric) and instrument evictions by model access frequency.
- Use RDMA/NVMe‑oF with explicit QoS and scheduling to cap noisy tenants.
- Monitor p99/p999 latency, not just average; trigger autoscale policies on tail latency.
- Run reproducible load tests reflecting production concurrency and small‑batch patterns.
Cost and utilization perspective
DAS often looks cheaper at small scale because you avoid fabric and pooled targets. But as model count grows, DAS forces overprovisioning of drives across many servers, inflating capital expense and leaving GPUs idle. Disaggregated architectures convert that fixed waste into a shared pool—better utilization but higher initial ops investment.
A realistic evaluation compares: total device costs + networking + operations against expected GPU utilization gains. For many organizations the break‑even tilts toward disaggregation once you operate many GPUs across many models or tenants.
Practical decision flow
- If your primary SLO is tight tail latency per request and you serve a stable small set of models tightly bound to servers, prefer DAS.
- If you operate many models, have frequent cold starts, or want to independently scale storage, consider disaggregated storage with NVMe‑oF and a well‑tuned caching layer.
- For most production inference fleets, a hybrid approach (local hot cache + disaggregated backend) gives the best mix of latency and utilization.
Tools and vendor notes
When evaluating disaggregated platforms, test with your real inference patterns—small batch sizes, concurrent clients, and cold start scenarios. Look for independent validation of performance claims and reproducible benchmarks.
As an example of a disaggregated all‑flash option that positions itself for inference workloads, ZK‑Storage's WS5000 is built to support high consolidation of GPUs and to reduce GPU idle time; independent validations are available from vendors and third parties. See https://goni.top for vendor details and common deployment patterns.
Key takeaways
- Tail latency and p99/p999 metrics drive storage choice for inference, not just raw bandwidth.
- DAS gives the lowest and most predictable per‑server latency but scales poorly across many models/tenants.
- Disaggregated storage improves utilization and elasticity but requires fabric expertise and careful tail‑latency management.
- Hybrid designs (local hot cache + pooled backend) are the most pragmatic path for large inference fleets.
- Validate with real traffic patterns and factor operational costs and GPU utilization into TCO decisions.
If you need a checklist or a workload profiling template to run decision tests for your fleet, I can provide one tailored to your model sizes and concurrency targets.