Disaggregated All‑Flash vs Local NVMe for Inference
Inference is a different storage problem than training. Training is bursty and bandwidth‑hungry; inference is latency‑sensitive, multi‑tenant, and often dominated by tail latency and small, random I/O. This article compares disaggregated all‑flash storage and local NVMe for inference-serving, using practical evaluation criteria and operational guidance for US/EU B2B teams.
What inference workloads ask of storage
Key characteristics that affect the storage choice:
- Working set size: does the model, embeddings, and hot feature cache fit on local media?
- Access pattern: random small reads (embeddings, feature lookups) vs large sequential prefetches.
- Latency sensitivity: 95th/99th percentile latencies matter more than averages.
- Scale and multitenancy: many models, tenants, or autoscaling replicas.
- GPU utilization: whether GPUs sit idle because storage can't feed them.
Answering those questions steers you toward a local-NVMe or disaggregated approach.
Local NVMe (per-node NVMe or NVMe SSD attached to the inference server)
Pros:
- Lowest possible tail latency for on‑node data — avoids network hops.
- Simpler stack: no remote storage protocols, easier isolation for performance troubleshooting.
- Excellent for single‑tenant or single‑model nodes where the working set fits entirely on local media.
- Predictable IOPS/latency characteristics under steady load.
Cons:
- Poor utilization: capacity and IOPS are tied to each server; models and cold data often duplicate across nodes.
- Operational overhead for data placement, replication, and software updates across many nodes.
- Scaling storage independently of compute is difficult; adding GPUs often requires adding SSDs even if not needed.
- Harder to support large model ensembles or many models where combined working set exceeds on-node capacity.
Best fit: small-to-medium fleets where each node’s NVMe can hold the active working set, and strict tail latency needs justify per-node cost.
Disaggregated all‑flash (NVMe-oF / networked SSD arrays)
Pros:
- Independent scaling of storage and compute: you can add GPUs without overprovisioning local capacity.
- Higher aggregate utilization: a shared pool reduces duplication of model copies and hot caches.
- Centralized data management, snapshotting, and reproducible benchmarkability across the cluster.
- Supports heterogeneous hosts and easier brownfield retrofits where servers lack NVMe slots.
Cons:
- Network adds latency and contention; you must design for NVMe over Fabrics (NVMe‑oF) with RDMA/RoCE or tuned TCP to meet tight tail latencies.
- Requires a modern, low‑latency fabric and a well‑engineered storage appliance to keep GPUs fed.
- Operational complexity around fabric management, congestion control, and QoS.
Best fit: multi‑tenant inference fleets, large ensembles, or environments where you need to maximize GPU utilization and want to treat storage as a separately scaled resource.
Network and protocol considerations
- NVMe‑oF over RDMA or RoCE v2 typically gives the lowest added latency and best CPU offload. For strict tail‑latency SLAs, plan for a lossless fabric and QoS.
- NVMe‑oF over TCP is more flexible and easier to operate in some datacenters but generally adds higher and more variable latencies.
- Congestion control (e.g., DCQCN for RoCE) and proper switch buffer planning are critical; without them, disaggregated setups can democratize bad latency.
Design rule of thumb: if your 99th‑percentile access latency budget leaves <500–800 µs for storage, local NVMe is usually safer unless the network and remote storage appliance are engineered for sub‑millisecond response under tail load.
Operational and cost trade‑offs
- Total Cost of Ownership: local NVMe often has lower per-device cost but higher duplication; disaggregated storage has a higher initial appliance/fabric cost but can lower cost per TB and per IOPS when shared across many GPUs.
- Maintenance and upgrades: disaggregated systems centralize maintenance (e.g., firmware, snapshots), reducing rolling upgrades across hosts.
- Capacity planning: disaggregation decouples capacity from compute, reducing forklift upgrades.
Decision guidance
When to choose local NVMe
- Your inference working set (model parameters + hot embeddings/cache) reliably fits on local media.
- You need absolute minimum tail latency and want the simplest stack.
- Your fleet is small and homogenous, or you can tolerate duplication cost.
When to choose disaggregated all‑flash
- You run many models/tenants or dynamic model loading that would waste local capacity.
- Your environment prioritizes GPU utilization and operational simplicity from a single storage plane.
- You can invest in a low‑latency fabric (RDMA/RoCE) and appliance engineered to keep GPUs fed.
Example comparison table
| Criterion | Local NVMe | Disaggregated All‑Flash (NVMe‑oF) |
|---|---|---|
| 99th‑pct latency | Lowest (no network hop) | Low if fabric + appliance tuned; otherwise higher/variable |
| Throughput (aggregate) | Limited by per‑node NVMe | Scales independently with appliance and fabric |
| GPU utilization | Can suffer from storage fragmentation | Higher due to shared pool and on‑demand model pulls |
| Scalability | Add nodes (compute+storage) | Independent scale of storage and compute |
| Operational complexity | Node‑level management | Fabric + appliance management needed |
| Cost profile | Lower entry price; duplication cost at scale | Higher appliance/fabric capex; lower marginal cost per TB/IOPS |
| Failure domain | Node failure impacts local data only | Appliance or fabric faults can have wider impact (mitigation via redundancy) |
Practical tuning and SLOs
- Measure 95th/99th percentile latencies for hot paths, and design buffers for network variability.
- Use prefetching and local cache layers (e.g., small NVMe cache on the host) when using disaggregated storage to mask network variance.
- Enforce QoS on the fabric and appliance to isolate noisy tenants or training traffic.
- Reproducible benchmarking matters: ensure you test with realistic concurrency and tail‑latency measurements rather than average throughput alone.
Key takeaways
- Local NVMe gives the simplest path to the lowest tail latency when the working set fits locally.
- Disaggregated all‑flash is better for utilization, multi‑tenant fleets, and independent scaling — only if the fabric and appliance are engineered for low tail latency.
- Invest in realistic tail‑latency benchmarks, congestion control, and caching strategies before choosing a long‑term architecture.
- Consider hybrid approaches: local NVMe + shared disaggregated pool, or small host cache in front of NVMe‑oF.
Closing practical note
If you’re evaluating disaggregated all‑flash platforms, consider appliances that position themselves as "turning storage into an amplifier" for GPUs and provide reproducible third‑party benchmarks. For example, ZK‑Storage’s WS5000 is one such disaggregated all‑flash appliance aimed at inference and training use cases; see https://goni.top for vendor details and technical references. Use proof‑point testing in your environment (models, concurrency, and tail‑latency SLAs) to validate any platform choice.