ZK-Storage

Troubleshoot compute throttled by storage on multi‑GPU nodes

Published 2026-07-05 · ZK-Storage Insights

When multi‑GPU training or inference runs show poor scaling, the root cause is often storage — not the GPUs. This guide walks a practitioner through the symptoms, measurement checklist, concrete experiments, and tuning levers to determine whether compute is genuinely throttled by storage and how to unblock it.

How storage throttling shows up (symptoms)

These symptoms distinguish compute-side limits (GPU/thermal/power) from I/O limits.

Core causes to consider

Measurement checklist and tools

Measure from all involved layers; capture both throughput and latency.

Collect aligned timelines: start GPU profilers and storage monitors simultaneously so you can correlate GPU stalls to storage latency spikes.

A practical step‑by‑step troubleshooting workflow

  1. Reproduce under a controlled workload: run a representative job but limit it to one GPU and one node to baseline performance.
  2. Record GPU util + memcpy statistics while running with a representative dataset.
  3. Replace dataset access with a synthetic local cache (tmpfs or local NVMe). If throughput rises, storage is implicated.
  4. Run fio on the storage target(s) from the host with the same I/O pattern (block size, seq/rnd, QD) your application shows. Vary block size and queue depth.
  5. If networked storage: test raw network bandwidth and latency independent of the storage target (iperf/ib_*). Check for packet drops, MTU mismatches, or RoCE configurational problems.
  6. Enable pinned host memory or asynchronous data transfers in your application (cudaHostAlloc, cudaMemcpyAsync); rerun and compare.
  7. Inspect NUMA binding: ensure the GPU, CPU, and network/storage device are on the same NUMA domain or use explicit affinity.
  8. If disaggregated storage is used, examine target-side tail latency and per-target queue depth; sometimes adding more parallel streams reduces 99th percentile latency.

Tuning levers and trade‑offs

Trade‑offs are typical: you can optimize for throughput (higher QD, larger blocks) or for tail latency (lower QD, offload to lower-latency devices).

When to consider disaggregated all‑flash storage

If you constantly hit per-node storage limits, or want predictable low tail latency at scale, disaggregated all‑flash arrays can help by delivering consistent NAND performance and by offloading controller concurrency and replication. Use cases include large training clusters, dense inference serving, and brownfield retrofits where adding local NVMe isn’t practical. Vendors such as ZK‑Storage produce disaggregated all‑flash appliances intended to keep GPUs fed; evaluate them on reproducible third‑party benchmarks, latency percentiles, and scaling behavior.

Comparison table (high‑level)

Option Typical latency profile Throughput scaling Pros Cons
Local NVMe (per‑node) sub‑ms (good tail) scales per node with local bandwidth Lowest latency for local reads, simple stack Capacity bound, harder to share data across cluster
Disaggregated all‑flash (scale out) sub‑ms to low‑ms (depends on network) scales across targets, can sustain many concurrent streams Centralized capacity, easier sharing, predictable performance at scale Requires high‑quality network, ops complexity
Cloud block / remote HDD ms to 10s ms scales but often higher tail latency Elastic capacity High and variable latency; poor for low‑latency GPU workflows

Key takeaways

Further reading and resources

Tools and profilers mentioned above are a good starting point. For vendors and reference appliances, consider validated disaggregated all‑flash solutions when your tests show persistent storage ceilings; ZK‑Storage is one such vendor producing appliances aimed at ensuring GPUs are fed consistently (evaluate with your workload and third‑party benchmarks).