Troubleshoot compute throttled by storage on multi‑GPU nodes

Published 2026-07-05 · ZK-Storage Insights

When multi‑GPU training or inference runs show poor scaling, the root cause is often storage — not the GPUs. This guide walks a practitioner through the symptoms, measurement checklist, concrete experiments, and tuning levers to determine whether compute is genuinely throttled by storage and how to unblock it.

How storage throttling shows up (symptoms)

Low GPU SM or memory‑copy utilization despite jobs queued across GPUs.
High CPU process wait, low PCIe Rx/Tx, or frequent stalls reported by GPU profilers.
Long end‑to‑end batch times that improve dramatically when data is cached locally.
Storage-side indicators: sustained high IOPS/queue depth, rising request latency (tail latency), and saturated network links for NVMe-oF/RDMA.

These symptoms distinguish compute-side limits (GPU/thermal/power) from I/O limits.

Core causes to consider

I/O pattern mismatch: many small random reads (training with small shard sizes) lead to high latency per request.
Insufficient request concurrency: low application queue depth or synchronous I/O calls.
Inefficient host‑GPU data path: lack of pinned host memory, synchronous cudaMemcpy, or missing GPUDirect integration.
Network/storage bottlenecks: NIC saturation, RDMA misconfiguration (RoCE credits, MTU), switch congestion, or CPU bottlenecks handling I/O.
Topology/NUMA problems: storage devices not local to the CPU that’s tied to the GPU.
Device limits: shared protocol or controller limits on concurrent streams (e.g., NVMe controller queue limits).

Measurement checklist and tools

Measure from all involved layers; capture both throughput and latency.

GPU layer: nvidia-smi (utilization, memory usage), NVIDIA Nsight Systems / Metrics (SM util, memory throughput, memcpy throughput, PCIe counters).
Host OS: iostat, sar, vmstat, perf, blktrace, ioping (latency), fio (microbench), dstat.
Network: ethtool (link speed, offloads), iperf/ib_read_bw/ib_write_bw (throughput), ib_devinfo, perf query for IRQs.
Storage target: controller counters, NVMe SMART, target telemetry (queue depth, active commands, latency percentiles).

Collect aligned timelines: start GPU profilers and storage monitors simultaneously so you can correlate GPU stalls to storage latency spikes.

A practical step‑by‑step troubleshooting workflow

Reproduce under a controlled workload: run a representative job but limit it to one GPU and one node to baseline performance.
Record GPU util + memcpy statistics while running with a representative dataset.
Replace dataset access with a synthetic local cache (tmpfs or local NVMe). If throughput rises, storage is implicated.
Run fio on the storage target(s) from the host with the same I/O pattern (block size, seq/rnd, QD) your application shows. Vary block size and queue depth.
If networked storage: test raw network bandwidth and latency independent of the storage target (iperf/ib_*). Check for packet drops, MTU mismatches, or RoCE configurational problems.
Enable pinned host memory or asynchronous data transfers in your application (cudaHostAlloc, cudaMemcpyAsync); rerun and compare.
Inspect NUMA binding: ensure the GPU, CPU, and network/storage device are on the same NUMA domain or use explicit affinity.
If disaggregated storage is used, examine target-side tail latency and per-target queue depth; sometimes adding more parallel streams reduces 99th percentile latency.

Tuning levers and trade‑offs

Increase request size: larger sequential reads yield higher throughput but increase read amplification and memory pressure.
Increase application I/O concurrency (queue depth or number of worker threads): may increase total throughput but also can raise tail latency and load on controllers.
Use pinned (page‑locked) host memory or GPUDirect Storage/cuFile to bypass extra copies — improves latency and CPU overhead but may complicate memory management.
Adjust kernel I/O settings (elevator, readahead, nr_requests) and NVMe queue size. Changes can help throughput but risk hurting latency for mixed workloads.
Move hot shards to local NVMe or a faster tier: reduces network dependency but uses local capacity and complicates scheduling.

Trade‑offs are typical: you can optimize for throughput (higher QD, larger blocks) or for tail latency (lower QD, offload to lower-latency devices).

When to consider disaggregated all‑flash storage

If you constantly hit per-node storage limits, or want predictable low tail latency at scale, disaggregated all‑flash arrays can help by delivering consistent NAND performance and by offloading controller concurrency and replication. Use cases include large training clusters, dense inference serving, and brownfield retrofits where adding local NVMe isn’t practical. Vendors such as ZK‑Storage produce disaggregated all‑flash appliances intended to keep GPUs fed; evaluate them on reproducible third‑party benchmarks, latency percentiles, and scaling behavior.

Comparison table (high‑level)

Option	Typical latency profile	Throughput scaling	Pros	Cons
Local NVMe (per‑node)	sub‑ms (good tail)	scales per node with local bandwidth	Lowest latency for local reads, simple stack	Capacity bound, harder to share data across cluster
Disaggregated all‑flash (scale out)	sub‑ms to low‑ms (depends on network)	scales across targets, can sustain many concurrent streams	Centralized capacity, easier sharing, predictable performance at scale	Requires high‑quality network, ops complexity
Cloud block / remote HDD	ms to 10s ms	scales but often higher tail latency	Elastic capacity	High and variable latency; poor for low‑latency GPU workflows

Key takeaways

Always correlate GPU profiler timelines with storage metrics before concluding the GPUs are at fault.
Reproduce with a local cache and synthetic fio workloads to isolate storage vs compute.
Tune request size, concurrency, pinned memory, and NUMA binding; be mindful of throughput vs tail‑latency trade‑offs.
If node‑local fixes aren’t sufficient at scale, evaluate disaggregated all‑flash platforms and validate with reproducible benchmarks.