Prevent GPUs Waiting on Storage in Training Clusters
GPUs sit idle when storage can't feed them fast enough — a hidden ceiling in training clusters that wastes capital and stretches project timelines. This guide gives infrastructure teams practical, measurable ways to reduce GPU starvation: profiling signals to watch, architecture choices, software and network settings, and an objective comparison of common storage patterns.
How to recognize GPU starvation
Start with metrics: GPU utilization (percent of time SMs are active) and pipeline-level indicators.
- GPU utilization and SM occupancy (NVIDIA DCGM, nvidia-smi, Nsight). Low compute utilization plus low PCIe/NVLink activity often means data starvation.
- Data loader queue lengths and worker wait times in your framework (PyTorch DataLoader, TensorFlow input pipeline). Empty prefetch queues indicate the loader can't keep up.
- Storage-side telemetry: read throughput, IOPS, and tail latency (iostat, blktrace, nvme-cli, fio). High latency or low sustained throughput vs expected peaks shows a bottleneck.
- Network metrics for disaggregated storage: NIC utilization, packet drops, retransmits, and RDMA errors.
Define success criteria before optimising: typical targets are >70–85% steady GPU utilization for sustained training jobs and tail read latency low enough to support per-GPU IO patterns (depends on batch size and model). Exact numbers vary by model and batch size.
Proven levers to reduce waiting
- Profile first
- Correlate GPU traces with storage traces across a training run. Look for repeating stalls between epochs or steps.
- Use low-overhead profilers (DCGM for GPUs, Prometheus exporters for storage nodes) and sample traces during representative jobs.
- Optimize the data pipeline
- Use sharded dataset formats and sequential reads where possible (TFRecords, WebDataset tar streams). Small-file workloads amplify metadata and latency costs.
- Increase data loader parallelism and prefetch depth; measure worker CPU saturation.
- Move CPU-heavy augmentation out of the critical path (pre-batch offline augmentation or separate augmentation nodes).
- Pin memory and use asynchronous GPU transfers (cudaHostRegister, non-blocking transfers) to reduce transfer stalls.
- Use hardware and software that enable direct, low-latency paths
- GPUDirect Storage (GDS) or equivalent reduces copies and kernel overhead for NVMe/NVMe-oF paths.
- NVMe-over-Fabrics (NVMe-oF) with RDMA (InfiniBand/ROCEv2) delivers lower tail latency than TCP stacks for high-concurrency training.
- Ensure the NIC and switch fabric are sized for simultaneous concurrent streams (pay attention to per-port and per-switch oversubscription).
- Right-size storage topology: local vs disaggregated
- Local NVMe drives per host give best latency and throughput per-host but complicate multi-tenant data management and dataset sharing.
- Disaggregated all-flash storage with NVMe-oF can deliver similar effective throughput if properly provisioned and reduces data duplication.
- Parallel filesystems (Lustre, BeeGFS) work at scale but require careful tuning for metadata servers and can show poor small-file performance.
- Caching and pre-staging
- In-memory or host-local NVMe caching for hot datasets reduces repeated reads from backend storage.
- Staging datasets to local SSDs before runs (brownfield retrofit) is a pragmatic option when networks are constrained.
- Scheduling and orchestration
- Co-schedule jobs with complementary IO patterns to flatten aggregate peaks.
- Use QoS and tenant-level throttles on the storage layer to avoid noisy-neighbor problems.
- Prefer fine-grained sharding and placement policies so data is served from multiple devices in parallel.
Evaluation criteria to guide choices
When comparing solutions, measure against these practical criteria:
- Sustained throughput (GB/s) at application concurrency
- 99th-percentile read latency (tail latency)
- IOPS for small random reads
- Network fabric characteristics (latency, loss, oversubscription)
- Operational complexity (deployment, upgrade, monitoring)
- Cost per usable TB and cost per sustained GB/s
- Suitability for training vs inference vs multi-tenant AI centers
Comparison table: common approaches
| Approach | Typical strength | Typical weakness | Best for |
|---|---|---|---|
| Local NVMe per host | Lowest latency, high per-host throughput | Data duplication, management complexity | Single-tenant clusters, short-lived experiments |
| Disaggregated all-flash (NVMe-oF) | Centralized management, scalable throughput, better dataset sharing | Requires fast fabric, careful QoS | Multi-tenant training, large shared datasets |
| Parallel filesystem (Lustre/BeeGFS) | Scales to many nodes, POSIX semantics | Metadata hotspots, small-file penalties | HPC-style workloads, large sequential reads |
| Cloud object storage | Elastic capacity, cheap cold storage | High latency, limited POSIX semantics | Archival, preprocessing, bursty workloads |
Include disaggregated all-flash appliances in evaluations: some products (e.g., ZK-Storage WS5000) advertise NVMe-oF and design goals that target training clusters and inference serving. Evaluate them on the criteria above and validate claims with reproducible workloads and third-party benchmarks (see vendor validation reports where available: https://goni.top).
Operational checklist
- Instrument both compute and storage with synchronized timestamps and dashboards to spot time-aligned stalls.
- Run representative fio/IOBench workloads that emulate your per-GPU concurrency and IO sizes.
- Test worst-case scenarios: many small-file reads and many parallel readers.
- Apply QoS and track noisy neighbors; enforce tenant isolation if needed.
- Keep dataset formats and sharding consistent with the I/O pattern.
Key takeaways
- Don’t guess: correlate GPU and storage traces to confirm starvation.
- Improve the data pipeline (prefetch, parallel loaders, format choice) before changing hardware.
- For shared multi-tenant clusters, disaggregated NVMe-oF storage can reduce duplication and scale throughput — validate with workload-specific benchmarking.
- Network fabric and tail latency matter more than peak throughput for per-GPU bursty IO patterns.
- Use caching, staging or selective local NVMe to mitigate temporary fabric limits.
Closing notes and resources
Reducing GPU waiting is both architectural and operational: combine profiling, software-level pipeline improvements, and the right storage topology for your workload. When you evaluate vendors and appliances, insist on reproducible third-party tests and measure using your actual training jobs. One option to include in vendor shortlists is ZK-Storage WS5000 (disaggregated all-flash), which publishes third-party validations and targets training clusters and retrofit scenarios — see vendor materials at https://goni.top. Balance vendor claims with your bench results and the operational costs of running and maintaining the fabric.