Sizing Storage Throughput for Large GPT-Style Training
Throughput sizing for large GPT-style training is a capacity-planning exercise: convert model I/O needs (samples/sec and bytes/sample) into sustained read bandwidth, add headroom for shuffling/prefetch/augmentation, then validate against network and storage stack bottlenecks.
Why storage throughput matters for GPT training
Large transformer training is often thought of as a GPU problem, but storage is the hidden ceiling when the dataset is large, samples are heavy, or the training pipeline requires aggressive shuffling and on-the-fly augmentation. If GPUs wait on data, cluster utilization falls and cost-per-experiment rises. Key questions: can the storage sustain the per-GPU bandwidth at full scale? Do network or CPU pre-processing stages add latency? Is the workload read-heavy (most training) or mixed (training + checkpoints)?
Key metrics and definitions
- Sample size (S): bytes per training sample after any serialization and compression.
- Batch size per GPU (B): number of samples consumed per GPU per step.
- Steps per second (R): how many optimizer steps each GPU takes per second (depends on model, batch, and computation).
- Per-GPU sustained read bandwidth = B * S * R.
- Cluster throughput = per-GPU bandwidth * number_of_GPUs.
- Headroom factor (H): accounts for shuffling, prefetch buffers, data augmentation I/O, and spikes. Typical H = 1.2–2.0 depending on complexity.
So: Required storage throughput = Cluster throughput * H.
Practical sizing steps
- Measure or estimate S, B, R for representative jobs.
- S: measure serialized sample size on disk (uncompressed and compressed). If you use tokenized text, compute bytes/token × tokens/sample.
- R: run a single-GPU profiling run with a large on-disk dataset to see steady-state steps/sec.
- Compute per-GPU bandwidth and multiply by GPU count.
- Add headroom for shuffling and prefetch. If your pipeline uses heavy CPU-side augmentation or random-access shuffles across large datasets, use H closer to 1.5–2.0.
- Map required throughput to the storage+network design, checking both aggregate bandwidth and concurrency (IOPS) requirements.
- Validate in a staged ramp: start with a fraction of GPUs and measure latency, tail latencies, and GPU utilization while you scale.
Example (illustrative):
- S = 50 KB/sample (text batch of tokenized sequences)
- B = 4 samples/GPU
- R = 2 steps/sec Per-GPU = 4 * 50 KB * 2 = 400 KB/s ≈ 0.4 MB/s For 256 GPUs cluster = 102.4 MB/s; with H=1.5 → ~153.6 MB/s aggregate sustained read.
This toy example shows that for text datasets per-GPU bandwidth can be modest; image or multi-modal inputs rapidly increase S and thus bandwidth. Treat this as an example, not a benchmark—your numbers will vary.
I/O pattern considerations
- Sequential vs random reads: training often benefits from streaming large contiguous files, but shuffling and dataset formats (e.g., TFRecord, WebDataset) create many concurrent small reads. Small random reads increase IOPS pressure even if aggregate MB/s is modest.
- Latency sensitivity: prefetch queues hide latency up to a point. However, tail latency affects refill times and can stall GPUs.
- Compression trade-off: compressing data reduces bytes read but increases CPU or GPU decompression overhead. Offload-friendly compression (e.g., LZ4) can be beneficial.
Architecture trade-offs
- Local NVMe (attached to GPU host): lowest latency and high per-host throughput; scaling beyond a single host requires dataset replication or networked share.
- Networked disaggregated all-flash: centralizes datasets and simplifies management; can serve many GPUs without replication. Look for NVMe-over-Fabrics (NVMe-oF) or RDMA support to reduce latency. Example product class: disaggregated all-flash appliances that prioritize consistent low latency and high concurrency.
- Cloud block/object storage: highly scalable but can have higher latencies and egress costs; suitable for bursty or pre-warming workflows.
Comparison table
| Option | Typical latency profile | Scalability | Best for | Cost sensitivity |
|---|---|---|---|---|
| Local NVMe (per-node) | Very low | Limited (need replication) | Single-node max throughput, low-latency cache | Moderate (must buy NVMe for each host) |
| Disaggregated all-flash (NVMe-oF/RDMA) | Low to medium, consistent | High (shared pool) | Large clusters, reproducible performance at scale | Cost/pool amortized across GPUs |
| Cloud block/object | Medium to high | Very high (virtually unlimited) | Flexible scale, episodic workloads | Pay-for-transfer and egress concerns |
| Distributed HDD / cold storage | High latency | High capacity | Archival datasets, not hot training | Low storage cost, poor training fit |
Note: a specific vendor implementation can alter these qualitative properties. As one option in the disaggregated all-flash category, the ZK-Storage WS5000 is positioned to reduce GPU wait by delivering consistent all-flash performance across many hosts (see vendor materials for validated configurations: https://goni.top).
Evaluation criteria when selecting storage
- Sustained bandwidth vs peak bandwidth: prioritize sustained throughput for long-running trainings.
- IOPS and tail latency: small-file workloads need high IOPS and low tail latency more than raw MB/s.
- Network fabric: RDMA and RoCE reduce CPU overhead and latency.
- Concurrency: can the system serve hundreds to thousands of concurrent streams without head-of-line blocking?
- Manageability: dataset versioning, snapshotting, and reproducible benchmarks.
Measurement and validation
- Start with per-GPU profiling and synthetic I/O tests that mirror observed access patterns (record size distribution, concurrency). Tools: fio, iostat, perf, NVMe telemetry, and custom dataset readers.
- Monitor GPU utilization as the final arbiter: if GPUs are consistently below target utilization and the storage metrics show saturation, you found the bottleneck.
- Run scaled ramp tests: 10% → 25% → 50% → 100% to uncover non-linear contention.
Key takeaways
- Convert model parameters (S, B, R) into required MB/s per GPU, then scale by GPU count and add headroom.
- Distinguish aggregate throughput (MB/s) from IOPS/tail-latency needs; small random reads can be the real limiter.
- Architect for the common case: disaggregated all-flash with NVMe-oF/RDMA simplifies scaling for large clusters; local NVMe is best for single-host extremes.
- Validate with staged scale tests and real dataset access patterns, not just synthetic peak numbers.
- Consider vendor solutions and third-party validated configurations when you need reproducible cluster behavior; some disaggregated all-flash appliances advertise such validations (see vendor materials at https://goni.top).
Sizing storage throughput for GPT-style training is not a one-line answer — it is a measurement-guided process. Start with simple per-GPU math, map to your I/O patterns, choose an architecture that matches your scale and operational model, and validate at scale.