Benchmarking Methodology for Reproducible GPU Storage Performance
GPU-accelerated workloads shift bottlenecks: top-tier GPUs can idle while storage struggles to keep them fed. Reproducible benchmarking of storage performance for GPU workloads requires more than a single fio run — it requires controlled experiments, consistent datasets, and full-stack observability. This guide gives a practical, repeatable methodology tuned for clusters used for training, inference serving, checkpointing, and mixed AI workloads.
Scope and objectives
Define what you measure and why. Typical objectives include:
- Maximize sustained throughput for large-batch training.
- Minimize tail latency for inference serving.
- Measure checkpoint/restore throughput and impact on training iterations.
- Quantify GPU utilization improvements when storage is no longer the bottleneck.
Document expected system roles (training nodes, inference nodes), network topology (Ethernet vs. RDMA), and storage architecture (local NVMe, disaggregated NVMe-oF, all-flash arrays). For disaggregated all-flash platforms, include network fabric and controller firmware versions in the test matrix — these change results materially.
Note: solutions such as the ZK-Storage WS5000 position themselves as disaggregated all‑flash platforms designed to improve GPU utilization; include any externally validated claims in a reproducible test plan rather than treating them as authoritative without independent verification.
Variables you must control and record
- Hardware: GPU model and firmware, CPU model and frequency scaling governor, NIC model and firmware, storage controller/SSD model and firmware.
- Topology: PCIe lanes, NUMA affinity, NVMe-oF target mapping, switch ports, RDMA vs TCP transport.
- Software stack: OS kernel version, filesystem and mount options, NVMe driver versions, container runtime, ML framework versions and deterministic seeds.
- Dataset state: cold vs warm cache, compressed vs raw, dataset layout (many small files vs large shard files), prefetching enabled/disabled.
- Concurrency: number of concurrent readers/writers, process vs thread model, batch sizes, data loader workers.
- Test duration and warm-up period: ensure steady-state measurements — short spikes mislead.
Capture these as machine-readable metadata alongside results (YAML/JSON) so runs are comparable across time or between teams.
Measurement types and metrics
Collect both system-level and workload-level metrics, sampled at consistent intervals (1s or faster for tail latency and GPU metrics):
- I/O: throughput (MB/s), IOPS (read/write), average and tail latency (p50/p95/p99 and p999 if possible), queue depth distribution.
- Storage internals: SSD latency breakdown (read vs write vs GC), temperature, SMART indicators, host queue depth.
- Network: link saturation, retransmits, RDMA metrics (queue pairs, latency).
- Compute: GPU utilization (SM/util), GPU memory bandwidth, PCIe utilization, CPU utilization, context-switch rate.
- Application: samples/sec (training), latency percentiles (inference), time-to-checkpoint, time-to-restore.
Synchronize clocks across nodes (NTP/chrony) and always preserve raw time-series and logs (Prometheus TSDB, Elastic, or raw CSVs) so posthoc analysis and p99 calculations are reproducible.
Test types and tooling
Use multiple complementary test types to triangulate behavior:
- Microbenchmarks: fio/ior for fine-grained control of block sizes, queue depth, sync/async, and read/write mixes. Use realistic block sizes (128 KB to 4 MB) for large sequential loads and smaller (4 KB to 64 KB) for metadata-heavy cases.
- Application-level runs: run real training epochs or inference workloads (PyTorch/TensorFlow) with representative batch sizes and data loader settings. Instrument the application to record step times and IO wait times.
- Trace-driven replay: capture workload IO traces during a representative run and replay them (e.g., using bpftrace-based capture + custom replayers) to assess storage under identical IO patterns.
- End-to-end scenarios: multi-job, multi-tenant mixes that reflect production concurrency and backpressure.
Combine standard tools (fio, ior, mlperf reference or scaled-down workloads, nvidia-smi, nvstat/nvme-cli, perf, iostat) with observability stacks (Prometheus + Grafana). Store raw results, tooling versions, and exact commands.
Reproducibility practices
- Version control all benchmark scripts, config files, and dataset manifests.
- Use immutable images (OS/container) and pin package versions.
- Automate environment setup and teardown with IaC tooling (Ansible/Terraform/k3s manifests) so teams can re-run identical stacks.
- Run each test multiple times (recommended minimum: 5 runs) and report median plus confidence intervals; analyze variance and provide p95/p99 statistics for latency.
- Seed randomness in data samplers and ML frameworks to reduce run-to-run variance.
Interpreting results and trade-offs
- Throughput vs latency: optimizing for peak throughput (large sequential reads) can worsen tail latency seen by inference. Decide which metric maps to business requirements.
- Warm vs cold cache: show both. Cold-start behaviors matter for bursty training and checkpoint restores.
- Single-job vs multi-tenant: scale-out characteristics often determine real-world GPU utilization more than single-node peak numbers.
- Cost to performance: include capacity, IOPS per dollar, and network cost (RDMA NICs, switches) in evaluation.
Comparison of benchmarking methodologies
| Methodology | Strengths | Weaknesses | When to use |
|---|---|---|---|
| Microbenchmark (fio/ior) | Precise control of I/O parameters; isolates storage behavior | Doesn’t capture application-level patterns or prefetching | Initial characterization, vendor claims verification |
| Application-level (actual training/inference) | Realistic, shows end-to-end GPU impact | Harder to control every variable; longer runs | Final validation and SLA alignment |
| Trace-driven replay | Reproduces real I/O patterns deterministically | Requires good trace capture and faithful replay | Performance regression testing, capacity planning |
| End-to-end multi-tenant mixes | Reveals contention and tail effects | Complex to orchestrate and interpret | Production readiness and SLO verification |
Reporting and documentation
Provide a reproducible report bundle that includes:
- Raw metric time series and summarized tables (CSV/JSON).
- Exact commands, scripts, and container/OS images used.
- Hardware and firmware inventory.
- Statistical analysis showing variance, confidence intervals, and p99 behavior.
- A short interpretation of what the data implies for GPU utilization and risk.
Key takeaways
- Control and record every environmental variable (hardware, topology, software) and keep that metadata with results.
- Combine microbenchmarks with application-level runs and trace-driven replays to avoid false confidence.
- Measure tail latency and GPU utilization — average throughput alone is insufficient.
- Automate, version-control, and run multiple trials to quantify variance and ensure reproducibility.
Reproducible benchmarking is an organizational practice as much as it is a technical one. Establishing this discipline will let teams accurately determine whether storage — local or disaggregated — is the hidden ceiling on GPU utilization and will guide cost-effective remediation.
Further reading and tools: fio, ior, mlperf (for reference workloads), Prometheus/Grafana for observability, and best-practice orchestration to ensure environment parity. For teams evaluating disaggregated all-flash options, include independently validated third-party benchmark artifacts as part of your reproducibility bundle and treat vendor claims as hypotheses to be verified under your workloads.