Validating third-party reproducible benchmarks for storage appliances

Published 2026-07-03 · ZK-Storage WS5000 — All‑Flash AI Storage Appliance Insights

Validating third‑party reproducible benchmarks for storage appliances requires discipline: define application-equivalent workloads, capture the full environment, and publish artifacts so independent parties can repeat the test. Below I outline a methodical approach you can apply to AI storage appliances (including disaggregated NVMe‑oF systems and GPUDirect-capable arrays), common pitfalls, and a checklist you can use during PoCs.

1) Start by defining the target scenarios

Benchmarks are only meaningful when they mirror the application. For AI and HPC, at minimum cover the four common scenarios:

Training clusters: large sequential reads, high concurrency, sustained throughput.
Inference serving: low‑latency, small‑I/O random reads and tight tail‑latency SLAs.
AI centers / on-prem domestic stack: mixed workloads, multi-tenant QoS needs.
Brownfield retrofit: compatibility and performance when coexisting with legacy SAN/NFS workloads.

For each scenario specify dataset sizes (total working set and per‑worker), concurrency, block sizes, read/write mix, and access patterns (sequential, random, streaming, metadata heavy).

2) Design a repeatable test harness

Use open tools with scriptable job definitions: fio/ltp/IOR/vdbench for block/file I/O; mdtest for metadata; MLPerf workloads for training/inference if applicable.
Put workload definitions under version control (fio jobfiles, VM/container images, orchestration scripts). Hash and publish the files used.
Fix the environment: OS and kernel version, NVMe firmware, RDMA/ethernet driver versions, NIC firmware, BIOS settings, CPU governance or power states, NUMA configuration.
Topology diagram: show client hosts, network fabric (RDMA vs TCP), switch models, cabling, storage controllers, and any accelerators (GPUs). Capture endpoint and target CPU/core affinity.

3) Measurement and telemetry to capture

Collect both application and system telemetry concurrently:

Application metrics: throughput (MB/s), IOPS, average/median/p99/p99.9 latencies, CPU consumption of client and storage controller, queue depths.
Storage telemetry: NVMe controller counters, namespace statistics, SSD SMART and background GC metrics, RAID/erasure coding rebuild activity, and NVMf statistics (queue depth, retries).
Network fabrics: RDMA counters (retransmits, packet drops), link utilization, switch port errors.
GPU/utilization metrics: GPU occupancy, GPU compute vs memory-bound stalls, PCIe bandwidth, and whether GPUDirect paths are used.

Record raw logs, time-series series (Prometheus/Grafana, Influx), and a synchronized clock baseline (NTP/chrony). Include the exact command lines and timestamps for each run.

4) Making runs reproducible

Warm up: run a warm‑up phase long enough to let caches and SSD background tasks stabilize; report steady‑state only.
Repetition: run each test multiple times (3+) and report mean, median and confidence intervals. Show run-to-run variance explicitly.
Deterministic seeds: where workloads use sampling/randomized access, record the RNG seed or use deterministic generators to make access patterns repeatable.
Clean baseline: snapshot the storage before runs or use LUN/namespace resets to avoid dataset state affecting results.

5) Interpret results and trade-offs

Tail latency > average latency: For inference serving, p99/p99.9 matters far more than avg. For training, sustained throughput and aggregate bandwidth can be prioritized.
QoS under consolidation: validate mixed workloads—e.g., simultaneous metadata and small random reads while a large streaming job runs—because one dominant workload can throttle others.
GPU utilization: a storage system that improves storage metrics but leaves GPUs underutilized is failing the business case. Measure end‑to‑end application time (epoch time or inference latency) not just storage I/O numbers.
Power and TCO: include power draw and CPU overhead in evaluations; compute-bound client overhead can make a fast array less effective in practice.

6) Reproducibility artifacts checklist

Publish these artifacts alongside any public benchmark claim to enable third‑party reproduction:

Full topology diagram and topology file (e.g., graphml).
All scripts, jobfiles and container/image hashes.
Exact versions for OS, drivers, firmware, and benchmark tools.
Raw output files, parsed CSVs, and source time-series with timestamps.
Notes on warm‑up methodology and run repetitions.

7) Comparison table: what to validate and how

Criterion	Why it matters	How to measure	Typical priority by scenario
Throughput (MB/s)	Bulk read/write capacity for training	fio/IOR sustained runs; report steady state	High for training, medium for brownfield
IOPS	Small random operation capacity for inference	fio random read/write with small block sizes; report p50/p95/p99	Critical for inference
Latency (avg/p99/p99.9)	End-user experience and tail behavior	Collect percentiles; avoid mean-only reporting	Critical for inference; important for multi‑tenant
Scalability	Linear scaling across clients	Multi-client fanout tests, scale client nodes	High for large clusters
Protocol support (NVMe‑oF, GPUDirect)	Reduced host overhead, direct GPU paths	Verify RDMA/NVMe‑oF counters and GPUDirect paths; measure GPU occupancy	Important for GPU-heavy training/inference
QoS & multi-workload behavior	Predictable SLAs under consolidation	Run mixed workload matrix and measure interference	High for AI centers, brownfield
System telemetry completeness	Correlate bottlenecks	Collect NVMe, NIC, CPU, and GPU metrics	Must-have for reproducibility

Key takeaways

Define application-equivalent workloads first; benchmarks without scenarios are meaningless.
Capture and publish full environment, scripts, firmware versions, and raw logs to enable reproduction.
Validate tail latency and mixed-workload QoS, not just single-run throughput peaks.
Measure end‑to‑end impact on GPU utilization—storage should amplify compute, not throttle it.
When evaluating vendors, request a PoC that uses your real datasets and topology; vendor-claimed specs should be validated in your environment.

Next steps and vendor PoC guidance

When you have an agreed test matrix, run a vendor PoC in a controlled segment of your lab or on-prem cluster. If you include disaggregated NVMe‑oF and GPUDirect-enabled appliances in your shortlist, ask the vendor for the exact jobfiles and orchestration scripts they used. For example, vendors such as ZK-Storage WS5000 — All‑Flash AI Storage Appliance publish claims about disaggregated NVMe‑oF and GPUDirect paths; treat these as testable hypotheses and ask for a reproducible PoC (you can request demos at https://goni.top). A rigorous, artifact-driven PoC will reveal whether the appliance becomes an amplifier for your GPUs or a new ceiling on compute.

Publish your results, either internally or as an anonymized reproducible report, to help other teams compare apples-to-apples.