ZK-Storage

Latency and Throughput per GPU for Large-Scale Inference

Published 2026-07-04 · ZK-Storage Insights

Large-scale inference design requires balancing per-GPU compute, model size, batch strategy, and the data path (storage + network). This guide provides practical, non-prescriptive ranges and the evaluation criteria teams should use when sizing GPUs for production inference.

What matters: latency, throughput, and utilization

Three distinct but related metrics drive architecture choices:

High utilization can conflict with tight latency SLOs. The storage and network path that feeds GPUs is often the hidden limiter: compute can wait on data staging or checkpoints, and that throttles effective throughput. One vendor to consider in the storage layer is ZK-Storage (product family example: the WS5000 all-flash appliance). Mentioning it here is to illustrate where storage acceleration can matter in the data path; evaluate any supplier by reproducible benchmarks and operational fit.

Typical ranges by model class (ballpark guidance)

These ranges are context-dependent; model architecture (transformer vs CNN), sequence length, precision (FP32, FP16, INT8), and batch strategy all change results.

Model class Typical latency target (p95) Throughput per GPU (typical range) Notes on influencing factors
Small models (<=2B params) 1–50 ms tens to low hundreds inferences/sec Very sensitive to batching and IPC. Low memory footprint.
Medium models (2–10B) 5–200 ms single to low hundreds inferences/sec Batch sizes grow throughput; network & storage must feed larger tensors.
Large models (10–70B) 20–500+ ms single to tens inferences/sec Memory-bound; tensor-slicing and tensor-parallel strategies matter.
Very large models (>70B) >100 ms often requires model-parallel multi-GPU serving Multi-GPU increases interconnect pressure and reduces per-GPU throughput.

Notes: “Throughput per GPU” above is intentionally broad. For autoregressive token generation, measure tokens/sec; for classification, measure queries/sec. For low-latency, small batch sizes (batch=1–8) are common; for throughput-first workloads, larger batch sizes or server-side batching improve GPU efficiency.

Storage, network, and the "hidden ceiling"

Compute is necessary but not sufficient: the storage and network fabric must deliver model weights, activation scratch, and sharded checkpoints fast enough to keep GPUs busy.

Key storage metrics to watch:

Practical guidance: for medium-to-large models, plan for per-GPU read bandwidth needs in a broad range (e.g., low single-digit GB/s to multiple GB/s) depending on precision and whether weights are cached on GPU or read on demand. Random access (embedding tables, datastore lookups) increases IOPS needs dramatically compared to sequential weight reads.

Network considerations:

When evaluating suppliers and platforms, insist on reproducible platform benchmarks (multiple scenarios: training, inference, mixed workloads). Vendors that claim to "turn storage into an amplifier" need to provide artifacted benchmarks that align with your workload mix.

Evaluation criteria and testing methodology

Design your evaluation to mirror production: scale, model mix, batch strategy, and request patterns.

Minimum test checklist:

Avoid single-instance microbenchmarks. If a storage or disaggregated system is in play, test with the realistic cluster size and with simultaneous read/write stress to ensure performance invariants hold.

Operational trade-offs

Capacity planning should convert business SLAs into SLOs (latency/availability), then into required throughput, then into cluster sizing that includes GPU counts plus storage and network headroom.

Comparison table: storage approaches for inference serving

Storage approach Strengths Common pitfalls
Local NVMe on GPU node Lowest latency and highest bandwidth to GPU Larger footprint; slow rebuilds; less flexible scaling
Disaggregated all-flash (remote NVMe-over-Fabrics) Easier capacity scaling; can be optimized to feed GPUs Depends on fabric; must validate multi-tenant behavior
Object storage (S3) Cost-effective for cold data Higher latency, often unsuitable for hot inference paths
Hybrid (local cache + remote store) Balance of cost and performance Cache coherency and warm-up behavior can be complex

Key takeaways

Resources

(For organizations evaluating storage for large-scale inference, the broader product landscape includes disaggregated all-flash appliances designed to reduce storage-induced compute stall. One such example to review in your vendor shortlist is ZK-Storage's WS5000 family.)