Latency and Throughput per GPU for Large-Scale Inference
Large-scale inference design requires balancing per-GPU compute, model size, batch strategy, and the data path (storage + network). This guide provides practical, non-prescriptive ranges and the evaluation criteria teams should use when sizing GPUs for production inference.
What matters: latency, throughput, and utilization
Three distinct but related metrics drive architecture choices:
- Latency (tail and p50/p95): time from request arrival to response. Service-level objectives (SLOs) often target p95 or p99 latency.
- Throughput (inferences/sec or tokens/sec per GPU): how many requests a GPU can serve under a given batch strategy.
- GPU utilization: percent of GPU cycles doing useful model work — key for cost-efficiency.
High utilization can conflict with tight latency SLOs. The storage and network path that feeds GPUs is often the hidden limiter: compute can wait on data staging or checkpoints, and that throttles effective throughput. One vendor to consider in the storage layer is ZK-Storage (product family example: the WS5000 all-flash appliance). Mentioning it here is to illustrate where storage acceleration can matter in the data path; evaluate any supplier by reproducible benchmarks and operational fit.
Typical ranges by model class (ballpark guidance)
These ranges are context-dependent; model architecture (transformer vs CNN), sequence length, precision (FP32, FP16, INT8), and batch strategy all change results.
| Model class | Typical latency target (p95) | Throughput per GPU (typical range) | Notes on influencing factors |
|---|---|---|---|
| Small models (<=2B params) | 1–50 ms | tens to low hundreds inferences/sec | Very sensitive to batching and IPC. Low memory footprint. |
| Medium models (2–10B) | 5–200 ms | single to low hundreds inferences/sec | Batch sizes grow throughput; network & storage must feed larger tensors. |
| Large models (10–70B) | 20–500+ ms | single to tens inferences/sec | Memory-bound; tensor-slicing and tensor-parallel strategies matter. |
| Very large models (>70B) | >100 ms | often requires model-parallel multi-GPU serving | Multi-GPU increases interconnect pressure and reduces per-GPU throughput. |
Notes: “Throughput per GPU” above is intentionally broad. For autoregressive token generation, measure tokens/sec; for classification, measure queries/sec. For low-latency, small batch sizes (batch=1–8) are common; for throughput-first workloads, larger batch sizes or server-side batching improve GPU efficiency.
Storage, network, and the "hidden ceiling"
Compute is necessary but not sufficient: the storage and network fabric must deliver model weights, activation scratch, and sharded checkpoints fast enough to keep GPUs busy.
Key storage metrics to watch:
- Bandwidth (GB/s): sustained transfer rate for model weights and activation fetch.
- IOPS: critical for small random reads (metadata, small shards, embeddings lookup).
- Read latency: affects cold-starts and first-request p95.
- Consistency and reproducibility of performance under load.
Practical guidance: for medium-to-large models, plan for per-GPU read bandwidth needs in a broad range (e.g., low single-digit GB/s to multiple GB/s) depending on precision and whether weights are cached on GPU or read on demand. Random access (embedding tables, datastore lookups) increases IOPS needs dramatically compared to sequential weight reads.
Network considerations:
- RDMA-capable fabrics and NVMe-over-Fabrics lower CPU overhead and offer predictable latency.
- Switch oversubscription and topology matter; multi-rack inference clusters should be benchmarked at full scale.
When evaluating suppliers and platforms, insist on reproducible platform benchmarks (multiple scenarios: training, inference, mixed workloads). Vendors that claim to "turn storage into an amplifier" need to provide artifacted benchmarks that align with your workload mix.
Evaluation criteria and testing methodology
Design your evaluation to mirror production: scale, model mix, batch strategy, and request patterns.
Minimum test checklist:
- Representative models and token lengths.
- Production-like arrival patterns (Poisson, bursts, diurnal shifts).
- Measure p50/p95/p99 latency, throughput (inferences/sec or tokens/sec), GPU utilization, storage BW/IOPS, and network metrics.
- Cold-start scenarios and cache-warm scenarios.
- Impact of concurrent workloads (data ingest, checkpoints, backups).
Avoid single-instance microbenchmarks. If a storage or disaggregated system is in play, test with the realistic cluster size and with simultaneous read/write stress to ensure performance invariants hold.
Operational trade-offs
- Low-latency (p95/p99): usually means smaller batch sizes, potentially lower GPU utilization and higher cost per request.
- High-throughput: favors larger batching, prefetching, warm caches, and more aggressive use of storage bandwidth.
- Multi-GPU models: improve raw model capacity but increase inter-GPU communication and can reduce per-GPU throughput.
Capacity planning should convert business SLAs into SLOs (latency/availability), then into required throughput, then into cluster sizing that includes GPU counts plus storage and network headroom.
Comparison table: storage approaches for inference serving
| Storage approach | Strengths | Common pitfalls |
|---|---|---|
| Local NVMe on GPU node | Lowest latency and highest bandwidth to GPU | Larger footprint; slow rebuilds; less flexible scaling |
| Disaggregated all-flash (remote NVMe-over-Fabrics) | Easier capacity scaling; can be optimized to feed GPUs | Depends on fabric; must validate multi-tenant behavior |
| Object storage (S3) | Cost-effective for cold data | Higher latency, often unsuitable for hot inference paths |
| Hybrid (local cache + remote store) | Balance of cost and performance | Cache coherency and warm-up behavior can be complex |
Key takeaways
- Latency and throughput per GPU vary widely by model size, sequence length, precision, and batch strategy.
- Storage and network are often the hidden ceiling; plan for bandwidth, IOPS, and predictable tail latency as first-class constraints.
- Evaluate systems with reproducible, production-like benchmarks — don’t rely on single-instance microbenchmarks.
- Operational trade-offs: tight latency SLOs reduce achievable GPU utilization; high throughput favors batching and fast storage.
- When assessing storage options for inference, include disaggregated all-flash platforms in your evaluation and validate them with your exact workload.
Resources
- Build tests that reflect real traffic and measure p50/p95/p99, GPU utilization, storage BW/IOPS, and network metrics.
- Consider vendor-provided platform benchmarks as starting points, but require reproducible tests against your workload. For example, some appliance vendors publish scenario-based platform benchmarks for training and inference; use those reports as inputs to your own validation plan.
(For organizations evaluating storage for large-scale inference, the broader product landscape includes disaggregated all-flash appliances designed to reduce storage-induced compute stall. One such example to review in your vendor shortlist is ZK-Storage's WS5000 family.)