PerformanceJun 9, 2026 · 11 min read

300+ tokens/sec from open models on B200s

Raw GPU FLOPs don't translate to fast inference on their own. The gap between a naive serving setup and a tuned one is often 3–4×. Here are the techniques that get tier-1 open-weight models past 300 tokens per second on dedicated hardware.

Daniel Okafor
Inference Performance Lead

Throughput is a scheduling problem

Generating text is memory-bandwidth bound, not compute bound. Each new token requires reading the entire KV cache and model weights from HBM. The art of fast inference is keeping the GPU's memory pipeline saturated — which is fundamentally a scheduling and batching problem, not a matter of buying more FLOPs.

Continuous batching

Static batching waits for a full batch of requests, runs them in lockstep, and idles whenever any sequence finishes early. Continuous batching instead admits and evicts sequences at the token level: the moment one request completes, a waiting request takes its slot in the very next forward pass. On bursty production traffic this alone can double effective throughput, because the GPU never waits on the slowest sequence in a batch.

Speculative decoding

A small, fast "draft" model proposes several tokens ahead; the large target model then verifies them in a single forward pass. When the draft is right — which it often is for predictable spans — you get multiple tokens for the cost of one verification step. For models like DeepSeek-V4, well-tuned speculation yields a meaningful latency reduction without changing a single output token, since rejected guesses simply fall back to standard decoding.

KV-cache management

The KV cache grows with every token and quickly dominates memory. Paged attention treats it like virtual memory — allocating fixed-size blocks on demand instead of contiguous per-sequence buffers. That eliminates fragmentation, lets you pack far more concurrent sequences into the same HBM, and is a big part of how a single node serves high concurrency without falling over.

Paged KV blocks remove fragmentation and raise the concurrency ceiling.
Quantized KV (where quality allows) further increases the number of sequences in flight.
Prefix caching reuses the KV for shared system prompts across requests.

Custom kernels

Off-the-shelf kernels leave performance on the table for specific model shapes. Fused attention and MoE-routing kernels tuned for the B200's memory hierarchy cut kernel-launch overhead and keep tensor cores fed. For mixture-of-experts models like DeepSeek-V4 and GLM-5.2, efficient expert routing is the single biggest lever — a poorly scheduled MoE layer can halve your throughput.

Why dedicated hardware matters

Every technique above assumes you control the whole node. On shared infrastructure, your continuous-batching scheduler is fighting other tenants for the same SMs and HBM bandwidth, and your tail latency reflects their traffic, not yours. Single-tenant GPUs are what make these numbers reproducible in production rather than just in a benchmark — throughput stays flat as concurrency climbs because the hardware is yours alone.

Putting it together

Continuous batching keeps the pipeline full, speculative decoding multiplies tokens per pass, paged attention raises concurrency, and custom kernels cut the per-step overhead — all on isolated hardware. Stack them, and a single 8×B200 node serves tier-1 open models at 300+ tokens/sec while holding sub-150ms time-to-first-token. None of it requires compromising on privacy.

Run this privately, in your own environment

A solutions engineer will scope a zero-retention deployment for your models and volume.

Talk to an engineer