AI Model Report

Infrastructure · MAY 12, 2026

vLLM v0.20.2 ships Model Runner V2: up to 56% higher throughput on GB200

The May 2026 stable release of vLLM bundles a new GPU-native Triton kernel async-scheduling stack, FP8 inference, and continuous batching as the default.

By Aiko Tanaka · Inference & serving · May 12, 2026

vLLM v0.20.2, the current stable release as of May 2026, ships Model Runner V2 (MRV2) as the headline change. On NVIDIA's GB200 hardware, MRV2 delivers up to 56% higher throughput than the v0.19 line, by way of two engineering decisions: GPU-native Triton kernels for the hot path, and an async-scheduling reorganization that overlaps work that previously serialized.

The throughput delta is the headline. The infrastructure picture is broader:

### Model Runner V2

Available in v0.20.0 and later. The "+56%" headline is on GB200; results vary by hardware. The win comes from two places:

  • GPU-native Triton kernels for hot-path operations, replacing CUDA paths that had been the canonical configuration since v0.16.
  • Async scheduling that decouples token generation from the orchestration loop. Where the prior runner serialized request preparation, KV cache management, and forward-pass execution, MRV2 overlaps them at a much finer granularity.

### FP8 inference

v0.20.2 also ships FP8 inference as a stable feature on H100 and Blackwell GPUs. The throughput improvement is significant on either architecture and is now enabled with a single flag rather than the multi-flag configuration that the v0.19 series required.

### Continuous batching, by default

Continuous batching — dynamically grouping incoming requests for maximum GPU utilization — is now the default behavior. Prior releases shipped it as an opt-in flag. The default change matters because most operators were not enabling it when they should have been; the v0.20.2 default makes the better behavior the path of least resistance.

### The benchmarks-versus-reproduction gap

The 56% figure is vLLM's published number on GB200 with MRV2. Operator-reproducible numbers vary by:

  • Model architecture. Dense models see different deltas than MoE models, which see different deltas than retrieval-augmented configurations.
  • Batch size. The delta narrows substantially at batch=1.
  • Sequence length. Long-context workloads benefit more from the async-scheduling change.
  • Prefix-caching configuration. vLLM v0.20.2's prefix-caching path has changed slightly; previously hand-tuned prefix configurations may need to be retuned.

### Competitive picture: vLLM vs. SGLang

SGLang reports 29% higher throughput than vLLM on H100s in some benchmarks (16,200 vs. 12,500 tok/s) and up to 6.4x gains on prefix-heavy workloads. SGLang's February 2026 release unlocked 25x inference performance gains on NVIDIA GB300 NVL72 systems specifically. Operators serving prefix-heavy workloads should still benchmark both engines on their representative traffic — the right choice is workload-dependent.

For inference-stack operators reading this with a Blackwell-class fleet, the v0.20.2 migration plan is the next month's homework. The win is real; the configuration changes are non-trivial; the operator should set aside a calendar week for the retune.

Sources