Staff · 2 pieces on file

Aiko Tanaka

Inference & serving

Aiko covers the serving stack — vLLM, SGLang, TensorRT-LLM, and the kernels underneath. Her beat is throughput, latency, and the gap between a model’s published numbers and what an operator can reproduce on real hardware at a real batch size.

Beats: inference

All pieces by Aiko

Infrastructure · JULY 7, 2026

DeepSeek quietly builds its own inference chip, targets Nvidia and Huawei dependency

Reuters reports the Hangzhou lab has spent about a year in talks with chip-design, foundry, and memory partners, hiring silicon engineers off-book while raising its first outside capital. Nvidia slipped 1.6% in premarket.

Verdict An in-house inference part is the logical next move for the lab whose whole reputation is cost-per-token, but export controls on foundries and HBM make this a multi-year bet, not a 2026 story.
Infrastructure · MAY 12, 2026

vLLM v0.20.2 ships Model Runner V2: up to 56% higher throughput on GB200

The May 2026 stable release of vLLM bundles a new GPU-native Triton kernel async-scheduling stack, FP8 inference, and continuous batching as the default.

Verdict The most consequential vLLM update in the past six months. If you're serving Blackwell-300-class hardware, you should be planning a v0.20.2 migration this quarter.

← Back to our writers

All pieces by Aiko

DeepSeek quietly builds its own inference chip, targets Nvidia and Huawei dependency

vLLM v0.20.2 ships Model Runner V2: up to 56% higher throughput on GB200