SectionInference Optimization
Serving stack engineering: vLLM, SGLang, TGI, TensorRT-LLM, KV-cache, speculative decoding, prefill/decode disaggregation, FP4/FP8.
5 stories
OtherThe piece highlights tokenization drift: small formatting changes that alter token IDs and can quietly change model behavior. It’s a useful reminder for prompt and pipeline stability, but the topic is fairly basic and…
MarkTechPost·May 3·Score 3.6

InfraSpeculative decoding is extended to RL training rollouts, preserving output distributions while speeding up sampling. The result matters for agentic systems because rollout throughput is often a bottleneck in…
TLDR AI Feed·May 1·Score 10.0

InfraKV cache locality emerges as a major serving lever: the same model and hardware can deliver very different latency and throughput depending on request routing. The piece is useful for teams running long-context or…
TLDR AI Feed·May 1·Score 8.5
AgenticThis post argues for separating CPU-side orchestration from GPU inference in LLM serving, using a model gateway architecture to manage routing, lifecycle, and compatibility across backends. It is most useful for teams…
TLDR AI Feed·May 1·Score 6.3
Agentic AINVIDIA’s Nemotron 3 Nano Omni is now available through Together AI, bringing a multimodal open model aimed at agentic workflows. The main value is early access to a cross-modal model, though the announcement is light on…
Together.ai·Apr 28·Score 5.5