Section

Inference Optimization

Serving stack engineering: vLLM, SGLang, TGI, TensorRT-LLM, KV-cache, speculative decoding, prefill/decode disaggregation, FP4/FP8.

5 stories

Other
What is Tokenization Drift and How to Fix It?
The piece highlights tokenization drift: small formatting changes that alter token IDs and can quietly change model behavior. It’s a useful reminder for prompt and pipeline stability, but the topic is fairly basic and…
MarkTechPostMay 3Score 3.6
May 3
Score 3.6
Infra
Speculative Decoding for RL Training (18 minute read)
Speculative decoding is extended to RL training rollouts, preserving output distributions while speeding up sampling. The result matters for agentic systems because rollout throughput is often a bottleneck in…
TLDR AI FeedMay 1Score 10.0
May 1
Score 10.0
Infra
KV Cache Locality: The Hidden Variable in Your LLM Serving Cost (11 minute read)
KV cache locality emerges as a major serving lever: the same model and hardware can deliver very different latency and throughput depending on request routing. The piece is useful for teams running long-context or…
TLDR AI FeedMay 1Score 8.5
May 1
Score 8.5
Agentic
SMG: The Case for Disaggregating CPU from GPU in LLM Serving (16 minute read)
This post argues for separating CPU-side orchestration from GPU inference in LLM serving, using a model gateway architecture to manage routing, lifecycle, and compatibility across backends. It is most useful for teams…
TLDR AI FeedMay 1Score 6.3
May 1
Score 6.3
Agentic AI
Together AI Brings NVIDIA Nemotron 3 Nano Omni to Developers on Day 0
NVIDIA’s Nemotron 3 Nano Omni is now available through Together AI, bringing a multimodal open model aimed at agentic workflows. The main value is early access to a cross-modal model, though the announcement is light on…
Together.aiApr 28Score 5.5
Apr 28
Score 5.5

What is Tokenization Drift and How to Fix It?

Speculative Decoding for RL Training (18 minute read)

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost (11 minute read)

SMG: The Case for Disaggregating CPU from GPU in LLM Serving (16 minute read)

Together AI Brings NVIDIA Nemotron 3 Nano Omni to Developers on Day 0