From 266fcc79ad97e101ec720f2b89ea6bdc7354dff9 Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Sat, 27 Jun 2026 22:09:05 +0000 Subject: [PATCH] docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled-in wins The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does NOT reproduce stock: the dominant SSM decode fusions are compiled in, not runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part. True stock needs a separately-built unpatched binary at the same pin. Correct the methodology skill's per-lever discipline + apples-to-apples rule accordingly. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .agents/vllm-parity-methodology.md | 18 +++++++++++++----- 1 file changed, 13 insertions(+), 5 deletions(-) diff --git a/.agents/vllm-parity-methodology.md b/.agents/vllm-parity-methodology.md index 0ebc7f140..5114f0712 100644 --- a/.agents/vllm-parity-methodology.md +++ b/.agents/vllm-parity-methodology.md @@ -30,10 +30,16 @@ backend README. keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap. 4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate -> - same-session A/B bench (patched-off vs patched-on, identical harness = an exact - measure). Bank only what lifts AND gates. **Record every rejected or flat lever - with the reason** - over time this is the most valuable part: it stops the next - person re-running dead ends. + same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for + levers that are actually runtime-gated; a lever **compiled into** the binary + (e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure + it build-vs-build. The full-patchset "stock" baseline likewise needs a + **separately-built unpatched binary at the same pin** - toggling the runtime + flag on the patched binary does not reproduce stock (it measures only the gated + part; here that was ~neutral, which is exactly how this gotcha hides). Bank only + what lifts AND gates. **Record every rejected or flat lever with the reason** - + over time this is the most valuable part: it stops the next person re-running + dead ends. 5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every lever measured, not assumed). What remains is physical - the memory-bandwidth @@ -43,7 +49,9 @@ backend README. ## Hard rules learned - **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness - (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM" + (`llama-batched-bench`) is exact - lead with it. But "stock" must be a + separately-built unpatched binary at the SAME pin, NOT the patched binary with + the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM" (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness and config (context length alone shifted the MoE figure 76% <-> 86%). - **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%