From 266fcc79ad97e101ec720f2b89ea6bdc7354dff9 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 27 Jun 2026 22:09:05 +0000
Subject: [PATCH] docs(agents): fix A/B-bench gotcha - env-toggle != stock for
 compiled-in wins

The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does
NOT reproduce stock: the dominant SSM decode fusions are compiled in, not
runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part.
True stock needs a separately-built unpatched binary at the same pin. Correct the
methodology skill's per-lever discipline + apples-to-apples rule accordingly.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
 .agents/vllm-parity-methodology.md | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/.agents/vllm-parity-methodology.md b/.agents/vllm-parity-methodology.md
index 0ebc7f140..5114f0712 100644
--- a/.agents/vllm-parity-methodology.md
+++ b/.agents/vllm-parity-methodology.md
@@ -30,10 +30,16 @@ backend README.
    keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
 
 4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
-   same-session A/B bench (patched-off vs patched-on, identical harness = an exact
-   measure). Bank only what lifts AND gates. **Record every rejected or flat lever
-   with the reason** - over time this is the most valuable part: it stops the next
-   person re-running dead ends.
+   same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
+   levers that are actually runtime-gated; a lever **compiled into** the binary
+   (e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
+   it build-vs-build. The full-patchset "stock" baseline likewise needs a
+   **separately-built unpatched binary at the same pin** - toggling the runtime
+   flag on the patched binary does not reproduce stock (it measures only the gated
+   part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
+   what lifts AND gates. **Record every rejected or flat lever with the reason** -
+   over time this is the most valuable part: it stops the next person re-running
+   dead ends.
 
 5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
    lever measured, not assumed). What remains is physical - the memory-bandwidth
@@ -43,7 +49,9 @@ backend README.
 ## Hard rules learned
 
 - **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
-  (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
+  (`llama-batched-bench`) is exact - lead with it. But "stock" must be a
+  separately-built unpatched binary at the SAME pin, NOT the patched binary with
+  the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
   (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
   and config (context length alone shifted the MoE figure 76% <-> 86%).
 - **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%