docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled-in wins

The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does NOT reproduce stock: the dominant SSM decode fusions are compiled in, not runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part. True stock needs a separately-built unpatched binary at the same pin. Correct the methodology skill's per-lever discipline + apples-to-apples rule accordingly. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 10:27:30 -04:00 · 2026-06-27 22:09:05 +00:00
parent 3466094c68
commit 266fcc79ad
1 changed files with 13 additions and 5 deletions
--- a/.agents/vllm-parity-methodology.md
+++ b/.agents/vllm-parity-methodology.md
@@ -30,10 +30,16 @@ backend README.
   keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.

 4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
-   same-session A/B bench (patched-off vs patched-on, identical harness = an exact
-   measure). Bank only what lifts AND gates. **Record every rejected or flat lever
-   with the reason** - over time this is the most valuable part: it stops the next
-   person re-running dead ends.
+   same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
+   levers that are actually runtime-gated; a lever **compiled into** the binary
+   (e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
+   it build-vs-build. The full-patchset "stock" baseline likewise needs a
+   **separately-built unpatched binary at the same pin** - toggling the runtime
+   flag on the patched binary does not reproduce stock (it measures only the gated
+   part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
+   what lifts AND gates. **Record every rejected or flat lever with the reason** -
+   over time this is the most valuable part: it stops the next person re-running
+   dead ends.

 5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
   lever measured, not assumed). What remains is physical - the memory-bandwidth
@@ -43,7 +49,9 @@ backend README.
 ## Hard rules learned

 - **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
-  (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
+  (`llama-batched-bench`) is exact - lead with it. But "stock" must be a
+  separately-built unpatched binary at the SAME pin, NOT the patched binary with
+  the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
  (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
  and config (context length alone shifted the MoE figure 76% <-> 86%).
 - **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%