docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled-in wins

The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does
NOT reproduce stock: the dominant SSM decode fusions are compiled in, not
runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part.
True stock needs a separately-built unpatched binary at the same pin. Correct the
methodology skill's per-lever discipline + apples-to-apples rule accordingly.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 22:09:05 +00:00
parent 3466094c68
commit 266fcc79ad

View File

@@ -30,10 +30,16 @@ backend README.
keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
same-session A/B bench (patched-off vs patched-on, identical harness = an exact
measure). Bank only what lifts AND gates. **Record every rejected or flat lever
with the reason** - over time this is the most valuable part: it stops the next
person re-running dead ends.
same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
levers that are actually runtime-gated; a lever **compiled into** the binary
(e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
it build-vs-build. The full-patchset "stock" baseline likewise needs a
**separately-built unpatched binary at the same pin** - toggling the runtime
flag on the patched binary does not reproduce stock (it measures only the gated
part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
what lifts AND gates. **Record every rejected or flat lever with the reason** -
over time this is the most valuable part: it stops the next person re-running
dead ends.
5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
lever measured, not assumed). What remains is physical - the memory-bandwidth
@@ -43,7 +49,9 @@ backend README.
## Hard rules learned
- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
(`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
(`llama-batched-bench`) is exact - lead with it. But "stock" must be a
separately-built unpatched binary at the SAME pin, NOT the patched binary with
the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
and config (context length alone shifted the MoE figure 76% <-> 86%).
- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%