docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape gap)

P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR, default-off) is recorded as NO-GO on two independent signals; nothing built beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at 653bb2f3d (LocalAI series stays at 46 patches, 0001-0055). (1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%; MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the 5-sample spread - reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/127). A compact expert-major layout + single sort, both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that needs P3 Marlin persistent-CTA, not a P2 layout swap. (2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher requires; the matcher, region executor, and pre-existing POC/fused-quant all engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0 engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4); test-backend-ops all green both arms. Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage, then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d), not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 20:37:03 -04:00 · 2026-07-02 15:43:57 +00:00
parent ccf75d1dcd
commit 586639d016
2 changed files with 164 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
@@ -367,6 +367,102 @@ patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree.
  loop; keep every pool alloc shape-stable across replays (keyed on n_tokens/n_experts,
  never on data-dependent routing counts) or it forces re-capture.

+#### P2 RESULT (NO-GO, recorded 2026-07-02, `LLAMA_MOE_REGION_EXECUTOR`, default-off)
+
+The layout-only expert-major region executor was implemented, correctness-proven
+on the synthetic sentinel, and A/B'd against the grouped-MMQ control at the P0
+kill-gate. **Verdict: NO-GO on two independent signals; nothing built beyond P0,
+nothing landed.** The topic branch `p2-moe-region` is retained on the DGX fork for
+forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `localai-paged`
+`653bb2f3d`, NOT pushed); the fork `localai-paged` HEAD is **untouched at
+`653bb2f3d`** and the LocalAI series stays at 46 patches (`0001-0055`). This
+records P2-at-this-granularity as a confirmed floor.
+
+- **(1) Primary GO metric FLAT (the kill-gate's stated criterion).** The kill-gate
+  required the n=257 (batched large-M) `MOE_SWIGLU_DOWN` rows to improve **> 5%**
+  over the grouped-MMQ control. Measured (region arm vs grouped-MMQ control, 5x
+  medians): control **1021.61 us**, region **1022.15 us** => **-0.05%**
+  (marginally slower). n=128: 804.87 vs 807.63 = -0.34%. `MUL_MAT_ID_RAGGED_MOE`
+  (lone MUL_MAT_ID, region never engages there): n=257 +0.48%, n=128 +0.28% (pure
+  noise, confirms no perturbation of the standalone grouped MMQ). All four deltas
+  sit inside the 5-sample spread => sentinel flat. **This reproduces the six prior
+  one-boundary MoE transplants (phases 113/114/122/123/125/127) - the null
+  hypothesis the scope said P2 had to beat.** A compact expert-major layout + a
+  single route-sort, with both GEMMs still ragged grouped-MMQ, does not move the
+  sentinel; the ragged-tile tiling (the actual +56.5 bucket-2 tax) is *unchanged*
+  by a layout swap. Closing bucket 2 needs P3's Marlin persistent-CTA aggregation,
+  not a P2 layout change.
+  - *Methodology caveat on the sentinel (reported as-is, it is the requested
+    metric):* `test-backend-ops` `eval_perf` duplicates only the down/out node
+    ~n_runs (~1000) times per timed iteration, so the single region invocation is
+    ~1/n_runs of the signal => the perf sentinel is structurally under-sensitive to
+    the region change. The flat verdict is corroborated by signal (2). (The n=257
+    `MOE_SWIGLU_DOWN` case was added to both `make_test_cases_eval` and
+    `make_test_cases_perf`; the eval list already had n=128.)
+- **(2) DECISIVE STRUCTURAL BLOCKER: the seam does not match q36's decision
+  graph.** `q36-35b-a3b-nvfp4.gguf` ships **separate** `ffn_gate_exps` +
+  `ffn_up_exps` (+ per-tensor `.scale`/`.input_scale`), **NOT** a merged
+  `ffn_gate_up_exps` (verified by GGUF tensor-name scan). `llama-graph.cpp`
+  `build_moe_ffn` therefore takes the separate-gate/up branch =>
+  `ffn_moe_gate_scaled` + `ffn_moe_up_scaled` + `ggml_swiglu_split`. The
+  whole-pattern matcher `ggml_cuda_moe_whole_pattern_detect_early` requires the
+  merged `gate_up(MUL_MAT_ID) -> VIEW -> VIEW -> SWIGLU -> down` shape, which is
+  **absent** on q36. Result: `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE` fires **0x** on
+  q36 (prefill AND decode); the region executor engages 0x; the pre-existing
+  POC/fused-quant (`LLAMA_MOE_ROUTED_FFN_POC=1 +FUSED_QUANT=1`) also engages 0x.
+  The region only engages on the synthetic merged-shape test sentinel (7
+  engagements/pass, `MOE_SWIGLU_DOWN` 8/8 nmse-correct). **Even a positive sentinel
+  could not have translated to q36 without first extending the matcher + POC to the
+  separate/scaled/swiglu-split shape.**
+- **KL gate: in-band but VACUOUS.** control KLD 0.136563 / same-top-p 83.725%;
+  region KLD 0.136563 / same-top-p 83.725% => delta **0.000000**, byte-identical.
+  In-band (delta < 0.01, top-p >= 84 baseline) but only because the region engages
+  0x on q36 - it is not a KL-neutrality claim for the executor (that is the separate
+  8/8 NVFP4 nmse sentinel).
+- **S_PP @512 (npp512 ntg4 npl32, 5x):** control 2320.62 t/s (stdev 0.23%), region
+  2316.70 t/s (stdev 0.24%) => -0.17% (flat; region == control at 0 engagement;
+  code-present, no regression). **Capture stability:** region S_PP stdev 0.24%
+  across 5 iters = no CUDA-graph re-capture thrash (pool allocs keyed on
+  n_tokens/n_experts held shape-stable).
+- **All correctness gates GREEN, both arms** (default AND
+  `LLAMA_MOE_REGION_EXECUTOR=1`): `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID
+  806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, MUL_MAT_ID_RAGGED_MOE 6/6,
+  BF16_STREAM_SEGMENT 4/4. Default md5 canonical both models (MoE `8cb0ce23`, dense
+  `5951a5b4`); env-on also canonical (greedy prompt is small-M => region bails).
+  Region correctness where it *does* engage is proven by the 8/8 NVFP4 nmse match
+  incl. n=257 (ne_get_rows=2056).
+- **Implementation (correct, committed on `p2-moe-region`, NOT pushed, ~407 LOC / 6
+  files).** `moe-ffn.cu` `ggml_cuda_moe_region_executor`: one route-sort (ids_meta,
+  cur framing); gate_up grouped NVFP4 MMQ writes a **compact expert-major buffer**
+  via iota `ids_dst` (the token-order `[2*n_ff, n_used, n_tokens]` intermediate
+  never materialised); new `moe_swiglu_nvfp4_quant_compact_kernel` reads the compact
+  buffer by route-slot (no ids_src1 gather); down MMQ unpermutes to token order.
+  Strict all-consumers guard `ggml_cuda_moe_region_consumers_ok` bails if any node
+  outside the 5-node region reads gate_up/views/glu (covers shared-expert aliasing).
+  `LLAMA_MOE_REGION_TRACE`.
+- **Honest delta vs expectation.** The scope's P2 line targeted ~40 of the +56.5
+  bucket-2 prefill tax + the ~11 ms decode MoE residual. **Delivered: 0** (region
+  flat on its sentinel and 0-engagement on the decision model). The compact
+  expert-major layout is the wrong lever at this granularity: it swaps *where* the
+  intermediate lives without changing the ragged-tile GEMM tiling that owns the
+  cost.
+- **Prerequisite handoff (gates P2 AND P3).** Before ANY MoE-region lever can
+  engage on q36, the seam - the whole-pattern matcher, the POC/fused-quant, AND the
+  region executor - must first be **rebuilt for q36's separate
+  `ffn_gate_exps`/`ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN
+  shape**. The current seam only matches a merged shape q36 does not emit. The
+  correct next action is a re-scope of the seam to the separate/scaled shape as the
+  gating prerequisite, then re-evaluate whether a *fused two-GEMM* region (not a
+  layout swap) beats the sentinel - the scope's own null hypothesis holds that the
+  win exists only as the complete fused kernel that never materialises the
+  intermediates.
+- **Artifacts (DGX `~/bench/p2_moe_region/`):** `focused_20260702_172644/` (perf
+  sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`;
+  `killgate_20260702_171826/` (engagement proof: `engage_moe.log`=0,
+  `engage_dense.log`=0); `build_20260702_145928/` (build logs). Environment:
+  `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, `nsys --cuda-graph-trace=node`, GPU lock
+  held.
+
 ### P3: Marlin-class large-M GEMM retry, ON TOP of P1+P2 (the forensics-informed retry)

 - **Goal:** land the W4A16 Marlin-shape GEMM (FP4->bf16 in-register dequant + bf16
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -2406,3 +2406,71 @@ at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte
 == fork HEAD tree. Nothing pushed. Artifacts:
 `~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229`
 on the DGX; fork topic branch `p1-bf16-stream` retained for forensics.
+
+## P2 expert-major fused MoE region - NO-GO (recorded 2026-07-02)
+
+Second phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0
+kill-gate for `LLAMA_MOE_REGION_EXECUTOR` (default-off) returned **NO-GO on two
+independent signals**, so per the phased contract nothing was built beyond P0 and
+nothing landed. See the "P2 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for
+the full record; summary and provenance:
+
+- **Verdict: NO-GO / DO-NOT-SHIP.** The expected-recovery line (~40 of the +56.5
+  bucket-2 prefill tax + ~11 ms decode residual) was **not** delivered - the
+  layout-only expert-major region is flat on its own sentinel and engages 0x on the
+  decision model.
+- **(1) Primary GO metric flat.** Kill-gate needed the n=257 batched-large-M
+  `MOE_SWIGLU_DOWN` rows to beat the grouped-MMQ control by > 5%. Measured (5x
+  medians): control 1021.61 us, region 1022.15 us => **-0.05%** (marginally
+  slower); n=128 -0.34%; `MUL_MAT_ID_RAGGED_MOE` (region never engages) n=257
+  +0.48% / n=128 +0.28% (noise). All four inside the 5-sample spread. This
+  reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/
+  127) - the null hypothesis P2 had to beat. A compact expert-major *layout* + a
+  single sort, with both GEMMs still ragged grouped-MMQ, does not change the
+  ragged-tile tiling that owns the +56.5 tax; that needs P3's Marlin
+  persistent-CTA, not a P2 layout swap. (Sentinel caveat: `eval_perf` duplicates
+  only the down node ~n_runs times, so the region invocation is ~1/n_runs of the
+  signal => under-sensitive; reported as the requested metric, corroborated by
+  signal 2.)
+- **(2) Decisive structural blocker (prerequisite gap).** `q36-35b-a3b-nvfp4.gguf`
+  ships **separate** `ffn_gate_exps` + `ffn_up_exps` (+ per-tensor
+  `.scale`/`.input_scale`), NOT a merged `ffn_gate_up_exps` (GGUF tensor-name scan).
+  `llama-graph.cpp` `build_moe_ffn` takes the separate-gate/up + `ggml_swiglu_split`
+  branch, so the whole-pattern matcher's merged
+  `gate_up(MUL_MAT_ID)->VIEW->VIEW->SWIGLU->down` shape is **absent**. The matcher,
+  the region executor, AND the pre-existing POC/fused-quant all engage **0x** on
+  q36 in prefill and decode. The region only engages on the synthetic merged-shape
+  test sentinel. Even a positive sentinel could not translate to q36 without first
+  rebuilding the seam for the separate/scaled/swiglu-split shape.
+- **KL: vacuously identical.** control and region KLD both 0.136563, same-top-p
+  both 83.725% => delta 0.000000 (byte-identical only because the region engages 0x
+  on q36; not an executor KL-neutrality claim).
+- **S_PP @512 (5x):** control 2320.62 vs region 2316.70 t/s = -0.17% (flat,
+  region == control at 0 engagement; stdev 0.24% => capture-stable, no re-capture
+  thrash).
+- **Correctness GREEN, both arms** (default AND env-on): MUL_MAT 1146/1146,
+  MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8,
+  MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. Default md5 canonical both
+  models (MoE `8cb0ce23`, dense `5951a5b4`); env-on canonical (small-M bails).
+- **Prerequisite handoff (gates P2 AND P3).** Before any MoE-region lever can
+  engage on q36, re-scope and rebuild the seam (whole-pattern matcher +
+  POC/fused-quant + region executor) for q36's separate `ffn_gate_exps`/
+  `ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN shape. Then
+  re-evaluate a *fused two-GEMM* region (not a layout swap), per the scope's null
+  hypothesis that the win exists only as the complete fused kernel that never
+  materialises the intermediates.
+
+Implementation (correct, committed, NOT pushed, ~407 LOC / 6 files):
+`moe-ffn.cu` `ggml_cuda_moe_region_executor` (one route-sort ids_meta; gate_up
+grouped NVFP4 MMQ writes a compact expert-major buffer via iota ids_dst, token-order
+intermediate never materialised; `moe_swiglu_nvfp4_quant_compact_kernel` reads by
+route-slot; down MMQ unpermutes) + strict all-consumers guard
+`ggml_cuda_moe_region_consumers_ok` + `LLAMA_MOE_REGION_TRACE`.
+
+Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46
+patches (`0001-0055`). Topic branch `mudler/llama.cpp:p2-moe-region` retained for
+forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `653bb2f3d`, NOT
+pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/`
+(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`,
+`.../killgate_20260702_171826/` (engagement proof, 0x on both models),
+`.../build_20260702_145928/` (build logs).