From 586639d016b3775ac212b16ff7544dc0c3e9119b Mon Sep 17 00:00:00 2001 From: Ettore Di Giacinto Date: Thu, 2 Jul 2026 15:43:57 +0000 Subject: [PATCH] docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape gap) P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR, default-off) is recorded as NO-GO on two independent signals; nothing built beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at 653bb2f3d (LocalAI series stays at 46 patches, 0001-0055). (1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%; MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the 5-sample spread - reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/127). A compact expert-major layout + single sort, both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that needs P3 Marlin persistent-CTA, not a P2 layout swap. (2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher requires; the matcher, region executor, and pre-existing POC/fused-quant all engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0 engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4); test-backend-ops all green both arms. Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage, then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d), not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto --- .../docs/EXECUTION_REARCH_SCOPE.md | 96 +++++++++++++++++++ .../docs/PARITY_HANDOFF.md | 68 +++++++++++++ 2 files changed, 164 insertions(+) diff --git a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md index cfdd36e45..7fa4547b7 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md @@ -367,6 +367,102 @@ patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree. loop; keep every pool alloc shape-stable across replays (keyed on n_tokens/n_experts, never on data-dependent routing counts) or it forces re-capture. +#### P2 RESULT (NO-GO, recorded 2026-07-02, `LLAMA_MOE_REGION_EXECUTOR`, default-off) + +The layout-only expert-major region executor was implemented, correctness-proven +on the synthetic sentinel, and A/B'd against the grouped-MMQ control at the P0 +kill-gate. **Verdict: NO-GO on two independent signals; nothing built beyond P0, +nothing landed.** The topic branch `p2-moe-region` is retained on the DGX fork for +forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `localai-paged` +`653bb2f3d`, NOT pushed); the fork `localai-paged` HEAD is **untouched at +`653bb2f3d`** and the LocalAI series stays at 46 patches (`0001-0055`). This +records P2-at-this-granularity as a confirmed floor. + +- **(1) Primary GO metric FLAT (the kill-gate's stated criterion).** The kill-gate + required the n=257 (batched large-M) `MOE_SWIGLU_DOWN` rows to improve **> 5%** + over the grouped-MMQ control. Measured (region arm vs grouped-MMQ control, 5x + medians): control **1021.61 us**, region **1022.15 us** => **-0.05%** + (marginally slower). n=128: 804.87 vs 807.63 = -0.34%. `MUL_MAT_ID_RAGGED_MOE` + (lone MUL_MAT_ID, region never engages there): n=257 +0.48%, n=128 +0.28% (pure + noise, confirms no perturbation of the standalone grouped MMQ). All four deltas + sit inside the 5-sample spread => sentinel flat. **This reproduces the six prior + one-boundary MoE transplants (phases 113/114/122/123/125/127) - the null + hypothesis the scope said P2 had to beat.** A compact expert-major layout + a + single route-sort, with both GEMMs still ragged grouped-MMQ, does not move the + sentinel; the ragged-tile tiling (the actual +56.5 bucket-2 tax) is *unchanged* + by a layout swap. Closing bucket 2 needs P3's Marlin persistent-CTA aggregation, + not a P2 layout change. + - *Methodology caveat on the sentinel (reported as-is, it is the requested + metric):* `test-backend-ops` `eval_perf` duplicates only the down/out node + ~n_runs (~1000) times per timed iteration, so the single region invocation is + ~1/n_runs of the signal => the perf sentinel is structurally under-sensitive to + the region change. The flat verdict is corroborated by signal (2). (The n=257 + `MOE_SWIGLU_DOWN` case was added to both `make_test_cases_eval` and + `make_test_cases_perf`; the eval list already had n=128.) +- **(2) DECISIVE STRUCTURAL BLOCKER: the seam does not match q36's decision + graph.** `q36-35b-a3b-nvfp4.gguf` ships **separate** `ffn_gate_exps` + + `ffn_up_exps` (+ per-tensor `.scale`/`.input_scale`), **NOT** a merged + `ffn_gate_up_exps` (verified by GGUF tensor-name scan). `llama-graph.cpp` + `build_moe_ffn` therefore takes the separate-gate/up branch => + `ffn_moe_gate_scaled` + `ffn_moe_up_scaled` + `ggml_swiglu_split`. The + whole-pattern matcher `ggml_cuda_moe_whole_pattern_detect_early` requires the + merged `gate_up(MUL_MAT_ID) -> VIEW -> VIEW -> SWIGLU -> down` shape, which is + **absent** on q36. Result: `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE` fires **0x** on + q36 (prefill AND decode); the region executor engages 0x; the pre-existing + POC/fused-quant (`LLAMA_MOE_ROUTED_FFN_POC=1 +FUSED_QUANT=1`) also engages 0x. + The region only engages on the synthetic merged-shape test sentinel (7 + engagements/pass, `MOE_SWIGLU_DOWN` 8/8 nmse-correct). **Even a positive sentinel + could not have translated to q36 without first extending the matcher + POC to the + separate/scaled/swiglu-split shape.** +- **KL gate: in-band but VACUOUS.** control KLD 0.136563 / same-top-p 83.725%; + region KLD 0.136563 / same-top-p 83.725% => delta **0.000000**, byte-identical. + In-band (delta < 0.01, top-p >= 84 baseline) but only because the region engages + 0x on q36 - it is not a KL-neutrality claim for the executor (that is the separate + 8/8 NVFP4 nmse sentinel). +- **S_PP @512 (npp512 ntg4 npl32, 5x):** control 2320.62 t/s (stdev 0.23%), region + 2316.70 t/s (stdev 0.24%) => -0.17% (flat; region == control at 0 engagement; + code-present, no regression). **Capture stability:** region S_PP stdev 0.24% + across 5 iters = no CUDA-graph re-capture thrash (pool allocs keyed on + n_tokens/n_experts held shape-stable). +- **All correctness gates GREEN, both arms** (default AND + `LLAMA_MOE_REGION_EXECUTOR=1`): `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID + 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, MUL_MAT_ID_RAGGED_MOE 6/6, + BF16_STREAM_SEGMENT 4/4. Default md5 canonical both models (MoE `8cb0ce23`, dense + `5951a5b4`); env-on also canonical (greedy prompt is small-M => region bails). + Region correctness where it *does* engage is proven by the 8/8 NVFP4 nmse match + incl. n=257 (ne_get_rows=2056). +- **Implementation (correct, committed on `p2-moe-region`, NOT pushed, ~407 LOC / 6 + files).** `moe-ffn.cu` `ggml_cuda_moe_region_executor`: one route-sort (ids_meta, + cur framing); gate_up grouped NVFP4 MMQ writes a **compact expert-major buffer** + via iota `ids_dst` (the token-order `[2*n_ff, n_used, n_tokens]` intermediate + never materialised); new `moe_swiglu_nvfp4_quant_compact_kernel` reads the compact + buffer by route-slot (no ids_src1 gather); down MMQ unpermutes to token order. + Strict all-consumers guard `ggml_cuda_moe_region_consumers_ok` bails if any node + outside the 5-node region reads gate_up/views/glu (covers shared-expert aliasing). + `LLAMA_MOE_REGION_TRACE`. +- **Honest delta vs expectation.** The scope's P2 line targeted ~40 of the +56.5 + bucket-2 prefill tax + the ~11 ms decode MoE residual. **Delivered: 0** (region + flat on its sentinel and 0-engagement on the decision model). The compact + expert-major layout is the wrong lever at this granularity: it swaps *where* the + intermediate lives without changing the ragged-tile GEMM tiling that owns the + cost. +- **Prerequisite handoff (gates P2 AND P3).** Before ANY MoE-region lever can + engage on q36, the seam - the whole-pattern matcher, the POC/fused-quant, AND the + region executor - must first be **rebuilt for q36's separate + `ffn_gate_exps`/`ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN + shape**. The current seam only matches a merged shape q36 does not emit. The + correct next action is a re-scope of the seam to the separate/scaled shape as the + gating prerequisite, then re-evaluate whether a *fused two-GEMM* region (not a + layout swap) beats the sentinel - the scope's own null hypothesis holds that the + win exists only as the complete fused kernel that never materialises the + intermediates. +- **Artifacts (DGX `~/bench/p2_moe_region/`):** `focused_20260702_172644/` (perf + sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`; + `killgate_20260702_171826/` (engagement proof: `engage_moe.log`=0, + `engage_dense.log`=0); `build_20260702_145928/` (build logs). Environment: + `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, `nsys --cuda-graph-trace=node`, GPU lock + held. + ### P3: Marlin-class large-M GEMM retry, ON TOP of P1+P2 (the forensics-informed retry) - **Goal:** land the W4A16 Marlin-shape GEMM (FP4->bf16 in-register dequant + bf16 diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md index fb7608c90..59abbc7d0 100644 --- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -2406,3 +2406,71 @@ at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte == fork HEAD tree. Nothing pushed. Artifacts: `~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229` on the DGX; fork topic branch `p1-bf16-stream` retained for forensics. + +## P2 expert-major fused MoE region - NO-GO (recorded 2026-07-02) + +Second phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 +kill-gate for `LLAMA_MOE_REGION_EXECUTOR` (default-off) returned **NO-GO on two +independent signals**, so per the phased contract nothing was built beyond P0 and +nothing landed. See the "P2 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for +the full record; summary and provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP.** The expected-recovery line (~40 of the +56.5 + bucket-2 prefill tax + ~11 ms decode residual) was **not** delivered - the + layout-only expert-major region is flat on its own sentinel and engages 0x on the + decision model. +- **(1) Primary GO metric flat.** Kill-gate needed the n=257 batched-large-M + `MOE_SWIGLU_DOWN` rows to beat the grouped-MMQ control by > 5%. Measured (5x + medians): control 1021.61 us, region 1022.15 us => **-0.05%** (marginally + slower); n=128 -0.34%; `MUL_MAT_ID_RAGGED_MOE` (region never engages) n=257 + +0.48% / n=128 +0.28% (noise). All four inside the 5-sample spread. This + reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/ + 127) - the null hypothesis P2 had to beat. A compact expert-major *layout* + a + single sort, with both GEMMs still ragged grouped-MMQ, does not change the + ragged-tile tiling that owns the +56.5 tax; that needs P3's Marlin + persistent-CTA, not a P2 layout swap. (Sentinel caveat: `eval_perf` duplicates + only the down node ~n_runs times, so the region invocation is ~1/n_runs of the + signal => under-sensitive; reported as the requested metric, corroborated by + signal 2.) +- **(2) Decisive structural blocker (prerequisite gap).** `q36-35b-a3b-nvfp4.gguf` + ships **separate** `ffn_gate_exps` + `ffn_up_exps` (+ per-tensor + `.scale`/`.input_scale`), NOT a merged `ffn_gate_up_exps` (GGUF tensor-name scan). + `llama-graph.cpp` `build_moe_ffn` takes the separate-gate/up + `ggml_swiglu_split` + branch, so the whole-pattern matcher's merged + `gate_up(MUL_MAT_ID)->VIEW->VIEW->SWIGLU->down` shape is **absent**. The matcher, + the region executor, AND the pre-existing POC/fused-quant all engage **0x** on + q36 in prefill and decode. The region only engages on the synthetic merged-shape + test sentinel. Even a positive sentinel could not translate to q36 without first + rebuilding the seam for the separate/scaled/swiglu-split shape. +- **KL: vacuously identical.** control and region KLD both 0.136563, same-top-p + both 83.725% => delta 0.000000 (byte-identical only because the region engages 0x + on q36; not an executor KL-neutrality claim). +- **S_PP @512 (5x):** control 2320.62 vs region 2316.70 t/s = -0.17% (flat, + region == control at 0 engagement; stdev 0.24% => capture-stable, no re-capture + thrash). +- **Correctness GREEN, both arms** (default AND env-on): MUL_MAT 1146/1146, + MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, + MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. Default md5 canonical both + models (MoE `8cb0ce23`, dense `5951a5b4`); env-on canonical (small-M bails). +- **Prerequisite handoff (gates P2 AND P3).** Before any MoE-region lever can + engage on q36, re-scope and rebuild the seam (whole-pattern matcher + + POC/fused-quant + region executor) for q36's separate `ffn_gate_exps`/ + `ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN shape. Then + re-evaluate a *fused two-GEMM* region (not a layout swap), per the scope's null + hypothesis that the win exists only as the complete fused kernel that never + materialises the intermediates. + +Implementation (correct, committed, NOT pushed, ~407 LOC / 6 files): +`moe-ffn.cu` `ggml_cuda_moe_region_executor` (one route-sort ids_meta; gate_up +grouped NVFP4 MMQ writes a compact expert-major buffer via iota ids_dst, token-order +intermediate never materialised; `moe_swiglu_nvfp4_quant_compact_kernel` reads by +route-slot; down MMQ unpermutes) + strict all-consumers guard +`ggml_cuda_moe_region_consumers_ok` + `LLAMA_MOE_REGION_TRACE`. + +Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46 +patches (`0001-0055`). Topic branch `mudler/llama.cpp:p2-moe-region` retained for +forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `653bb2f3d`, NOT +pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/` +(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`, +`.../killgate_20260702_171826/` (engagement proof, 0x on both models), +`.../build_20260702_145928/` (build logs).