mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-02 20:37:03 -04:00
docs(paged): record P2 MoE-region NO-GO (kill-gate flat + seam-shape gap)
P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR, default-off) is recorded as NO-GO on two independent signals; nothing built beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at 653bb2f3d (LocalAI series stays at 46 patches, 0001-0055). (1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%; MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the 5-sample spread - reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/127). A compact expert-major layout + single sort, both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that needs P3 Marlin persistent-CTA, not a P2 layout swap. (2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher requires; the matcher, region executor, and pre-existing POC/fused-quant all engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0 engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4); test-backend-ops all green both arms. Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage, then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d), not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -367,6 +367,102 @@ patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree.
|
||||
loop; keep every pool alloc shape-stable across replays (keyed on n_tokens/n_experts,
|
||||
never on data-dependent routing counts) or it forces re-capture.
|
||||
|
||||
#### P2 RESULT (NO-GO, recorded 2026-07-02, `LLAMA_MOE_REGION_EXECUTOR`, default-off)
|
||||
|
||||
The layout-only expert-major region executor was implemented, correctness-proven
|
||||
on the synthetic sentinel, and A/B'd against the grouped-MMQ control at the P0
|
||||
kill-gate. **Verdict: NO-GO on two independent signals; nothing built beyond P0,
|
||||
nothing landed.** The topic branch `p2-moe-region` is retained on the DGX fork for
|
||||
forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `localai-paged`
|
||||
`653bb2f3d`, NOT pushed); the fork `localai-paged` HEAD is **untouched at
|
||||
`653bb2f3d`** and the LocalAI series stays at 46 patches (`0001-0055`). This
|
||||
records P2-at-this-granularity as a confirmed floor.
|
||||
|
||||
- **(1) Primary GO metric FLAT (the kill-gate's stated criterion).** The kill-gate
|
||||
required the n=257 (batched large-M) `MOE_SWIGLU_DOWN` rows to improve **> 5%**
|
||||
over the grouped-MMQ control. Measured (region arm vs grouped-MMQ control, 5x
|
||||
medians): control **1021.61 us**, region **1022.15 us** => **-0.05%**
|
||||
(marginally slower). n=128: 804.87 vs 807.63 = -0.34%. `MUL_MAT_ID_RAGGED_MOE`
|
||||
(lone MUL_MAT_ID, region never engages there): n=257 +0.48%, n=128 +0.28% (pure
|
||||
noise, confirms no perturbation of the standalone grouped MMQ). All four deltas
|
||||
sit inside the 5-sample spread => sentinel flat. **This reproduces the six prior
|
||||
one-boundary MoE transplants (phases 113/114/122/123/125/127) - the null
|
||||
hypothesis the scope said P2 had to beat.** A compact expert-major layout + a
|
||||
single route-sort, with both GEMMs still ragged grouped-MMQ, does not move the
|
||||
sentinel; the ragged-tile tiling (the actual +56.5 bucket-2 tax) is *unchanged*
|
||||
by a layout swap. Closing bucket 2 needs P3's Marlin persistent-CTA aggregation,
|
||||
not a P2 layout change.
|
||||
- *Methodology caveat on the sentinel (reported as-is, it is the requested
|
||||
metric):* `test-backend-ops` `eval_perf` duplicates only the down/out node
|
||||
~n_runs (~1000) times per timed iteration, so the single region invocation is
|
||||
~1/n_runs of the signal => the perf sentinel is structurally under-sensitive to
|
||||
the region change. The flat verdict is corroborated by signal (2). (The n=257
|
||||
`MOE_SWIGLU_DOWN` case was added to both `make_test_cases_eval` and
|
||||
`make_test_cases_perf`; the eval list already had n=128.)
|
||||
- **(2) DECISIVE STRUCTURAL BLOCKER: the seam does not match q36's decision
|
||||
graph.** `q36-35b-a3b-nvfp4.gguf` ships **separate** `ffn_gate_exps` +
|
||||
`ffn_up_exps` (+ per-tensor `.scale`/`.input_scale`), **NOT** a merged
|
||||
`ffn_gate_up_exps` (verified by GGUF tensor-name scan). `llama-graph.cpp`
|
||||
`build_moe_ffn` therefore takes the separate-gate/up branch =>
|
||||
`ffn_moe_gate_scaled` + `ffn_moe_up_scaled` + `ggml_swiglu_split`. The
|
||||
whole-pattern matcher `ggml_cuda_moe_whole_pattern_detect_early` requires the
|
||||
merged `gate_up(MUL_MAT_ID) -> VIEW -> VIEW -> SWIGLU -> down` shape, which is
|
||||
**absent** on q36. Result: `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE` fires **0x** on
|
||||
q36 (prefill AND decode); the region executor engages 0x; the pre-existing
|
||||
POC/fused-quant (`LLAMA_MOE_ROUTED_FFN_POC=1 +FUSED_QUANT=1`) also engages 0x.
|
||||
The region only engages on the synthetic merged-shape test sentinel (7
|
||||
engagements/pass, `MOE_SWIGLU_DOWN` 8/8 nmse-correct). **Even a positive sentinel
|
||||
could not have translated to q36 without first extending the matcher + POC to the
|
||||
separate/scaled/swiglu-split shape.**
|
||||
- **KL gate: in-band but VACUOUS.** control KLD 0.136563 / same-top-p 83.725%;
|
||||
region KLD 0.136563 / same-top-p 83.725% => delta **0.000000**, byte-identical.
|
||||
In-band (delta < 0.01, top-p >= 84 baseline) but only because the region engages
|
||||
0x on q36 - it is not a KL-neutrality claim for the executor (that is the separate
|
||||
8/8 NVFP4 nmse sentinel).
|
||||
- **S_PP @512 (npp512 ntg4 npl32, 5x):** control 2320.62 t/s (stdev 0.23%), region
|
||||
2316.70 t/s (stdev 0.24%) => -0.17% (flat; region == control at 0 engagement;
|
||||
code-present, no regression). **Capture stability:** region S_PP stdev 0.24%
|
||||
across 5 iters = no CUDA-graph re-capture thrash (pool allocs keyed on
|
||||
n_tokens/n_experts held shape-stable).
|
||||
- **All correctness gates GREEN, both arms** (default AND
|
||||
`LLAMA_MOE_REGION_EXECUTOR=1`): `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID
|
||||
806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, MUL_MAT_ID_RAGGED_MOE 6/6,
|
||||
BF16_STREAM_SEGMENT 4/4. Default md5 canonical both models (MoE `8cb0ce23`, dense
|
||||
`5951a5b4`); env-on also canonical (greedy prompt is small-M => region bails).
|
||||
Region correctness where it *does* engage is proven by the 8/8 NVFP4 nmse match
|
||||
incl. n=257 (ne_get_rows=2056).
|
||||
- **Implementation (correct, committed on `p2-moe-region`, NOT pushed, ~407 LOC / 6
|
||||
files).** `moe-ffn.cu` `ggml_cuda_moe_region_executor`: one route-sort (ids_meta,
|
||||
cur framing); gate_up grouped NVFP4 MMQ writes a **compact expert-major buffer**
|
||||
via iota `ids_dst` (the token-order `[2*n_ff, n_used, n_tokens]` intermediate
|
||||
never materialised); new `moe_swiglu_nvfp4_quant_compact_kernel` reads the compact
|
||||
buffer by route-slot (no ids_src1 gather); down MMQ unpermutes to token order.
|
||||
Strict all-consumers guard `ggml_cuda_moe_region_consumers_ok` bails if any node
|
||||
outside the 5-node region reads gate_up/views/glu (covers shared-expert aliasing).
|
||||
`LLAMA_MOE_REGION_TRACE`.
|
||||
- **Honest delta vs expectation.** The scope's P2 line targeted ~40 of the +56.5
|
||||
bucket-2 prefill tax + the ~11 ms decode MoE residual. **Delivered: 0** (region
|
||||
flat on its sentinel and 0-engagement on the decision model). The compact
|
||||
expert-major layout is the wrong lever at this granularity: it swaps *where* the
|
||||
intermediate lives without changing the ragged-tile GEMM tiling that owns the
|
||||
cost.
|
||||
- **Prerequisite handoff (gates P2 AND P3).** Before ANY MoE-region lever can
|
||||
engage on q36, the seam - the whole-pattern matcher, the POC/fused-quant, AND the
|
||||
region executor - must first be **rebuilt for q36's separate
|
||||
`ffn_gate_exps`/`ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN
|
||||
shape**. The current seam only matches a merged shape q36 does not emit. The
|
||||
correct next action is a re-scope of the seam to the separate/scaled shape as the
|
||||
gating prerequisite, then re-evaluate whether a *fused two-GEMM* region (not a
|
||||
layout swap) beats the sentinel - the scope's own null hypothesis holds that the
|
||||
win exists only as the complete fused kernel that never materialises the
|
||||
intermediates.
|
||||
- **Artifacts (DGX `~/bench/p2_moe_region/`):** `focused_20260702_172644/` (perf
|
||||
sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`;
|
||||
`killgate_20260702_171826/` (engagement proof: `engage_moe.log`=0,
|
||||
`engage_dense.log`=0); `build_20260702_145928/` (build logs). Environment:
|
||||
`LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, `nsys --cuda-graph-trace=node`, GPU lock
|
||||
held.
|
||||
|
||||
### P3: Marlin-class large-M GEMM retry, ON TOP of P1+P2 (the forensics-informed retry)
|
||||
|
||||
- **Goal:** land the W4A16 Marlin-shape GEMM (FP4->bf16 in-register dequant + bf16
|
||||
|
||||
@@ -2406,3 +2406,71 @@ at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte
|
||||
== fork HEAD tree. Nothing pushed. Artifacts:
|
||||
`~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229`
|
||||
on the DGX; fork topic branch `p1-bf16-stream` retained for forensics.
|
||||
|
||||
## P2 expert-major fused MoE region - NO-GO (recorded 2026-07-02)
|
||||
|
||||
Second phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0
|
||||
kill-gate for `LLAMA_MOE_REGION_EXECUTOR` (default-off) returned **NO-GO on two
|
||||
independent signals**, so per the phased contract nothing was built beyond P0 and
|
||||
nothing landed. See the "P2 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for
|
||||
the full record; summary and provenance:
|
||||
|
||||
- **Verdict: NO-GO / DO-NOT-SHIP.** The expected-recovery line (~40 of the +56.5
|
||||
bucket-2 prefill tax + ~11 ms decode residual) was **not** delivered - the
|
||||
layout-only expert-major region is flat on its own sentinel and engages 0x on the
|
||||
decision model.
|
||||
- **(1) Primary GO metric flat.** Kill-gate needed the n=257 batched-large-M
|
||||
`MOE_SWIGLU_DOWN` rows to beat the grouped-MMQ control by > 5%. Measured (5x
|
||||
medians): control 1021.61 us, region 1022.15 us => **-0.05%** (marginally
|
||||
slower); n=128 -0.34%; `MUL_MAT_ID_RAGGED_MOE` (region never engages) n=257
|
||||
+0.48% / n=128 +0.28% (noise). All four inside the 5-sample spread. This
|
||||
reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/
|
||||
127) - the null hypothesis P2 had to beat. A compact expert-major *layout* + a
|
||||
single sort, with both GEMMs still ragged grouped-MMQ, does not change the
|
||||
ragged-tile tiling that owns the +56.5 tax; that needs P3's Marlin
|
||||
persistent-CTA, not a P2 layout swap. (Sentinel caveat: `eval_perf` duplicates
|
||||
only the down node ~n_runs times, so the region invocation is ~1/n_runs of the
|
||||
signal => under-sensitive; reported as the requested metric, corroborated by
|
||||
signal 2.)
|
||||
- **(2) Decisive structural blocker (prerequisite gap).** `q36-35b-a3b-nvfp4.gguf`
|
||||
ships **separate** `ffn_gate_exps` + `ffn_up_exps` (+ per-tensor
|
||||
`.scale`/`.input_scale`), NOT a merged `ffn_gate_up_exps` (GGUF tensor-name scan).
|
||||
`llama-graph.cpp` `build_moe_ffn` takes the separate-gate/up + `ggml_swiglu_split`
|
||||
branch, so the whole-pattern matcher's merged
|
||||
`gate_up(MUL_MAT_ID)->VIEW->VIEW->SWIGLU->down` shape is **absent**. The matcher,
|
||||
the region executor, AND the pre-existing POC/fused-quant all engage **0x** on
|
||||
q36 in prefill and decode. The region only engages on the synthetic merged-shape
|
||||
test sentinel. Even a positive sentinel could not translate to q36 without first
|
||||
rebuilding the seam for the separate/scaled/swiglu-split shape.
|
||||
- **KL: vacuously identical.** control and region KLD both 0.136563, same-top-p
|
||||
both 83.725% => delta 0.000000 (byte-identical only because the region engages 0x
|
||||
on q36; not an executor KL-neutrality claim).
|
||||
- **S_PP @512 (5x):** control 2320.62 vs region 2316.70 t/s = -0.17% (flat,
|
||||
region == control at 0 engagement; stdev 0.24% => capture-stable, no re-capture
|
||||
thrash).
|
||||
- **Correctness GREEN, both arms** (default AND env-on): MUL_MAT 1146/1146,
|
||||
MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8,
|
||||
MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. Default md5 canonical both
|
||||
models (MoE `8cb0ce23`, dense `5951a5b4`); env-on canonical (small-M bails).
|
||||
- **Prerequisite handoff (gates P2 AND P3).** Before any MoE-region lever can
|
||||
engage on q36, re-scope and rebuild the seam (whole-pattern matcher +
|
||||
POC/fused-quant + region executor) for q36's separate `ffn_gate_exps`/
|
||||
`ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN shape. Then
|
||||
re-evaluate a *fused two-GEMM* region (not a layout swap), per the scope's null
|
||||
hypothesis that the win exists only as the complete fused kernel that never
|
||||
materialises the intermediates.
|
||||
|
||||
Implementation (correct, committed, NOT pushed, ~407 LOC / 6 files):
|
||||
`moe-ffn.cu` `ggml_cuda_moe_region_executor` (one route-sort ids_meta; gate_up
|
||||
grouped NVFP4 MMQ writes a compact expert-major buffer via iota ids_dst, token-order
|
||||
intermediate never materialised; `moe_swiglu_nvfp4_quant_compact_kernel` reads by
|
||||
route-slot; down MMQ unpermutes) + strict all-consumers guard
|
||||
`ggml_cuda_moe_region_consumers_ok` + `LLAMA_MOE_REGION_TRACE`.
|
||||
|
||||
Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46
|
||||
patches (`0001-0055`). Topic branch `mudler/llama.cpp:p2-moe-region` retained for
|
||||
forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `653bb2f3d`, NOT
|
||||
pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/`
|
||||
(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`,
|
||||
`.../killgate_20260702_171826/` (engagement proof, 0x on both models),
|
||||
`.../build_20260702_145928/` (build logs).
|
||||
|
||||
Reference in New Issue
Block a user