mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 12:57:02 -04:00
P2 (expert-major fused routed-FFN region executor, LLAMA_MOE_REGION_EXECUTOR, default-off) is recorded as NO-GO on two independent signals; nothing built beyond the P0 kill-gate, nothing landed, fork localai-paged HEAD untouched at 653bb2f3d (LocalAI series stays at 46 patches, 0001-0055). (1) Primary GO metric flat: n=257 MOE_SWIGLU_DOWN region 1022.15 us vs grouped-MMQ control 1021.61 us = -0.05% (needed >5% faster); n=128 -0.34%; MUL_MAT_ID_RAGGED_MOE +0.48%/+0.28% (region never engages). All inside the 5-sample spread - reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/127). A compact expert-major layout + single sort, both GEMMs still ragged grouped-MMQ, does not move the ragged-tile tax; that needs P3 Marlin persistent-CTA, not a P2 layout swap. (2) Decisive structural blocker: q36-35b-a3b-nvfp4 ships separate ffn_gate_exps/ffn_up_exps (+ per-tensor .scale) with ggml_swiglu_split, not the merged gate_up->VIEW->VIEW->SWIGLU->down shape the whole-pattern matcher requires; the matcher, region executor, and pre-existing POC/fused-quant all engage 0x on q36 in prefill and decode. KL delta 0.000000 is vacuous (0 engagement). Default md5 canonical both models (MoE 8cb0ce23, dense 5951a5b4); test-backend-ops all green both arms. Prerequisite handoff (gates P2 and P3): rebuild the seam for q36's separate/scaled/swiglu-split FFN shape before any MoE-region lever can engage, then re-evaluate a fused two-GEMM region (not a layout swap). Topic branch p2-moe-region retained on the fork for forensics at 2d87564dd (base 653bb2f3d), not pushed. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>