mirror of
https://github.com/mudler/LocalAI.git
synced 2026-07-03 04:46:54 -04:00
P3 (the last big prefill lever) is a decisive NO-GO. The direct-A W4A16 Marlin path was re-created per the section-3 contract, engaged behind LLAMA_W4A16_DIRECT_A, and A/B'd against the FP4-MMQ default: -46.9/-48.0/-49.1% at M=512/1024/2048 (MoE q36-35b-a3b, 3-iter medians). The forensics retry is REFUTED - the integration tax it blamed was genuinely removed (act-quant 18.92 -> ~0 us/tok; host expert-sort + src1-gather + separate cast eliminated) and direct-A still lost. nsys graph-node decomposition: the mature bf16 grouped-W4A16 GEMM = 323.90 us/tok = 1.97x the FP4-MMQ int8 GEMM (164.6) = exactly bf16 = half int8/FP4 tensor-core peak on sm_121. Bucket 2 (GEMM tiling, +56.5) is now a CONFIRMED FP4-MMQ-optimal floor on GB10, joining bucket 1 (GDN scan, P5-confirmed). Novel sub-finding: fusing the A-gather in-kernel is a NET pessimization vs a separate bf16 pre-cast (+128 > ~63 tax removed), a GB10-specific inversion of the no-round-trips heuristic. KL in-band and better than control (KLD 0.130260 / same-top-p 85.172%); default md5s green both models; engagement proven (7680 env-on, 0 default). Nothing built beyond P0, nothing landed; fork localai-paged HEAD untouched at 653bb2f3d, series stays 46 patches; topic branch p3-w4a16-direct retained on the DGX fork at 8eef7ba43 (NOT pushed). Because P3 is the last major lever, this also writes the program-level conclusion into EXECUTION_REARCH_SCOPE.md section 4a (dated) and corrects the pre-execution projection to measured reality: six phases gated, exactly one landed (P1 +2% MoE prefill, bucket-3 projection boundary); P2/P3/P4/P5 rejected, P6 blocked-on-infra. Prefill closes to ~50-51% of vLLM (not ~55-65%), serving-agg stays ~60.7% (not ~80%), decode-GPU-steady stays ~86% (not ~95%), TTFT stays ~3.4x - because the two largest prefill buckets (1+2 = +115.7 of the 198.9 gap) are confirmed silicon/bandwidth floors that lift only on datacenter Blackwell. This confirms and strengthens the standing conclusion that GB10 throughput-parity is unreachable by exhaustion; the paged fork's precision parity + memory advantage stand. Default path untouched; canonical md5s green. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>