Files
LocalAI/backend/cpp
Ettore Di Giacinto a1a3b99960 docs(paged): record P3 W4A16 direct-A NO-GO + write program-level prefill conclusion
P3 (the last big prefill lever) is a decisive NO-GO. The direct-A W4A16 Marlin
path was re-created per the section-3 contract, engaged behind
LLAMA_W4A16_DIRECT_A, and A/B'd against the FP4-MMQ default: -46.9/-48.0/-49.1%
at M=512/1024/2048 (MoE q36-35b-a3b, 3-iter medians). The forensics retry is
REFUTED - the integration tax it blamed was genuinely removed (act-quant
18.92 -> ~0 us/tok; host expert-sort + src1-gather + separate cast eliminated)
and direct-A still lost. nsys graph-node decomposition: the mature bf16
grouped-W4A16 GEMM = 323.90 us/tok = 1.97x the FP4-MMQ int8 GEMM (164.6) =
exactly bf16 = half int8/FP4 tensor-core peak on sm_121. Bucket 2 (GEMM tiling,
+56.5) is now a CONFIRMED FP4-MMQ-optimal floor on GB10, joining bucket 1 (GDN
scan, P5-confirmed). Novel sub-finding: fusing the A-gather in-kernel is a NET
pessimization vs a separate bf16 pre-cast (+128 > ~63 tax removed), a
GB10-specific inversion of the no-round-trips heuristic. KL in-band and better
than control (KLD 0.130260 / same-top-p 85.172%); default md5s green both models;
engagement proven (7680 env-on, 0 default). Nothing built beyond P0, nothing
landed; fork localai-paged HEAD untouched at 653bb2f3d, series stays 46 patches;
topic branch p3-w4a16-direct retained on the DGX fork at 8eef7ba43 (NOT pushed).

Because P3 is the last major lever, this also writes the program-level
conclusion into EXECUTION_REARCH_SCOPE.md section 4a (dated) and corrects the
pre-execution projection to measured reality: six phases gated, exactly one
landed (P1 +2% MoE prefill, bucket-3 projection boundary); P2/P3/P4/P5 rejected,
P6 blocked-on-infra. Prefill closes to ~50-51% of vLLM (not ~55-65%),
serving-agg stays ~60.7% (not ~80%), decode-GPU-steady stays ~86% (not ~95%),
TTFT stays ~3.4x - because the two largest prefill buckets (1+2 = +115.7 of the
198.9 gap) are confirmed silicon/bandwidth floors that lift only on datacenter
Blackwell. This confirms and strengthens the standing conclusion that GB10
throughput-parity is unreachable by exhaustion; the paged fork's precision
parity + memory advantage stand. Default path untouched; canonical md5s green.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-02 22:31:14 +00:00
..