docs(paged): record P1 bf16-stream landing (GO)

P1 of the EXECUTION_REARCH_SCOPE additive program landed: LLAMA_BF16_STREAM (default-off) bf16-resident residual-segment executor for the q36 MoE model's projection boundaries. - EXECUTION_REARCH_SCOPE.md: dated "P1 RESULT" subsection (P0 kill-gate GO, full build-out deltas, KL, correctness gates, honest magnitude, provenance). - PARITY_HANDOFF.md: chronology note (verdict, engagement, prefill/KL numbers, fork commits, deferred-not-failed measurements). Key reframe recorded: q36 GDN/attention projections are BF16 weights (not NVFP4), so bf16-stream is a MoE-model prefill lever; the dense model quantizes those projections to NVFP4 and engages nothing (stays bit-identical). Prefill MoE @512 +1.99% (reproducible, at noise floor), KL delta -0.00052 (KL-improving), all md5 + test-backend-ops gates green. Fork HEAD 653bb2f3d, tree 6cf1523047. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-07-03 04:46:54 -04:00 · 2026-07-02 14:34:26 +00:00
parent 500d653bfa
commit ccf75d1dcd
2 changed files with 132 additions and 0 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md
@@ -248,6 +248,81 @@ started P2/P4/P5/P6.
  types (avoids `ggml.h`, 5 patches). Rides upstream fusion machinery (`ggml_can_fuse`,
  discussion #17621) by adding new clauses, not editing upstream's.

+#### P1 RESULT (landed 2026-07-02, `LLAMA_BF16_STREAM`, default-off)
+
+The bf16-resident residual-segment executor landed as three fork commits on
+`mudler/llama.cpp:localai-paged` (new HEAD `653bb2f3d`, tree `6cf1523047`, base
+`1edddc8fe`): `1271488fc` (segment executor + `norm-bf16.{cu,cuh}` + the
+re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank), `91373e1b9` (bf16 residual-add
+ rope op-variants), `653bb2f3d` (test sentinel). LocalAI series regenerated
+additively as `0053-0055` (46 patches total); kill-gate at pin `0ed235ea`: all
+patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree.
+
+- **Mechanism as-shipped (Option A, as scoped).** One additive clause in
+  `ggml_cuda_try_fuse` detects a residual-stream norm-producer (plain
+  `{RMS_NORM,MUL}` attn/GDN input norm, or the 0044 `{SILU,RMS_NORM,MUL,MUL}`
+  ssm_out gated-output norm) whose f32-output consumers are ALL large-M (M>=128)
+  cuBLAS-bf16 projections, runs the norm into a bf16 pool buffer via
+  `norm-bf16.cu` (bit-faithful to the f32 kernels up to the `__float2bfloat16`
+  store), executes the owned span inline through a bf16 view, then skips it. A
+  strict all-consumers-are-ours guard keeps the f32 norm un-materialised and
+  bails to the stock f32 path on small-M / decode / MMQ / native-FP4 /
+  multi-consumer. The `LLAMA_BF16_CUBLAS_F32_OUT` plank lets owned projections
+  write f32 directly from bf16 compute (F32_OUT else-branch byte-identical to the
+  original cuBLAS path). No upstream fuse clause edited; exactly 6 files, cmake
+  untouched (`.cu` globbed).
+- **KEY REFRAME (why a first guard engaged 0).** q36 GDN/attention projections
+  (attn_qkv/gate, ssm_alpha/beta/out) are **BF16 weights, NOT NVFP4**; only the
+  MoE experts (`ffn_*_exps`) are NVFP4. The convert tax therefore lives at the
+  BF16 cuBLAS projection boundary (`op_mul_mat_cublas` src0==BF16 converts f32
+  src1->bf16), not on the FP4-MMQ path (which pays act_quant, not convert). The
+  dense model quantizes its attn/GDN projections to NVFP4, so it **engages
+  nothing** and stays bit-identical. **bf16-stream is a MoE-model prefill lever.**
+- **P0 kill-gate (`~/bench/p1_bf16_stream/killgate_20260702_135544`): GO.** One
+  segment (960 gate_norm->ssm_out engagements/prefill). `convert_unary<float,bf16>`
+  fell 6840->5880 = exactly -960 (163.19->130.73 ms, -19.9%; share 2.27%->1.83%)
+  = 100% within-owned-segment drop (the kill-gate's stated criterion), no
+  boundary convert added. KL: control and bf16 arms **byte-identical** (KLD
+  0.136563 both, same-top-p 83.725% both) => KLD delta 0.000 < 0.01. Prefill S_PP
+  +0.53% (2323.24 vs 2310.94 t/s), inside the 3-sigma noise gate. Default md5
+  GREEN both models. (The total convert bucket only moved 4.83%->4.40% because
+  the minimal segment owns 1 of ~5 BF16 cuBLAS GEMMs per GDN layer; the >50% GO
+  is the within-segment 100%.)
+- **P1 full build-out: 2240 segments/prefill** (2.33x P0's 960) = 960
+  gate_norm->ssm_out (0044, single-consumer) + 1280 multi-consumer plain
+  rms_norm -> {attn q/k/v, GDN in_proj} BF16 projections. Prefill A/B (5 iters,
+  clean, captured before external contention): MoE @512 B=32 **+1.99%**
+  (2361.67 vs 2315.52 t/s; all 5 bf16 samples above all 5 ctrl; reproduced +1.89%),
+  @2048 B=8 +0.95%; dense @512 -0.09% / @2048 -0.10% (no-op). Recovered ~8.44
+  us/tok @512 (wall 431.87->423.43), ~4.02 @2048. Both MoE deltas sit at the
+  max(2%, 3-sigma) floor => classified neutral, but consistent and reproducible
+  positive shifts; no prefill regression => not a NO-GO. Decode S_TG neutral
+  (M<128 bails).
+- **KL gate GREEN (both models).** MoE bf16 KLD 0.136042 vs control 0.136563 =>
+  delta **-0.00052** (bf16 slightly better: F32_OUT keeps the full f32 GEMM
+  result instead of the old bf16 round-trip), inside the +0.01 band; same-top-p
+  84.461% vs 83.725% (>= 84% baseline). Dense: 0 engagements => bit-identical
+  (KLD delta 0, same-top-p 100%).
+- **All correctness gates GREEN.** Default md5 canonical both models
+  (MoE `8cb0ce23`, dense `5951a5b4`); env-on md5 canonical both (small-M bails);
+  `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET
+  46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4
+  (default AND opt-in). Files: binbcast.cu +10, ggml-cuda.cu +297, norm-bf16.cu
+  +483, norm-bf16.cuh +37, rope.cu +31, test-backend-ops.cpp +79.
+- **Honest magnitude / what remains.** The +1.9-2.0% @512 win is real,
+  reproducible, KL-benign (in fact KL-improving), and safe, but modest:
+  bf16-stream targets only prefill bucket 3 (the ~4.8%-of-wall convert/glue tax)
+  and owns the projection-boundary portion of it (~40% end-to-end), not the
+  GDN-scan (bucket 1) or GEMM-tiling (bucket 2) buckets. Read the "expected
+  recovery: ~45 us/tok" line above as an upper bound on the whole bucket-3+4
+  region; this landing captures the bucket-3 projection boundary only. The next
+  P1 increment on the table = extend the multi-consumer executor to own the
+  bf16->f32 dst direction plus the remaining attn_norm-fed projection src1
+  converts (~4 more converts/layer). Deferred (blocked only by an external
+  imatrix job contending the GPU, not a failed gate): the nsys graph-node bucket
+  table, decode S_TG @npl128, and the Phase130 serving A/B need a clean idle GB10
+  re-run; the scope deems throughput-neutral serving acceptable on GB10.
+
 ### P2: expert-major fused routed-FFN region executor (grow the merged MoE seam into the real thing)

 - **Goal:** drive both MoE GEMMs expert-major so the gate_up output never lands in
--- a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md
@@ -2349,3 +2349,60 @@ Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of

 New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of
 the 110-140 campaign are recorded above and in the per-phase bench artifacts.
+
+## P1 bf16-native execution pass - LANDED (2026-07-02)
+
+First phase of the `EXECUTION_REARCH_SCOPE.md` additive program to land.
+`LLAMA_BF16_STREAM` (default-off) runs a bf16-resident residual-segment
+executor for the q36 MoE decision model's projection boundaries, deleting the
+per-op `f32->bf16` convert the stock cuBLAS-bf16 path pays at the projection
+`src1`. See the "P1 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the
+full record; summary and provenance:
+
+- **Verdict: GO / SHIP.** P0 kill-gate GO, P1 build-out and independent verify
+  all correctness gates green, prefill positive-and-reproducible, KL-improving.
+- **Key reframe:** q36 GDN/attention projections (attn_qkv/gate,
+  ssm_alpha/beta/out) are **BF16 weights, not NVFP4** - only the MoE experts
+  (`ffn_*_exps`) are NVFP4. The convert tax lives at the BF16 cuBLAS projection
+  boundary (`op_mul_mat_cublas` src0==BF16), so bf16-stream is a **MoE-model
+  lever**; the dense model quantizes those projections to NVFP4 and engages
+  nothing (stays bit-identical).
+- **Engagement:** P0 = 960 gate_norm->ssm_out segments/prefill; full build-out =
+  2240 (960 single-consumer 0044 ssm_out + 1280 multi-consumer plain-rms_norm ->
+  {attn q/k/v, GDN in_proj}).
+- **Prefill (MoE @512 B=32):** +1.99% (2361.67 vs 2315.52 t/s, all 5 bf16 > all 5
+  ctrl; reproduced +1.89%); @2048 +0.95%; dense no-op (-0.09%). Recovered ~8.44
+  us/tok @512. At the noise floor -> classified neutral but reproducible; no
+  regression.
+- **KL (MoE):** bf16 KLD 0.136042 vs control 0.136563 => delta -0.00052 (bf16
+  slightly better via the `LLAMA_BF16_CUBLAS_F32_OUT` plank keeping the full f32
+  GEMM result); same-top-p 84.461% vs 83.725% (>= 84% baseline). Dense: 0
+  engagements => bit-identical.
+- **Correctness:** default md5 canonical both models (MoE `8cb0ce23`, dense
+  `5951a5b4`) present-but-off and env-on (small-M bails); `test-backend-ops`
+  MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN
+  7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4.
+- **Honest scope:** targets prefill bucket 3 (the ~4.8%-of-wall convert/glue
+  tax) only, and owns the projection-boundary portion of it (~40% end-to-end) -
+  not the GDN-scan (bucket 1, P5) or GEMM-tiling (bucket 2, P2/P3) buckets. Well
+  below the scope's optimistic ~45 us/tok target by construction. Next increment
+  = own the bf16->f32 dst direction + the remaining attn_norm-fed projection
+  src1 converts.
+- **Deferred (blocked by an external imatrix job contending the GB10, NOT a
+  failed gate):** the nsys graph-node bucket table, decode S_TG @npl128, and the
+  Phase130 serving A/B need a clean idle-GPU re-run.
+
+Fork branch `mudler/llama.cpp:localai-paged` fast-forwarded on top of
+`1edddc8fe` (LocalAI series `0001-0052`) with three P1 commits:
+
+- `1271488fc` feat(paged): P1 bf16-stream residual-segment executor +
+  norm-bf16 kernels (+ the re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank)
+- `91373e1b9` feat(paged): P1 bf16-stream bf16 residual-add + rope op-variants
+- `653bb2f3d` test(paged): P1 bf16-stream BF16_STREAM_SEGMENT sentinel
+
+New fork HEAD `653bb2f3d`, tree `6cf1523047`. LocalAI series regenerated
+additively as `0053-0055` (46 patches total, `0001-0052` untouched); kill-gate
+at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte
+== fork HEAD tree. Nothing pushed. Artifacts:
+`~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229`
+on the DGX; fork topic branch `p1-bf16-stream` retained for forensics.