docs(paged): record P1 bf16-stream landing (GO)

P1 of the EXECUTION_REARCH_SCOPE additive program landed: LLAMA_BF16_STREAM
(default-off) bf16-resident residual-segment executor for the q36 MoE model's
projection boundaries.

- EXECUTION_REARCH_SCOPE.md: dated "P1 RESULT" subsection (P0 kill-gate GO,
  full build-out deltas, KL, correctness gates, honest magnitude, provenance).
- PARITY_HANDOFF.md: chronology note (verdict, engagement, prefill/KL numbers,
  fork commits, deferred-not-failed measurements).

Key reframe recorded: q36 GDN/attention projections are BF16 weights (not
NVFP4), so bf16-stream is a MoE-model prefill lever; the dense model quantizes
those projections to NVFP4 and engages nothing (stays bit-identical). Prefill
MoE @512 +1.99% (reproducible, at noise floor), KL delta -0.00052 (KL-improving),
all md5 + test-backend-ops gates green. Fork HEAD 653bb2f3d, tree 6cf1523047.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-07-02 14:34:26 +00:00
parent 500d653bfa
commit ccf75d1dcd
2 changed files with 132 additions and 0 deletions

View File

@@ -248,6 +248,81 @@ started P2/P4/P5/P6.
types (avoids `ggml.h`, 5 patches). Rides upstream fusion machinery (`ggml_can_fuse`,
discussion #17621) by adding new clauses, not editing upstream's.
#### P1 RESULT (landed 2026-07-02, `LLAMA_BF16_STREAM`, default-off)
The bf16-resident residual-segment executor landed as three fork commits on
`mudler/llama.cpp:localai-paged` (new HEAD `653bb2f3d`, tree `6cf1523047`, base
`1edddc8fe`): `1271488fc` (segment executor + `norm-bf16.{cu,cuh}` + the
re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank), `91373e1b9` (bf16 residual-add
+ rope op-variants), `653bb2f3d` (test sentinel). LocalAI series regenerated
additively as `0053-0055` (46 patches total); kill-gate at pin `0ed235ea`: all
patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree.
- **Mechanism as-shipped (Option A, as scoped).** One additive clause in
`ggml_cuda_try_fuse` detects a residual-stream norm-producer (plain
`{RMS_NORM,MUL}` attn/GDN input norm, or the 0044 `{SILU,RMS_NORM,MUL,MUL}`
ssm_out gated-output norm) whose f32-output consumers are ALL large-M (M>=128)
cuBLAS-bf16 projections, runs the norm into a bf16 pool buffer via
`norm-bf16.cu` (bit-faithful to the f32 kernels up to the `__float2bfloat16`
store), executes the owned span inline through a bf16 view, then skips it. A
strict all-consumers-are-ours guard keeps the f32 norm un-materialised and
bails to the stock f32 path on small-M / decode / MMQ / native-FP4 /
multi-consumer. The `LLAMA_BF16_CUBLAS_F32_OUT` plank lets owned projections
write f32 directly from bf16 compute (F32_OUT else-branch byte-identical to the
original cuBLAS path). No upstream fuse clause edited; exactly 6 files, cmake
untouched (`.cu` globbed).
- **KEY REFRAME (why a first guard engaged 0).** q36 GDN/attention projections
(attn_qkv/gate, ssm_alpha/beta/out) are **BF16 weights, NOT NVFP4**; only the
MoE experts (`ffn_*_exps`) are NVFP4. The convert tax therefore lives at the
BF16 cuBLAS projection boundary (`op_mul_mat_cublas` src0==BF16 converts f32
src1->bf16), not on the FP4-MMQ path (which pays act_quant, not convert). The
dense model quantizes its attn/GDN projections to NVFP4, so it **engages
nothing** and stays bit-identical. **bf16-stream is a MoE-model prefill lever.**
- **P0 kill-gate (`~/bench/p1_bf16_stream/killgate_20260702_135544`): GO.** One
segment (960 gate_norm->ssm_out engagements/prefill). `convert_unary<float,bf16>`
fell 6840->5880 = exactly -960 (163.19->130.73 ms, -19.9%; share 2.27%->1.83%)
= 100% within-owned-segment drop (the kill-gate's stated criterion), no
boundary convert added. KL: control and bf16 arms **byte-identical** (KLD
0.136563 both, same-top-p 83.725% both) => KLD delta 0.000 < 0.01. Prefill S_PP
+0.53% (2323.24 vs 2310.94 t/s), inside the 3-sigma noise gate. Default md5
GREEN both models. (The total convert bucket only moved 4.83%->4.40% because
the minimal segment owns 1 of ~5 BF16 cuBLAS GEMMs per GDN layer; the >50% GO
is the within-segment 100%.)
- **P1 full build-out: 2240 segments/prefill** (2.33x P0's 960) = 960
gate_norm->ssm_out (0044, single-consumer) + 1280 multi-consumer plain
rms_norm -> {attn q/k/v, GDN in_proj} BF16 projections. Prefill A/B (5 iters,
clean, captured before external contention): MoE @512 B=32 **+1.99%**
(2361.67 vs 2315.52 t/s; all 5 bf16 samples above all 5 ctrl; reproduced +1.89%),
@2048 B=8 +0.95%; dense @512 -0.09% / @2048 -0.10% (no-op). Recovered ~8.44
us/tok @512 (wall 431.87->423.43), ~4.02 @2048. Both MoE deltas sit at the
max(2%, 3-sigma) floor => classified neutral, but consistent and reproducible
positive shifts; no prefill regression => not a NO-GO. Decode S_TG neutral
(M<128 bails).
- **KL gate GREEN (both models).** MoE bf16 KLD 0.136042 vs control 0.136563 =>
delta **-0.00052** (bf16 slightly better: F32_OUT keeps the full f32 GEMM
result instead of the old bf16 round-trip), inside the +0.01 band; same-top-p
84.461% vs 83.725% (>= 84% baseline). Dense: 0 engagements => bit-identical
(KLD delta 0, same-top-p 100%).
- **All correctness gates GREEN.** Default md5 canonical both models
(MoE `8cb0ce23`, dense `5951a5b4`); env-on md5 canonical both (small-M bails);
`test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET
46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4
(default AND opt-in). Files: binbcast.cu +10, ggml-cuda.cu +297, norm-bf16.cu
+483, norm-bf16.cuh +37, rope.cu +31, test-backend-ops.cpp +79.
- **Honest magnitude / what remains.** The +1.9-2.0% @512 win is real,
reproducible, KL-benign (in fact KL-improving), and safe, but modest:
bf16-stream targets only prefill bucket 3 (the ~4.8%-of-wall convert/glue tax)
and owns the projection-boundary portion of it (~40% end-to-end), not the
GDN-scan (bucket 1) or GEMM-tiling (bucket 2) buckets. Read the "expected
recovery: ~45 us/tok" line above as an upper bound on the whole bucket-3+4
region; this landing captures the bucket-3 projection boundary only. The next
P1 increment on the table = extend the multi-consumer executor to own the
bf16->f32 dst direction plus the remaining attn_norm-fed projection src1
converts (~4 more converts/layer). Deferred (blocked only by an external
imatrix job contending the GPU, not a failed gate): the nsys graph-node bucket
table, decode S_TG @npl128, and the Phase130 serving A/B need a clean idle GB10
re-run; the scope deems throughput-neutral serving acceptable on GB10.
### P2: expert-major fused routed-FFN region executor (grow the merged MoE seam into the real thing)
- **Goal:** drive both MoE GEMMs expert-major so the gate_up output never lands in

View File

@@ -2349,3 +2349,60 @@ Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of
New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of
the 110-140 campaign are recorded above and in the per-phase bench artifacts.
## P1 bf16-native execution pass - LANDED (2026-07-02)
First phase of the `EXECUTION_REARCH_SCOPE.md` additive program to land.
`LLAMA_BF16_STREAM` (default-off) runs a bf16-resident residual-segment
executor for the q36 MoE decision model's projection boundaries, deleting the
per-op `f32->bf16` convert the stock cuBLAS-bf16 path pays at the projection
`src1`. See the "P1 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the
full record; summary and provenance:
- **Verdict: GO / SHIP.** P0 kill-gate GO, P1 build-out and independent verify
all correctness gates green, prefill positive-and-reproducible, KL-improving.
- **Key reframe:** q36 GDN/attention projections (attn_qkv/gate,
ssm_alpha/beta/out) are **BF16 weights, not NVFP4** - only the MoE experts
(`ffn_*_exps`) are NVFP4. The convert tax lives at the BF16 cuBLAS projection
boundary (`op_mul_mat_cublas` src0==BF16), so bf16-stream is a **MoE-model
lever**; the dense model quantizes those projections to NVFP4 and engages
nothing (stays bit-identical).
- **Engagement:** P0 = 960 gate_norm->ssm_out segments/prefill; full build-out =
2240 (960 single-consumer 0044 ssm_out + 1280 multi-consumer plain-rms_norm ->
{attn q/k/v, GDN in_proj}).
- **Prefill (MoE @512 B=32):** +1.99% (2361.67 vs 2315.52 t/s, all 5 bf16 > all 5
ctrl; reproduced +1.89%); @2048 +0.95%; dense no-op (-0.09%). Recovered ~8.44
us/tok @512. At the noise floor -> classified neutral but reproducible; no
regression.
- **KL (MoE):** bf16 KLD 0.136042 vs control 0.136563 => delta -0.00052 (bf16
slightly better via the `LLAMA_BF16_CUBLAS_F32_OUT` plank keeping the full f32
GEMM result); same-top-p 84.461% vs 83.725% (>= 84% baseline). Dense: 0
engagements => bit-identical.
- **Correctness:** default md5 canonical both models (MoE `8cb0ce23`, dense
`5951a5b4`) present-but-off and env-on (small-M bails); `test-backend-ops`
MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN
7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4.
- **Honest scope:** targets prefill bucket 3 (the ~4.8%-of-wall convert/glue
tax) only, and owns the projection-boundary portion of it (~40% end-to-end) -
not the GDN-scan (bucket 1, P5) or GEMM-tiling (bucket 2, P2/P3) buckets. Well
below the scope's optimistic ~45 us/tok target by construction. Next increment
= own the bf16->f32 dst direction + the remaining attn_norm-fed projection
src1 converts.
- **Deferred (blocked by an external imatrix job contending the GB10, NOT a
failed gate):** the nsys graph-node bucket table, decode S_TG @npl128, and the
Phase130 serving A/B need a clean idle-GPU re-run.
Fork branch `mudler/llama.cpp:localai-paged` fast-forwarded on top of
`1edddc8fe` (LocalAI series `0001-0052`) with three P1 commits:
- `1271488fc` feat(paged): P1 bf16-stream residual-segment executor +
norm-bf16 kernels (+ the re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank)
- `91373e1b9` feat(paged): P1 bf16-stream bf16 residual-add + rope op-variants
- `653bb2f3d` test(paged): P1 bf16-stream BF16_STREAM_SEGMENT sentinel
New fork HEAD `653bb2f3d`, tree `6cf1523047`. LocalAI series regenerated
additively as `0053-0055` (46 patches total, `0001-0052` untouched); kill-gate
at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte
== fork HEAD tree. Nothing pushed. Artifacts:
`~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229`
on the DGX; fork topic branch `p1-bf16-stream` retained for forensics.