feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026)

Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet
recurrent state per head: f32 on long-memory heads (where bf16 rounding does not
contract and the KL error concentrates), bf16 on fast-decaying heads, classified
at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh
= 0.0 keeps every head f32 (bit-exact opt-out).

De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid
mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models
(dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd).

Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode
(ids in-place path) is incoherent; classifier/cache/kernel-params verified correct,
bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md.
Not ready for the GateSweep until fixed.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
This commit is contained in:
Ettore Di Giacinto
2026-06-26 16:21:33 +00:00
parent 6bfca146d6
commit fe5bd3f53d
3 changed files with 2121 additions and 0 deletions

View File

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,48 @@
# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS
Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025),
plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state
(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact).
## Design recap (from SPEEDUP_HUNT.md A-hybrid-design)
- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T.
ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias.
- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map.
- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence).
Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact.
- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null).
- CPU mirror: per-head partition read.
- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t.
## DE-RISK GATE (must pass before sweep)
1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed).
2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439,
MoE 07db32c2bcb78d17a43ed18bc22705cd.
## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)
Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff
tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau)
heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16.
- ssm_hybrid_tau_thresh DEFAULT = 0.0 => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).
- ssm_hybrid_tau_thresh -> +inf => ALL BF16 (shelved mode).
- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed.
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states
("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16.
Implemented the physically-correct rule; default 0.0 = bit-exact all-f32.
## STATUS
- [x] ggml.h/ggml.c hybrid op builders
- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch)
- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted)
- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda)
- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU)
- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors)
- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout)
- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror)
- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops)
- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build)
- [~] hybrid-ON smoke: RUNS (no crash) + classifier/cache/kernel-params verified, but OUTPUT INCOHERENT
=> OPEN BUG in the ids in-place cross-step state path (opt-in only; default unaffected). See
A_HYBRID_SSM_RESULTS.md. NOT ready for the sweep until fixed.
Committed: DGX paged 657e008; worktree patch 0026 + A_HYBRID_SSM_RESULTS.md.

View File

@@ -0,0 +1,90 @@
# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK RESULTS
Label: A-build (the GPU build agent). Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab),
incorporating the bf16-SSM-state plumbing (`BF16_SSM_STATE.diff`) as the base. Built into
`~/llama-paged-dev/build-cuda` (sm_121); committed on the DGX `paged` branch (657e008) and as
`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree.
## DE-RISK GATE - both required gates PASS
### Gate 1: test-backend-ops MIXED GATED_DELTA_NET (CUDA mixed vs CPU mixed)
`./bin/test-backend-ops -o GATED_DELTA_NET -b CUDA0` = **84/84 PASS, CUDA0 OK**. This includes the
**32 new hybrid mixed-dtype cases** (`test_gated_delta_net_hybrid`): head_count {4,8} x head_size
{64,128} x {single-token decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda {0,1}, with an
interleaved head_slot map (even heads f32, odd heads bf16) so both partition branches are exercised
across blocks. CUDA mixed vs CPU mixed agree. (Plus the pre-existing 52 f32 + bf16 cases still pass.)
### Gate 2: T_thresh=inf (default, all-f32) greedy md5 == 0023 baseline - BOTH MODELS
`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`, NO
`--ssm-bf16-tau` flag (default 0.0 => every head f32 => no split => the existing single-cache path):
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 baseline.
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023 baseline.
Re-verified byte-identical AFTER the full build with every plumbing edit in place. **The bit-exact
opt-out is preserved.**
## KNOB SEMANTICS (brief endpoint wording corrected)
`ssm_hybrid_tau_thresh` / `--ssm-bf16-tau` T: a gated-DeltaNet head is kept **f32 iff tau_h > T**,
else bf16. `tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h]))` tokens (ssm_a = SSM_A_NOSCAN =
-exp(A_log), verified qwen35.cpp:376; ssm_dt = SSM_DT bias). This is the brief's operative rule + the
"start 32-64" guidance + the physics (long-memory/large-tau heads stay f32; fast/small-tau heads ->
bf16). Endpoints:
- **T = 0.0 (DEFAULT) => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).**
- **T -> +inf => ALL BF16 (shelved mode).**
- sweep T in {16,32,64,128} bf16's progressively more (longer-memory) heads = more speed.
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED relative to the rule it states
("keep f32 if tau>T") and to "start 32-64" + the physics. The physically-correct rule is implemented;
the bit-exact all-f32 mode is the DEFAULT (T=0), which is exactly what the de-risk gate exercises.
## What was built (all components, validated correct)
1. **Classifier** (llama-memory-recurrent ctor, host, model-load): reads ssm_a/ssm_dt per GDN layer,
computes tau_h, sets head_is_bf16. VALIDATED on dense q27 (H_v=48, S_v=128): real per-head tau
spread min~0.2-0.5 / max~800-26000 tokens; at T=32 the split is ~13-31 f32 / 17-35 bf16 per layer.
Guarded against the device-memory-fitting pre-pass (weights not yet allocated => data==NULL =>
fall back to single f32 cache, a conservative/larger memory estimate; real load classifies).
2. **Split cache** (llama-memory-recurrent): per split GDN layer, s_l[il] holds the f32 partition
[S_v*S_v*n_f32, n_rows] and s_l_bf16[il] the bf16 partition [S_v*S_v*n_bf16, n_rows] + an I32[H]
head_slot map (local_idx>=0 f32, -(local_idx+1)<0 bf16), uploaded after buffer alloc. ctx metadata
budget bumped 2->4 tensors/layer (r, s_f32, s_bf16, head_slot). VALIDATED: cache layout correct
(f32/bf16 partitions 2MB apart, non-overlapping; sizes match counts).
3. **Kernel** (gated_delta_net.cu): ONE kernel templated +HYBRID; per-block (h_idx) branch on
head_slot picks the partition + local index (uniform within a block => no warp divergence). The
homogeneous (HYBRID=false) instantiations are byte-identical to before (if constexpr elides the
hybrid blocks). Two builders: ggml_gated_delta_net_hybrid (output-append, for the test) and
ggml_gated_delta_net_inplace_ids_hybrid (decode). Backend detects hybrid = src[9]!=null; gathers
both partitions for non-identity seqs; derives the bf16 in-place dst from src[8]+rs_head.
4. **CPU mirror** (ops.cpp): per-head partition read for the output-append form (the test path).
5. **Plumbing**: cparam ssm_hybrid_tau_thresh threaded llama_context_params -> cparams ->
llama_memory_params -> recurrent/hybrid/iswa ctors; common_params + CLI --ssm-bf16-tau (default 0).
6. **test-backend-ops**: the 32 mixed cases above.
## KNOWN OPEN ISSUE - hybrid-ON decode is incoherent (opt-in only; does NOT affect the default)
With `--ssm-bf16-tau` > 0 (any split, even tau=1 = a handful of bf16 heads), the model generates
incoherent text ("<think> the the the > EOF"). The bit-exact all-f32 default is UNAFFECTED (gate 2).
Diagnosis (everything reachable by inspection was verified correct):
- The op-level MIXED test PASSES, but it only covers the **output-append** form (state read from the
s0 input partitions, write to the f32 op output). The model decode uses the **ids in-place** form:
read from the in-place cache partition (identity), write the new state in place per partition. That
cross-step state path is NOT exercised by a single-op test (the in-place state write is a side
effect, not the compared op output), so it is the only un-netted surface - and that is where the bug
lives.
- Confirmed correct at runtime: the classifier (real tau split), the split cache layout (partitions
2MB apart, sizes match), and the exact kernel parameters (H=48, S_v=128, n_f32+n_bf16=H, head_slot
values, ids/state_dst/state_bf16 pointers all sane). The hybrid op IS built and dispatched (not a
homogeneous fallback). Garbage persists with CUDA graphs disabled, so it is not a graph-capture
issue. The recurrence math is shared with the (passing) output-append path.
- The bug is therefore in the ids in-place cross-step state handling (identity-d read and/or in-place
partition store, and/or the bf16 partition rs_zero/extra-states mirroring in delta-net-base) - a
state-corruption that cascades. It needs a multi-step reproduction (the single-op harness cannot
catch a cross-step in-place bug; the homogeneous in-place ids op itself has no op test - it was only
ever validated by model md5).
## NOT ready for the GateSweep yet
The de-risk gates (mixed op test + bit-exact default) BOTH PASS, but the hybrid-ON path must be made
coherent before the T_thresh KL/throughput sweep can produce meaningful numbers. Recommended next
step: build a minimal 2-step in-place reproduction (CPU ids-in-place hybrid mirror + a decode-loop
harness, or a kernel-side state dump comparing hybrid vs homogeneous for an all-f32-disguised split)
to localize identity-d-read vs in-place-store vs the bf16 clear/extra mirror.
Assisted-by: Claude:opus-4.8 [Claude Code]