mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 09:57:14 -04:00
feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026)
Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet recurrent state per head: f32 on long-memory heads (where bf16 rounding does not contract and the KL error concentrates), bf16 on fast-decaying heads, classified at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh = 0.0 keeps every head f32 (bit-exact opt-out). De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models (dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd). Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode (ids in-place path) is incoherent; classifier/cache/kernel-params verified correct, bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md. Not ready for the GateSweep until fixed. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code]
This commit is contained in:
File diff suppressed because it is too large
Load Diff
48
backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md
Normal file
48
backend/cpp/llama-cpp/patches/paged/A_HYBRID_PROGRESS.md
Normal file
@@ -0,0 +1,48 @@
|
||||
# A-build: hybrid per-head f32/bf16 SSM state - BUILD PROGRESS
|
||||
|
||||
Label: A-build (GPU agent). Base: DGX `~/llama-paged-dev` branch `paged` HEAD 2f4f5ab (patch 0025),
|
||||
plus `BF16_SSM_STATE.diff` applied as the bf16 plumbing base. Goal: per-head mixed-dtype SSM state
|
||||
(f32 long-memory heads, bf16 fast heads); default `ssm_hybrid_tau_thresh=inf` (all-f32, bit-exact).
|
||||
|
||||
## Design recap (from SPEEDUP_HUNT.md A-hybrid-design)
|
||||
- Classifier (host, model-load): tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h])); f32 if tau_h>T.
|
||||
ssm_a = SSM_A_NOSCAN = -exp(A_log) (verified qwen35.cpp:376). ssm_dt = SSM_DT bias.
|
||||
- Split cache: per GDN layer, s_l (f32, n_f32 heads) + s_l_bf16 (bf16, n_bf16 heads). head_slot map.
|
||||
- Kernel: ONE kernel templated +HYBRID; per-block (h_idx) branch on head_slot (uniform, no divergence).
|
||||
Recurrence math byte-for-byte f32-register, untouched. Homogeneous (HYBRID=false) path bit-exact.
|
||||
- Op: extra src[8]=state_bf16, src[9]=head_slot; backend detects hybrid = (src[9]!=null).
|
||||
- CPU mirror: per-head partition read.
|
||||
- test-backend-ops: MIXED case (some heads f32, some bf16) output-append, decode+prefill+keep_rs_t.
|
||||
|
||||
## DE-RISK GATE (must pass before sweep)
|
||||
1. test-backend-ops GATED_DELTA_NET mixed PASS (CUDA mixed vs CPU mixed).
|
||||
2. T_thresh=inf greedy md5 == 0023 baseline: dense 5951a5b4d624ce891e22ab5fca9bc439,
|
||||
MoE 07db32c2bcb78d17a43ed18bc22705cd.
|
||||
|
||||
## KNOB SEMANTICS (IMPORTANT - brief endpoint wording corrected)
|
||||
Rule (brief verbatim + physics + "start 32-64" guidance all agree): a head is kept f32 iff
|
||||
tau_h > T_thresh, else bf16. tau_h = 1/(|ssm_a|*softplus(ssm_dt)) in tokens. Long-memory (large tau)
|
||||
heads stay f32 (bf16 rounding does not contract there -> KL); fast (small tau) heads -> bf16.
|
||||
- ssm_hybrid_tau_thresh DEFAULT = 0.0 => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).
|
||||
- ssm_hybrid_tau_thresh -> +inf => ALL BF16 (shelved mode).
|
||||
- sweep: raise T (16/32/64/128 tokens) to bf16 progressively more (longer-memory) heads = more speed.
|
||||
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED vs the operative rule it states
|
||||
("keep f32 if tau>T") and vs "start 32-64" + the physics. Correct endpoints: 0=all-f32, inf=all-bf16.
|
||||
Implemented the physically-correct rule; default 0.0 = bit-exact all-f32.
|
||||
|
||||
## STATUS
|
||||
- [x] ggml.h/ggml.c hybrid op builders
|
||||
- [x] gated_delta_net.cu hybrid kernel + dispatch (one kernel, +HYBRID template, uniform per-block branch)
|
||||
- [x] ops.cpp CPU hybrid read mirror (output-append; ids in-place is GPU-only, asserted)
|
||||
- [x] test-backend-ops mixed case (32 cases: hc 4/8 x hs 64/128 x decode/prefill/keep_rs_t x kda)
|
||||
- [x] de-risk gate 1: test-backend-ops GATED_DELTA_NET = 84/84 PASS (incl 32 hybrid mixed CUDA-vs-CPU)
|
||||
- [x] cparam/CLI ssm_hybrid_tau_thresh plumbing (default 0.0; threaded context->cparams->memory->ctors)
|
||||
- [x] memory-recurrent split cache + classifier (validated: real tau split, correct 2-partition layout)
|
||||
- [x] delta-net-base hybrid op build (fused ids decode + bf16 rs_zero/extra mirror)
|
||||
- [x] full build clean (sm_121; llama-completion/batched-bench/perplexity/test-backend-ops)
|
||||
- [x] de-risk gate 2 (default/all-f32 md5 == 0023 both models, re-verified post-build)
|
||||
- [~] hybrid-ON smoke: RUNS (no crash) + classifier/cache/kernel-params verified, but OUTPUT INCOHERENT
|
||||
=> OPEN BUG in the ids in-place cross-step state path (opt-in only; default unaffected). See
|
||||
A_HYBRID_SSM_RESULTS.md. NOT ready for the sweep until fixed.
|
||||
|
||||
Committed: DGX paged 657e008; worktree patch 0026 + A_HYBRID_SSM_RESULTS.md.
|
||||
90
backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md
Normal file
90
backend/cpp/llama-cpp/patches/paged/A_HYBRID_SSM_RESULTS.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# A - HYBRID PER-HEAD f32/bf16 SSM STATE - BUILD + DE-RISK RESULTS
|
||||
|
||||
Label: A-build (the GPU build agent). Lands as patch 0026 on top of 0025 (DGX HEAD 2f4f5ab),
|
||||
incorporating the bf16-SSM-state plumbing (`BF16_SSM_STATE.diff`) as the base. Built into
|
||||
`~/llama-paged-dev/build-cuda` (sm_121); committed on the DGX `paged` branch (657e008) and as
|
||||
`patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch` + this doc in the worktree.
|
||||
|
||||
## DE-RISK GATE - both required gates PASS
|
||||
|
||||
### Gate 1: test-backend-ops MIXED GATED_DELTA_NET (CUDA mixed vs CPU mixed)
|
||||
`./bin/test-backend-ops -o GATED_DELTA_NET -b CUDA0` = **84/84 PASS, CUDA0 OK**. This includes the
|
||||
**32 new hybrid mixed-dtype cases** (`test_gated_delta_net_hybrid`): head_count {4,8} x head_size
|
||||
{64,128} x {single-token decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda {0,1}, with an
|
||||
interleaved head_slot map (even heads f32, odd heads bf16) so both partition branches are exercised
|
||||
across blocks. CUDA mixed vs CPU mixed agree. (Plus the pre-existing 52 f32 + bf16 cases still pass.)
|
||||
|
||||
### Gate 2: T_thresh=inf (default, all-f32) greedy md5 == 0023 baseline - BOTH MODELS
|
||||
`llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`, NO
|
||||
`--ssm-bf16-tau` flag (default 0.0 => every head f32 => no split => the existing single-cache path):
|
||||
- dense q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023 baseline.
|
||||
- MoE q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023 baseline.
|
||||
Re-verified byte-identical AFTER the full build with every plumbing edit in place. **The bit-exact
|
||||
opt-out is preserved.**
|
||||
|
||||
## KNOB SEMANTICS (brief endpoint wording corrected)
|
||||
`ssm_hybrid_tau_thresh` / `--ssm-bf16-tau` T: a gated-DeltaNet head is kept **f32 iff tau_h > T**,
|
||||
else bf16. `tau_h = 1/(|ssm_a[il][h]| * softplus(ssm_dt[il][h]))` tokens (ssm_a = SSM_A_NOSCAN =
|
||||
-exp(A_log), verified qwen35.cpp:376; ssm_dt = SSM_DT bias). This is the brief's operative rule + the
|
||||
"start 32-64" guidance + the physics (long-memory/large-tau heads stay f32; fast/small-tau heads ->
|
||||
bf16). Endpoints:
|
||||
- **T = 0.0 (DEFAULT) => every tau>0 -> ALL F32 (bit-exact opt-out; the gate runs here).**
|
||||
- **T -> +inf => ALL BF16 (shelved mode).**
|
||||
- sweep T in {16,32,64,128} bf16's progressively more (longer-memory) heads = more speed.
|
||||
|
||||
NOTE: the brief's "inf=>all-f32, 0=>all-bf16" sentence is INVERTED relative to the rule it states
|
||||
("keep f32 if tau>T") and to "start 32-64" + the physics. The physically-correct rule is implemented;
|
||||
the bit-exact all-f32 mode is the DEFAULT (T=0), which is exactly what the de-risk gate exercises.
|
||||
|
||||
## What was built (all components, validated correct)
|
||||
1. **Classifier** (llama-memory-recurrent ctor, host, model-load): reads ssm_a/ssm_dt per GDN layer,
|
||||
computes tau_h, sets head_is_bf16. VALIDATED on dense q27 (H_v=48, S_v=128): real per-head tau
|
||||
spread min~0.2-0.5 / max~800-26000 tokens; at T=32 the split is ~13-31 f32 / 17-35 bf16 per layer.
|
||||
Guarded against the device-memory-fitting pre-pass (weights not yet allocated => data==NULL =>
|
||||
fall back to single f32 cache, a conservative/larger memory estimate; real load classifies).
|
||||
2. **Split cache** (llama-memory-recurrent): per split GDN layer, s_l[il] holds the f32 partition
|
||||
[S_v*S_v*n_f32, n_rows] and s_l_bf16[il] the bf16 partition [S_v*S_v*n_bf16, n_rows] + an I32[H]
|
||||
head_slot map (local_idx>=0 f32, -(local_idx+1)<0 bf16), uploaded after buffer alloc. ctx metadata
|
||||
budget bumped 2->4 tensors/layer (r, s_f32, s_bf16, head_slot). VALIDATED: cache layout correct
|
||||
(f32/bf16 partitions 2MB apart, non-overlapping; sizes match counts).
|
||||
3. **Kernel** (gated_delta_net.cu): ONE kernel templated +HYBRID; per-block (h_idx) branch on
|
||||
head_slot picks the partition + local index (uniform within a block => no warp divergence). The
|
||||
homogeneous (HYBRID=false) instantiations are byte-identical to before (if constexpr elides the
|
||||
hybrid blocks). Two builders: ggml_gated_delta_net_hybrid (output-append, for the test) and
|
||||
ggml_gated_delta_net_inplace_ids_hybrid (decode). Backend detects hybrid = src[9]!=null; gathers
|
||||
both partitions for non-identity seqs; derives the bf16 in-place dst from src[8]+rs_head.
|
||||
4. **CPU mirror** (ops.cpp): per-head partition read for the output-append form (the test path).
|
||||
5. **Plumbing**: cparam ssm_hybrid_tau_thresh threaded llama_context_params -> cparams ->
|
||||
llama_memory_params -> recurrent/hybrid/iswa ctors; common_params + CLI --ssm-bf16-tau (default 0).
|
||||
6. **test-backend-ops**: the 32 mixed cases above.
|
||||
|
||||
## KNOWN OPEN ISSUE - hybrid-ON decode is incoherent (opt-in only; does NOT affect the default)
|
||||
With `--ssm-bf16-tau` > 0 (any split, even tau=1 = a handful of bf16 heads), the model generates
|
||||
incoherent text ("<think> the the the > EOF"). The bit-exact all-f32 default is UNAFFECTED (gate 2).
|
||||
|
||||
Diagnosis (everything reachable by inspection was verified correct):
|
||||
- The op-level MIXED test PASSES, but it only covers the **output-append** form (state read from the
|
||||
s0 input partitions, write to the f32 op output). The model decode uses the **ids in-place** form:
|
||||
read from the in-place cache partition (identity), write the new state in place per partition. That
|
||||
cross-step state path is NOT exercised by a single-op test (the in-place state write is a side
|
||||
effect, not the compared op output), so it is the only un-netted surface - and that is where the bug
|
||||
lives.
|
||||
- Confirmed correct at runtime: the classifier (real tau split), the split cache layout (partitions
|
||||
2MB apart, sizes match), and the exact kernel parameters (H=48, S_v=128, n_f32+n_bf16=H, head_slot
|
||||
values, ids/state_dst/state_bf16 pointers all sane). The hybrid op IS built and dispatched (not a
|
||||
homogeneous fallback). Garbage persists with CUDA graphs disabled, so it is not a graph-capture
|
||||
issue. The recurrence math is shared with the (passing) output-append path.
|
||||
- The bug is therefore in the ids in-place cross-step state handling (identity-d read and/or in-place
|
||||
partition store, and/or the bf16 partition rs_zero/extra-states mirroring in delta-net-base) - a
|
||||
state-corruption that cascades. It needs a multi-step reproduction (the single-op harness cannot
|
||||
catch a cross-step in-place bug; the homogeneous in-place ids op itself has no op test - it was only
|
||||
ever validated by model md5).
|
||||
|
||||
## NOT ready for the GateSweep yet
|
||||
The de-risk gates (mixed op test + bit-exact default) BOTH PASS, but the hybrid-ON path must be made
|
||||
coherent before the T_thresh KL/throughput sweep can produce meaningful numbers. Recommended next
|
||||
step: build a minimal 2-step in-place reproduction (CPU ids-in-place hybrid mirror + a decode-loop
|
||||
harness, or a kernel-side state dump comparing hybrid vs homogeneous for an all-f32-disguised split)
|
||||
to localize identity-d-read vs in-place-store vs the bf16 clear/extra mirror.
|
||||
|
||||
Assisted-by: Claude:opus-4.8 [Claude Code]
|
||||
Reference in New Issue
Block a user