Conclude the MoE-parity hunt. The two remaining sub-levers in the
20.3-vs-13.8 ms projection bucket are both bit-changing or at the BW floor:
- convert-glue (3.24 ms/step, measured: 1.73 input f32->bf16 + 1.52 output
bf16->f32): NOT bit-exact eliminable. ggml-cuda.cu:1663-1690 rounds the f32
GEMM accumulator to bf16 (CUDA_R_16BF dst) then widens to f32; that
bf16-rounded value is load-bearing for the shipped md5. Removing the
round-trip (f32-direct output, bf16 residual stream, or NVFP4 weights) all
rebaseline md5. A precision boundary, like lever 4.
- bf16 projection GEMM (17.27 ms/step): BW-bound at the LPDDR5x floor
(~4.7 GB/step at 273 GB/s; M=128 -> 128 FLOP/byte vs >900 ridge). nvjet
already TMA-streams the weights; cutlass reads the same bytes. No kernel
lever; only fewer bytes (quantize) helps - rejected on quality.
Corrects the body premise that vLLM runs these projections as NVFP4-Marlin:
vLLM runs the same nvidia-modelopt checkpoint that keeps them BF16, so the
projection bucket is a matched-precision gap, not a quant gap.
Realistic bit-exact MoE ceiling ~86-88% of vLLM; shipped lever 1 (86.3%) is
at it. No one-more-lever for MoE. Only clean win left is DENSE (+0.41% lever 5),
gated behind resolving the paged-MoE baseline md5 drift.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Re-quantizing the MoE GGUF's bf16 GDN/attn projections to NVFP4 (the lever-4
scope hypothesis) fails the KL gate on every axis vs the shipping NVFP4 baseline:
PPL +6.51% (FULL) / +6.15% (CONS) against a <1% gate, mean KLD-to-f16 0.164/0.172
vs baseline 0.137, top-1 argmax agreement down ~2.2-2.6 points. Both projq
variants rejected; in_proj_ba being kept bf16 (CONS) recovered almost nothing, so
the damage is in the bulk attn/GDN projections.
Root cause: the bf16 projections are a deliberate modelopt precision choice, not a
provenance accident. vLLM runs the same modelopt checkpoint, so it keeps these
projections bf16 too - the baseline GGUF already matches vLLM. The ~20.3ms
projection-GEMM bucket is the price of high-precision projections that vLLM also
pays; it is not the llama-vs-vLLM lever it appeared to be. The speed win is only
purchasable with a 6% PPL regression. MoE stays at 86.3% of vLLM @ npl128.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Scopes lever 4 (read-only, no GPU) on top of the flat levers 2+3. Root cause: the
MoE GGUF (nvidia modelopt, 241 NVFP4 tensors) quantized only the experts and left the
GDN/attn linear projections in BF16, while the dense GGUF (unsloth, 304 NVFP4 tensors)
already has them NVFP4 (proven: dense ssm_out runs FP4 MMQ; dense decode at 96.6% of
vLLM). Lever 4 = re-quantize the MoE GGUF's bf16 GDN/attn projections to NVFP4, the same
move vLLM makes on the identical weights - the +6.5ms projections bucket, the largest
single banked MoE gain available.
Path: offline re-quantize to a new GGUF variant (expanded --tensor-type); zero kernel
code - the loader sidecar-scale path + tuned mul_mat_q<NVFP4> are already in tree and
proven by the dense GGUF. Bit-changing => KL-gate, not md5. KL expected to pass (per-step
non-accumulating weight quant, unlike the failed bf16-state; experts already W4A4-clean);
lm_head is the one risky tensor (gate on argmax-agreement). Expected ~+4-6.5ms => MoE
86.3% -> ~88-91% of vLLM. Recommend a separate OPT-IN gallery variant (preserve the
bit-exact default; promote to default only if the KL gate is clean).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Rename the two base NVFP4 entries to a consistent -paged suffix
(qwen3.6-27b-nvfp4 -> qwen3.6-27b-nvfp4-paged, qwen3.6-35b-a3b-nvfp4 ->
qwen3.6-35b-a3b-nvfp4-paged) so all four base/MTP paged entries share the
naming convention. Update the two matching examples in the backend plan doc.
Add qwopus3.6-27b-v2-mtp-nvfp4-paged and qwopus3.6-27b-coder-mtp-nvfp4-paged:
verbatim copies of the stock qwopus NVFP4-MTP entries (same GGUF uri/sha256,
sampling, template, tags, function block) rewired onto the LocalAI
paged-attention stack (backend llama-cpp-localai-paged; f16, flash_attention,
131072 context, 99 gpu_layers, batch 512; paged_kv + max_batch_tokens:512 +
kv_unified:false + parallel:128). The stock entries are left untouched.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fuse the residual k_get_rows_float in the gated-DeltaNet decode path (the biggest
single kernel vLLM lacks per MOE_GAP_VS_VLLM.md, ~5.2 ms/step MoE). 0019 fused the
SSM-state gather, 0021 fused the conv compute but kept a build_rs gather for the
conv taps; nsys located that conv-state tap gather (n_embd_r=24576 floats x 128
seqs, ~720 x ~115 us per 24-step window) as the last k_get_rows in the GDN path.
New op ggml_ssm_conv_update_inplace_ids reads each sequence's prior conv taps from
cache[ids[s]] in-kernel (identity in place from the write slot, non-identity via a
disjoint scratch), mirroring the 0019 in-place + ids fusion. Bit-exact: read VALUES
unchanged, only the read path changes. Helps both dense and MoE (shared GDN conv).
GATE test-backend-ops (CUDA0 2/2): SSM_CONV_UPDATE_IDS, SSM_CONV_UPDATE, SSM_CONV,
GATED_DELTA_NET, GET_ROWS all PASS. GATE greedy md5 (-temp 0 -seed 1 -n 48)
BYTE-IDENTICAL both models: q36-27b-nvfp4 5951a5b4..., q36-35b-a3b-nvfp4 07db32c2...
nsys: k_get_rows<float,float> 10174 -> 9454 instances, 186.3 -> 102.8 ms (720 conv
gathers eliminated, replaced by a ~1.1 us no-op gather).
Built and gated on the DGX llama tree (branch paged, commit 944636c, f32 default).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Cross-agent synthesis on top of the both-engine nsys decomposition (3b5957157):
settle the user's "can we do what vLLM does on MoE?" question with the three
converging investigations (groundtruth measurement + vllm-marlin source-read +
marlin-port feasibility).
Verdict: vLLM's ~15% MoE-decode lead is NOT the Marlin GEMM (that bucket is a
-1.7 ms llama WIN: native FP4-MMA W4A4 47.3 vs Marlin W4A16 50.0 at the ragged
tiny-M decode shape, both at the LPDDR5x BW floor). The gap is bf16
dense-projection bandwidth (+6.5), recurrence state-gather plumbing (+6.6, led
by k_get_rows 5.2), graph/stream-overlap overhead (~+7), W4A4 act-quant tax
(+3.3), and router/glue (+5.4).
A W4A16/Marlin grouped MoE GEMM is REJECTED (default and opt-in): it would
regress the 27% GEMM bucket to half-rate bf16 MMA, re-enter the GB10 occupancy
wall the dense scaffold already STOPPED at, and its entire intrinsic upside is
the ~2% act-quant tax - smaller than the bit-exact +1.9% the 0025 re-graph
already banked, and closeable bit-exactly by fusing the act-quant.
Recommended build (none a new MoE GEMM): (1) fuse the k_get_rows SSM-state
gather (bit-exact, ~+5, biggest single-kernel win); (2) extend CUDA-graph
coverage + stream overlap (bit-exact, ~+7); (3) fuse the W4A4 act-quant into
RMSNorm/SiLU (bit-exact, +3.3); (4) NVFP4-quantize the still-bf16 GDN/attn
projections + lm_head (bit-changing, +6.5, the same NVFP4-dense-quant move vLLM
makes). Bit-exact levers alone reach ~94% of vLLM; with the projection quant
~96-97%, parity-or-better physically in reach since both heaviest kernels
(SSM core, MoE GEMM) are already llama wins.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-truth side-by-side per-kernel ms/step of the MoE decode gap on DGX GB10.
llama (752 t/s, step 169.8ms) vs vLLM graphs-on (901-equiv, step 142.0ms): 27.8ms gap.
Headline: the grouped MoE-expert GEMM is a llama WIN - native FP4-MMA W4A4 47.3ms
vs vLLM Marlin W4A16 50.0ms at the tiny-M decode shape. A Marlin-style W4A16 MoE
GEMM would be slower; it is not the lever (extends the w4a16-marlin DENSE verdict).
The 15% lives elsewhere: bf16 projections + convert glue (+6.5ms), recurrence
state-gather plumbing (+6.6ms, led by k_get_rows 5.2ms), graph coverage + stream
overlap (~+7ms), W4A4 act-quant tax (+3.3ms), router/glue (+5.4ms).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Patch 0026 added the hybrid per-head bf16 SSM-state opt-in as the
ssm_hybrid_tau_thresh cparam + the --ssm-bf16-tau CLI flag (default 0 =
bit-exact f32). Expose it per-model via the LocalAI gallery/model YAML
`options:` list, mirroring the paged_kv / max_batch_tokens setenv hooks.
- grpc-server.cpp: new `ssm_bf16_tau` (alias `ssm_hybrid_tau`) option ->
setenv(LLAMA_SSM_BF16_TAU) when the value parses to a positive float. It
does NOT reference the paged-only common_params field, so the turboquant
fork (which lacks patch 0026) stays byte-clean.
- patch 0026 (common.cpp common_context_params_to_llama): getenv fallback
feeds cparams.ssm_hybrid_tau_thresh from LLAMA_SSM_BF16_TAU only when the
--ssm-bf16-tau CLI flag is unset (0). Absent/non-positive env => untouched,
so stock stays bit-exact; the CLI flag takes precedence when set.
- docs: backend/index.yaml note, docs backends.md, gallery header NOTE
(referencing A_HYBRID_SSM_RESULTS.md; the 2 NVFP4 entries stay bit-exact).
Byte-safe when unset: with no ssm_bf16_tau option the env is never touched
and the default f32 bit-exact recurrence is preserved. Verified the parse +
consume code paths with a standalone compile-and-run (option string ->
LLAMA_SSM_BF16_TAU -> tau, plus 0 / garbage / CLI-precedence / unset cases).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA
mul_mat_q) was built bit-exact on the clean 0025 base and measured: the
grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4
warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT
(npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows
+42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS
both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS.
No patch 0028; DGX dev tree reverted to pristine 0025.
Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph
banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of
vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap
that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default
and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95%
but is non-bit-exact and fails the MoE KL gate).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of
quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode
step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted -
it does not lift.
Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for
quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns
is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is
flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128
~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel
caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as
patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this
256-tiny-expert BW-bound model.
Bit-exactness proven: md5 identical at block 64/128/256 for both models (the
per-thread quant body is untouched; thread->output map is invariant to
blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref,
MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the
remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM).
No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Regenerate patch 0026 with the hybrid-decode carry fix and record the
KL/throughput gate-sweep results.
Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot
maps, which were uploaded only once at construction; after the post-warmup
reset every head read head_slot==0 (f32-local-0), collapsing the split and
producing incoherent decode. Persist head_slot_host and re-upload via
upload_head_slots() after every buffer clear. Hybrid decode is now coherent and
the cross-op state carry is byte-exact (write==read, both partitions).
Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline,
both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND
same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error
concentrates in long-memory heads is refuted: KL scales with the bf16 head
count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry
is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving
lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot
meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is
opt-in only and not recommended in the gallery config. Full tables in
A_HYBRID_SSM_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet
recurrent state per head: f32 on long-memory heads (where bf16 rounding does not
contract and the KL error concentrates), bf16 on fast-decaying heads, classified
at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh
= 0.0 keeps every head f32 (bit-exact opt-out).
De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid
mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models
(dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd).
Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode
(ids in-place path) is incoherent; classifier/cache/kernel-params verified correct,
bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md.
Not ready for the GateSweep until fixed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Append lever C (structural dense residual: lm_head + scheduling) findings
and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard
(gain x tractability x gate), ranked build order, the concrete A build plan
for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue
with each one's build trigger.
Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates
PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated,
~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact,
dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.
Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:
- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
the refactored update_slots() lambda (continue->return, batch.n_tokens->
batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
inserted lines are unchanged. (This one rebased cleanly via 3-way but its
committed .patch no longer applies with plain git apply, so it is caught by the
per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
batch.size() + add_ok=false against the refactored loop.
All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.
The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.
Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
(ref 745.7, +0.7%), both within noise.
Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d.
Conflicts resolved:
- Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and
master's backends/privacy-filter-darwin).
- gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4)
for the paged backend prepended to master's full list; master keeps its own
*-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate
gallery names are pre-existing in master, not introduced here.
The patchset still needs re-verification against the new tip (pin-sync, next step).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
block accounting and placement, never KV values or compute, and is gated behind
LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
content-cache hashes.
Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
under the paged engine so a finished-idle slot returns its blocks promptly.
Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
(5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
(Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
arms (bit-exact serving).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):
- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
bounded on this bandwidth-bound a3b model. Keep down_proj quantize
retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
defect (prefill collapses 507->65 t/s after a burst, restart cures it):
reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
static decode-first budget (Part 1) explains 903s/213s burst TTFT and
the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
bimodality was the shelved BF16 SSM path. Closed.
First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the GPU kernel-budget probe with the read-only glue source
map. Add (4) the implementation cost - llama has no model-compute-dtype
knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes
F32 output), so f16 glue is not a flag but an opt-in multi-file change
(norm.cu f16 kernels + f16 residual stream). Add the final verdict:
precision is not the dominant cause of the 8% residual (83% of the step
is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is
non-bit-exact, so do not build it as the default; ship the 95%-bit-exact
f32 plateau and target the structural cublas/graph-launch ~3-4% instead.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023):
- attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a
no-op; q8_0 within noise -> KV dtype is not a decode lever
- nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn,
excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12%
incl. the non-FP4 cublas GEMM)
- f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the
8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate
draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is
that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is
bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at
equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).
- final_benchmark.csv: clean 8-column plot-ready schema
(model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.
Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state
(per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so
a split could capture most of the +25-31% while passing the f32 KL gate), dense
CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell
sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023),
and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of
vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with
~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory
heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the
95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try)
The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls
ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a
plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via
find_package(... CONFIG). It just had no darwin build path - the existing
package.sh and run.sh are Linux-only and there was no make target / workflow step.
Adds the bespoke darwin path, modeled on the ds4 one:
- scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib
bundling, create-oci-image (no Linux package.sh).
- Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter.
- scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp.
- .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp).
- backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries.
- backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(privacy-filter): macOS proto include + bundle ggml dylibs
Validated natively on an M4 (the build/package/load chain now works with Metal):
- CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but
only linked the binary dir, so on macOS it could not find protobuf's headers
(runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include.
Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op
on Linux (apt headers are already on the default search path).
- privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links
(libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk
absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH.
M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml
dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces
The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone quantize fold is empirically flat (Lever-2 precedent) with the
worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128
= 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all
byte-identical to llama f32. Only bf16 state (shelved) goes further.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Resolve pkg/xsysinfo/gpu.go: keep master's NVIDIAComputeCapability +
parseComputeCap (the #10485 multi-GPU work); re-express our IsNVIDIABlackwell
as a thin wrapper over NVIDIAComputeCapability instead of a duplicate
nvidia-smi probe.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill
lever: ggml mul_mat_id re-quantizes each token's activation once per expert for
the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread
reduction, so the gathered blocks are byte-identical across experts. The lever
quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is
untouched and down_proj keeps the stock path.
Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128
745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4
byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms).
Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default;
GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the
gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama;
the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM
82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps
the per-column f32 reduction/FMA order byte-identical (md5-gateable). The
bf16-state plan stays shelved.
Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state
instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ...
within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The
per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only
the (warp,block)->column assignment differs and the result is bit-identical;
per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM
latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner
(NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW.
GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above
vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128
335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5
byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4;
test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in
OCCUPANCY_RETUNE_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.
New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+ branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.
decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%)
npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%)
npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the cross-engine bit-exactness and f32-preserving-parity study.
Resolve the contradiction between sub-agents (one f32, two bf16) by reading
every link of vLLM's state-dtype chain on live source:
- config.json text_config.mamba_ssm_dtype = "float32" (both served models)
- cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override
- vllm.py __post_init__ -> try_verify_and_update_config (config finalize)
- Qwen3_5ForConditionalGenerationConfig override copies "float32" into
mamba_ssm_cache_dtype before state-dtype resolution
- mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16)
- qwen_gdn_linear_attn allocates the temporal cache at f32
Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED
(equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED
(vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune
of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever);
B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand
gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction
trees, plus general FP non-associativity). bf16 temporal state degrades BELOW
vLLM's f32 recurrent precision -> an over-clock, not a parity requirement.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesizes the bf16 SSM recurrent-state-cache plan into a build-agent brief:
ordered file-by-file edit list (kernel/op dtype-generic first, then cparams
default flip, gRPC/YAML, back-compat), the KL<1e-3 + PPL-delta + coherence +
long-context-drift acceptance gate that REPLACES the bit-exact md5 gate (bf16 is
intentionally non-bit-exact, equal precision to vLLM), bench targets (recurrence
3.98->2-3 ms/call, step 384->289-339 ms, 360-443 tok/s dense) + nsys check, the
default-bf16/f32-opt-out semantics + state-file back-compat, the risk register,
and the single biggest risk (silent corruption on the prefill/keep_rs_t/gather
paths) with the de-risk-first test-backend-ops step. Conv state stays f32 in v1.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesis of the byte-gate workflow (ncu-byte-gate measurement +
vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design).
Verdict closes all five decision points:
(1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak,
MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100%
f32-vs-bf16 state-cache width, not extra passes.
(2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state,
gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes.
(3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%),
eliminates concat_cont + cpy_scalar + folds silu.
(4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant
50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM.
Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal
precision to vLLM, which is itself bf16.
(5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest
value, gated). Confirming measurements stated per step.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were
blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the
byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved
<= peak_BW x duration caps the re-stream factor.
llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W,
202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state):
3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once,
verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs
>peak BW = impossible).
VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass,
coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch
the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes.
The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD
bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms =
parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Final synthesis of the critical-path gap analysis: the decode step is
99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to
vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net
recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all
REFUTED as worded; the single residual is the unmeasured DRAM byte ratio
of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass
fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state
concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at
~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fresh nsys --cuda-graph-trace=node capture of one steady decode step on
q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step
is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a
single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us).
This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and
confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net
196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM"
was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue
(Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing
test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel
GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever
3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain
at batch=128. It does: nothing in the GDN region forces graph-disable
(check_compability lists only split-buffers and large-batch MUL_MAT_ID), and
the recurrent head is constant for a steady 128-seq batch so the inplace_ids
state_dst offset + rs_head op_param + SSM input shapes are stable across steps.
The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is
the per-256-token full-attention block-table cadence (not a GDN op). The
~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not
launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the
~0.4% between-step host cgraph rebuild.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Read-only source comparison of the gated-DeltaNet decode region. vLLM folds
conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule
recurrence and the SSM state write-back into ONE Triton kernel
(fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into
a gated rms_norm, and captures the whole decode forward in a full CUDA graph
(GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the
same region as ~8 separate host-launched, serially-dependent ggml nodes. That
launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
(only hipblas keeps the fallback build). ggml's arm64 variant table
(armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
flags and --target ggml through, then collects the .so set. run.sh and
package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
build, which emits dylibs rather than .so).
ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.
Scope still excludes the darwin packaging wiring (separate change).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
gcc-14 (installed in the compile step). The host only selects a variant it
actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.
Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.
Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all
arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
output projection (ssm_out).
The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences. vLLM packs the same projection into
one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the
redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs
3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe /
qwen3next); other archs untouched.
Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.
decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.
Mirrors DGX dev-tree commit df1cc97.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Cross-check the adversarial validation against the profiler ground-truth and
finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one
llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as
an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead
of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D
[6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128
sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM.
GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4
kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes
are all ruled out. Decode parity is reachable in software (not a hardware
floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same
GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D
before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings
toward 391 documented.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at
npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs,
graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is
GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs
vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy;
SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific
overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap),
dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per
GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>