B-3 (the 0017-deferred mmq_y-down warp-remap of the NVFP4 grouped FP4-MMA
mul_mat_q) was built bit-exact on the clean 0025 base and measured: the
grouped GEMM kernel itself runs -1.3% (occupancy did rise via the nwarps=4
warp-remap / 128 threads-per-CTA), but end-to-end MoE decode is FLAT
(npl128 +0.4%, npl32 +0.3%, within noise) because the stream-k fixup grows
+42% (mmq_y=64 doubles the row-tiles) and the step is SSM/BW-bound. md5 PASS
both models, test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806 PASS.
No patch 0028; DGX dev tree reverted to pristine 0025.
Assessment: the bit-exact MoE GEMM/launch track is exhausted (B-1 re-graph
banked ~82->85%; B-2 and B-3 are 0). Honest bit-exact MoE ceiling = ~85% of
vLLM @npl128. The residual is the structural Marlin-NvFp4 grouped-GEMM gap
that no bit-exact lever closes. Recommend shipping the ~85% bit-exact default
and exposing the held 0026 bf16-SSM as a default-off opt-in (it reaches ~95%
but is non-bit-exact and fails the MoE KL gate).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
B-2 / M1 (SPEEDUP_HUNT rank #2): bit-exact block/grid/occupancy retune of
quantize_mmq_nvfp4 (the MoE down_proj activation-quant, ~2% of the MoE decode
step). Built+measured on a clean 0025 base (DGX GB10 sm_121), then reverted -
it does not lift.
Finding: the existing blockDim.x=128 is ALREADY the kernel-level optimum for
quantize_mmq_nvfp4 on GB10. nsys (8193 invocations): block=128 total 117.4M ns
is the fastest; 64 +8.7%, 192 +9.9%, 256 +6.9%. End-to-end MoE decode_agg is
flat within 0.4% noise across all block sizes {32..256} (npl32 ~438, npl128
~751 t/s). The act-quant is ~2% of a BW-bound step, so even a perfect kernel
caps the win at ~2%, and 128 is already optimal => measured 0%. Same outcome as
patch 0015 (M-tile) and 0017 (MINBLOCKS): no occupancy headroom on this
256-tiny-expert BW-bound model.
Bit-exactness proven: md5 identical at block 64/128/256 for both models (the
per-thread quant body is untouched; thread->output map is invariant to
blockDim.x). Gate at default: dense 5951a5b4 == ref, MoE 07db32c2 == ref,
MUL_MAT 1146/1146, MUL_MAT_ID 806/806 PASS.
MoE stays ~85% of vLLM @npl128 / ~87% @npl32 - still well below vLLM, so the
remaining MoE lever is B-3 (mmq_y-down warp-remap on the grouped FP4 GEMM).
No patch 0027; dev tree reverted to pristine 0025. Full data in B_MOE_RESULTS.md.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Regenerate patch 0026 with the hybrid-decode carry fix and record the
KL/throughput gate-sweep results.
Fix: clear(data=true) zeroes the whole recurrent buffer including the head_slot
maps, which were uploaded only once at construction; after the post-warmup
reset every head read head_slot==0 (f32-local-0), collapsing the split and
producing incoherent decode. Persist head_slot_host and re-upload via
upload_head_slots() after every buffer clear. Hybrid decode is now coherent and
the cross-op state carry is byte-exact (write==read, both partitions).
Gate result: de-risk PASS (test-backend-ops 84/84; T=0 md5 == 0023 baseline,
both models). Ship gate FAILS - no T_thresh meets MeanKLD<1e-3 AND
same-top-p>=99.5% with a meaningful speedup. The premise that the bf16 error
concentrates in long-memory heads is refuted: KL scales with the bf16 head
count and saturates ~0.06/~91% (MoE saturates at the minimal split). The carry
is byte-exact, so this is genuine bf16 sensitivity, not a bug. The byte-saving
lever is real (dense +12.4%, MoE +11.5% decode @npl128 at T=128) but cannot
meet the strict KL bar. Shipped default-off (f32, bit-exact opt-out); hybrid is
opt-in only and not recommended in the gallery config. Full tables in
A_HYBRID_SSM_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever A patch + build/de-risk results. Splits the persisted gated-DeltaNet
recurrent state per head: f32 on long-memory heads (where bf16 rounding does not
contract and the KL error concentrates), bf16 on fast-decaying heads, classified
at model load by tau_h = 1/(|ssm_a|*softplus(ssm_dt)). Default ssm_hybrid_tau_thresh
= 0.0 keeps every head f32 (bit-exact opt-out).
De-risk gates BOTH PASS: test-backend-ops GATED_DELTA_NET CUDA0 OK (incl 32 hybrid
mixed CUDA-vs-CPU cases); default all-f32 greedy md5 == 0023 baseline both models
(dense 5951a5b4d624ce891e22ab5fca9bc439, MoE 07db32c2bcb78d17a43ed18bc22705cd).
Known open issue (opt-in hybrid only; default unaffected): hybrid-ON model decode
(ids in-place path) is incoherent; classifier/cache/kernel-params verified correct,
bug isolated to the ids in-place cross-step state path. See A_HYBRID_SSM_RESULTS.md.
Not ready for the GateSweep until fixed.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
Append lever C (structural dense residual: lm_head + scheduling) findings
and the master RANK + PLAN section to SPEEDUP_HUNT.md. Per-lever scorecard
(gain x tractability x gate), ranked build order, the concrete A build plan
for the hybrid per-head f32/bf16 SSM state cache, and the ordered B/C/D queue
with each one's build trigger.
Verdict: ship the MoE re-graph (patch 0025, measured +1.9-4.4%, both gates
PASSED) now; build A as the lead (only lever ABOVE vLLM on dense, KL-gated,
~430-454 t/s = 103-108% of vLLM); bank B-2/B-3 on MoE; C last (<1% bit-exact,
dead-end); D opt-in-only and dense-only behind the same KL gate bf16-SSM failed.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The worktree merge bumped LLAMA_VERSION 8be759e6 -> 9d5d882d. This re-syncs the
paged patch-stack (0001-0024) to the new tip: the stack was rebased onto
9d5d882d on the DGX dev tree, rebuilt clean (CUDA sm_121), and re-validated
bit-exact before re-exporting the LocalAI .patch files.
Re-exporting each shipped patch from its rebased commit and diffing body-to-body
against the committed files identifies exactly 4 that changed and no longer
git-apply to 9d5d882d:
- 0008 cross-request prefix share: re-anchored the [paged 0008] commit block to
the refactored update_slots() lambda (continue->return, batch.n_tokens->
batch.size()); identical env-guarded logic.
- 0013 static prefill budget: budget var-block / while-gate / admission-break
re-expressed against the refactored loop (add_ok=false idiom).
- 0015 expert-density MoE token-tile auto-select: pure context re-anchor; upstream
inserted a test_mul_mat_id case at the hunk anchor in test-backend-ops.cpp. The
inserted lines are unchanged. (This one rebased cleanly via 3-way but its
committed .patch no longer applies with plain git apply, so it is caught by the
per-patch apply-check, not by the rebase conflict count.)
- 0016 dynamic decode-first budget: dynamic budget block + n_decode_in_batch =
batch.size() + add_ok=false against the refactored loop.
All four are byte-faithful format-patch exports of the gate-green rebased commits.
Applying the full corrected series to a fresh 9d5d882d reproduces the gate-green
tree byte-for-byte across every code file.
The other 7 touched patches (0009/0017/0018/0019/0020/0021/0024) are LINENUM-only
(hunk bodies byte-identical, only @@ line-numbers shifted) and still apply
cleanly, so they are left unchanged. The remaining patches are identical.
Validation on the rebased build (NVFP4 Qwen3.6, GB10 sm_121):
- test-backend-ops CUDA0: GATED_DELTA_NET 36/36, SSM_CONV 45/45, MUL_MAT
1146/1146, MUL_MAT_ID 806/806 all OK.
- greedy md5 (-fa on -n 48 --temp 0 --seed 1): dense q36-27b-nvfp4
5951a5b4d624ce891e22ab5fca9bc439 and MoE q36-35b-a3b-nvfp4
07db32c2bcb78d17a43ed18bc22705cd, both == baseline.
- decode S_TG @npl128: dense 366.41 t/s (ref 373.2, -1.8%), MoE 751.11 t/s
(ref 745.7, +0.7%), both within noise.
Details in backend/cpp/llama-cpp/patches/paged/PIN_SYNC_9d5d882d.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Sync to master (12 commits) + the llama.cpp pin bump 8be759e6 -> 9d5d882d.
Conflicts resolved:
- Makefile .NOTPARALLEL: union (keep both backends/llama-cpp-localai-paged and
master's backends/privacy-filter-darwin).
- gallery/index.yaml: our 2 base NVFP4 entries (qwen3.6-27b-nvfp4, qwen3.6-35b-a3b-nvfp4)
for the paged backend prepended to master's full list; master keeps its own
*-nvfp4-mtp variants (distinct entries). Go build + YAML validated; the 8 duplicate
gallery names are pre-existing in master, not introduced here.
The patchset still needs re-verification against the new tip (pin-sync, next step).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
block accounting and placement, never KV values or compute, and is gated behind
LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
content-cache hashes.
Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
under the paged engine so a finished-idle slot returns its blocks promptly.
Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
(5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
(Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
arms (bit-exact serving).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesis of the four read-only/GPU investigations (A MoE grouped-GEMM,
B cublas lm_head, C TTFT/paged-pool burst, D dense CUDA-graph):
- A: llama already has the sorted-grouped-FP4-MMA GEMM (higher tier than
vLLM's GB10 W4A16 Marlin fallback); standalone bit-exact kernel win is
bounded on this bandwidth-bound a3b model. Keep down_proj quantize
retune (M1) as a cheap bank-shot; fold the decode-graph (M2) into a
later shared GDN+MoE decode-graph project.
- B: lm_head is BF16 (not FP4), nvjet already ~72% of peak HBM; bit-exact
ceiling <1%, the only big win (NVFP4 head) is non-bit-exact and unfair
vs vLLM. Dead end. Rank last.
- C: paged-pool burst-degradation BUG (Part 2) is a true correctness
defect (prefill collapses 507->65 t/s after a burst, restart cures it):
reclamation gap on partial seq_rm + free-queue fragmentation. Plus the
static decode-first budget (Part 1) explains 903s/213s burst TTFT and
the chunked-interleave fix.
- D: f32 dense CUDA-graph is STABLE (<1%, no bimodality); the brief's
bimodality was the shelved BF16 SSM path. Closed.
First build target: the paged-pool burst-degradation bug fix (Fix-1
truncate-on-partial-seq_rm + Fix-2 defrag-on-empty + Fix-3 release-on-slot-
completion). Small, localized, default-off byte-identical, crisp repro
(npl64 burst then npl8: prefill within 10% of fresh + num_free restored).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the GPU kernel-budget probe with the read-only glue source
map. Add (4) the implementation cost - llama has no model-compute-dtype
knob, the residual stream is F32 by construction (ggml_mul_mat hardcodes
F32 output), so f16 glue is not a flag but an opt-in multi-file change
(norm.cu f16 kernels + f16 residual stream). Add the final verdict:
precision is not the dominant cause of the 8% residual (83% of the step
is already f32/W4A4-matched), f16 recovers only 40-60% of the gap and is
non-bit-exact, so do not build it as the default; ship the 95%-bit-exact
f32 plateau and target the structural cublas/graph-launch ~3-4% instead.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Empirical probe on q36-27b-nvfp4 @npl128 (build f7409c2, patch 0023):
- attention KV cache default is ALREADY f16 (K/V f16) -> --cache-type f16 is a
no-op; q8_0 within noise -> KV dtype is not a decode lever
- nsys node-trace decode budget: f32-glue (norms/elementwise/activations/attn,
excl. SSM recurrence + NVFP4 GEMM) = 28.7 ms = 8.4% of step (40.9 ms = 12%
incl. the non-FP4 cublas GEMM)
- f16 realistically recovers ~11-16 ms of the ~27 ms/step gap = ~40-60% of the
8.2% residual -> ~95-96% parity, not a full close; non-bit-exact opt-in only
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Earlier text claimed bf16 = vLLM's own precision; that was a refuted byte-gate
draft re-surfacing. The settled finding (BITEXACT_VS_VLLM.md, proven 3 ways) is
that vLLM keeps the gated-DeltaNet TEMPORAL state in f32 (only its conv state is
bf16). So bf16 temporal is BELOW vLLM's recurrent precision, not a match; and at
equal f32 precision llama's recurrence already beats vLLM (84.6% vs 82.4% peak).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Public deliverable for the patch-0018..0023 f32 bit-exact paged-attention ship:
the apples-to-apples NVFP4 decode benchmark (llama.cpp paged 0023 vs vLLM 0.23.0
on GB10 / DGX Spark, matched weights, CUDA graphs ON both sides).
- final_benchmark.csv: clean 8-column plot-ready schema
(model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb),
16 rows (2 models x 2 engines x npl 8/32/64/128).
- QWEN36_NVFP4_BENCH.md: embed the two decode-vs-npl plots; add the
internal-consistency note (decode_agg vs perseq*npl is TTFT-governed, holds on
both engines, no stale-baseline carry-over).
- decode-vs-npl PNGs (one per model), llama vs vLLM, per-point llama-%-of-vLLM labels.
Headline (measured, nothing pre-assumed): dense llama 90-117% of vLLM decode
(ahead at npl8), MoE 77-83%, at higher precision (f32 GDN state + q8 act vs vLLM
bf16 GDN + w4a4) and 1.5-3x lower unified memory (on-demand paged KV vs vLLM's
flat ~107 GB pool).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ranked pick-up points after the 95%-bit-exact plateau: hybrid-precision SSM state
(per-head f32/bf16 split - the bf16 error is concentrated in long-memory heads, so
a split could capture most of the +25-31% while passing the f32 KL gate), dense
CUDA-graph instability, the rms_norm->fp4 fold (flat-risk), datacenter Blackwell
sm_100 (no LPDDR5x floor), adaptive prefill budget, MoE-specific recurrence tuning.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
De-risk passed (test-backend-ops 52/52 bf16, f32 default byte-identical to 0023),
and the throughput lever is real (recurrence -49%/call, dense ~490 t/s = 125% of
vLLM clean). But bf16-vs-f32 KLD is 0.06-0.17 at >=1024 ctx (threshold 1e-3) with
~90% top-token agreement: intrinsic bf16 error over gated-DeltaNet long-memory
heads, not a bug. That is exactly vLLM's own bf16 GDN precision. Shelved; ship the
95% bit-exact f32 plateau (0018-0023). bf16 work backed up on DGX (BF16_SSM_STATE.diff).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try)
The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls
ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a
plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via
find_package(... CONFIG). It just had no darwin build path - the existing
package.sh and run.sh are Linux-only and there was no make target / workflow step.
Adds the bespoke darwin path, modeled on the ds4 one:
- scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib
bundling, create-oci-image (no Linux package.sh).
- Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL).
- .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter.
- scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp.
- .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp).
- backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries.
- backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
* fix(privacy-filter): macOS proto include + bundle ggml dylibs
Validated natively on an M4 (the build/package/load chain now works with Metal):
- CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but
only linked the binary dir, so on macOS it could not find protobuf's headers
(runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include.
Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op
on Linux (apt headers are already on the default search path).
- privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links
(libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk
absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH.
M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml
dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces
The backend launcher scripts derive their own directory with
CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as
$CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:...,
exec $CURDIR/<binary> "$@"). When a backend is installed under a path that
contains a space - notably macOS's ~/Library/Application Support/... - bash
word-splits the unquoted $CURDIR, so the test builtin fails with
"binary operator expected" and exec tries to run ".../Library/Application",
yielding "No such file or directory". The backend never starts, surfacing as
a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the
realpath "$0") in every affected run.sh; no logic changes.
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The standalone quantize fold is empirically flat (Lever-2 precedent) with the
worst gain/plumbing ratio; no bit-exact lever remains. Dense 371.81 t/s @npl128
= 95.0% of vLLM 391, recurrence past vLLM at the LPDDR5x DRAM floor, all
byte-identical to llama f32. Only bf16 state (shelved) goes further.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Resolve pkg/xsysinfo/gpu.go: keep master's NVIDIAComputeCapability +
parseComputeCap (the #10485 multi-GPU work); re-express our IsNVIDIABlackwell
as a thin wrapper over NVIDIAComputeCapability instead of a duplicate
nvidia-smi probe.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror patch 0023 + results into the paged series. Bit-exact MoE decode/prefill
lever: ggml mul_mat_id re-quantizes each token's activation once per expert for
the broadcast up/gate proj (ne11==1); quantize_mmq_nvfp4 has no cross-thread
reduction, so the gathered blocks are byte-identical across experts. The lever
quantizes the ne12 unique tokens once and gathers the block_fp4_mmq rows into the
expert-gathered layout with a coalesced uint4 copy (144 B = 9 uint4); the GEMM is
untouched and down_proj keeps the stock path.
Measured (DGX GB10, on top of patch 0022, q36-35b-a3b-nvfp4): decode S_TG npl128
745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%, prefill T_PP -4%; dense q36-27b-nvfp4
byte-flat. nsys: quantize_mmq_nvfp4 868 -> 457 ms, gather +32 ms (net -379 ms).
Bit-exact: q36-27b 5951a5b4..., q36-35b-a3b 07db32c2... (on == off == 0022);
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default;
GGML_CUDA_MOE_QUANT_DEDUP=0 restores stock.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel, carried as paged patch 0022. After the f32 verdict (vLLM carries the
gated-DeltaNet temporal state in float32 and moves the same ~805 MB/call as llama;
the gap was pure DRAM bandwidth efficiency on equal bytes - llama 73.4% vs vLLM
82.4% of the 273 GB/s GB10 peak), the lever is a latency-coverage retune that keeps
the per-column f32 reduction/FMA order byte-identical (md5-gateable). The
bf16-state plan stays shelved.
Column folding: each warp owns COLS_PER_WARP columns of the 128x128 recurrent state
instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ...
within a per-block column tile; grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The
per-lane strided row sharding and the warp_reduce butterfly are unchanged, so only
the (warp,block)->column assignment differs and the result is bit-identical;
per-warp memory-level parallelism rises ~COLS_PER_WARP-fold, covering more DRAM
latency on this bandwidth-bound kernel. Default tile is the measured GB10 winner
(NUM_WARPS=16, COLS_PER_WARP=8), env-selectable via GDN_NW / GDN_CPW.
GB10: gated_delta_net decode 4.02 -> 3.49 ms/call, 73.4% -> 84.6% of peak (above
vLLM's 82.4%; 102.6% of vLLM recurrence BW). decode S_TG t/s: dense 27b npl128
335.9 -> 373.2 (+11.1%), MoE 35b-a3b npl128 688.4 -> 745.7 (+8.3%). Greedy md5
byte-identical to the 0021 baseline on both q36-27b-nvfp4 and q36-35b-a3b-nvfp4;
test-backend-ops -o GATED_DELTA_NET 36/36 PASS. Bench/method in
OCCUPANCY_RETUNE_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.
New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+ branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.
decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%)
npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%)
npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesize the cross-engine bit-exactness and f32-preserving-parity study.
Resolve the contradiction between sub-agents (one f32, two bf16) by reading
every link of vLLM's state-dtype chain on live source:
- config.json text_config.mamba_ssm_dtype = "float32" (both served models)
- cache.py default mamba_ssm_cache_dtype = "auto"; bench passes no override
- vllm.py __post_init__ -> try_verify_and_update_config (config finalize)
- Qwen3_5ForConditionalGenerationConfig override copies "float32" into
mamba_ssm_cache_dtype before state-dtype resolution
- mamba_utils._mamba_state_dtype -> temporal = torch.float32 (conv = bf16)
- qwen_gdn_linear_attn allocates the temporal cache at f32
Verdicts: B1 TRUE (sub-claim 'more efficient than vLLM' refuted); B2 REFUTED
(equal f32 bytes both sides, ~10pct efficiency gap not 2x width); B3 REFUTED
(vLLM hits throughput with f32 state; a bit-exact occupancy/coalescing retune
of gated_delta_net_cuda 74->81pct peak is the f32-preserving parity lever);
B4 CONFIRMED (bit-exact-vs-vLLM impossible: A1 FP4 GEMM 8/4/16-bit operand
gap + A2 recurrence g.Sigma vs Sigma.g reassociation on different reduction
trees, plus general FP non-associativity). bf16 temporal state degrades BELOW
vLLM's f32 recurrent precision -> an over-clock, not a parity requirement.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesizes the bf16 SSM recurrent-state-cache plan into a build-agent brief:
ordered file-by-file edit list (kernel/op dtype-generic first, then cparams
default flip, gRPC/YAML, back-compat), the KL<1e-3 + PPL-delta + coherence +
long-context-drift acceptance gate that REPLACES the bit-exact md5 gate (bf16 is
intentionally non-bit-exact, equal precision to vLLM), bench targets (recurrence
3.98->2-3 ms/call, step 384->289-339 ms, 360-443 tok/s dense) + nsys check, the
default-bf16/f32-opt-out semantics + state-file back-compat, the risk register,
and the single biggest risk (silent corruption on the prefill/keep_rs_t/gather
paths) with the de-risk-first test-backend-ops step. Conv state stays f32 in v1.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Synthesis of the byte-gate workflow (ncu-byte-gate measurement +
vllm-fused-recurrence-study + llama-fused-recurrence-design + conv-fusion-design).
Verdict closes all five decision points:
(1) Byte ratio: llama re-stream ~1.0x (cap <=1.33x); recurrence at 74% GB10 peak,
MORE BW-efficient than vLLM packed_decode at 41%. The 2x DRAM gap is 100%
f32-vs-bf16 state-cache width, not extra passes.
(2) Fused single-pass recurrence: NO-BUILD - already one R + one W of f32 state,
gate ops touch tiny q/k/g/beta not the 805 MB state -> recovers ~0 bytes.
(3) Conv-state in-place fusion: GO - bit-exact, no-regret, +12-14 ms/step (~+3%),
eliminates concat_cont + cpy_scalar + folds silu.
(4) bf16 SSM state: BUILD (KL<1e-3 gated product call) - only lever on the dominant
50% recurrence term, +45-95 ms/step -> step 289-339 ms = parity-to-ahead of vLLM.
Bit-exact parity unreachable on this term (f32 bytes irreducible); bf16 = equal
precision to vLLM, which is itself bf16.
(5) Build order: conv fusion next (no-regret, bit-exact), then bf16 state (highest
value, gated). Confirming measurements stated per step.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Decisive measurement (ncu-byte-gate agent, DGX GB10). ncu HW DRAM counters were
blocked (ERR_NVGPUCTRPERM, root-only NVreg param; no passwordless sudo), so the
byte ratio was settled via CUPTI kernel timing + exact byte geometry: bytes moved
<= peak_BW x duration caps the re-stream factor.
llama gated_delta_net_cuda decode (B=128, f32 state): 3.98 ms/call, 805 MB R+W,
202 GB/s = 74% of GB10 peak. vLLM fused_recurrent_packed_decode (B=128, bf16 state):
3.62 ms/call, 402 MB R+W, 111 GB/s = 41% peak. Both single-pass (load-once/store-once,
verified in source). llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x needs
>peak BW = impossible).
VERDICT: NO-BUILD the fused single-pass recurrence - the kernel is already single-pass,
coalesced, and MORE bandwidth-efficient than vLLM's triton kernel; the gate ops touch
the tiny q/k/g/beta projections, not the 805 MB state, so fusion recovers ~0 state bytes.
The entire 2x DRAM gap vs vLLM is f32 (llama) vs bf16 (vLLM) state-cache width. BUILD
bf16 SSM state instead: halves 805->413 MB, ~45-95 ms/step, step 384 -> 289-339 ms =
parity-to-ahead of vLLM 327 (non-bit-exact vs f32 but equal to vLLM's own bf16 precision).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Final synthesis of the critical-path gap analysis: the decode step is
99.94% GPU-busy single-stream (idle 0.225ms = 0.06%), so the 14% gap to
vLLM is kernel GPU-time dominated by the bandwidth-bound gated_delta_net
recurrence (196.37ms = 51.6%), not launch bubbles. Claims A/B/C all
REFUTED as worded; the single residual is the unmeasured DRAM byte ratio
of llama's recurrence vs vLLM's fused kernel. Ranked plan: single-pass
fused GDN recurrence (gap-closer, gate on ncu byte-ratio test) + conv-state
concat fusion (no-regret +2-3%, bit-exact); gate-fold alone tops out at
~89% of vLLM; bf16 state is the only floor-mover but breaks bit-exactness.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fresh nsys --cuda-graph-trace=node capture of one steady decode step on
q36-27b-nvfp4 dense at npl128 (clean Lever-1 build-cuda-base). The decode step
is a single CUDA graph; node-level expansion shows it is 99.94% GPU-busy on a
single stream with 0.225 ms/step inter-kernel idle (0.06%, zero gaps >5us).
This refutes the "~60% idle bubbles / 57 ms = 100% bubble" hypothesis and
confirms the cudagraph-coverage source verdict. Real decode mix: gated_delta_net
196 ms = 51.6% of the step (4.08 ms/call x48; the prior 1.47 ms/call "near-vLLM"
was a prefill-contaminated eager average), FP4 GEMM+quantize 29%, gating glue
(Lever 3 target) only 3.35%, gdn_gather 0.06 ms. By roofline-decode's own sizing
test (idle < 57 ms => gap is elsewhere) the 14% gap to vLLM lives in kernel
GPU-time, dominated by the bandwidth-bound GDN recurrence, not in bubbles; Lever
3 fusion is resized to ~3% and reframed as byte-reduction, not bubble removal.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Determine whether the ggml CUDA graph covers the gated-DeltaNet serial chain
at batch=128. It does: nothing in the GDN region forces graph-disable
(check_compability lists only split-buffers and large-batch MUL_MAT_ID), and
the recurrent head is constant for a steady 128-seq batch so the inplace_ids
state_dst offset + rs_head op_param + SSM input shapes are stable across steps.
The fused op does no host-sync / capture-time cudaMalloc. The only re-warm is
the per-256-token full-attention block-table cadence (not a GDN op). The
~40% util is bandwidth-roofline (SSM state traffic 66% of step bytes), not
launch-gap idle - so no GDN graph-safe lever; the only non-covered idle is the
~0.4% between-step host cgraph rebuild.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Read-only source comparison of the gated-DeltaNet decode region. vLLM folds
conv-silu, q/k l2norm, scale, softplus+A_log gate, sigmoid-beta, the delta-rule
recurrence and the SSM state write-back into ONE Triton kernel
(fused_recurrent_gated_delta_rule_packed_decode), with the output gate fused into
a gated rms_norm, and captures the whole decode forward in a full CUDA graph
(GDNAttentionMetadata UNIFORM_BATCH, decode-only full cudagraph). llama runs the
same region as ~8 separate host-launched, serially-dependent ggml nodes. That
launch/bubble delta - not GEMM throughput - is the candidate 62%-vs-40% busy gap.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): single x86 CPU build via ggml CPU_ALL_VARIANTS
Replace the per-microarch avx/avx2/avx512/fallback multi-binary build on
x86 with a single grpc-server plus the dlopen-able libggml-cpu-*.so set
that ggml's backend registry selects at runtime by probing host CPU
features. One build instead of four, broader microarch coverage (adds
alderlake AVX-VNNI, zen4 AVX512-BF16, sapphirerapids AMX), and the
shell-side /proc/cpuinfo probing in run.sh goes away.
Build/link notes:
- CPU_ALL_VARIANTS requires GGML_BACKEND_DL + BUILD_SHARED_LIBS=ON, so
ggml/llama become shared objects. SHARED_LIBS is now a make variable
(default OFF) so the override survives the recursive sub-make into the
VARIANT build dir instead of being re-clobbered by the base flags.
- The cpu-all target also builds "--target ggml": the per-microarch
backends are runtime-dlopened, not link deps, so they only compile via
ggml's add_dependencies().
- hw_grpc_proto is pinned STATIC. Under BUILD_SHARED_LIBS=ON it would
otherwise become a DSO referencing hidden-visibility symbols in the
static libprotobuf.a, which fails to link ("hidden symbol ... is
referenced by DSO"). Keeping it static links gRPC/protobuf into the
executable while only ggml/llama stay shared, so no PIC or base-image
change is required.
- package.sh bundles the libggml-*.so set into package/lib; ggml finds
them by scanning the bundled ld.so directory (/proc/self/exe), which
run.sh launches from.
Scope: x86 only. arm64/darwin keep the single fallback build. The
ik-llama-cpp / turboquant forks and the other ggml C++ backends are
unchanged; the same recipe applies but is out of scope here.
Validated with a full docker build plus a live inference smoke test:
the model loads, ggml selects the AVX512_BF16 variant on a Zen-class
host, and tokens generate correctly.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): extend CPU_ALL_VARIANTS to arm64 + turboquant
- llama-cpp: x86 AND arm64 now use the single llama-cpp-cpu-all build
(only hipblas keeps the fallback build). ggml's arm64 variant table
(armv8.x / armv9.x, plus apple_m* on darwin) is selected at runtime.
- turboquant: same recipe via a turboquant-cpu-all target. turboquant
copies backend/cpp/llama-cpp's CMakeLists.txt + Makefile per flavor, so
the hw_grpc_proto STATIC fix and the SHARED_LIBS / EXTRA_CMAKE_ARGS
make-vars are inherited; the target just passes SHARED_LIBS=ON, the DL
flags and --target ggml through, then collects the .so set. run.sh and
package.sh updated to ship/select turboquant-cpu-all.
- Makefile lib-collection find now also matches *.dylib (for the darwin
build, which emits dylibs rather than .so).
ik-llama-cpp is intentionally left unchanged: its pinned ggml has no
CPU_ALL_VARIANTS support and its IQK kernels require AVX2, so the
per-microarch dynamic backend set does not apply.
Scope still excludes the darwin packaging wiring (separate change).
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* feat(llama-cpp,turboquant): arm64 gcc-14 for SME variants + darwin cpu-all packaging
- arm64: ggml CPU_ALL_VARIANTS builds armv9.2 SME variants whose -march=...+sme
is rejected by the Ubuntu 24.04 default gcc-13. Build the arm64 variants with
gcc-14 (installed in the compile step). The host only selects a variant it
actually supports at runtime, but every variant must still compile.
- darwin: scripts/build/llama-cpp-darwin.sh builds llama-cpp-cpu-all instead of
the fallback binary, keeps Metal (GGML_METAL stays ON; --target ggml also builds
ggml-metal). The per-microarch libggml-cpu-*.dylib are placed in the package
root next to the binary (darwin has no bundled ld.so, so ggml's executable-dir
scan looks there), while the other shared dylibs go in lib/ for DYLD_LIBRARY_PATH.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp-darwin): distribute ggml backends by suffix (.so root, .dylib lib)
ggml emits its loadable backends (per-microarch CPU variants, metal, blas) with a
.so suffix even on darwin, while the core libraries (ggml-base/ggml/llama/
llama-common/mtmd) use .dylib. Split the distribution by suffix: .so DL backends
go in the package root for ggml's executable-directory scan, .dylib core libs go
in lib/ for DYLD_LIBRARY_PATH. The previous .dylib name-pattern matched none of the
variants.
Verified on an M4: ggml loads the apple_m4 CPU variant (SME=1) and Metal, model
loads and generates correct tokens.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* fix(llama-cpp,turboquant): only CPU_ALL_VARIANTS for pure-CPU builds, GPU uses fallback
The previous gate sent every non-hipblas build through llama-cpp-cpu-all, so the
GPU image builds (cublas, sycl_f16/f32, vulkan, nvidia l4t) compiled the whole CPU
microarch variant matrix on top of their already-huge GPU backend - blowing the
build time (the sycl job was only 59% done after 2h11m) - and the arm64 l4t build
failed at `apt-get install gcc-14` (exit 100) on the Jetson base.
Gate on an empty BUILD_TYPE instead: only the pure CPU image (build-type: '' in
.github/backend-matrix.yml) builds the CPU_ALL_VARIANTS set; every GPU build gets a
single fallback CPU grpc-server, since the accelerator does the compute. This also
confines the arm64 gcc-14 step (needed for the armv9.2 SME variants) to the CPU
build, away from the GPU base images.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
* docs(llama-cpp): correct run.sh comment for arm64/darwin cpu-all
arm64 and darwin CPU images now also ship llama-cpp-cpu-all (not fallback-only);
only GPU images ship fallback-only. Fix the stale comment to match.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
output projection (ssm_out).
The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences. vLLM packs the same projection into
one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the
redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs
3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe /
qwen3next); other archs untouched.
Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.
decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.
Mirrors DGX dev-tree commit df1cc97.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Cross-check the adversarial validation against the profiler ground-truth and
finalize DECODE_PARITY_EXPLORE.md. The post-SSM 254->391 decode gap is one
llama-specific defect: the gated-DeltaNet output projection (ssm_out) runs as
an FP4 GEMV (mul_mat_vec_q, 132 ms/step = 26% of decode) at batch 128 instead
of a tensor-core MMQ GEMM. Mechanism confirmed at source: final_output is 3D
[6144,1,n_seqs] so src1->ne[1]=1 trips the MMVQ dispatch (<=8), with the 128
sequences in ne[2]. vLLM packs the same projection into a cutlass M=128 GEMM.
GDN recurrence is only +11%/call (not the lever); P2a optimized the wrong FP4
kernel (the 17% MMQ, not the 26% MMVQ); CUDA graphs, host loop, and DRAM bytes
are all ruled out. Decode parity is reachable in software (not a hardware
floor): identical bytes/floor, vLLM hits 62% util vs llama 40% on the same
GB10. Highest-value next step (~free, bit-exact): collapse final_output to 2D
before ssm_out so M=128 routes to MMQ. Ranked levers + cumulative ceilings
toward 391 documented.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Fresh post-SSM nsys of llama (build-cuda-base, patch 0019) AND vLLM 0.23.0 at
npl128 decode. Reproduces the 391 reference (vLLM 394 t/s eager / 420 graphs,
graphs +6% only) and confirms llama 245 t/s. Both ~98% GPU-busy; the gap is
GPU kernel-time, not idle/host/graphs. GDN compute comparable (llama 4.03 vs
vLLM 3.62 ms/call, +11%). bytes/step: llama not higher (131 vs 85 MB memcpy;
SSM-fix 18GB/step DtoD removal confirmed in-trace). Single biggest llama-specific
overage = FP4 matmul path 236 vs 117 ms/step (+119 ms = 64% of the gap),
dominated by mul_mat_vec_q (FP4 GEMV at batch 128, 132 ms/step, 26%, one per
GDN layer). Track B optimized the wrong FP4 kernel (mul_mat_q, not the GEMV).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): add main-model cpu_moe/n_cpu_moe options
Mirror the existing draft_cpu_moe/draft_n_cpu_moe siblings for the main
model, matching upstream --cpu-moe / --n-cpu-moe (common/arg.cpp). Lets
users keep MoE expert weights on CPU to manage VRAM on large MoE models.
Closes part of #10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* feat(llama-cpp): forward unknown '-' options to upstream arg parser
Any options: entry starting with '-' is collected and passed verbatim to
llama.cpp's own common_params_parse (LLAMA_EXAMPLE_SERVER) at the end of
params_parse, so every upstream llama-server flag works without a new
hand-wired branch. Passthrough runs last and wins on overlap; n_parallel is
snapshotted to survive parser_init's SERVER reset, and help/usage/completion
flags are skipped to avoid exiting the backend.
Closes#10483
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* docs(llama-cpp): document cpu_moe/n_cpu_moe and option passthrough
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
* fix(llama-cpp): terminate tensor/kv override vectors after passthrough
The tensor_buft_overrides padding and the kv/draft override terminators
ran before the generic option passthrough, so a passthrough flag
(--cpu-moe, --override-tensor, --override-kv, ...) appended a real entry
after the null sentinel - tripping the model loader's
back().pattern == nullptr assertion (crash) or being silently dropped.
Move all three termination/padding blocks to the end of params_parse,
after both the named-option loop and common_params_parse have pushed
their real entries. Also widen the exit()-flag skip list so --version,
--license, --list-devices and --cache-list cannot terminate the backend.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---------
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Co-authored-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of the llama-paged-dev patch 0019 engine change plus the measured
results. Step 2 of the SSM decode work: after Step 1 (in-place state write-back,
patch 0018) the largest non-GEMM decode bucket was the recurrent-state get_rows
gather (18.8 percent of decode GPU time). This removes that materialization,
mirroring ggml_ssm_scan's ids source: ggml_gated_delta_net_inplace_ids reads each
sequence's prior state directly from cache[ids[seq]] (src[5] = full cache,
src[7] = ids), so combined with Step 1's in-place write the op reads AND writes
the cache directly with no state materialization at all.
Race-free by construction: identity sequences (ids[seq] == rs_head + seq, the
whole AR decode path) read s0 in place from the destination slot; non-identity
sequences (reorder / rs_zero, e.g. multi-new-seq prefill) read from a disjoint
scratch a small gather kernel populates first. ids stays a device pointer.
Bit-identical to the get_rows path. Gated to qwen35 + qwen35moe; qwen3next,
kimi-linear, the non-fused and rollback paths are unchanged.
Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
q36-27b-nvfp4 dense: npl32 137.64 -> 170.68 (+24.0 percent),
npl128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391).
q36-35b-a3b-nvfp4 MoE: npl32 299.68 -> 366.69 (+22.4 percent),
npl128 409.30 -> 553.63 (+35.3 percent).
Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
(dense + MoE). nsys k_get_rows_float bucket 18.8 -> 0.7 percent. The residual
decode gap to vLLM is now the FP4 GEMM (~48 percent of decode). See
SSM_DECODE_FIX_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Mirror of the llama-paged-dev patch 0018 engine change plus the measured
results. Per SSM layer per step decode no longer D2D-copies the full ~225 MB
recurrent state into the cache: the fused gated_delta_net op writes the final
state in place at the active sequences cache slot (new
ggml_gated_delta_net_inplace, src[6] = state_dst), mirroring vLLM
fused_recurrent_gated_delta_rule. SSM math unchanged (bit-identical greedy).
Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on):
q36-27b-nvfp4 dense: npl32 113.74 -> 136.39 (+19.9 percent),
npl128 146.23 -> 180.53 (+23.5 percent, = predicted copy-removal ceiling).
q36-35b-a3b-nvfp4 MoE: npl128 313.36 -> 372.62 (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. See SSM_DECODE_FIX_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append the four-point synthesis to A2_CUDAGRAPH_DECODE.md: measured
CUDA-graph lever size (<1%, not the guessed 10-20%), the corrected
'eager' premise (default paged decode already captures), the unchanged
37-38% of vLLM at npl128, and the honest verdict that A.2 closes none of
the 2.6x gap because paged attention touches ~0.4% of decode on this
hybrid-SSM model. Residual lever is the qwen35 gated-DeltaNet SSM path
(state D2D copy + get_rows gather), orthogonal to paged attention.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Phase 1 ruled out CUDA graphs as the paged-decode lever (GPU 99.4% busy,
decode_agg flat graphs on-vs-off) and attributed the 2.6x gap to vLLM to the
per-step GPU kernel work (FP4 GEMM + attention at batch 128). Phase 2 decomposed
that kernel work directly on the Phase-1 nsys reps and corrects the attribution.
Findings (q36-27b-nvfp4 = gguf arch qwen35, a 48:16 hybrid gated-DeltaNet
linear-attention + full-attention model; DGX GB10 sm_121, fusion off):
- Graphs re-confirmed not the lever: fresh paged graphs-ON 146.03 vs OFF 144.90
t/s (+0.78%, noise); the captured rep is 99.5% busy with the same ~3267ms
memcpy (graphs capture memcpy nodes too).
- The 99.4% busy is real but ~19% of it is D2D memcpy, not compute: an
overlap-correct interval-union sweep gives kernels-only 80.2% busy, the gap
filled by 1584 D2D copies/run (~80/step, ~230MB each = the gated-DeltaNet
recurrent state). Phase 1's cuda_gpu_trace lumped this into compute.
- Decode GPU-time decomposition (% of kernel+memcpy busy): gated_delta_net 23.4%,
get_rows 21.9%, D2D state copy 18.9%, FP4 GEMV 15.5%, FP4 GEMM 10.4%,
full attention 0.4%. Grouped: SSM/gated-DeltaNet machinery ~67%, FP4 matmul
~28%, full attention (all paged-attn optimizes) ~0.4%.
Verdict: not graphs, not the host loop, not primarily FP4 GEMM, not attention.
Paged attention touches ~0.4% of decode on this model, so no paged/graph/
block-table change can move decode_agg. The lever is the ggml qwen35
gated-DeltaNet decode: kill the per-layer recurrent-state D2D copy and fuse the
get_rows gather into the recurrence (vLLM's fused_recurrent_gated_delta_rule
keeps state in place). Ceiling: -copy ~146->180; -copy-and-gather ~146->247 t/s.
No code patch (the lever is an SSM-path rewrite, orthogonal to paged attention);
patches/paged/0018 stays free.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Phase 1 measures the CUDA-graph lever on the paged decode (q36-27b-nvfp4
dense, GB10 sm_121, fusion off). The 4-cell decode_agg {stock,paged} x
{graphs on,off} is flat within ~1%: the graphs-on win is +0.13% at npl128
and +1.1% at npl32 (both within run noise). The default paged decode is not
eager: it captures and replays graphs with a 256-token reset cadence
identical to stock non-paged (block-table ne0 = GGML_PAD(n_gather,256) only
steps at 256-token boundaries); only the gather fallback grows n_gather every
step and runs pure eager. 'graphs reused=0' was a uid fast-path false negative
(llama rebuilds the cgraph each step, so the reuse log never fires while the
graph still replays via the instance path).
nsys (reliable eager trace, plus the captured trace re-run with
--cuda-graph-trace=node to defeat nsys omitting graph-internal kernels, an
artifact that otherwise reads 0.3% busy) shows the steady decode is 99.4-99.5%
GPU-busy. Idle is ~0.6% of the step: 0.37% within-step launch gaps (the only
thing graphs remove, cut to 0.11% when captured) plus a 0.24% between-step
host gap (~2ms per step). Throughput is identical on/off.
Verdict: CUDA-graphing the paged decode is not a throughput lever; the decode
is GPU-compute-bound and the 2.6x gap to vLLM (148 vs 391) is in the per-step
GPU kernel work (FP4 GEMM + attention at batch 128), not launch overhead or
the host loop.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Append section 9 (skeptical staff-CUDA-engineer review) to FP4_GEMM_SCOPE_B.md,
stress-testing the dense/MoE parity verdict against the committed grounding.
Key findings:
- Not the W4A16 wall: the npl-sweep (dense 99/56/46/41% of vLLM at npl 8/32/64/128)
shows llama's FP4-MMA kernel HITS the weight-read floor at M=8 and FALLS OFF it as
M grows, while vLLM HOLDS it. Working-path tune, dual existence proof (M=8 + vLLM
M=128), not a greenfield build. Same binding constraint as W4A16 though (hide
LPDDR5x latency at the larger tile on an occupancy-dominated part).
- The dense gap is ~82-87% GEMM, ~13-18% non-GEMM (467 ms total = 383-405 GEMM +
62-84 non-GEMM). B alone caps ~80%; track A is what tips dense over the parity line.
- Sharpest omission: vLLM's M=128 floor is reached via cutlass TMA + deep pipeline -
the technique the doc forbids on GB10. TMA != manual cp.async (lower occupancy cost);
it must be an in-scope P2 fallback, not categorically banned.
- Honest landing: dense ~80-90% (parity the optimistic tail, contingent on B+A+floor),
MoE ~55-65% (parity not reachable from B). Low-regret: even a tripped P2 kill-gate
lands B+A ~89%, doubling today's 41%.
- Sequencing fix: land A first (defines B's interface + baseline + kill-gate), then
run B's P2 against the post-A number.
Verdict: DENSE conditional GO (scope as GEMM-gap-closing, not true parity; A-first,
gate at P2, add TMA); MoE NO-GO for parity from B (do the cheap mmq_x-down win as a
1.7-1.85x, not parity).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>