Record the shared-A/Ai GB10 cost model, the GO decision for one default-off f32 Ai prototype, and the Phase 13 implementation plan.
Assisted-by: Codex:gpt-5
Record the Phase 11 default-off QS-early GDN experiment, its canonical md5 gates, the same-session GB10 A/B regression, and the rejected diff artifact.
Assisted-by: Codex:gpt-5
Record the default-off C32 slab experiment, its md5 gates, the dense tail-row fix, and the performance regression that rejects the source patch.
Assisted-by: Codex:gpt-5
Record the Phase 10 current-M5 prefill baseline and the source inspection finding that C32 M5 needs a real U-staging implementation rather than a launcher-only shortcut.
Assisted-by: Codex:gpt-5
Record the Phase 9 MTP smoke gate, mirror the fork patch that disables backend sampling for MTP drafts, and scope the follow-up C32 slab GDN prefill phase.
Assisted-by: Codex:gpt-5
Mark the Phase 6 serving classifier complete, preserve the old parity final as historical, and scope Phase 7 source candidates with explicit md5 and op gates.
Assisted-by: Codex:gpt-5
Record both-engine serving nsys buckets, rejected sampler short-circuit, and rejected GDN/MMQ env grids for the GB10 parity work.
Assisted-by: Codex:gpt-5
Record the Phase 5 Wq shared-memory padding experiment, its gates, sub-threshold benchmark gain, and the decision to ship no 0051 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit d9b9be0be as patch 0050 and record the Phase 4 W4A16 shared-memory padding gates, benchmarks, and mirror verification.
Assisted-by: Codex:gpt-5
Record the Phase 3 scale-broadcast experiment, its md5 and MUL_MAT_ID gates, the prefill regression, and the decision to ship no 0050 patch.
Assisted-by: Codex:gpt-5
Mirror fork commit 7dfa0e175 as patch 0049 and record the Phase 2 GB10 W4A16 shape sweep, md5 gates, MUL_MAT_ID checks, and mirror verification.
Assisted-by: Codex:gpt-5
Mirror the fork-first W4A16 packed tile metadata commit into the LocalAI paged patch series, record the Phase 1 benchmark result, and keep the implementation plan checked off.
Assisted-by: Codex:gpt-5
Record the clean forced W4A16 baseline, default comparison, selected metadata target, and completed plan checkpoint for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Record the clean DGX build retry, binary provenance, canonical greedy md5 gates, and completed plan steps for the GB10 parity reopen.
Assisted-by: Codex:gpt-5
Add the read-only DGX artifact review for the Phase 0 parity reopen, including supported paged measurements and missing vLLM difference-method evidence.
Assisted-by: Codex:gpt-5
Add the clean llama.cpp fork state, base merge point, patch count, and tree-match result for the Phase 0 parity reopen workflow.
Assisted-by: Codex:gpt-5
Add a phased follow-up spec for challenging the GB10 vLLM-parity closure, including provenance gates, W4A16/GDN/MoE workstreams, and subagent ownership boundaries.
Assisted-by: Codex:gpt-5
Reconcile the paged backend pin prose with the current Makefile pin, mark the 0044 patch tracking note as resolved, and add DGX Docker worker idleness to the benchmark preflight.
Assisted-by: Codex:gpt-5
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for all paged-backend kernel/patch work. Always update it FIRST: commit
the change on the fork branch and push it, then regenerate the LocalAI patch
series (backend/cpp/llama-cpp-localai-paged/patches/paged/) from the fork via
git format-patch so the series is a 1:1 drift-free mirror of the branch. Never
edit the LocalAI patch files directly, and never add a patch with no
corresponding fork-branch commit. The series is a derivative; the fork is the
source. The fork branch is also where the build and the per-path bit-exact md5
gate actually run, so it is the only place a change is truly validated.
Codified in two places:
- .agents/llama-cpp-localai-paged-backend.md: new "Fork-first workflow
(MANDATORY)" section at the top of the patch/pin-sync material, plus the
"Encapsulating your work" bullet now points at it.
- backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md: strengthened the
hard-gate (section 2.5) into "Fork-first is MANDATORY", and corrected a stale
numbering example (fork 51168c5ee "patch 0044" maps to worktree 0044, not the
f32-only M5 which is worktree 0047).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The fork mudler/llama.cpp branch localai-paged is the canonical source of
truth for the paged-backend patch series. This file is the git format-patch
of fork commit 51168c5ee ("feat(paged): fused gated RMSNorm + SiLU gate-mul
CUDA op (patch 0044)"), verified byte-identical to that commit's format-patch
output. The full on-disk series applies clean in numeric order on the pinned
base and the resulting tree is byte-identical to the fork commit tree (tree
hash a73d759350277532a14e853e1fe78f08bbb74ce8), so the LocalAI series is a
drift-free 1:1 mirror of the fork branch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Ground-check follow-up to 2431090ff. Two factual corrections:
- Section 7 worktree line had the ahead/behind counts swapped ("25 ahead,
197 behind"); the branch is actually ~199 ahead / 25 behind origin/master.
- Discrepancy item 5 flagged only the MoE CDEF PAGED_GATE_MD5 (0921716...);
the dense run is symmetric (COMBINED_DEFINITIVE.txt records ecfe924d... for
dense, which likewise differs from the canonical dense gate 5951a5b4). Both
CDEF values come from combined_definitive.sh's own gate command, not the
canonical bit-exact gate in section 3.3, so neither is sanctioned and both
must be KL-validated.
Everything else in the handoff verified accurate: fork branch localai-paged
HEAD 51168c5ee (patch 0044) on dgx:~/llama-paged-fork, dev-tree HEAD a7d439e,
all md5/KL numbers, the 86%/1078/924 decode record, bench env, and all
referenced file/artifact paths.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds docs/PARITY_HANDOFF.md: the operational how-to for an agent with zero
context picking up the GB10 vLLM-parity work. Complements VLLM_PARITY_FINAL.md
(the why/record) with TL;DR state, the hard gates (per-path bit-exact md5,
KL-gate, no LLAMA_MAX_BATCH_TOKENS, fork-is-canonical), a copy-pasteable
operational quickstart (ssh/lock/build/bench + the --cuda-graph-trace=node
decode-profiling rule that caused 4 wrong analyses), the complete tested-and-
rejected lever map, methodology lessons, the three forward directions, and a
key file/artifact index with the open discrepancies to reconcile.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The decode-serving section characterized the high-N gap as "BW-floored, vLLM
pays equally / 56-68%". A clean uncontended graph-node-traced profile
(dgx ~/highN_prof2 + ~/highN_vllm, 2026-06-30) shows that was a profiling
artifact: decode runs as a replayed CUDA graph, and nsys without
--cuda-graph-trace=node collapses each replay into one opaque launch, so every
prior decode decomposition (159 us/tok, "host-bound", "5.4x more efficient")
was wrong. Corrected via --cuda-graph-trace=node + the ntg=64-minus-ntg=16
difference method.
Real picture (paged npl=256): 99% GPU-busy (idle 1.4%), NOT host-bound. GDN
recurrent scan 553 us/tok (51%, linear in batch, dominant), NVFP4 expert GEMM
254 (23%), bf16 proj 73 (7%), elementwise 57, SSM conv 31. Gap reconciled:
vLLM-server 1177 -> vLLM true GPU-steady 1078 (chunked-prefill overlap inflates
its window ~8pt) -> llama GPU-steady 924 (= 86% of 1078) -> llama-server 718
(61%, the ~17pt S3-recoverable serving graph-reuse overhead). So vs vLLM's true
GPU-steady decode we are ~86%, not 56%. GDN is a shared BW floor where paged
leads (83% vs 79% of 273 GB/s peak; both 1.17-1.18x for 2x batch).
The residual ~14pt is vLLM's mature fused kernels (Marlin MoE +11ms, Triton
elementwise +10ms); both ggml fusions rejected: act-quant-into-MMQ -79.4%
(ggml MMQ re-quantizes y per row-tile x stream-k split, no single-pass tiling),
norm+quant+silu infeasible via ggml_cuda_can_fuse. Added rejected levers:
Q8_0/FP8 projection (regime error, closes <=6%; vLLM FP8-proj confirmed from
hf_quant_config.json MIXED_PRECISION), the two decode fusions; refined BV-block
GDN occupancy to -1.04% (wave-hidden).
Revised verdict: PREFILL genuinely capped (36-43%, not graph-replayed so real);
DECODE-SERVING near-parity ~86% of vLLM true GPU-steady (headline 56% was a
measurement/operating-point artifact). GB10-vs-datacenter framing kept.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add docs/VLLM_PARITY_FINAL.md: the standing, never-re-litigate record of the
exhaustive GB10 (sm_121) vLLM-parity investigation for the Qwen3.6 NVFP4 hybrid
models. Captures the definitive same-session both-engine benchmark (prefill
S_PP, decode/serving per-seq + aggregate, TTFT, PEAK_GB, paged-as-%-of-vLLM for
both the MoE 35B-A3B and dense 27B models), the complete lever map (every
prefill-GEMM, prefill-GDN, decode and serving/engine attempt with its verdict
and key number), the structural floors (LPDDR5x bandwidth, FP4-MMQ optimality,
GDN O(C^2) intra-chunk + serial recurrence, vLLM's HBM-tuned FLA/Marlin), the
shipped bit-exact wins, and the parity verdict: parity is a hardware ceiling on
GB10, not missing optimizations; the path to parity is datacenter Blackwell.
Every number cites its artifact (dgx:~/bench/COMBINED_DEFINITIVE.txt, the
marlin_gate / gdn_p1_ab A/Bs, PREFILL_GEMM_RESULTS, VLLM_PARITY_LEVER_MAP,
DECODE_SERVING_SCOPE, the patch headers); figures not pinned to an artifact are
marked estimated. Add a section-9 summary + link in the backend README.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The 0044/0045 patches were exported from the old bf16/hybrid dev tree and no
longer apply on the f32-only series (0026 ssm_bf16_tau is dropped), so the
build broke at `git apply`. Re-sync the vendored series to the now
feature-complete fork branch mudler/llama.cpp:localai-paged, which is the
canonical source (pin 0ed235ea + the paged patch commits in order).
- git rm the dev-tree-based 0044 (GDN M5, bf16-machinery base) and 0045
(Marlin W4A16 offline-repack, not part of the fork branch).
- Add the fork branch's newest commit (2c32ab8b7, "GDN M5 tensor-core
chunked-scan prefill, f32-only re-port") as 0047, generated with a single
git format-patch off that branch. It sequences after 0046 (its parent on
the branch) and recovers the prefill win 0044 encoded (+3.5% S_PP @npp512,
+17.7% @npp2048), bit-exact per-path (test-backend-ops GATED_DELTA_NET
46/46 default and force-M5; greedy md5 default-on == M5-forced == canonical).
- Track patch 0046 (dense-prefill geometry gate), which was on disk but never
committed, so the series is complete in git.
- README: patch-table header 0001-0046 -> 0001-0047, replace the 0044 row with
the f32-only 0047 row, fix the dangling 0044 prose references, note the
bf16 M6/M7/M8 variants are not part of this f32-only series, and add a
maintenance bullet that the series is now generated from the fork branch so
there is no more patch-export drift.
Verified: on a pristine llama.cpp at pin 0ed235ea the full series 0001-0043,
0046, 0047 applies clean in sorted order with the Makefile's exact
`git apply --verbose` method (37/37 OK), and the resulting tree is
byte-identical to the fork branch tip 2c32ab8b7.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031)
as a single GDN_TC-selected build and ship the M5 variant (full TC form-T
solve + state-update mma) default-ON when LLAMA_KV_PAGED is set.
The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both
env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays
regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1
so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a
{1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because
the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%).
Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1,
llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32:
-npp 512 S_PP 2208.96 -> 2286.5 t/s (+3.5%, mean of 3 interleaved A/B)
-npp 2048 S_PP 2021.5 -> 2379.8 t/s (+17.7%)
Decode S_TG unchanged (~399 vs ~397 t/s, within noise).
Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on
== M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4.
test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk
up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens)
and the sequential path agree word-for-word until one benign greedy token-flip;
dense is byte-identical. The chunked scan is a NEW per-path result (different FP
reduction order), NMSE-validated benign.
CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README
sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a
measured end-to-end A/B proved that is a serving mistake. S3 defers prefill
admission on the period-8 cadence, which delays prompt admission: 2.5x worse
TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no
end-to-end win at any concurrency. Its apparent decode_agg gain was a metric
artifact (faster per-step decode bought by starving prefill).
Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the
mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for
decode-dominated, low-arrival traffic where TTFT is not a concern. The default now
prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping
default-on; only S3's default changes.
Re-exports patch 0041 (change folded into its source commit) and updates the
README 0041 row plus the decode-serving narrative to record the A/B finding.
Greedy md5 gate unchanged (single-sequence llama-completion path, not
update_slots): paged MoE 8cb0ce23, dense 5951a5b4.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push
serving graph reuse from ~72% toward ~100%) was implemented in an isolated
worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10.
Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving
throughput at every concurrency and does not close the vLLM gap.
Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step
(n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so
n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a
release()-side guard keeps a finished slot warm under padding. Each dummy is its
own sequence (private recurrent state, per-stream paged attention, logits
discarded), so it cannot perturb a real stream.
Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The
literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA
decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF
diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed:
ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly
(MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing.
Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16
/ PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered
aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD
(reuse 38%) 558.0.
Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound
- baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys
~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding
adds dummy-row compute proportional to pad_width - real_load, catastrophic at low
load; (3) in continuous serving padding cannot hold a constant width (perpetual
prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink
padding prevents is itself a throughput win in a compute-bound regime. The
residual burst gap is GPU-compute, which a host-side reuse lever cannot close.
Patch series unchanged: this rejected lever is NOT added to patches/paged/.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
FIX A (patch 0031 compose break): the chunked GDN prefill patch carried
'#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but
those were introduced by the dropped bf16-tau patch 0026, so on the
bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git
apply' failed. The same 0026 drop also shifted 0031's later hunks off their
context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template
params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh
pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the
cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for
INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA &&
!keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF
(GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies
clean on 0ed235ea ('git apply --check' green for every patch).
FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF.
Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still
overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in
update_slots(); re-exported from the dev tree, README 0041 row updated to match.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add the two decode-serving graph-reuse levers (validated on GB10) that close the
host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real
continuous serving while tying it in static batched-bench).
- 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode
llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml
graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape
can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5
byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%.
- 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode
steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse.
Default-off policy on top of 0016; bit-exact (per-stream independent).
S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph
reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq
median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%);
S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put
set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table +
DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight
GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet)
tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the
validated, bit-exact-gated mechanism and records the honest GB10 result: it is a
regression, so it ships default-off (== stock), mirroring the patch-0017
default-off discipline.
Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense
M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that
dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays
FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda
gains the NVFP4 case.
Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID
806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4
large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for
short prefill (5951a5b4, decode untouched) and for a >threshold prefill that
exercises the bf16 path (5f3967df - no greedy argmax flips).
Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99
-> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42
(-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here:
FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step
weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate,
only partially amortizing as M grows. Default-off keeps stock S_PP (966.98).
Phase 2 (MoE grouped large-M) not implemented: it inherits the same
bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS
would regress for the same reason; a real prefill GEMM win needs option (b), a
native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Scopes the follow-up recorded by patch 0031 + README section 5: replace the
serial per-thread reductions of the chunked gated-DeltaNet prefill scan with
mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the
path that would beat the tuned sequential scan and close the GDN prefill
bucket toward vLLM's ~2.5x-cheaper chunked scan.
Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer
Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a
data-center only); the usable path is the extended mma.sync family. So the
kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's
mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma'
shorthand reads as mma only on this part.
Design: register-resident state frees the 64KB that forced C=16, admitting
C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32
precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the
bounded de-gating; A-inverse via blocked forward substitution (FLA UT
transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW
~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not.
Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains
beat the sequential scan and close most of the bucket but not full sm_100-class
parity. KL-gate binding (NMSE likely fails at reduced precision). Phased:
re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state ->
occupancy/cp.async; opt-in default-OFF until A/B-proven.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static
batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving
through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains
~5.9. Scope shows the gap is the scheduler/host loop, not the kernel.
Root-cause hypothesis from source: continuous batching's batch-shape + seq-set
churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/
allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph
update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host
rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids
this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that
the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill
freezing + burst collapse, NOT decode-step graph reuse, which is why the serving
gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was
static-regime and is reopened here for serving only.
Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for
graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable
scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound
before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/
get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE
and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel
code; no GPU run in this pass.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked
parallel-scan prefill kernel for gated DeltaNet (the upstream
gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO).
Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous.
Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within
the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases:
exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms
f32 chunked-vs-sequential NMSE ~1e-13.
OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128
f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM
with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential
(~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with
GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the
84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident
state with larger chunks (recorded in README section 5).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where
MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than
vLLM's marlin/cutlass large-M path (~51% of the prefill gap).
Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first,
MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a
multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already
exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path
(falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS
banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4).
Design only - no kernel, no GPU run.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:claude-opus-4-8 [Claude Code]
The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is
removed from the llama-cpp-localai-paged patch series. Clean re-measurement after
the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache)
landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16
(tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs
780.0 t/s. The mode engages but adds zero speed because it is subsumed by the
fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau
was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and
extra CUDA template-instantiation compile cost with no offsetting benefit.
Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only
mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn,
which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025,
0028-0030) applies clean with git apply --check against the pin
0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob
(patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are
already absent).
Removed:
- patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
- the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared
grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no
longer reads)
- the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows
(README + final_benchmark.csv), the ssm_bf16_tau option text in backend
index.yaml, the gallery NOTE block, and the docs/features/backends.md mention.
The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat)
in the backend README section 5, the paged-backend agent guide, and the
vLLM-parity methodology, so it is not re-tried.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>