Source-only analysis of what it would take to give the gated-DeltaNet decode
fusions (0018 in-place state write-back, 0019 fused recurrent-state gather,
0021 ssm_conv_update_inplace, 0028 conv-tap gather fusion) native kernels on
the non-CUDA compute backends, so the patch-series decode win extends past
CUDA-family hardware.
Key findings:
- The base GGML_OP_GATED_DELTA_NET and GGML_OP_SSM_CONV kernels ALREADY exist
upstream on Metal, Vulkan AND SYCL (the README's no-Vulkan-kernel line is
stale). The Qwen3.6 hybrids run on all three today via the non-fused path;
Layer-2 is the decode SPEEDUP, not enabling the model to run.
- Per backend the new work is only the FUSION plumbing: redirect the GDN state
write (in-place), add the ids read, write one new conv-update kernel + its
ids variant, two tiny gather kernels, plus supports_op + op-handler + (Vulkan)
pipeline/push-constant/descriptor wiring. Builders, CPU refs, model graph and
test-backend-ops cases are shared and already done.
- Bit-exactness is feasible per backend by construction (the fusions redirect
addresses, not the f32 reduction order); test-backend-ops (backendX-vs-CPU)
is the gate.
- The 0030 name allow-list should become capability-driven (make supports_op
authoritative for the discriminated src slots).
- Ranked: ops-first PR, then Metal (highest value/effort, fixed simdgroup =
simplest bit-exactness), then SYCL (near-verbatim CUDA mirror, cheapest to
author), then Vulkan (widest hardware reach but the shader-gen + variant
matrix + subgroup variance make it the capstone).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Section 4(c): real Apple M4/Metal numbers (Qwen3-8B Q4_K_M, stock vs patched) -
patchset is neutral-to-slightly-negative on Metal (the in-kernel block-table read
is CUDA-only; NVFP4/GDN-fusions inert), so prefer stock llama-cpp on Apple Silicon.
Vulkan: same picture, worse (no upstream GDN op). Section 6: cross-backend learnings
+ upstream candidates (the GDN decode-plumbing fusions are the portable, bit-exact,
CPU-mirrored win worth upstreaming).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.
- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
(kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
its own 0001-0002 patches, dense-era design docs, tests). Zero references
repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
+ 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
remove the paged-series handling from prepare.sh. The stock llama.cpp target
now only clones the pin and applies its own (currently empty) base patches/
series. The runtime paged option hooks in the shared grpc-server.cpp are
untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
each freshly cloned tree via strict git apply (apply-paged-patches), after the
copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
build-toggle from comments.
Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Advance the paged-attention backend's owned llama.cpp pin by 23 upstream
commits. The shipped source-only patch series (0001-0030, 28 patches) applies
strict-clean (git apply, exit 0) on a fresh c299a92c checkout with no re-export
needed, and the bit-exact gate is GREEN on every path on GB10 (CUDA sm_121):
- md5 greedy decode (-ngl 99 -fa on -n 48 --temp 0 --seed 1): dense
non-paged/paged 5951a5b4, MoE non-paged 07db32c2, MoE paged 8cb0ce23; all
match the established baselines.
- test-backend-ops CUDA0: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16,
SSM_CONV_UPDATE_IDS 16/16, GATED_DELTA_NET 84/84, MUL_MAT 1146/1146,
MUL_MAT_ID 806/806; all OK.
The 23-commit upstream jump did not change our decode output. The .patch files
are kept byte-identical (they already apply strict-clean at the new pin); only
the pin, the PIN_SYNC evidence doc, and the canary/gallery doc references change.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The paged backend (backend/cpp/llama-cpp-localai-paged) pins its own verified
llama.cpp tip and is excluded from the nightly auto-bumper so a naive bump can
never silently break the shipped build. That exclusion also removed the early
warning of upstream drift. This restores the signal without touching the pin.
Add .github/workflows/llama-cpp-paged-canary.yml (weekly + workflow_dispatch):
- apply-check job (ubuntu-latest, toolchain-free): resolve the latest
ggml-org/llama.cpp master tip, shallow-checkout it, and apply the full paged
series 0001-0030 in order with the build's own git-apply method via the new
shared helper .github/scripts/paged-canary-apply.sh. Red on any apply break.
- compile job (needs apply-check): on the exact tip it validated, build the
paged backend (cublas) inside the same base-grpc-cuda-12 toolchain and the
same `make grpc-server` target the shipped build uses, so a red means upstream
drift, not toolchain noise. nvcc compiles the kernels with no GPU present.
Red here = run a PIN_SYNC (rebase + bit-exact gate + re-export), then bump the
paged Makefile pin. The canary is signal-only: it opens no PR and never moves
the pin, so the shipped build and the dep-bump PRs stay green regardless. It is
fully separate from bump_deps.
The lone pre-existing quirk in the series (patch 0019 carries a stray modify
hunk against the dev-only doc SSM_DECODE_FIX_RESULTS.md, absent from any clean
upstream checkout; git apply is atomic so it rejects the whole patch and
cascades to 0021/0022/0026/0028) is handled path-scoped: the helper excludes
only that dev-doc and still applies 0019's real code hunks atomically, mirroring
prepare.sh's tolerance, so the quirk never false-positives the canary but a
genuine code break in 0019 still turns it red.
Point the existing pin comments in backend/cpp/llama-cpp-localai-paged/Makefile
and .github/workflows/bump_deps.yaml at this canary as the drift signal, and
document it in the PIN_SYNC doc: canary red -> do a pin-sync.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
The llama-cpp-localai-paged backend reused backend/cpp/llama-cpp's LLAMA_VERSION,
which .github/workflows/bump_deps.yaml auto-bumps nightly to the latest
ggml-org/llama.cpp master tip. The stock backend is patch-free so that bump is
safe, but the paged backend applies a vendored patch series
(backend/cpp/llama-cpp/patches/paged/) hand-verified bit-exact against ONE
specific tip. A naive bump moves the tip out from under the patches and breaks
'git apply' at build time - a dep-bump PR would go red (or, worse, the break
surfaces later in a release build).
Mirror the turboquant precedent: give the paged wrapper its OWN LLAMA_VERSION
pin (the verified 9d5d882d) and force it into every copied build via
LLAMA_VERSION=$(LLAMA_VERSION), so the nightly stock bump no longer drags the
paged build to an unverified tip. Unlike turboquant (whose fork branch carries
the patches and is safe to auto-bump), the paged series is vendored, so it gets
NO bump_deps.yaml entry: it is advanced only by the manual PIN_SYNC process.
Add cross-referencing comments in both Makefiles and bump_deps.yaml.
Also add PIN_BUMP_APPLY_CHECK.md: an apply-feasibility report for the latest tip
(c299a92c, 23 commits ahead). The full series applies CLEAN under 'git apply'
with only benign line offsets and zero conflicts; the lone failure (0019) is a
pre-existing stray dev-doc hunk, identical on the current pin, not a bump
regression.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
New backend = stock llama-cpp grpc-server + the paged patchset (forces LLAMA_PAGED=on),
shipped as its own meta-backend (mirrors turboquant, simpler: no fork pin, no
grpc-server patching - the paged runtime hooks already exist in grpc-server.cpp).
Stock llama-cpp untouched (LLAMA_PAGED?=on retained; the de-risk flip deferred for
sign-off). Gallery: qwen3.6-27b-nvfp4 (dense) + qwen3.6-35b-a3b-nvfp4 (MoE) with the
benchmark run config (paged_kv, max_batch_tokens, parallel, flash_attention, f16),
mudler/ GGUF uris (sha256 TODO until publish). Importer dropdown entry + tests.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>