chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-27 18:06:58 -04:00 · 2026-06-27 13:20:05 +00:00
parent db14006fcd
commit 08b754f910
21 changed files with 41 additions and 235 deletions
--- a/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md
@@ -0,0 +1,514 @@
+# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
+
+Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
+actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
+
+SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and
+the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and
+the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas /
+metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the
+darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to-
+negative there and non-CUDA users should use the stock llama-cpp backend (README 4c).
+
+================================================================================
+0. GROUND TRUTH (what the repo actually does today)
+================================================================================
+
+The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
+worktree. Two mechanisms, both already present:
+
+  (a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
+      target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
+      patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
+      prepare.sh has a fallback `patch`-based apply guarded by a sentinel
+      (llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
+      already ships the paged engine compiled in.
+
+  (b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
+      hooks (lines ~752-842). They only call setenv() before context init:
+        - option `kv_paged` / `paged_kv` / `paged_attention`  -> setenv LLAMA_KV_PAGED=1
+        - option `kv_paged_debug` / `paged_kv_debug`          -> setenv LLAMA_KV_PAGED_DEBUG=1
+        - option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
+        - option `max_batch_tokens` / `mbt`                   -> setenv LLAMA_MAX_BATCH_TOKENS
+        - option `prefill_cap`                                -> setenv LLAMA_PREFILL_CAP
+      Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
+      so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
+      The paged engine itself lives entirely inside the patched llama.cpp lib
+      (paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
+
+Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
+ONE backend. The task is to SPLIT that into two backends:
+  - llama-cpp  = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
+                 paged hook), grpc-server.cpp keeps the dormant hooks.
+  - <newname>  = stock grpc-server.cpp + paged patch series applied + paged on.
+
+The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
+backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
+
+CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
+(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
+SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
+gets the same single-build CPU target for free by reusing the same Makefile machinery.
+
+--------------------------------------------------------------------------------
+RECOMMENDED BACKEND NAME: `llama-cpp-paged`  (see section 4 for the full rationale)
+--------------------------------------------------------------------------------
+Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
+SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
+DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
+in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
+turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
+
+================================================================================
+1. NEW BACKEND - file by file
+================================================================================
+
+--------------------------------------------------------------------------------
+1.1 backend/cpp/llama-cpp/Makefile  (the ONE necessary touch to stock)
+--------------------------------------------------------------------------------
+Change exactly one default so the STOCK image ships clean against upstream:
+
+    -LLAMA_PAGED?=on
+    +LLAMA_PAGED?=off
+
+Why: this is the entire point of the split - stock llama-cpp must build clean so an
+upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
+grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
+it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
+currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
+the PR. If the team prefers stock literally untouched, the alternative is to leave
+`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
+not achieved. Recommendation: flip to off.
+
+(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
+patches/, patches/paged/ are all reused as-is by the new backend.)
+
+--------------------------------------------------------------------------------
+1.2 backend/cpp/llama-cpp-paged/Makefile  (NEW - thin wrapper, model on turboquant)
+--------------------------------------------------------------------------------
+Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
+do NOT):
+  - turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
+    as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
+    NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
+    turboquant). We only force LLAMA_PAGED=on.
+  - turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
+    apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
+    paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
+    own `llama.cpp:` target when LLAMA_PAGED=on.
+
+Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
+cpu-all/grpc/rpc-server):
+
+    LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
+
+    define paged-build   # $(1)=flavor $(2)=cmake flags $(3)=target
+      rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
+      cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
+      $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
+      # clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
+      LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
+      CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
+        $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
+      cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
+    endef
+
+    llama-cpp-paged-cpu-all:
+      # identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
+      # + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
+      ... LLAMA_PAGED=on SHARED_LIBS=ON \
+          EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
+          TARGET="--target grpc-server --target ggml" ...
+
+    package: ; bash package.sh
+    purge:   ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
+    clean: purge
+
+Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
+package.sh glob them.
+
+--------------------------------------------------------------------------------
+1.3 backend/cpp/llama-cpp-paged/run.sh  (NEW - copy turboquant/run.sh, rename binaries)
+--------------------------------------------------------------------------------
+s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
+llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
+DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
+
+--------------------------------------------------------------------------------
+1.4 backend/cpp/llama-cpp-paged/package.sh  (NEW - copy turboquant/package.sh, rename)
+--------------------------------------------------------------------------------
+s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
+ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
+and the per-arch libc/ld.so set (unchanged).
+
+--------------------------------------------------------------------------------
+1.5 backend/Dockerfile.llama-cpp-paged  (NEW - copy Dockerfile.turboquant, swap paths)
+--------------------------------------------------------------------------------
+Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
+  - bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
+  - ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
+    (OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
+     stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
+     stock cache would give the paged build near-free object reuse. Trade-off: a regression
+     in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
+  - both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
+  - final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
+
+--------------------------------------------------------------------------------
+1.6 .docker/llama-cpp-paged-compile.sh  (NEW - copy llama-cpp-compile.sh, swap make targets)
+--------------------------------------------------------------------------------
+Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
+`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
+(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
+arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
+
+--------------------------------------------------------------------------------
+1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
+--------------------------------------------------------------------------------
+  a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
+  b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
+       # llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
+       # series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
+       BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
+     (lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
+      llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
+  c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
+       $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
+  d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
+  e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
+     line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
+  f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
+     backends/llama-cpp-darwin at line 1124; see 1.11).
+
+--------------------------------------------------------------------------------
+1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
+--------------------------------------------------------------------------------
+For EACH variant you choose to ship (see phased recommendation in section 4), add a row
+copied from the corresponding llama-cpp row with:
+  - backend: "llama-cpp-paged"
+  - dockerfile: "./backend/Dockerfile.llama-cpp-paged"
+  - tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
+    (e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
+           -gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
+  - builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
+    (this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
+     is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
+  - CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
+    tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
+    backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
+    pattern as -cpu-llama-cpp.
+  - Darwin (if shipping): add to includeDarwin:
+      - backend: "llama-cpp-paged"
+        tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
+        lang: "go"
+    (omit build-type, exactly like the llama-cpp darwin row at line 4908.)
+
+  REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
+  changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
+  triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
+  (the shared source) retrigger this backend too.
+
+--------------------------------------------------------------------------------
+1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
+--------------------------------------------------------------------------------
+  a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
+     next to the turboquant branch (line 45):
+       if (item.dockerfile.endsWith("llama-cpp-paged")) {
+         // reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
+         return `backend/cpp/llama-cpp-paged/`;
+       }
+     ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
+     keep the specific branch first regardless (defensive, and returns the right path).
+  b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
+       if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
+  c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
+       if (backend === "llama-cpp-paged" && !changed) {
+         changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
+       }
+  Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
+
+--------------------------------------------------------------------------------
+1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
+--------------------------------------------------------------------------------
+GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
+meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
+per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
+
+  Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
+    - &llamacpppaged
+      name: "llama-cpp-paged"
+      alias: "llama-cpp-paged"
+      license: mit
+      icon: <same as llama-cpp>
+      description: |
+        LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
+        prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
+        Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
+      urls: [ https://github.com/ggerganov/llama.cpp ]
+      tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
+      capabilities:
+        default: "cpu-llama-cpp-paged"
+        nvidia: "cuda12-llama-cpp-paged"
+        nvidia-cuda-12: "cuda12-llama-cpp-paged"
+        nvidia-cuda-13: "cuda13-llama-cpp-paged"
+        nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
+        nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
+        nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
+        metal: "metal-llama-cpp-paged"
+        # add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
+
+  Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
+    capabilities map pointing at the `*-development` image names.
+
+  Step c - concrete image entries at end of file (mirror the llama-cpp block lines
+    2106-2200), one latest + one development per variant, each as:
+      - !!merge <<: *llamacpppaged
+        name: "cpu-llama-cpp-paged"
+        uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
+        mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
+      - !!merge <<: *llamacpppaged
+        name: "cpu-llama-cpp-paged-development"
+        uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
+        mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
+      ...repeat for cuda12 / cuda13 / l4t / metal etc.
+  The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
+
+--------------------------------------------------------------------------------
+1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
+--------------------------------------------------------------------------------
+If metal is shipped, also:
+  - scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
+    drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
+  - Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
+  - backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
+    step that calls `make backends/llama-cpp-darwin`).
+  - index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
+  - C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
+    linking protobuf/grpc++), so no Homebrew-include failure.
+
+--------------------------------------------------------------------------------
+1.12 Importer / /backends/known dropdown  (drop-in, NOT a new importer)
+--------------------------------------------------------------------------------
+This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
+add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
+  - AdditionalBackends() (line 37): append
+      {Name: "llama-cpp-paged", Modality: "text",
+       Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
+  - Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
+      preferences.backend == "llama-cpp-paged" is honored:
+        case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
+  - core/gallery/importers/importers_test.go: add a table case asserting the preference
+    override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
+    HF fixture). Run `go test ./core/gallery/importers/...`.
+
+--------------------------------------------------------------------------------
+1.13 Docs
+--------------------------------------------------------------------------------
+  - docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
+    one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
+    -> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
+
+--------------------------------------------------------------------------------
+1.14 Does grpc-server.cpp need the paged hooks?  YES - already present, reused unchanged.
+--------------------------------------------------------------------------------
+The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
+SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
+(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
+hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
+LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
+
+================================================================================
+2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
+================================================================================
+
+Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
+the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
+either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
+`options:` string list (key:value, parsed by grpc-server.cpp set_option).
+
+--------------------------------------------------------------------------------
+2.1 Benchmark llama-server flags -> LocalAI model-config mapping
+--------------------------------------------------------------------------------
+  -c 131072                  -> context_size: 131072            (LLMConfig.ContextSize, yaml context_size)
+  -fa on                     -> flash_attention: "on"           (LLMConfig.FlashAttention, yaml flash_attention; string)
+  -ngl 99                    -> gpu_layers: 99                  (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
+  -b 2048                    -> batch: 2048                     (schema.PredictionOptions.Batch, yaml batch)  [see caveat]
+  --parallel 128             -> options: ["parallel:128"]       (grpc-server.cpp:629; alias n_parallel)
+  LLAMA_KV_PAGED=1           -> options: ["paged_kv:true"]      (grpc-server.cpp:778)
+  LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
+  f16 KV                     -> f16: true                       (LLMConfig.F16, yaml f16)
+  (recommended for paged)    -> options: ["kv_unified:false"]   (grpc-server.cpp:746 - the per-slot paged
+                                  capacity/memory benefit only materializes with a per-sequence cache;
+                                  the patch comment explicitly recommends pairing paged with kv_unified:false)
+
+  CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
+  (grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
+  benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
+    (i)  set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
+         max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
+         benchmark states decode throughput is budget-independent), OR
+    (ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
+  RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
+  lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
+  be added later to honor -b/-ub independently (not required to ship).
+
+--------------------------------------------------------------------------------
+2.2 gallery/index.yaml entry - DENSE  q36-27b-nvfp4
+--------------------------------------------------------------------------------
+- name: "qwen3.6-27b-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF      # placeholder, section 3
+  description: |
+    Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
+    paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
+    Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
+  license: "apache-2.0"                                         # confirm vs Qwen license
+  tags: [ llm, gguf, nvfp4, reasoning ]
+  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
+  overrides:
+    backend: llama-cpp-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512                       # see -ub caveat 2.1; matches the 512 ubatch floor
+    known_usecases: [ chat ]
+    options:
+      - use_jinja:true
+      - paged_kv:true                # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512         # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
+      - kv_unified:false             # enables the per-slot paged capacity/memory benefit
+      - parallel:128                 # --parallel 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+      sha256: <FILL after publish>
+      uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
+
+--------------------------------------------------------------------------------
+2.3 gallery/index.yaml entry - MoE  q36-35b-a3b-nvfp4
+--------------------------------------------------------------------------------
+Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
+MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
+purely for saturated MoE decode use max_batch_tokens:256.
+- name: "qwen3.6-35b-a3b-nvfp4-paged"
+  urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
+  ...
+  overrides:
+    backend: llama-cpp-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    batch: 512
+    options:
+      - use_jinja:true
+      - paged_kv:true
+      - max_batch_tokens:512          # or 256 for max saturated MoE decode (sweep winner)
+      - kv_unified:false
+      - parallel:128
+    parameters:
+      model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+      sha256: <FILL after publish>
+      uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
+
+Note: these are the BENCHMARK serving configs. For an interactive single-user default you
+may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
+- optional, not required to ship the benchmark reproduction.
+
+================================================================================
+3. GGUF PUBLISHING (so the gallery uri: resolves)
+================================================================================
+
+The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
+q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
+document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
+tensors from nvidia modelopt weights). To publish:
+
+  1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
+       <ORG>/Qwen3.6-27B-NVFP4-GGUF      (single q36-27b-nvfp4.gguf)
+       <ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF  (single q36-35b-a3b-nvfp4.gguf)
+     ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
+  2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
+     (LocalAI verifies it on download). Without sha256 the entry still works but loses the
+     integrity check - fill it.
+  3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
+     pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
+     Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
+     paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
+  4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
+     LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
+     tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
+     backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
+     IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
+  5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
+     modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
+     attribute upstream in the card.
+
+================================================================================
+4. OPEN DECISIONS / BLOCKERS / BUILD COST
+================================================================================
+
+BACKEND NAME - RECOMMEND `llama-cpp-paged`.
+  - llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
+    every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
+    changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
+    importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
+  - localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
+    dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
+    Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
+  - localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
+    llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
+    ".cpp" looks like a file extension to the suffix matcher. Avoid.
+
+BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
+  1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
+     If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
+  2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
+     until upload. Blocks gallery validation only, not the backend build.
+  3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
+     option; shipped config uses batch:512. Minor, not a blocker.
+  4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
+     - get explicit sign-off since it alters a heavily-used backend's build.
+
+PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
+base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
+  Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
+    - cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64  (NVFP4/paged win)
+    - cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
+  Phase 2 (parity with stock llama-cpp coverage, only if demand):
+    - metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
+  Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
+  and these add CI cost without a clear consumer.
+
+BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
+~byte-identical to stock so a SHARED ccache id makes most objects free):
+  - CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
+    apt step. Two arches + a merge job.
+  - CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
+    across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
+  - Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
+  - No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
+    so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
+    write; first master build per row pays the cold cost once, then warm.
+
+VERIFICATION (post-implementation, needs a GPU box - out of scope here):
+  - `make backends/llama-cpp-paged` builds + installs locally (from-source path).
+  - Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
+    checkout) - proves the split.
+  - Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
+    server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
+    max_batch_tokens/parallel took effect.
+  - go test ./core/gallery/importers/... green (importer drop-in case).
+  - node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
+    llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
+
+================================================================================
+END OF PLAN
+================================================================================
--- a/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md
@@ -0,0 +1,75 @@
+# Paged bit-exactness gate - per path (canonical references)
+
+## TL;DR
+
+The greedy decode of the **paged** path does not byte-match the **non-paged**
+path for the MoE model. This is a **benign FP-accumulation-order difference of
+the paged attention reduction**, KL-validated against the f16 reference. It is
+**not a bug**. The bit-exactness gate is therefore **per path**:
+
+| path | model | canonical md5 |
+|------|-------|---------------|
+| non-paged | MoE q36-35b-a3b-nvfp4   | `07db32c2bcb78d17a43ed18bc22705cd` |
+| paged     | MoE q36-35b-a3b-nvfp4   | `8cb0ce23777bf55f92f63d0292c756b0` |
+| non-paged | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` |
+| paged     | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
+
+Gate command (chat-template / conversation path):
+```
+llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
+                 -n 48 --temp 0 --seed 1
+# paged: prefix with  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
+```
+Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
+completion lands in a different md5 namespace).
+
+**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
+the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
+single reference `5951a5b4`.
+
+## Why dense is bit-exact but MoE is not
+
+Dense paged decode reproduces the non-paged reduction order exactly, so dense
+greedy md5 is identical across paths. The MoE path runs additional kernels (the
+NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
+between the paged and non-paged attention layouts. Over a long greedy decode this
+flips a small number of near-tied argmaxes, changing the byte stream. The same
+divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
+off, and with the patch-0029 block-table cache on or off - it is a property of
+the paged attention path, not of any one lever.
+
+## KL evidence that the paged path is sound (the load-bearing check)
+
+`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
+`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
+(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
+
+| comparison | PPL(Q) | KL divergence | Same top p | Cor |
+|------------|-------:|--------------:|-----------:|----:|
+| f16 reference | 7.3734 | - | - | - |
+| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
+| **paged** vs f16     | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
+| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
+
+Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
+
+### Verdict: BENIGN
+
+- **Paged does not diverge from the f16 ground truth more than non-paged does.**
+  KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
+  7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
+  error bars). A real paged-MoE correctness bug would push paged measurably
+  *further* from f16; it does not (it is marginally closer).
+- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
+  89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
+  with essentially zero probability bias. That is the signature of two equivalent
+  FP-reorderings of the same quantized model, both equally approximating the f16
+  ground truth - not a quality regression.
+- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
+  heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
+  logit near-ties are abundant, so a different-but-equivalent reduction order
+  flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
+  zero Delta-p bias).
+
+Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
+reference for the MoE deployment path.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -0,0 +1,86 @@
+# llama.cpp patch series — paged attention (vLLM-parity engine)
+
+A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
+in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
+(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
+`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
+**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
+patch needs fixing, and the failure points at exactly which step the upstream change touched.
+
+## Base
+
+- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
+  the pin = re-run the regen workflow below and fix only the patches that no longer apply.
+
+## The series (phases → patches)
+
+| # | Patch | What | Verifies |
+|---|-------|------|----------|
+| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately |
+| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
+| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
+| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
+| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
+| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
+
+Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
+(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
+
+## Regen workflow (the anti-drift recipe)
+
+```sh
+# 1. check out the exact pin into a dev tree
+git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
+git checkout <LLAMA_VERSION from ../Makefile>
+git checkout -b paged
+
+# 2. apply the current series (each becomes a commit), or develop the next patch
+git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch     # or `git apply` + commit per patch
+
+# 3. iterate a phase as ONE commit, then export the whole series 1:1
+git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N
+
+# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
+```
+
+## Build integration
+
+The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock
+`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned
+`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the
+same strict `git apply` the stock build uses for base patches:
+```
+for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done
+```
+All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships
+everywhere without ever touching the stock `llama-cpp` source tree.
+
+## Status
+
+- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
+- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
+  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
+- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
+  (see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
+  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
+  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
+  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
+  **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
+  compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
+  - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
+    flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
+    scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
+    the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
+    identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
+    makes paged placement token-identical under flash-attn.**
+- 0004–0006 follow.
+
+### Honest parity note (important)
+
+This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
+by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
+(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
+touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
+gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
+`../README.md`). So full vLLM parity = this series **AND** the
+kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
@@ -0,0 +1,101 @@
+# Pin-sync: paged patch-stack -> llama.cpp c299a92c
+
+Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
+28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
+("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
+GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
+path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
+upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
+
+## Upstream jump
+
+- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
+  ("model : Add label for LFM2.5-230M (#25008)")
+- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
+  ("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
+- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
+
+## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
+
+Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
+**zero patch changes**. The already-shipped source-only series (the result of the
+`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
+`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
+`git apply`** (the `apply-paged-patches` step in
+`backend/cpp/llama-cpp-localai-paged/Makefile`:
+`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
+28 patches reported "Applied patch ... cleanly", the sentinel
+`src/paged-kv-manager.cpp` was created, and there are **zero** stray
+`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
+intact). git apply tolerates `@@` line-number offsets, which absorbed the
+upstream drift; no hunk context broke.
+
+Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
+patch tarball used for the verification has
+`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
+
+## Clean build
+
+Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
+28 patches applied as working-tree changes, then:
+
+```
+cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
+  -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
+  -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
+cmake --build build-cuda --target llama-completion test-backend-ops -j20
+```
+
+Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
+`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
+
+## GATE: ALL GREEN
+
+Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
+`9d5d882d` build too):
+```
+llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
+                 -n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
+# paged dense: prefix  LLAMA_KV_PAGED=1
+# paged MoE:   prefix  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
+```
+
+(a) greedy md5 - all four paths PASS:
+| path | model | md5 @ c299a92c | baseline | verdict |
+|------|-------|----------------|----------|---------|
+| non-paged | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
+| paged     | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| paged     | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
+
+(b) `test-backend-ops` (Backend CUDA0) - all PASS:
+| op | result |
+|----|--------|
+| SSM_CONV            | 45/45 OK |
+| SSM_CONV_UPDATE     | 16/16 OK |
+| SSM_CONV_UPDATE_IDS | 16/16 OK |
+| GATED_DELTA_NET     | 84/84 OK |
+| MUL_MAT             | 1146/1146 OK |
+| MUL_MAT_ID          | 806/806 OK |
+
+(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
+series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
+pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
+
+Bit-exactness preserved across the 23-commit upstream jump.
+
+## Canary
+
+`.github/workflows/llama-cpp-paged-canary.yml` and
+`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
+series is source-only and applies strict-clean with no `--exclude`, the canary's
+`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
+the shipped series) and may be removed on a future canary touch; left in place
+here to keep the pin-bump diff minimal.
+
+## Source of truth
+
+The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
+source of truth and are unchanged by this bump. The DGX dev tree
+(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
+the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md
@@ -0,0 +1,337 @@
+# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
+
+Source-only analysis (no GPU, no build) of what it would take to give the
+gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
+backends, so the patch-series decode win extends past CUDA-family hardware.
+
+In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
+SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
+backend that ran the plain op on the discriminated node would silently
+miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
+
+This doc was written against the ggml backend trees in
+`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
+series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
+touches GDN/SSM). So every "kernel already exists" statement below is a
+conservative lower bound: the pin has at least these kernels.
+
+--------------------------------------------------------------------------------
+## 0. Headline finding (correct a stale assumption first)
+
+The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
+upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
+**That is now stale.** All three backends already carry the BASE compute ops:
+
+| op                     | Metal                              | Vulkan                                   | SYCL                            |
+|------------------------|------------------------------------|------------------------------------------|---------------------------------|
+| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
+| GGML_OP_SSM_CONV       | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
+| GGML_OP_SSM_SCAN       | yes                                | `ssm_scan.comp` (mamba2)                 | `ssm_scan.cpp` (mamba2)         |
+
+Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
+commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
+Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
+(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
+the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
+
+Consequence: the GDN-compute op being "partly there" is true on every backend,
+not just Metal. What is still missing per backend is only the FUSION plumbing
+(in-place write-back target, the ids gather read, and the conv-update kernel) -
+a materially smaller scope than "port GDN from scratch."
+
+--------------------------------------------------------------------------------
+## 1. Per-op semantics (the four fusions to port)
+
+All four reuse an existing GGML_OP enum with extra `src[]` slots as a
+discriminator; none adds a new enum value. f32 throughout. The arithmetic core
+is IDENTICAL to the upstream non-fused op; only the read source and/or the write
+target are redirected. That single fact drives the whole bit-exactness story
+(section 3).
+
+### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
+- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
+  `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
+  cache at `kv_head`). K == 1 only.
+- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
+  directly into `state_dst` instead of appending it to the op output. The op
+  output then carries only the attention scores. Removes the per-layer per-step
+  ~full-state D2D copy-back (the 0018 win).
+- Race (in-place read == write): each (seq, head) block owns a disjoint cache
+  slot. The kernel loads the whole prior state `s0` into per-thread registers
+  (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
+  BEFORE the ring write, so reading and writing the same slot is safe.
+
+### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
+- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
+  (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
+  `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
+- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
+  (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
+  materialization. Combined with OP A the op now reads AND writes the cache in
+  place.
+- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
+  read s0 in place from the destination slot (safe via the register snapshot
+  above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
+  TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
+  DISJOINT scratch that the recurrence then reads, so the recurrence never reads
+  a slot another block is writing. Value-preserving memcpy -> bit-identical to
+  the get_rows path.
+
+### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
+- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
+  `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
+  `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
+  `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
+  fuse_silu.
+- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
+  width-K conv window in registers from the K-1 cached taps + the current token,
+  compute the depthwise conv with the SAME ascending-tap FMA order as plain
+  `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
+  `sumf += b` with b==0), optionally fold SiLU, write the conv output
+  `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
+  Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
+  cpy).
+- Race: read source (gathered taps) and write target (cache view) are disjoint
+  buffers -> race-free by construction, no ids/identity logic.
+
+### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
+- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
+  FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
+- Semantics: gather-free conv-update - read each sequence's prior taps from
+  `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
+  `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
+  `ssm_conv_gather_nonident` kernel. The window is copied to a local array
+  BEFORE the (possibly aliasing) ring write so the identity read==write slot is
+  correct. Bit-identical to get_rows + OP C.
+
+### Net new kernels vs reuse, per op
+- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
+  GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
+- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
+  1 ids binding + rs_head param + 1 tiny gather kernel.
+- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
+  a windowed reduction over a PRE-concatenated input; it does not assemble the
+  window from cached taps + the current token, fold silu, or write the shifted
+  ring state. This is the largest net-new piece.
+- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
+  tiny conv gather kernel.
+
+The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
+emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
+SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
+work is the kernel(s) + the backend wiring.
+
+--------------------------------------------------------------------------------
+## 2. Per-backend: authoring model, effort, gotchas, wiring
+
+### 2.1 Metal (MSL)
+
+Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
+specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
+glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
+lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
+and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
+simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
+
+Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
+and are ergonomic to extend. OP A is a write-base redirect of the existing
+`kernel_gated_delta_net_impl` (its tail already does
+`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
+loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
+and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
+ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
+write are both new). Host glue spans 3-4 files.
+
+Gotchas:
+- In-place race: the existing kernel ALREADY snapshots the state column into
+  `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
+  must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
+- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
+  `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
+  `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
+  `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
+  only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
+  `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
+- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
+  the fused variant only redirects addresses, so it is bit-identical to Metal's
+  own non-fused path by construction (the conv per-channel FMA needs the exact
+  ascending order + the `+0.0f`).
+- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
+  pipeline name (or a function-constant branch) distinguishes the variants.
+
+### 2.2 Vulkan (GLSL .comp -> SPIR-V)
+
+Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
+`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
+etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
+push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
+binds subbuffers; `supports_op` in the device support function. Subgroup size
+VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
+
+Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
+Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
+new `ggml_vk_create_pipeline` registration with an explicit binding count and
+push-constant size, a new/extended push-constant struct (add `rs_head`), and
+GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
+(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
+pipeline is created with `"main", 7, ...` - both must change.
+
+Gotchas:
+- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
+  already ships shmem / cluster / nocluster variants keyed on subgroup size and
+  relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
+  applied across ALL of those variants, and re-validated per vendor.
+- In-place race: GLSL must read the full column shard into local registers before
+  the ring write (same pattern); confirm the SPIR-V memory model is not relied on
+  for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
+  OP C/D need the explicit window-to-local copy.
+- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
+  discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
+  Both must be tightened. This is the backend where the 0030 hazard is most
+  concrete (a present plain-conv kernel + a permissive supports_op = silent
+  miscompute) - Vulkan is the exact case 0030 was written for.
+- conv-update is per-channel (one invocation per channel) so it is
+  subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
+- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
+  constants, so the silu-fold half of OP C is partly precedented here (unlike
+  Metal); the ring write-back + tap-window assembly are still new.
+
+### 2.3 SYCL (single-source DPC++)
+
+Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
+`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
+`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
+`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
+source).
+
+Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
+near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
+`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
+a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
+almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
+read-base select, the two tiny gather kernels, and the new conv-update kernel.
+No pipeline/push-constant/binding bookkeeping.
+
+Gotchas:
+- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
+  the register-snapshot-before-write pattern carries over directly. Safe.
+- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
+  discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
+  hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
+  `ggml_sycl_op_gated_delta_net` must branch on the extra src.
+- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
+  situation as CUDA; the fused variant matches SYCL's own non-fused path by
+  construction. conv-update is per-channel -> subgroup-agnostic.
+
+### 2.4 Common wiring (all three) + the 0030 emission-gate change
+
+Per backend, four wiring touch-points beyond the kernel body:
+1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
+   entries so the discriminated/extra-src node is reported supported ONLY when
+   the new kernel handles it (and rejected otherwise, instead of today's
+   silently-true-for-the-plain-kernel).
+2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
+   dispatch the fused kernel.
+3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
+   bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
+4. The patch-0030 gate in `src/llama-context.cpp`.
+
+The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
+name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
+auto_fgdn = false`. As each backend gains kernels this must become capability-
+driven, in one of two ways:
+- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
+  kernels + tightened supports_op ship; OR
+- clean (recommended upstream form): DELETE the name allow-list and make
+  `supports_op` authoritative - have the `auto_fgdn` resolution probe
+  `ggml_backend_dev_supports_op` on a representative node that carries the
+  discriminated `src[]` slots. Then routing falls out of the normal scheduler
+  fallback and no backend name is ever hard-coded. This also fixes 0030's stated
+  weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
+  nodes and covered the discriminated SSM_CONV only incidentally.
+
+--------------------------------------------------------------------------------
+## 3. Bit-exactness per backend (the md5 gate question)
+
+Feasible on ALL THREE, and not actually constraining, because of how the gate is
+scoped:
+
+- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
+  its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
+  `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
+  cross-backend md5 and none is expected.
+- The relevant per-backend invariant is: the FUSED variant must equal that
+  backend's OWN non-fused path. The fusions change only the read source
+  (gather -> indexed read; the gather is a value-preserving memcpy) and the write
+  target (appended output -> in-place cache slot). They do NOT touch the
+  per-column FMA/reduce order. So the fused op is bit-identical to the
+  non-fused op on the same backend BY CONSTRUCTION.
+- Two arithmetic details each port MUST preserve exactly: (a) the conv
+  ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
+  `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
+  re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
+  registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
+  GATED_DELTA_NET) is the per-backend gate.
+
+--------------------------------------------------------------------------------
+## 4. Upstream path and ranked recommendation
+
+### Ops-first, then one PR per backend (NOT one big PR)
+
+Recommended sequence:
+
+1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
+   `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
+   `test-backend-ops` cases, and the capability-driven gate (the clean
+   `supports_op`-authoritative version of 0030). This is independently mergeable
+   and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
+   that way).
+2. PR #2 - Metal kernels + wiring.
+3. PR #3 - SYCL kernels + wiring.
+4. PR #4 - Vulkan kernels + wiring.
+
+Do NOT bundle the backends: each needs its own hardware to validate
+`test-backend-ops`, reviewers are backend-specialized, and a regression in one
+must not block the others.
+
+### Value x effort ranking (which backend first)
+
+| backend | user base / value          | author effort | bit-exact difficulty | net rank |
+|---------|----------------------------|---------------|----------------------|----------|
+| Metal   | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
+| SYCL    | LOW-MED (Intel GPU)        | LOWEST (near-verbatim CUDA mirror) | LOW   | **2nd** |
+| Vulkan  | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
+
+Recommendation: **Metal first.** It banks the biggest user-facing decode win at
+medium effort, the base GDN + conv kernels already exist, and Apple's fixed
+simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
+nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
+low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
+as the high-effort / high-breadth capstone - it reaches the widest hardware
+(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
+existing variant matrix, the subgroup-width variance, and the per-vendor
+validation burden make it the right capstone once the pattern is proven on
+Metal + SYCL.
+
+A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
+(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
+separate later effort.
+
+--------------------------------------------------------------------------------
+## 5. Summary
+
+- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
+  (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
+  three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
+- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
+  read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
+  (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
+  supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
+  descriptor wiring. The builders, CPU refs, model graph and tests are shared and
+  already done.
+- Bit-exactness is feasible everywhere and per-backend by construction (the
+  fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
+  (backendX-vs-CPU) is the gate.
+- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
+  name allow-list), then Metal, then SYCL, then Vulkan.
--- a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
+++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
@@ -0,0 +1,17 @@
+model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
+q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
+q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
+q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
+q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
+q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
+q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
+q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
+q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
+q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
+q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
+q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
+q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
+q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
+q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
+q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
+q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
--- a/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
@@ -0,0 +1,217 @@
+// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
+//
+// Reproduces, at the libllama level, the two host-side defects behind the
+// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
+// signature:
+//
+//   * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
+//     what llama-server issues on every reused slot - frees the kv-cache CELLS
+//     but the paged manager keeps owning the trailing BLOCKS. The manager's
+//     free pool silently shrinks. Test A measures the reclaimed-block delta.
+//
+//   * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
+//     many sequences and frees them in a scrambled order leaves the free queue a
+//     scrambled permutation of physical block ids. A later low-npl prefill then
+//     pops physically scattered blocks, so its KV scatter-write + in-kernel
+//     paged-attention gather lose locality and prefill throughput collapses;
+//     decode (single-token append) barely notices. Test B times an npl8 prefill
+//     on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
+//
+// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
+// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
+// of the fresh npl8 and num_free returns to the pristine value after the drain.
+//
+// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
+// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
+
+#include "llama.h"
+#include "paged-prefix-api.h"
+
+#include <chrono>
+#include <clocale>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <vector>
+
+static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
+
+using clk = std::chrono::steady_clock;
+static double secs(clk::time_point a, clk::time_point b) {
+    return std::chrono::duration<double>(b - a).count();
+}
+
+struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
+
+// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
+// blocks unique so no cross-request prefix sharing masks the accounting.
+static llama_token tok_of(int seq, int pos, int n_vocab) {
+    return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
+}
+
+// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
+// Returns wall seconds (sync'd).
+static double prefill(Ctx & C, int seq, int pos0, int n) {
+    clk::time_point t0 = clk::now();
+    C.batch.n_tokens = 0;
+    for (int j = 0; j < n; ++j) {
+        int i = C.batch.n_tokens;
+        C.batch.token[i]    = tok_of(seq, pos0 + j, C.n_vocab);
+        C.batch.pos[i]      = pos0 + j;
+        C.batch.n_seq_id[i] = 1;
+        C.batch.seq_id[i][0]= seq;
+        C.batch.logits[i]   = (j + 1 == n) ? 1 : 0;
+        C.batch.n_tokens++;
+    }
+    if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
+    llama_synchronize(C.ctx);
+    return secs(t0, clk::now());
+}
+
+// One decode step (single token) for seq at pos.
+static void decode1(Ctx & C, int seq, int pos) {
+    C.batch.n_tokens = 1;
+    C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
+    C.batch.pos[0]   = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
+    if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
+}
+
+int main(int argc, char ** argv) {
+    std::setlocale(LC_NUMERIC, "C");
+    const char * model_path = nullptr;
+    for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
+    if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
+
+    const int NSLOT = env_i("BURST_NSLOT", 64);
+    const int NPL   = env_i("NPL", 8);
+    const int PP    = env_i("PP", 512);
+    const int KEEP  = env_i("KEEP", 256);
+    const int GEN   = env_i("GEN", 4);
+    const int ngl   = env_i("PAGED_NGL", 99);
+    const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
+
+    ggml_backend_load_all();
+    llama_model_params mp = llama_model_default_params();
+    mp.n_gpu_layers = ngl;
+    llama_model * model = llama_model_load_from_file(model_path, mp);
+    if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
+    const llama_vocab * vocab = llama_model_get_vocab(model);
+    const int n_vocab = llama_vocab_n_tokens(vocab);
+
+    // Pool sized for the burst plus headroom so the burst fits but a later npl
+    // run draws from whatever the burst's churn left behind.
+    const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
+    llama_context_params cp = llama_context_default_params();
+    cp.n_ctx     = (uint32_t) cells;
+    cp.n_batch   = (uint32_t) (PP + 16);
+    cp.n_ubatch  = (uint32_t) (PP + 16);
+    cp.n_seq_max = NSLOT + NPL + 2;
+    cp.kv_unified = true;     // one unified stream-0 pool -> num_free(ctx) is the whole pool
+    cp.no_perf   = true;
+    llama_context * ctx = llama_init_from_model(model, cp);
+    if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
+
+    Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
+    C.batch = llama_batch_init(cp.n_batch, 0, 1);
+
+    printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
+           paged, NSLOT, NPL, PP, KEEP, GEN, cells);
+
+    llama_memory_clear(C.mem, true);
+    const long F_start = paged_prefix_api::num_free_global();
+
+    // ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
+    {
+        prefill(C, 0, 0, PP);
+        const long f_after_prefill = paged_prefix_api::num_free_global();
+        llama_memory_seq_rm(C.mem, 0, KEEP, -1);          // partial tail removal
+        const long f_after_rm = paged_prefix_api::num_free_global();
+        llama_memory_seq_rm(C.mem, 0, -1, -1);            // full free -> pristine
+        const long f_after_full = paged_prefix_api::num_free_global();
+        const long bs = 16;
+        const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
+        printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
+               "(expect %ld post-fix, 0 pre-fix)  afterFullFree=%ld\n",
+               F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
+    }
+
+    // ---- Test B: fragmentation -> npl prefill collapse -----------------------
+    // Fresh npl prefill baseline on a pristine pool.
+    llama_memory_clear(C.mem, true);
+    double tps_fresh;
+    {
+        clk::time_point t0 = clk::now();
+        long ntok = 0;
+        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
+        tps_fresh = ntok / secs(t0, clk::now());
+        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
+    }
+    const long F_pristine = paged_prefix_api::num_free_global();
+
+    // High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
+    // steps (mixed alloc), then drain them in a scrambled order (odd ids first,
+    // then even, each truncated before the full free) so the free queue becomes a
+    // scrambled permutation - the fragmentation the bug never compacts.
+    for (int s = 0; s < NSLOT; ++s) {
+        if (prefill(C, NPL + s, 0, PP) < 0) return 1;
+        for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
+    }
+    const long F_during_burst = paged_prefix_api::num_free_global();
+    // Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
+    // scrambled slot order to scramble the physical free order.
+    for (int parity = 1; parity >= 0; --parity)
+        for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
+            llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1);   // partial (Fix-1 path)
+            llama_memory_seq_rm(C.mem, NPL + s, -1, -1);     // full free
+        }
+    const long F_after_drain = paged_prefix_api::num_free_global();
+
+    // Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
+    // defragged) free queue.
+    double tps_post;
+    {
+        clk::time_point t0 = clk::now();
+        long ntok = 0;
+        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
+        tps_post = ntok / secs(t0, clk::now());
+        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
+    }
+
+    const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
+    printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
+           "(afterDrain==pristine? %s)\n",
+           F_start, F_pristine, F_during_burst, F_after_drain,
+           F_after_drain == F_pristine ? "YES" : "NO");
+    printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f  ratio=%.3f "
+           "(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
+
+    // ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
+    // Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
+    // slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
+    // low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
+    // prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
+    llama_memory_clear(C.mem, true);
+    // Touch the pool once so the manager exists, then read the full-pool size
+    // (num_free is 0 while no manager is registered).
+    if (prefill(C, 0, 0, 16) < 0) return 1;
+    llama_memory_seq_rm(C.mem, 0, -1, -1);
+    const long F_pre_c = paged_prefix_api::num_free_global();
+    for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
+    const long F_idle = paged_prefix_api::num_free_global();
+    for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
+    const long F_reclaimed = paged_prefix_api::num_free_global();
+    printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
+           "(returns_to_fresh? %s)\n",
+           F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
+           F_reclaimed == F_pre_c ? "YES" : "NO");
+
+    printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
+           paged, ratio,
+           F_after_drain == F_pristine ? "YES" : "NO",
+           F_reclaimed == F_pre_c ? "YES" : "NO");
+
+    llama_batch_free(C.batch);
+    llama_free(ctx);
+    llama_model_free(model);
+    return 0;
+}
--- a/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp
@@ -0,0 +1,59 @@
+// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
+// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
+//
+//   Fix-1  PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
+//          beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
+//          longer strands blocks whose cells were cleared.
+//   Fix-2  defrag_free_pool() relinks the free queue into ascending block-id
+//          order once the pool is fully idle, undoing a burst's scrambled frees
+//          so a later prefill pops physically contiguous blocks again.
+
+#include "paged-kv-manager.h"
+#include <cstdio>
+
+using paged::PagedKVManager;
+
+int main() {
+    int rc = 0;
+
+    // ---- Fix-1: truncate reclaims the trailing block suffix -----------------
+    {
+        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
+        const size_t f0 = m.num_free_blocks();   // 63 (block 0 reserved as null)
+        m.allocate(0, 512);                       // ceil(512/16)=32 blocks
+        const size_t f1 = m.num_free_blocks();    // 31
+        m.truncate(0, 256);                       // keep ceil(256/16)=16, free 16
+        const size_t f2 = m.num_free_blocks();    // 47
+        printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
+               f0, f1, f2, f2 - f1);
+        if (f2 - f1 != 16) rc = 1;
+        m.truncate(0, 16);                        // keep 1 block, free 15 more
+        const size_t f3 = m.num_free_blocks();    // 62
+        printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
+        if (f3 != f0 - 1) rc = 1;
+        m.free(0);
+        if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
+    }
+
+    // ---- Fix-2: defrag restores ascending popleft order ---------------------
+    {
+        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
+        for (int s = 0; s < 8; ++s) m.allocate(s, 16);          // pop blocks 1..8
+        const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4};      // free out of order
+        for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
+        m.defrag_free_pool();                                    // all idle -> compact
+        m.allocate(100, 16 * 3);                                 // pop 3 blocks
+        const auto bt = m.block_table(100);
+        bool asc = true;
+        printf("[unit Fix-2] post-defrag block_table:");
+        for (size_t i = 0; i < bt.size(); ++i) {
+            printf(" %d", bt[i]);
+            if (i && bt[i] < bt[i - 1]) asc = false;
+        }
+        printf("  ascending=%s (expect YES)\n", asc ? "YES" : "NO");
+        if (!asc) rc = 1;
+    }
+
+    printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
+    return rc;
+}
--- a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png
--- a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png