chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
docs(agents): add paged-backend maintenance + vLLM-parity methodology skills
2026-06-27 09:57:14 -04:00 · 2026-06-27 13:20:05 +00:00 · 2026-06-27 12:58:01 +00:00 · 2026-06-27 12:29:15 +00:00 · 2026-06-27 12:18:11 +00:00 · 2026-06-27 12:11:24 +00:00
74 changed files with 13188 additions and 30 deletions
--- a/.agents/llama-cpp-localai-paged-backend.md
+++ b/.agents/llama-cpp-localai-paged-backend.md
@@ -0,0 +1,96 @@
+# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)
+
+`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
+llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
+gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
+`llama-cpp` backend's sources and applies a vendored patch series on top at build
+time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.
+
+**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md`
+(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
+pin/canary policy). Read it for any technical detail; this guide is the maintenance
+how-to.
+
+## Where things live
+
+- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
+  stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
+  this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
+  `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
+- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
+  series (0001-0030), nothing else.
+- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
+  operational docs (`PIN_SYNC_*.md`, `PAGED_BITEXACT_NOTE.md`,
+  `UPSTREAM_LAYER2_SCOPE.md`) and dev artifacts live in
+  `backend/cpp/llama-cpp-localai-paged/docs/`.
+- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
+  - the CUDA build entry points.
+- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
+  paged patches.
+
+## Invariants (do not break these)
+
+- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
+  `patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
+- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
+  (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
+  slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
+  cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
+  (Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
+  `stream_*` server symbols - another reason it is CUDA-only.)
+- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
+  dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
+  stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
+- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
+  baseline. The one opt-in precision trade (`ssm_bf16_tau`, patch 0026) defaults
+  off; never put it in a recommended/gallery config.
+
+## Maintaining the pin against new llama.cpp
+
+The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
+pin-sync. It is deliberately **excluded from the nightly auto-bumper**
+(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
+and break `git apply` at build time.
+
+1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
+   runs weekly: it applies + builds the series against the latest upstream tip and
+   goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
+2. **The pin-sync** (recorded in `docs/PIN_SYNC_*.md`): rebase the series onto the new
+   tip (resolve conflicts; re-export **source-only** with a pathspec like
+   `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
+   pass the bit-exact gate on **every** path + `test-backend-ops`, then bump
+   `LLAMA_VERSION`. The 9d5d882d -> c299a92c bump (23 upstream commits) needed zero
+   patch changes; bumps are usually offset-tolerant (git apply absorbs offsets).
+
+## The bit-exact gate (run for every change)
+
+- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
+  paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
+  MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
+  `llama-completion` hangs in conversation mode.
+- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
+  `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
+- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
+  benign, KL-validated FP-accumulation-order difference (see `docs/PAGED_BITEXACT_NOTE.md`).
+  Compare a paged-MoE change to the **paged** reference, not the non-paged one.
+
+## Encapsulating your work
+
+- When you change a patch, regenerate the `.patch` (source-only) and keep the dev
+  tree and this worktree byte-identical. Commit both with sign-off.
+- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
+  the README's patch table and dev notes - keep the README the single doc; do not
+  scatter `*_RESULTS.md` files.
+- Record rejected/flat levers in the README too (they stop the next person from
+  re-running dead ends).
+
+## Follow-ups (Metal / SYCL / Vulkan)
+
+The decode fusions are implemented for **CUDA + CPU only**. The base
+gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
+so the models **run** there via the non-fused path - what is missing is the
+fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
+Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md`
+(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
+PR per backend, each gated by `test-backend-ops` on the target hardware). The
+methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
--- a/.agents/vllm-parity-methodology.md
+++ b/.agents/vllm-parity-methodology.md
@@ -0,0 +1,87 @@
+# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp
+
+This is the playbook that took the paged backend
+([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
+from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
+ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
+accelerator Y" effort. The *levers* are model- and hardware-specific; the
+*discipline* below is not. The worked example, with all numbers, is the paged
+backend README.
+
+## The core loop
+
+1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
+   path) and an f32 reference. Every optimization must stay byte-identical to it -
+   or ship as an explicit, default-off precision opt-in. This is what lets you
+   optimize aggressively without silently regressing quality. Gate two ways:
+   greedy md5, and `test-backend-ops` against the CPU oracle.
+
+2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
+   *kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
+   here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
+   **plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
+   Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
+   window artifact (decode was 96-99% GPU-busy), not real idle.
+
+3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
+   competitor's, side by side, per bucket, and compute the per-bucket delta. This
+   tells you WHERE the gap actually is - not where you would guess. It overturned
+   premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
+   keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
+
+4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
+   same-session A/B bench (patched-off vs patched-on, identical harness = an exact
+   measure). Bank only what lifts AND gates. **Record every rejected or flat lever
+   with the reason** - over time this is the most valuable part: it stops the next
+   person re-running dead ends.
+
+5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
+   lever measured, not assumed). What remains is physical - the memory-bandwidth
+   floor, the irreducible serial-SSM host loop (sampling can't start until logits
+   land). Name it; do not claim more than you measured.
+
+## Hard rules learned
+
+- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
+  (`llama-batched-bench`) is exact - lead with it. Cross-engine "% of vLLM"
+  (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
+  and config (context length alone shifted the MoE figure 76% <-> 86%).
+- **The win may be a precision trade, not a free lever.** bf16 SSM state was +12%
+  but failed the f32 KL gate (vLLM keeps f32 too), so it ships default-off opt-in -
+  never in a recommended config.
+- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
+  critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
+  projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
+  Always measure before believing; a plausible mechanism is not a result.
+- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
+  different (equivalent) FP-reduction orders; validate the difference is benign
+  (KLD to f32) and then gate each path against its own reference.
+
+## Orchestration (multi-agent)
+
+- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
+  design/analysis/read agents are fine; concurrent GPU benches pollute each other's
+  numbers.
+- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
+  *refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
+- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
+  progress-file checkpoint. Agents that background work and "wait for the monitor
+  event" stall - forbid that pattern.
+- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
+  benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
+  failure cannot strand them).
+
+## What generalizes (and what doesn't)
+
+The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
+NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
+benefit. But the *findings* often generalize and are worth upstreaming: the
+"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
+fusion ops help any backend running these models. Separate "ship our tuned backend"
+from "upstream the portable op" - they are different deliverables.
+
+## The closing record
+
+Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
+the structural ceiling, and the cross-backend / cross-quant generality. Negative
+results are as valuable as wins. The paged backend README is the template.
--- a/.docker/llama-cpp-localai-paged-compile.sh
+++ b/.docker/llama-cpp-localai-paged-compile.sh
@@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged.
+# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
+
+set -euxo pipefail
+
+export CCACHE_DIR=/root/.ccache
+ccache --max-size=5G || true
+ccache -z || true
+
+export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
+
+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build
+fi
+
+cd /LocalAI/backend/cpp/llama-cpp-localai-paged
+
+if [ -z "${BUILD_TYPE:-}" ]; then
+  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
+  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
+  if [ "${TARGETARCH}" = "arm64" ]; then
+    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
+    export CC=gcc-14 CXX=g++-14
+  fi
+  make llama-cpp-localai-paged-cpu-all
+else
+  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
+  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
+  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
+  make llama-cpp-localai-paged-fallback
+fi
+make llama-cpp-localai-paged-grpc
+make llama-cpp-localai-paged-rpc-server
+
+ccache -s || true
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -4881,6 +4881,67 @@ include:
    dockerfile: "./backend/Dockerfile.golang"
    context: "./"
    ubuntu-version: '2404'
+  # llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each
+  # row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix
+  # swapped; builder-base-image is left UNCHANGED so these reuse the same
+  # base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new
+  # base-images.yml variant.
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "8"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-12-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
+    runs-on: 'bigger-runner'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
+    runs-on: 'bigger-runner'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64'
+    base-image: "ubuntu:24.04"
+    runs-on: 'ubuntu-24.04-arm'
+    ubuntu-version: '2404'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
+  - build-type: 'cublas'
+    cuda-major-version: "12"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-arm64-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-l4t-cuda-12-arm64'
+    base-image: "nvcr.io/nvidia/l4t-jetpack:r36.4.0"
+    runs-on: 'ubuntu-24.04-arm'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
+    ubuntu-version: '2204'

 # Darwin matrix (consumed by backend-jobs-darwin).
 includeDarwin:
--- a/.github/scripts/paged-canary-apply.sh
+++ b/.github/scripts/paged-canary-apply.sh
@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+#
+# paged-canary-apply.sh - apply the vendored paged-attention patch series
+# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030) to a llama.cpp checkout, the
+# same way the build does, but tolerating the ONE known-benign pre-existing
+# quirk in the series. Used by the early-warning canary
+# (.github/workflows/llama-cpp-paged-canary.yml) so it only goes red on a REAL
+# upstream break, never on that quirk.
+#
+# Usage: paged-canary-apply.sh <llama.cpp-checkout-dir> <patches-dir>
+#   <patches-dir> is normally backend/cpp/llama-cpp-localai-paged/patches (it holds the
+#   top-level base series 0*.patch, currently empty, and the paged/ subseries).
+#
+# Exit 0  = the whole series applied -> patches still fit upstream.
+# Exit !=0 = a patch failed to apply  = the red signal: an upstream change moved
+#            the tree out from under the patches, so it is time to run a PIN_SYNC.
+#
+# Apply method MIRRORS backend/cpp/llama-cpp/Makefile's `llama.cpp` target:
+# plain `git apply --verbose`, which natively tolerates @@ line-number offsets
+# but NOT context-line changes. Matching the build's method is the point - the
+# canary's apply result is exactly what the real build's apply would do.
+#
+# The ONLY tolerance, and it is path-scoped (not a blanket `|| true`): patch
+# 0019 carries a stray *modify* hunk against the dev-only doc
+# SSM_DECODE_FIX_RESULTS.md, a file that exists only on the DGX dev tree and is
+# absent from any clean upstream checkout. `git apply` is atomic, so that single
+# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
+# build on 0019's code, the rejection cascades to them too. This is a
+# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
+# upstream break (see backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
+# and backend/cpp/llama-cpp-localai-paged/README.md). We exclude ONLY that dev-doc path and still
+# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
+# still fails the canary. prepare.sh tolerates the same hunk via
+# `patch ... || true`; this mirrors that tolerance precisely.
+
+set -euo pipefail
+
+CHECKOUT="${1:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"
+PATCHES="${2:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"
+
+# The lone tolerated dev-doc, and the only patch allowed to carry it.
+DEVDOC_GLOB='*SSM_DECODE_FIX_RESULTS.md'
+DEVDOC_PATCH='0019-qwen35-ssm-decode-fused-gather.patch'
+
+# Resolve to absolute paths so the apply works after we cd into the checkout.
+PATCHES="$(cd "$PATCHES" && pwd)"
+cd "$CHECKOUT"
+
+shopt -s nullglob
+
+apply_one() {
+  local p="$1"; shift
+  echo "paged-canary: applying $(basename "$p")"
+  if ! git apply --verbose "$@" "$p"; then
+    echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
+    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md), do NOT bump the pin blindly"
+    exit 1
+  fi
+}
+
+# Base series first (parity with the build: patches/0*.patch before
+# patches/paged/0*.patch). Currently empty; nullglob makes this a no-op.
+for p in "$PATCHES"/0*.patch; do
+  apply_one "$p"
+done
+
+# Paged series, in order.
+for p in "$PATCHES"/paged/0*.patch; do
+  if [ "$(basename "$p")" = "$DEVDOC_PATCH" ]; then
+    # Apply 0019's real code hunks; exclude ONLY the benign dev-doc hunk.
+    apply_one "$p" --exclude="$DEVDOC_GLOB"
+  else
+    apply_one "$p"
+  fi
+done
+
+echo "paged-canary: the full paged patch series applied cleanly to the upstream tip"
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -169,14 +169,14 @@ jobs:
      # invalidates cleanly; restore-keys fall back to the latest entry for the
      # same pin so unchanged TUs stay warm even when the cache is fresh.
      - name: Compute llama.cpp version
-        if: inputs.backend == 'llama-cpp'
+        if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged'
        id: llama-version
        run: |
          version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ')
          echo "version=${version}" >> "$GITHUB_OUTPUT"

      - name: Restore ccache
-        if: inputs.backend == 'llama-cpp'
+        if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged'
        id: ccache-cache
        uses: actions/cache/restore@v4
        with:
@@ -186,7 +186,7 @@ jobs:
            ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}-

      - name: Configure ccache
-        if: inputs.backend == 'llama-cpp'
+        if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged'
        run: |
          mkdir -p "$HOME/Library/Caches/ccache"
          ccache -M 2G
@@ -251,9 +251,14 @@ jobs:
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend

      - name: ccache stats
-        if: inputs.backend == 'llama-cpp'
+        if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged'
        run: ccache -s

+      # Only stock llama-cpp persists the ccache: both backends share the same
+      # ccache-llama-<arch>-<version>-<run_id> key, so the paged job restores from
+      # the shared prefix (warm) but must NOT also save under the identical key in
+      # the same run (it would collide). The shared upstream TUs stay warm via the
+      # stock save; the paged-only patched TUs are a small recompile.
      - name: Save ccache
        if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request'
        uses: actions/cache/save@v4
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -9,6 +9,23 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
+        # NOTE: there is intentionally NO entry for the llama-cpp-localai-paged
+        # backend. It carries a vendored paged-attention patch series
+        # (backend/cpp/llama-cpp-localai-paged/patches/paged/) hand-verified bit-exact against
+        # ONE specific llama.cpp tip; a naive nightly bump would move the tip out
+        # from under the patches and break `git apply` at build time. Its pin is
+        # therefore decoupled (its own LLAMA_VERSION in
+        # backend/cpp/llama-cpp-localai-paged/Makefile) and advanced ONLY by the
+        # manual PIN_SYNC process. Do not add it here. (turboquant CAN be
+        # auto-bumped below because its fork branch carries the patches.)
+        #
+        # Excluding it from the auto-bumper removed the early warning of upstream
+        # drift; that signal is restored separately by the dedicated canary
+        # .github/workflows/llama-cpp-paged-canary.yml, which weekly applies +
+        # compiles the paged series against the latest llama.cpp tip and goes red
+        # when upstream breaks it (prompting a PIN_SYNC). The canary is
+        # signal-only - it never opens a bump PR and never moves the pin - so
+        # this dep-bump workflow and its PRs stay green regardless.
        include:
          - repository: "ggml-org/llama.cpp"
            variable: "LLAMA_VERSION"
--- a/.github/workflows/llama-cpp-paged-canary.yml
+++ b/.github/workflows/llama-cpp-paged-canary.yml
@@ -0,0 +1,178 @@
+name: 'llama.cpp paged patches: upstream canary'
+
+# EARLY-WARNING CANARY for the vendored paged-attention patch series
+# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030).
+#
+# WHY THIS EXISTS
+# The paged backend (backend/cpp/llama-cpp-localai-paged) pins its OWN verified
+# llama.cpp tip (LLAMA_VERSION in backend/cpp/llama-cpp-localai-paged/Makefile)
+# and is intentionally EXCLUDED from the nightly auto-bumper
+# (.github/workflows/bump_deps.yaml), so a naive upstream bump can never silently
+# break the shipped build. The cost of that safety: nobody finds out when
+# upstream DRIFTS past the patches. This canary restores that signal WITHOUT
+# touching the shipped pin - weekly it tries the patch series + a real compile
+# against the LATEST llama.cpp master tip and goes red the moment upstream breaks
+# the patches.
+#
+# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip,
+# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance
+# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See
+# backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md.
+#
+# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully
+# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A
+# green run means "the paged series still applies and compiles on upstream HEAD";
+# a red run means "upstream moved - schedule a pin-sync".
+
+on:
+  schedule:
+    # Weekly (Mondays 06:00 UTC), mirroring the weekly DEPS_REFRESH / bump_deps
+    # cadence. Offset from bump_deps' nightly 20:00 so the two never pile up.
+    - cron: '0 6 * * 1'
+  workflow_dispatch:
+
+permissions:
+  contents: read
+
+concurrency:
+  group: llama-cpp-paged-canary
+  cancel-in-progress: false
+
+env:
+  # Upstream source of truth - the same repo/branch bump_deps tracks for the
+  # stock llama-cpp pin.
+  LLAMA_UPSTREAM: 'https://github.com/ggml-org/llama.cpp'
+
+jobs:
+  apply-check:
+    # Cheap, fast, toolchain-free early warning: does the series still APPLY to
+    # the latest upstream tip? A patch no longer applying is by far the most
+    # common way upstream breaks a vendored series, so this runs first, is
+    # reliable on a free runner, and feeds the resolved tip to the compile job.
+    if: github.repository == 'mudler/LocalAI'
+    runs-on: ubuntu-latest
+    timeout-minutes: 20
+    outputs:
+      tip: ${{ steps.resolve.outputs.tip }}
+    steps:
+      - name: Checkout LocalAI
+        uses: actions/checkout@v7
+
+      - name: Resolve latest llama.cpp master tip
+        id: resolve
+        run: |
+          tip="$(git ls-remote "$LLAMA_UPSTREAM" refs/heads/master | cut -f1)"
+          if [ -z "$tip" ]; then
+            echo "::error::could not resolve llama.cpp master tip from $LLAMA_UPSTREAM"
+            exit 1
+          fi
+          pin="$(grep -m1 'LLAMA_VERSION?=' backend/cpp/llama-cpp-localai-paged/Makefile | cut -d= -f2)"
+          echo "latest llama.cpp master tip: $tip"
+          echo "shipped paged pin:           $pin"
+          echo "tip=$tip" >> "$GITHUB_OUTPUT"
+          {
+            echo "## llama.cpp paged canary"
+            echo ""
+            echo "- upstream master tip: \`$tip\`"
+            echo "- shipped paged pin:   \`$pin\`"
+          } >> "$GITHUB_STEP_SUMMARY"
+
+      - name: Checkout llama.cpp at latest tip (shallow)
+        run: |
+          mkdir -p /tmp/llama.cpp
+          cd /tmp/llama.cpp
+          git init -q
+          git remote add origin "$LLAMA_UPSTREAM"
+          git fetch -q --depth 1 origin "${{ steps.resolve.outputs.tip }}"
+          git checkout -q FETCH_HEAD
+          git log --oneline -1
+
+      - name: Apply paged patch series (build's git-apply method)
+        run: |
+          bash .github/scripts/paged-canary-apply.sh \
+            /tmp/llama.cpp \
+            "$PWD/backend/cpp/llama-cpp-localai-paged/patches"
+          echo "- apply: full paged series applies to the upstream tip :white_check_mark:" >> "$GITHUB_STEP_SUMMARY"
+
+  compile:
+    # Proves the patches still COMPILE against the latest tip, using the SAME
+    # toolchain + build target the shipped paged backend uses (the
+    # base-grpc-cuda-12 builder base + the Makefile `grpc-server` cublas target),
+    # so a failure means upstream drift, not toolchain noise. CUDA is compiled
+    # (nvcc; no GPU required) because most of the paged series is CUDA kernels.
+    # Runs only if the apply check passed, on the exact tip it validated.
+    #
+    # If a full CUDA compile on the hosted runner ever proves too heavy/flaky,
+    # switch `runs-on` to 'bigger-runner' (the runner class the real paged CUDA
+    # build uses), or drop to a CPU build (BUILD_TYPE='') which still compiles
+    # all host + CPU paged code, leaving CUDA-kernel coverage to the apply check
+    # plus the manual PIN_SYNC GPU gate.
+    needs: apply-check
+    if: github.repository == 'mudler/LocalAI'
+    runs-on: ubuntu-latest
+    timeout-minutes: 180
+    steps:
+      - name: Checkout LocalAI
+        uses: actions/checkout@v7
+
+      - name: Free disk space
+        uses: ./.github/actions/free-disk-space
+        with:
+          mode: hosted
+
+      - name: Login to Quay.io
+        uses: docker/login-action@v4
+        with:
+          registry: quay.io
+          username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }}
+          password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }}
+
+      - name: Compile paged backend against latest tip (cublas)
+        env:
+          TIP: ${{ needs.apply-check.outputs.tip }}
+          BUILDER_BASE_IMAGE: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64'
+        run: |
+          docker run --rm \
+            -v "$PWD":/LocalAI -w /LocalAI \
+            -e TIP -e LLAMA_UPSTREAM \
+            "$BUILDER_BASE_IMAGE" bash -euxo pipefail -c '
+              # Mirror the Dockerfile: gRPC lives at /opt/grpc in the base image;
+              # copy it to the prefix CMake find_package expects.
+              cp -a /opt/grpc/. /usr/local/
+
+              # Pre-populate the llama.cpp checkout at the latest tip with the
+              # paged series applied via the tolerant canary apply. Because
+              # backend/cpp/llama-cpp/llama.cpp now exists, the stock Makefile's
+              # llama.cpp target (clone + base-patch apply) is skipped and the
+              # now patch-free prepare.sh only copies the grpc-server sources -
+              # so we drive the REAL grpc-server build path on top of our paged
+              # apply. The stock llama-cpp backend no longer carries the paged
+              # series (it lives in backend/cpp/llama-cpp-localai-paged/patches/
+              # paged); we build it here in the stock dir only because that is
+              # where the shared build infra (Makefile / grpc-server.cpp /
+              # CMakeLists.txt / prepare.sh) lives.
+              cd backend/cpp/llama-cpp/
+              mkdir -p llama.cpp
+              cd llama.cpp
+              git init -q
+              git remote add origin "$LLAMA_UPSTREAM"
+              git fetch -q --depth 1 origin "$TIP"
+              git checkout -q FETCH_HEAD
+              cd /LocalAI
+              bash .github/scripts/paged-canary-apply.sh \
+                backend/cpp/llama-cpp/llama.cpp \
+                "$PWD/backend/cpp/llama-cpp-localai-paged/patches"
+
+              # Cheapest real CUDA build that proves the patches compile: one
+              # CUDA arch, cublas. CMAKE_ARGS is passed via the environment (not
+              # as a make arg) so the Makefile += flags are still appended,
+              # exactly like .docker/llama-cpp-localai-paged-compile.sh. The paged
+              # series is already applied to the checkout above, so the stock
+              # build just compiles the patched tree.
+              cd backend/cpp/llama-cpp/
+              BUILD_TYPE=cublas \
+              CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=80" \
+              make grpc-server
+              test -x grpc-server
+            '
+          echo "- compile: paged series builds (cublas) against the upstream tip :white_check_mark:" >> "$GITHUB_STEP_SUMMARY"
--- a/.gitignore
+++ b/.gitignore
@@ -9,6 +9,15 @@ prepare-sources
 /backend/cpp/llama-cpp/llama.cpp
 /backend/cpp/llama-*
 !backend/cpp/llama-cpp
+# llama-cpp-localai-paged is a tracked source dir (a thin wrapper Makefile over
+# backend/cpp/llama-cpp). Re-include it like llama-cpp above; its sibling
+# *-build dirs are still ignored by the /backend/cpp/llama-* rule, and its
+# in-dir build artifacts (binaries, package output, collected ggml .so set) are
+# re-ignored just below.
+!backend/cpp/llama-cpp-localai-paged
+/backend/cpp/llama-cpp-localai-paged/llama-cpp-localai-paged-*
+/backend/cpp/llama-cpp-localai-paged/package
+/backend/cpp/llama-cpp-localai-paged/ggml-shared-libs
 /backends
 /backend-images
 /result.yaml
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -23,6 +23,8 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]
 | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) |
 | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions |
 | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing |
+| [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) | Working on the CUDA-only paged-attention llama.cpp variant (Qwen3.6 hybrid-SSM / Blackwell NVFP4 decode) - patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, CUDA-only invariants, stock-stays-pure, Metal/SYCL/Vulkan follow-up scope |
+| [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) | The methodology for closing the vLLM decode-throughput gap in llama.cpp - bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B discipline, recording rejected levers, multi-agent GPU orchestration |
 | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks |
 | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling |
 | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix |
@@ -37,6 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

 - **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
+- **Paged llama.cpp backend**: `llama-cpp-localai-paged` is a CUDA-only variant that owns its own patch series + its own pinned llama.cpp (manual pin-sync, weekly canary); the stock `llama-cpp` backend stays patch-free. Read [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) before touching either, and [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) for the decode-parity methodology behind it.
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
 - **Docs**: Update `docs/content/` when adding features or changing config
--- a/18
+++ b/18
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin backends/llama-cpp-localai-paged

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -671,6 +671,15 @@ test-extra-backend-llama-cpp: docker-build-llama-cpp
 test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp
 	BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend

+## llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Same
+## GGUF surface as stock llama-cpp (the paged engine is runtime-gated by the
+## LLAMA_KV_PAGED env the grpc-server option hooks set), so the standard
+## llama-cpp capability set is what we exercise here.
+test-extra-backend-llama-cpp-localai-paged: docker-build-llama-cpp-localai-paged
+	BACKEND_IMAGE=local-ai-backend:llama-cpp-localai-paged \
+	BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \
+	$(MAKE) test-extra-backend
+
 ## turboquant: exercises the llama.cpp-fork backend with the fork's
 ## *TurboQuant-specific* KV-cache types (turbo3 for both K and V). turbo3
 ## is what makes this backend distinct from stock llama-cpp — picking q8_0
@@ -1181,6 +1190,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false
 # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization.
 # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile.
 BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
+# llama-cpp-localai-paged = stock llama.cpp grpc-server + the LocalAI paged-attention
+# patch series (vendored in this wrapper backend). Reuses backend/cpp/llama-cpp sources via a thin
+# wrapper Makefile (same upstream pin as stock llama-cpp; no fork, no patch-grpc-server).
+BACKEND_LLAMA_CPP_LOCALAI_PAGED = llama-cpp-localai-paged|llama-cpp-localai-paged|.|false|false
 # ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine.
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
@@ -1282,6 +1295,7 @@ endef
 $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
+$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_LOCALAI_PAGED)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
@@ -1345,7 +1359,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-llama-cpp-localai-paged docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter

 ########################################################
 ### Mock Backend for E2E Tests
--- a/backend/Dockerfile.llama-cpp-localai-paged
+++ b/backend/Dockerfile.llama-cpp-localai-paged
@@ -0,0 +1,163 @@
+ARG BASE_IMAGE=ubuntu:24.04
+# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even
+# when no prebuilt base is supplied. The builder-prebuilt stage is only
+# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback
+# content here is harmless — BuildKit prunes the unreferenced builder.
+ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
+# BUILDER_TARGET selects which builder stage the final scratch image copies
+# package output from. Declared at global scope (before any FROM) so it's
+# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local
+# `make backends/llama-cpp-localai-paged` on the from-source path.
+ARG BUILDER_TARGET=builder-fromsource
+ARG APT_MIRROR=""
+ARG APT_PORTS_MIRROR=""
+
+
+# ============================================================================
+# Stage: builder-fromsource — self-contained build path.
+# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC +
+# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then
+# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the
+# default; local `make backends/llama-cpp-localai-paged`).
+#
+# The install script is the same one that backend/Dockerfile.base-grpc-builder
+# runs, so the result is bit-equivalent to the prebuilt-base path
+# (builder-prebuilt below).
+# ============================================================================
+FROM ${BASE_IMAGE} AS builder-fromsource
+ARG BUILD_TYPE
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
+ARG CMAKE_FROM_SOURCE=false
+# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues
+ARG CMAKE_VERSION=3.31.10
+ARG GRPC_VERSION=v1.65.0
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG SKIP_DRIVERS=false
+ARG TARGETARCH
+ARG TARGETVARIANT
+ARG GO_VERSION=1.25.4
+ARG UBUNTU_VERSION=2404
+ARG APT_MIRROR
+ARG APT_PORTS_MIRROR
+ARG AMDGPU_TARGETS=""
+ARG BACKEND=rerankers
+# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120'
+ARG CUDA_DOCKER_ARCH
+ARG CMAKE_ARGS
+
+ENV BUILD_TYPE=${BUILD_TYPE} \
+    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
+    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
+    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
+    CMAKE_VERSION=${CMAKE_VERSION} \
+    GRPC_VERSION=${GRPC_VERSION} \
+    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
+    SKIP_DRIVERS=${SKIP_DRIVERS} \
+    TARGETARCH=${TARGETARCH} \
+    UBUNTU_VERSION=${UBUNTU_VERSION} \
+    APT_MIRROR=${APT_MIRROR} \
+    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
+    AMDGPU_TARGETS=${AMDGPU_TARGETS} \
+    CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \
+    CMAKE_ARGS=${CMAKE_ARGS} \
+    DEBIAN_FRONTEND=noninteractive
+
+# CUDA on PATH (no-op when CUDA isn't installed)
+ENV PATH=/usr/local/cuda/bin:${PATH}
+# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed)
+ENV PATH=/opt/rocm/bin:${PATH}
+
+WORKDIR /build
+
+# Install everything via the shared script — the same one that
+# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and
+# this from-source path are bit-equivalent.
+RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
+    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
+    bash /usr/local/sbin/install-base-deps
+
+# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so
+# CMake's find_package finds it at the canonical prefix the Makefile expects.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4)
+# for rationale. llama-cpp-localai-paged is the SAME upstream llama.cpp with
+# the LocalAI paged patch series applied; it reuses backend/cpp/llama-cpp
+# source via a thin wrapper Makefile, so MOST TUs are content-identical to the
+# stock llama-cpp build. Sharing a cache id with llama-cpp could give
+# cross-variant hits — but for now keep them separate (mirroring turboquant) so
+# a regression in one doesn't poison the other. Revisit sharing after measuring
+# the actual hit rate.
+#
+# The compile body is shared with builder-prebuilt via .docker/llama-cpp-localai-paged-compile.sh.
+RUN --mount=type=bind,source=.docker/llama-cpp-localai-paged-compile.sh,target=/usr/local/sbin/compile.sh \
+    --mount=type=cache,target=/root/.ccache,id=llama-cpp-localai-paged-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    bash /usr/local/sbin/compile.sh
+
+
+# Copy libraries using a script to handle architecture differences
+RUN make -BC /LocalAI/backend/cpp/llama-cpp-localai-paged package
+
+
+# ============================================================================
+# Stage: builder-prebuilt — uses the pre-built base from
+# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml).
+# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan
+# pre-installed, so we just copy gRPC to /usr/local and compile. Used when
+# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets
+# builder-base-image). llama-cpp-localai-paged reuses the SAME base-grpc-* tags
+# as the stock llama-cpp backend (same gRPC + same toolchain), so no new
+# base-images.yml variant is required.
+# ============================================================================
+FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
+
+ARG BUILD_TYPE
+ENV BUILD_TYPE=${BUILD_TYPE}
+ARG CUDA_DOCKER_ARCH
+ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
+ARG CMAKE_ARGS
+ENV CMAKE_ARGS=${CMAKE_ARGS}
+# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile
+# (which the llama-cpp-localai-paged Makefile reuses via a sibling build dir) errors out
+# when the var is empty on a hipblas build, and the prebuilt path is what CI exercises most
+# of the time. The builder-fromsource stage above already does this; mirror it here.
+ARG AMDGPU_TARGETS
+ENV AMDGPU_TARGETS=${AMDGPU_TARGETS}
+ARG TARGETARCH
+ARG TARGETVARIANT
+
+# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to
+# /usr/local. Mirror what the from-source path does so the compile step
+# can find gRPC at the canonical prefix the Makefile expects.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+RUN --mount=type=bind,source=.docker/llama-cpp-localai-paged-compile.sh,target=/usr/local/sbin/compile.sh \
+    --mount=type=cache,target=/root/.ccache,id=llama-cpp-localai-paged-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    bash /usr/local/sbin/compile.sh
+
+RUN make -BC /LocalAI/backend/cpp/llama-cpp-localai-paged package
+
+
+# ============================================================================
+# Final stage — copies package output from one of the two builders.
+# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder.
+#
+# BuildKit doesn't support variable expansion in `COPY --from=` directly,
+# so we resolve the ARG by aliasing the chosen builder to a fixed stage
+# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder.
+# BUILDER_TARGET itself is declared as a global ARG at the top of this
+# file (required for use in FROM), so we just re-import it into this
+# stage's scope before the FROM directive.
+# ============================================================================
+FROM ${BUILDER_TARGET} AS builder
+
+FROM scratch
+
+
+# Copy all available binaries (the build process only creates the appropriate ones for the target architecture)
+COPY --from=builder /LocalAI/backend/cpp/llama-cpp-localai-paged/package/. ./
--- a/backend/cpp/llama-cpp-localai-paged/Makefile
+++ b/backend/cpp/llama-cpp-localai-paged/Makefile
@@ -0,0 +1,146 @@
+
+# llama-cpp-localai-paged is LocalAI's paged-attention llama.cpp variant. It
+# builds upstream llama.cpp with the LocalAI paged-attention patch series
+# (patches/paged/, vendored in THIS backend) applied on top. It reuses
+# backend/cpp/llama-cpp's grpc-server.cpp / CMakeLists.txt / prepare.sh / Makefile
+# sources verbatim via a thin wrapper - the stock llama-cpp backend is pure
+# upstream and carries NONE of the paged patches; this backend OWNS them.
+#
+# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled
+# on): the paged patch series is hand-verified bit-exact against ONE specific
+# llama.cpp tip and re-exported by the manual PIN_SYNC process
+# (docs/PIN_SYNC_*.md). A naive pin bump would move the tip out from
+# under the patches and break `git apply` at build time, so this backend OWNS
+# its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin
+# from backend/cpp/llama-cpp/Makefile. The override is forced into every copied
+# build via `LLAMA_VERSION=$(LLAMA_VERSION)`. There is deliberately NO
+# bump_deps.yaml entry for it: it is advanced ONLY by PIN_SYNC, never nightly.
+# (turboquant CAN auto-bump because its fork branch carries the patches; the
+# paged series is vendored as .patch files here, so it cannot.)
+#
+#   - NO patch-grpc-server.sh and NO apply-patches.sh: the shared grpc-server.cpp
+#     already carries the (runtime-gated) paged option hooks, and the paged patch
+#     series (patches/paged/) is applied by THIS Makefile's own apply step onto
+#     the freshly cloned tree, using the same strict `git apply` method the stock
+#     build uses for base patches. The stock llama-cpp Makefile applies only its
+#     own (currently empty) base patches/ series, never the paged one.
+
+# Manually pin-synced llama.cpp tip the paged patch series is verified against.
+# Decoupled from the auto-bumped stock pin in backend/cpp/llama-cpp/Makefile so
+# the nightly llama.cpp bump cannot silently break the vendored paged patches.
+# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate +
+# re-export), then update this value. See:
+#   backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_*.md
+#
+# This pin = the manual, verified sync. The signal telling you WHEN to do the
+# next sync is the early-warning canary
+# (.github/workflows/llama-cpp-paged-canary.yml): weekly it applies + compiles
+# this patch series against the latest upstream llama.cpp tip and goes red the
+# moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then
+# bump this value. The canary never touches this pin; it is signal-only.
+LLAMA_VERSION?=c299a92c38b6de6a1139617652b66081828648db
+
+CMAKE_ARGS?=
+BUILD_TYPE?=
+NATIVE?=false
+ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh
+TARGET?=--target grpc-server
+JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1)
+ARCH?=$(shell uname -m)
+
+CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
+LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
+# OUR vendored paged-attention patch series. Owned by this backend; the stock
+# llama-cpp backend no longer carries it. Applied onto each freshly cloned
+# llama.cpp tree by apply-paged-patches below (strict git apply).
+PAGED_PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches/paged
+
+GREEN := \033[0;32m
+RESET := \033[0m
+
+# Apply OUR vendored paged-attention patch series (patches/paged/0*.patch) onto a
+# freshly cloned llama.cpp tree ($(1)) using the SAME strict git-apply method the
+# stock build uses for its base patches (backend/cpp/llama-cpp/Makefile `llama.cpp`
+# target). Strict: any patch that no longer applies aborts the build (exit 1) -
+# that is the signal to run a PIN_SYNC, never to bump the pin blindly. The series
+# is owned by THIS backend, not by the now-pure stock llama-cpp backend.
+define apply-paged-patches
+	cd $(1) && \
+	for p in $(PAGED_PATCHES_DIR)/0*.patch; do \
+		[ -e "$$p" ] || continue; \
+		echo "applying llama.cpp PAGED patch: $$p"; \
+		git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
+	done
+endef
+
+# Each flavor target:
+#   1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh +
+#      CMakeLists.txt + Makefile) into a sibling
+#      llama-cpp-localai-paged-<flavor>-build directory;
+#   2. clones OUR pinned upstream llama.cpp into that copy via the copy's own
+#      `llama.cpp` target (which applies the stock base patches/ series, normally
+#      empty), then applies THIS backend's paged patch series (patches/paged/)
+#      onto the cloned tree with strict `git apply` (apply-paged-patches);
+#   3. runs the copy's `grpc-server` target and copies the produced binary up as
+#      llama-cpp-localai-paged-<flavor>.
+# We clone+patch only the *copy*, never the original under backend/cpp/llama-cpp/,
+# so the stock llama-cpp build stays untouched and patch-free.
+define paged-build
+	rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build
+	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build purge
+	$(info $(GREEN)I llama-cpp-localai-paged build info:$(1)$(RESET))
+	LLAMA_VERSION=$(LLAMA_VERSION) $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build llama.cpp
+	$(call apply-paged-patches,$(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/llama.cpp)
+	CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_VERSION=$(LLAMA_VERSION) \
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build grpc-server
+	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/grpc-server llama-cpp-localai-paged-$(1)
+endef
+
+llama-cpp-localai-paged-avx2:
+	$(call paged-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
+
+llama-cpp-localai-paged-avx512:
+	$(call paged-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server)
+
+llama-cpp-localai-paged-avx:
+	$(call paged-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
+
+llama-cpp-localai-paged-fallback:
+	$(call paged-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server)
+
+# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all).
+# Reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and
+# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same
+# overrides through to the copied build: SHARED_LIBS=ON, the DL flags, and
+# --target ggml (which pulls in the per-microarch libggml-cpu-*.so via ggml's
+# add_dependencies). The .so set is collected for package.sh to bundle into
+# package/lib.
+llama-cpp-localai-paged-cpu-all:
+	rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build
+	cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build purge
+	$(info $(GREEN)I llama-cpp-localai-paged build info:cpu-all-variants$(RESET))
+	LLAMA_VERSION=$(LLAMA_VERSION) $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build llama.cpp
+	$(call apply-paged-patches,$(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/llama.cpp)
+	SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" LLAMA_VERSION=$(LLAMA_VERSION) \
+	$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build grpc-server
+	cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/grpc-server llama-cpp-localai-paged-cpu-all
+	rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs
+	find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \;
+	@echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/
+
+llama-cpp-localai-paged-grpc:
+	$(call paged-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target rpc-server)
+
+llama-cpp-localai-paged-rpc-server: llama-cpp-localai-paged-grpc
+	cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-grpc-build/llama.cpp/build/bin/rpc-server llama-cpp-localai-paged-rpc-server
+
+package:
+	bash package.sh
+
+purge:
+	rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-*-build
+	rm -rf llama-cpp-localai-paged-* package
+
+clean: purge
--- a/backend/cpp/llama-cpp-localai-paged/README.md
+++ b/backend/cpp/llama-cpp-localai-paged/README.md
@@ -0,0 +1,366 @@
+# LocalAI paged-attention llama.cpp patch series
+
+This backend vendors the patch series (in `patches/paged/`) that turns stock
+llama.cpp into LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The
+patches are applied on top of a pinned upstream llama.cpp at build time; nothing
+here is a fork - it is a source-only `*.patch` stack plus this canonical doc.
+
+> One-file rule: this README is the canonical reference for the patch series. The
+> only other docs are operational, kept in `docs/`, and linked below:
+> - [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
+> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
+> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
+
+---
+
+## 1. What it is
+
+`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
+vendored patch series over upstream llama.cpp that adds
+
+- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
+  free pool, ref-counted blocks) with a **block-table flash-attention** read so
+  the attention kernels index physical cells instead of a contiguous buffer;
+- **cross-request prefix sharing** - concurrent requests that share a long
+  prefix physically reuse one committed copy of the prefix blocks and prefill
+  only their divergent suffix;
+- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
+  decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
+- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
+  gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
+  GEMM - dominates the decode step.
+
+It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
+export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
+[pin-sync process](docs/PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
+(see section 7).
+
+The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
+enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
+`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
+runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
+clean and the paged build.
+
+---
+
+## 2. Architecture
+
+The decode step on these models breaks into three cost centers; the patch series
+attacks each one.
+
+**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
+(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
+fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
+copy-on-write for shared prefixes). The attention path reads through a **block
+table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
+as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
+CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
+and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
+reduction order identical to stock. A null block table is the stock contiguous
+read, byte-identical.
+
+**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
+gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
+GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
+cost. The series fuses that plumbing to mirror vLLM's
+`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
+to its cache slot in place (no copy-back, no `get_rows` materialization), the
+conv state is updated in place, the output projection is reshaped to route to the
+tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
+bit-exact (md5-gateable) against the f32 baseline.
+
+**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
+Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
+in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
+keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
+NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
+to dequant.
+
+**The prefill/decode scheduler.** `update_slots()` already emits one unified
+mixed prefill+decode batch per step. The scheduler patches change only the *count*
+of prefill tokens admitted per step: decode tokens are claimed first
+(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
+live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
+shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
+orthogonal to the paged allocator.
+
+---
+
+## 3. Patch series (0001-0030)
+
+28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
+`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
+is in section 5.
+
+### Paged-KV core (0001-0012)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
+| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
+| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
+| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
+| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
+| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
+| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
+| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
+| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
+| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
+| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
+
+### Decode-first scheduler (0013, 0016)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
+| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
+
+(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
+high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
+default-on, density-aware auto-select that is prefill-safe by construction. Both
+bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
+but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
+ships as the parity gate + default-off instrumentation only.)
+
+### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
+
+These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
+
+| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
+|---|---|---|
+| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
+| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
+| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
+| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
+| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
+| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
+
+### MoE NVFP4 quant (0023, 0025)
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
+| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
+
+### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
+
+| # | What it does | Bit-exact |
+|---|---|---|
+| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
+| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
+| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
+| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
+
+---
+
+## 4. Benchmarks
+
+Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
+**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
+S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
+serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png),
+[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data
+[`final_benchmark.csv`](docs/final_benchmark.csv).
+
+### (a) + (b) Patched vs stock vs vLLM
+
+The **stock** and **patched** columns are the same binary, env-toggled, on the
+**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
+apples-to-apples measure of the patch series' contribution. The **vLLM** column
+is a **different harness** (vLLM server + client continuous batching), so the
+cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
+
+**Dense Qwen3.6-27B-NVFP4** (t/s):
+
+| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
+|----:|------:|--------:|-----:|------------------:|---------------------:|
+| 8   |  65.7 |   84.0 |  71.1 | 118% | 1.28x |
+| 32  | 113.7 |  204.0 | 207.6 |  98% | 1.79x |
+| 64  | 134.3 |  294.9 | 309.7 |  95% | 2.20x |
+| 128 | 143.5 |  371.2 | 422.4 |  88% | 2.59x |
+
+**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
+
+| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
+|----:|------:|--------:|------:|-----------------:|---------------------:|
+| 8   | 181.4 |  227.4 |  315.1 | 72% | 1.25x |
+| 32  | 260.8 |  455.7 |  681.9 | 67% | 1.75x |
+| 64  | 306.8 |  612.3 |  765.5 | 80% | 2.00x |
+| 128 | 331.3 |  772.6 | 1011.7 | 76% | 2.33x |
+
+**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
+@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
+config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
+groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
+
+**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
+stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
+remaining gap is structural (see section 5).
+
+### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here?
+
+Short answer: **no - the wins are CUDA/Blackwell-specific.** Two facts first: the
+24GB NVFP4 GGUF doesn't fit a 16GB M4 (SSD paging), and on Metal `supports_op`
+**excludes NVFP4** from `MUL_MAT`/`MUL_MAT_ID`/`GET_ROWS` (FP4 matmuls fall back to
+CPU - no Apple FP4-MMA). So NVFP4 Qwen3.6 is not a Mac fit; a Metal-native Q4_K is.
+
+Measured **stock vs patched** (same pin `c299a92c`, both built `-DGGML_METAL=ON`;
+the 28-patch series **compiles clean on Metal** - the CUDA code is `#if`-guarded),
+on **Qwen3-8B Q4_K_M** (a dense GQA model that fits 16GB and exercises the *live*
+Metal features; no Qwen3.6 hybrid GGUF fits 16GB, and the GDN fusions gate off on
+Metal anyway), `llama-bench` pp512/tg128 t/s:
+
+| config | pp512 | tg128 |
+|---|---:|---:|
+| stock | 226.7 | 20.4 |
+| patched, paged **off** | 226.7 | 20.3 (= stock) |
+| patched, paged **on** | 222.6 | 19.8 (~0.97x) |
+
+Concurrency (`batched-bench`) scales identically to stock (S_TG ~20 -> ~137 at
+npl32, from llama.cpp's existing batching). **Verdict: neutral-to-slightly-negative
+on Metal.** Patched-paged-off equals stock; turning paged on is ~0-3% slower
+decode / ~2-8% slower prefill, because the in-kernel block-table flash-attn read
+that *recovers* the gather cost is CUDA-only (`fattn-*.cuh`) - on Metal the paged
+path falls back to a host-side gather, pure overhead over stock's contiguous read.
+Everything Blackwell-specific (NVFP4, GDN fusions via 0030, occupancy) is inert.
+So **on Apple Silicon, prefer the stock `llama-cpp` backend.**
+
+**Vulkan / SYCL** (source analysis): the gated-DeltaNet and SSM_CONV ops DO have
+upstream kernels on Vulkan and SYCL (as on Metal), so the Qwen3.6 hybrids RUN on
+all three via the non-fused path. The patchset's fusions are gated off there
+(0030), so the outcome is the same neutral-to-slightly-negative as Metal - not
+"won't run". This backend therefore ships **CUDA-only** (where the fusions are
+live + verified); non-CUDA users should use the stock `llama-cpp` backend. See
+[`UPSTREAM_LAYER2_SCOPE.md`](docs/UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA
+fused kernels would take.
+
+---
+
+## 5. Dev notes - what we learned
+
+**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
+md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
+is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
+`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
+chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
+touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
+
+**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)).
+Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
+(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
+is a benign FP-accumulation-order difference of the paged attention reduction,
+**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
+KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
+equivalent FP-reorderings of the same quantized model, not a regression. Future
+paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
+
+**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
+decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
+**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
+the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
+bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
+llama is losing. The MoE GEMM kernel is *not* where the gap lives.
+
+**Rejected / flat levers** (recorded so they are not re-tried):
+
+- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
+  exhausted by 0025; more graph/stream overlap is a no-op or small regression on
+  this model.
+- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
+  by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
+  further bit-exact lever clears it. 0023 already banks the de-dup.
+- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
+  Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
+  same bf16 projections. No-ship.
+- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
+  needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
+  (The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
+  carries over to MoE.)
+
+**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
+that bf16 KL error concentrates in long-memory heads and can be removed by
+keeping them f32 - is **empirically refuted**: the error scales with the bf16
+head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
+byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
+byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
+ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
+in a recommended/gallery config.
+
+---
+
+## 6. Architecture and quant generality
+
+(From the arch-generality and quant-generality audits.)
+
+- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
+  activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
+  help **any quant** of these models (the GDN recurrence, conv, gather and
+  o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
+  not on the weight dtype).
+- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
+  to dequant on other hardware; the GB10-tuned occupancy params (0022) are
+  perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
+  correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
+  CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
+
+- **What generalizes beyond this backend (upstream candidates).** The *speedups*
+  are CUDA/Blackwell-specific (which is why Metal/Vulkan don't benefit - section
+  4c), but several *findings and ops* are portable and worth upstreaming:
+  - The headline is hardware-independent: on hybrid gated-DeltaNet models, decode
+    is bottlenecked by the recurrent-state **plumbing** (memcpy + gathers, ~67% of
+    the step), not the weight GEMM. The fusions for it (in-place state 0018, gather
+    0019/0028, conv 0021) are bit-exact and already have CPU reference kernels, so
+    they would speed up Qwen3.6 / Qwen3-Next / any hybrid-SSM decode on **every**
+    backend once the ggml ops gain the respective (Metal/Vulkan) kernels - the
+    highest-value upstream contribution.
+  - The o_proj GEMV->MMQ reshape (0020) is a model-graph fix (batch the projection
+    to hit the GEMM path) - arch-agnostic in principle, trivial to upstream.
+  - The paged KV + cross-request prefix sharing + decode-first scheduler align with
+    llama.cpp's own in-progress KV / chunked-prefill work and could inform it.
+  - The per-path bit-exact md5 gate + the weekly upstream-drift canary is a reusable
+    maintenance pattern for any vendored-patch backend.
+
+---
+
+## 7. Pin + maintenance policy
+
+- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
+  [`PIN_SYNC`](docs/PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
+  onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
+  (dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
+  c299a92c` jump (23 upstream commits) needed zero patch changes and did not
+  change decode output.
+- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
+  `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
+  silently shift the tree out from under the patches.
+- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../.github/workflows/llama-cpp-paged-canary.yml)
+  (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh))
+  tries the patch series against the latest upstream tip with the build's own
+  strict `git apply`. **Red = upstream drifted past the series -> run a
+  PIN_SYNC** (do not bump the pin blindly). The canary references
+  [`PIN_SYNC_c299a92c.md`](docs/PIN_SYNC_c299a92c.md).
+
+---
+
+## 8. Models
+
+> **Build coverage: CUDA-only.** This backend ships only the CUDA/cublas build
+> targets (cuda-12, cuda-13, and the nvidia-l4t arm64 cuda-12/cuda-13 Jetson
+> rows). There are no cpu / vulkan / sycl / hipblas / metal-darwin builds: the
+> patchset's wins are CUDA/Blackwell-specific (section 4c), so off-CUDA the
+> backend is neutral-to-negative and non-CUDA users should run the stock
+> `llama-cpp` backend instead. The `backend/index.yaml` meta-backend resolves
+> `default`/`nvidia` to a CUDA variant accordingly.
+
+The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
+
+| Gallery entry | Weights (HuggingFace) | Notes |
+|---|---|---|
+| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
+| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
+
+Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
+(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
+`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
+`ssm_bf16_tau`). The full backend-split + gallery plan is in
+[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md).
--- a/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md
@@ -0,0 +1,514 @@
+# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
+
+Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
+actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
+
+SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and
+the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and
+the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas /
+metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the
+darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to-
+negative there and non-CUDA users should use the stock llama-cpp backend (README 4c).
+
+================================================================================
+0. GROUND TRUTH (what the repo actually does today)
+================================================================================
+
+The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
+worktree. Two mechanisms, both already present:
+
+  (a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
+      target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
+      patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
+      prepare.sh has a fallback `patch`-based apply guarded by a sentinel
+      (llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
+      already ships the paged engine compiled in.
+
+  (b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
+      hooks (lines ~752-842). They only call setenv() before context init:
+        - option `kv_paged` / `paged_kv` / `paged_attention`  -> setenv LLAMA_KV_PAGED=1
+        - option `kv_paged_debug` / `paged_kv_debug`          -> setenv LLAMA_KV_PAGED_DEBUG=1
+        - option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
+        - option `max_batch_tokens` / `mbt`                   -> setenv LLAMA_MAX_BATCH_TOKENS
+        - option `prefill_cap`                                -> setenv LLAMA_PREFILL_CAP
+      Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
+      so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
+      The paged engine itself lives entirely inside the patched llama.cpp lib
+      (paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
+
+Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
+ONE backend. The task is to SPLIT that into two backends:
+  - llama-cpp  = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
+                 paged hook), grpc-server.cpp keeps the dormant hooks.
+  - <newname>  = stock grpc-server.cpp + paged patch series applied + paged on.
+
+The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
+backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
+
+CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
+(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
+SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
+gets the same single-build CPU target for free by reusing the same Makefile machinery.
+
+--------------------------------------------------------------------------------
+RECOMMENDED BACKEND NAME: `llama-cpp-paged`  (see section 4 for the full rationale)
+--------------------------------------------------------------------------------
+Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
+SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
+DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
+in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
+turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
+
+================================================================================
+1. NEW BACKEND - file by file
+================================================================================
+
+--------------------------------------------------------------------------------
+1.1 backend/cpp/llama-cpp/Makefile  (the ONE necessary touch to stock)
+--------------------------------------------------------------------------------
+Change exactly one default so the STOCK image ships clean against upstream:
+
+    -LLAMA_PAGED?=on
+    +LLAMA_PAGED?=off
+
+Why: this is the entire point of the split - stock llama-cpp must build clean so an
+upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
+grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
+it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
+currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
+the PR. If the team prefers stock literally untouched, the alternative is to leave
+`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
+not achieved. Recommendation: flip to off.
+
+(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
+patches/, patches/paged/ are all reused as-is by the new backend.)
+
+--------------------------------------------------------------------------------
+1.2 backend/cpp/llama-cpp-paged/Makefile  (NEW - thin wrapper, model on turboquant)
+--------------------------------------------------------------------------------
+Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
+do NOT):
+  - turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
+    as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
+    NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
+    turboquant). We only force LLAMA_PAGED=on.
+  - turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
+    apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
+    paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
+    own `llama.cpp:` target when LLAMA_PAGED=on.
+
+Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
+cpu-all/grpc/rpc-server):
+
+    LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
+
+    define paged-build   # $(1)=flavor $(2)=cmake flags $(3)=target
+      rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
+      cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
+      $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
+      # clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
+      LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
+      CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
+        $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
+      cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
+    endef
+
+    llama-cpp-paged-cpu-all:
+      # identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
+      # + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
+      ... LLAMA_PAGED=on SHARED_LIBS=ON \
+          EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
+          TARGET="--target grpc-server --target ggml" ...
+
+    package: ; bash package.sh
+    purge:   ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
+    clean: purge
+
+Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
+package.sh glob them.
+
+--------------------------------------------------------------------------------
+1.3 backend/cpp/llama-cpp-paged/run.sh  (NEW - copy turboquant/run.sh, rename binaries)
+--------------------------------------------------------------------------------
+s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
+llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
+DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
+
+--------------------------------------------------------------------------------
+1.4 backend/cpp/llama-cpp-paged/package.sh  (NEW - copy turboquant/package.sh, rename)
+--------------------------------------------------------------------------------
+s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
+ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
+and the per-arch libc/ld.so set (unchanged).
+
+--------------------------------------------------------------------------------
+1.5 backend/Dockerfile.llama-cpp-paged  (NEW - copy Dockerfile.turboquant, swap paths)
+--------------------------------------------------------------------------------
+Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
+  - bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
+  - ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
+    (OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
+     stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
+     stock cache would give the paged build near-free object reuse. Trade-off: a regression
+     in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
+  - both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
+  - final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
+
+--------------------------------------------------------------------------------
+1.6 .docker/llama-cpp-paged-compile.sh  (NEW - copy llama-cpp-compile.sh, swap make targets)
+--------------------------------------------------------------------------------
+Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
+`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
+(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
+arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
+
+--------------------------------------------------------------------------------
+1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
+--------------------------------------------------------------------------------
+  a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
+  b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
+       # llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
+       # series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
+       BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
+     (lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
+      llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
+  c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
+       $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
+  d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
+  e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
+     line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
+  f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
+     backends/llama-cpp-darwin at line 1124; see 1.11).
+
+--------------------------------------------------------------------------------
+1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
+--------------------------------------------------------------------------------
+For EACH variant you choose to ship (see phased recommendation in section 4), add a row
+copied from the corresponding llama-cpp row with:
+  - backend: "llama-cpp-paged"
+  - dockerfile: "./backend/Dockerfile.llama-cpp-paged"
+  - tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
+    (e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
+           -gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
+  - builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
+    (this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
+     is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
+  - CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
+    tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
+    backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
+    pattern as -cpu-llama-cpp.
+  - Darwin (if shipping): add to includeDarwin:
+      - backend: "llama-cpp-paged"
+        tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
+        lang: "go"
+    (omit build-type, exactly like the llama-cpp darwin row at line 4908.)
+
+  REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
+  changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
+  triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
+  (the shared source) retrigger this backend too.
+
+--------------------------------------------------------------------------------
+1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
+--------------------------------------------------------------------------------
+  a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
+     next to the turboquant branch (line 45):
+       if (item.dockerfile.endsWith("llama-cpp-paged")) {
+         // reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
+         return `backend/cpp/llama-cpp-paged/`;
+       }
+     ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
+     keep the specific branch first regardless (defensive, and returns the right path).
+  b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
+       if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
+  c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
+       if (backend === "llama-cpp-paged" && !changed) {
+         changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
+       }
+  Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
+
+--------------------------------------------------------------------------------
+1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
+--------------------------------------------------------------------------------
+GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
+meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
+per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
+
+  Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
+    - &llamacpppaged
+      name: "llama-cpp-paged"
+      alias: "llama-cpp-paged"
+      license: mit
+      icon: <same as llama-cpp>
+      description: |
+        LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
+        prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
+        Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
+      urls: [ https://github.com/ggerganov/llama.cpp ]
+      tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
+      capabilities:
+        default: "cpu-llama-cpp-paged"
+        nvidia: "cuda12-llama-cpp-paged"
+        nvidia-cuda-12: "cuda12-llama-cpp-paged"
+        nvidia-cuda-13: "cuda13-llama-cpp-paged"
+        nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
+        nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
+        nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
+        metal: "metal-llama-cpp-paged"
+        # add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
+
+  Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
+    capabilities map pointing at the `*-development` image names.
+
+  Step c - concrete image entries at end of file (mirror the llama-cpp block lines
+    2106-2200), one latest + one development per variant, each as:
+      - !!merge <<: *llamacpppaged
+        name: "cpu-llama-cpp-paged"
+        uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
+        mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
+      - !!merge <<: *llamacpppaged
+        name: "cpu-llama-cpp-paged-development"
+        uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
+        mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
+      ...repeat for cuda12 / cuda13 / l4t / metal etc.
+  The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
+
+--------------------------------------------------------------------------------
+1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
+--------------------------------------------------------------------------------
+If metal is shipped, also:
+  - scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
+    drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
+  - Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
+  - backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
+    step that calls `make backends/llama-cpp-darwin`).
+  - index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
+  - C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
+    linking protobuf/grpc++), so no Homebrew-include failure.
+
+--------------------------------------------------------------------------------
+1.12 Importer / /backends/known dropdown  (drop-in, NOT a new importer)
+--------------------------------------------------------------------------------
+This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
+add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
+  - AdditionalBackends() (line 37): append
+      {Name: "llama-cpp-paged", Modality: "text",
+       Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
+  - Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
+      preferences.backend == "llama-cpp-paged" is honored:
+        case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
+  - core/gallery/importers/importers_test.go: add a table case asserting the preference
+    override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
+    HF fixture). Run `go test ./core/gallery/importers/...`.
+
+--------------------------------------------------------------------------------
+1.13 Docs
+--------------------------------------------------------------------------------
+  - docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
+    one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
+    -> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
+
+--------------------------------------------------------------------------------
+1.14 Does grpc-server.cpp need the paged hooks?  YES - already present, reused unchanged.
+--------------------------------------------------------------------------------
+The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
+SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
+(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
+hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
+LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
+
+================================================================================
+2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
+================================================================================
+
+Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
+the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
+either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
+`options:` string list (key:value, parsed by grpc-server.cpp set_option).
+
+--------------------------------------------------------------------------------
+2.1 Benchmark llama-server flags -> LocalAI model-config mapping
+--------------------------------------------------------------------------------
+  -c 131072                  -> context_size: 131072            (LLMConfig.ContextSize, yaml context_size)
+  -fa on                     -> flash_attention: "on"           (LLMConfig.FlashAttention, yaml flash_attention; string)
+  -ngl 99                    -> gpu_layers: 99                  (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
+  -b 2048                    -> batch: 2048                     (schema.PredictionOptions.Batch, yaml batch)  [see caveat]
+  --parallel 128             -> options: ["parallel:128"]       (grpc-server.cpp:629; alias n_parallel)
+  LLAMA_KV_PAGED=1           -> options: ["paged_kv:true"]      (grpc-server.cpp:778)
+  LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
+  f16 KV                     -> f16: true                       (LLMConfig.F16, yaml f16)
+  (recommended for paged)    -> options: ["kv_unified:false"]   (grpc-server.cpp:746 - the per-slot paged
+                                  capacity/memory benefit only materializes with a per-sequence cache;
+                                  the patch comment explicitly recommends pairing paged with kv_unified:false)
+
+  CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
+  (grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
+  benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
+    (i)  set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
+         max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
+         benchmark states decode throughput is budget-independent), OR
+    (ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
+  RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
+  lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
+  be added later to honor -b/-ub independently (not required to ship).
+
+--------------------------------------------------------------------------------
+2.2 gallery/index.yaml entry - DENSE  q36-27b-nvfp4
+--------------------------------------------------------------------------------
+- name: "qwen3.6-27b-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF      # placeholder, section 3
+  description: |
+    Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
+    paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
+    Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
+  license: "apache-2.0"                                         # confirm vs Qwen license
+  tags: [ llm, gguf, nvfp4, reasoning ]
+  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
+  overrides:
+    backend: llama-cpp-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512                       # see -ub caveat 2.1; matches the 512 ubatch floor
+    known_usecases: [ chat ]
+    options:
+      - use_jinja:true
+      - paged_kv:true                # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512         # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
+      - kv_unified:false             # enables the per-slot paged capacity/memory benefit
+      - parallel:128                 # --parallel 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+      sha256: <FILL after publish>
+      uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
+
+--------------------------------------------------------------------------------
+2.3 gallery/index.yaml entry - MoE  q36-35b-a3b-nvfp4
+--------------------------------------------------------------------------------
+Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
+MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
+purely for saturated MoE decode use max_batch_tokens:256.
+- name: "qwen3.6-35b-a3b-nvfp4-paged"
+  urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
+  ...
+  overrides:
+    backend: llama-cpp-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    batch: 512
+    options:
+      - use_jinja:true
+      - paged_kv:true
+      - max_batch_tokens:512          # or 256 for max saturated MoE decode (sweep winner)
+      - kv_unified:false
+      - parallel:128
+    parameters:
+      model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+      sha256: <FILL after publish>
+      uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
+
+Note: these are the BENCHMARK serving configs. For an interactive single-user default you
+may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
+- optional, not required to ship the benchmark reproduction.
+
+================================================================================
+3. GGUF PUBLISHING (so the gallery uri: resolves)
+================================================================================
+
+The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
+q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
+document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
+tensors from nvidia modelopt weights). To publish:
+
+  1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
+       <ORG>/Qwen3.6-27B-NVFP4-GGUF      (single q36-27b-nvfp4.gguf)
+       <ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF  (single q36-35b-a3b-nvfp4.gguf)
+     ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
+  2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
+     (LocalAI verifies it on download). Without sha256 the entry still works but loses the
+     integrity check - fill it.
+  3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
+     pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
+     Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
+     paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
+  4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
+     LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
+     tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
+     backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
+     IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
+  5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
+     modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
+     attribute upstream in the card.
+
+================================================================================
+4. OPEN DECISIONS / BLOCKERS / BUILD COST
+================================================================================
+
+BACKEND NAME - RECOMMEND `llama-cpp-paged`.
+  - llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
+    every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
+    changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
+    importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
+  - localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
+    dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
+    Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
+  - localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
+    llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
+    ".cpp" looks like a file extension to the suffix matcher. Avoid.
+
+BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
+  1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
+     If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
+  2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
+     until upload. Blocks gallery validation only, not the backend build.
+  3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
+     option; shipped config uses batch:512. Minor, not a blocker.
+  4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
+     - get explicit sign-off since it alters a heavily-used backend's build.
+
+PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
+base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
+  Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
+    - cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64  (NVFP4/paged win)
+    - cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
+  Phase 2 (parity with stock llama-cpp coverage, only if demand):
+    - metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
+  Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
+  and these add CI cost without a clear consumer.
+
+BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
+~byte-identical to stock so a SHARED ccache id makes most objects free):
+  - CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
+    apt step. Two arches + a merge job.
+  - CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
+    across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
+  - Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
+  - No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
+    so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
+    write; first master build per row pays the cold cost once, then warm.
+
+VERIFICATION (post-implementation, needs a GPU box - out of scope here):
+  - `make backends/llama-cpp-paged` builds + installs locally (from-source path).
+  - Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
+    checkout) - proves the split.
+  - Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
+    server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
+    max_batch_tokens/parallel took effect.
+  - go test ./core/gallery/importers/... green (importer drop-in case).
+  - node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
+    llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
+
+================================================================================
+END OF PLAN
+================================================================================
--- a/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md
@@ -0,0 +1,75 @@
+# Paged bit-exactness gate - per path (canonical references)
+
+## TL;DR
+
+The greedy decode of the **paged** path does not byte-match the **non-paged**
+path for the MoE model. This is a **benign FP-accumulation-order difference of
+the paged attention reduction**, KL-validated against the f16 reference. It is
+**not a bug**. The bit-exactness gate is therefore **per path**:
+
+| path | model | canonical md5 |
+|------|-------|---------------|
+| non-paged | MoE q36-35b-a3b-nvfp4   | `07db32c2bcb78d17a43ed18bc22705cd` |
+| paged     | MoE q36-35b-a3b-nvfp4   | `8cb0ce23777bf55f92f63d0292c756b0` |
+| non-paged | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` |
+| paged     | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
+
+Gate command (chat-template / conversation path):
+```
+llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
+                 -n 48 --temp 0 --seed 1
+# paged: prefix with  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
+```
+Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
+completion lands in a different md5 namespace).
+
+**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
+the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
+single reference `5951a5b4`.
+
+## Why dense is bit-exact but MoE is not
+
+Dense paged decode reproduces the non-paged reduction order exactly, so dense
+greedy md5 is identical across paths. The MoE path runs additional kernels (the
+NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
+between the paged and non-paged attention layouts. Over a long greedy decode this
+flips a small number of near-tied argmaxes, changing the byte stream. The same
+divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
+off, and with the patch-0029 block-table cache on or off - it is a property of
+the paged attention path, not of any one lever.
+
+## KL evidence that the paged path is sound (the load-bearing check)
+
+`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
+`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
+(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
+
+| comparison | PPL(Q) | KL divergence | Same top p | Cor |
+|------------|-------:|--------------:|-----------:|----:|
+| f16 reference | 7.3734 | - | - | - |
+| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
+| **paged** vs f16     | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
+| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
+
+Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
+
+### Verdict: BENIGN
+
+- **Paged does not diverge from the f16 ground truth more than non-paged does.**
+  KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
+  7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
+  error bars). A real paged-MoE correctness bug would push paged measurably
+  *further* from f16; it does not (it is marginally closer).
+- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
+  89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
+  with essentially zero probability bias. That is the signature of two equivalent
+  FP-reorderings of the same quantized model, both equally approximating the f16
+  ground truth - not a quality regression.
+- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
+  heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
+  logit near-ties are abundant, so a different-but-equivalent reduction order
+  flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
+  zero Delta-p bias).
+
+Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
+reference for the MoE deployment path.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md
@@ -0,0 +1,86 @@
+# llama.cpp patch series — paged attention (vLLM-parity engine)
+
+A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
+in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
+(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
+`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
+**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
+patch needs fixing, and the failure points at exactly which step the upstream change touched.
+
+## Base
+
+- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
+  the pin = re-run the regen workflow below and fix only the patches that no longer apply.
+
+## The series (phases → patches)
+
+| # | Patch | What | Verifies |
+|---|-------|------|----------|
+| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately |
+| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
+| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
+| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
+| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
+| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
+
+Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
+(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
+
+## Regen workflow (the anti-drift recipe)
+
+```sh
+# 1. check out the exact pin into a dev tree
+git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
+git checkout <LLAMA_VERSION from ../Makefile>
+git checkout -b paged
+
+# 2. apply the current series (each becomes a commit), or develop the next patch
+git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch     # or `git apply` + commit per patch
+
+# 3. iterate a phase as ONE commit, then export the whole series 1:1
+git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N
+
+# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
+```
+
+## Build integration
+
+The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock
+`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned
+`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the
+same strict `git apply` the stock build uses for base patches:
+```
+for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done
+```
+All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships
+everywhere without ever touching the stock `llama-cpp` source tree.
+
+## Status
+
+- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
+- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
+  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
+- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
+  (see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
+  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
+  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
+  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
+  **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
+  compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
+  - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
+    flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
+    scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
+    the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
+    identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
+    makes paged placement token-identical under flash-attn.**
+- 0004–0006 follow.
+
+### Honest parity note (important)
+
+This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
+by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
+(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
+touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
+gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
+`../README.md`). So full vLLM parity = this series **AND** the
+kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
@@ -0,0 +1,101 @@
+# Pin-sync: paged patch-stack -> llama.cpp c299a92c
+
+Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
+28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
+("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
+GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
+path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
+upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
+
+## Upstream jump
+
+- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
+  ("model : Add label for LFM2.5-230M (#25008)")
+- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
+  ("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
+- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
+
+## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
+
+Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
+**zero patch changes**. The already-shipped source-only series (the result of the
+`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
+`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
+`git apply`** (the `apply-paged-patches` step in
+`backend/cpp/llama-cpp-localai-paged/Makefile`:
+`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
+28 patches reported "Applied patch ... cleanly", the sentinel
+`src/paged-kv-manager.cpp` was created, and there are **zero** stray
+`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
+intact). git apply tolerates `@@` line-number offsets, which absorbed the
+upstream drift; no hunk context broke.
+
+Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
+patch tarball used for the verification has
+`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
+
+## Clean build
+
+Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
+28 patches applied as working-tree changes, then:
+
+```
+cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
+  -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
+  -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
+cmake --build build-cuda --target llama-completion test-backend-ops -j20
+```
+
+Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
+`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
+
+## GATE: ALL GREEN
+
+Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
+`9d5d882d` build too):
+```
+llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
+                 -n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
+# paged dense: prefix  LLAMA_KV_PAGED=1
+# paged MoE:   prefix  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
+```
+
+(a) greedy md5 - all four paths PASS:
+| path | model | md5 @ c299a92c | baseline | verdict |
+|------|-------|----------------|----------|---------|
+| non-paged | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
+| paged     | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
+| paged     | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
+
+(b) `test-backend-ops` (Backend CUDA0) - all PASS:
+| op | result |
+|----|--------|
+| SSM_CONV            | 45/45 OK |
+| SSM_CONV_UPDATE     | 16/16 OK |
+| SSM_CONV_UPDATE_IDS | 16/16 OK |
+| GATED_DELTA_NET     | 84/84 OK |
+| MUL_MAT             | 1146/1146 OK |
+| MUL_MAT_ID          | 806/806 OK |
+
+(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
+series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
+pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
+
+Bit-exactness preserved across the 23-commit upstream jump.
+
+## Canary
+
+`.github/workflows/llama-cpp-paged-canary.yml` and
+`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
+series is source-only and applies strict-clean with no `--exclude`, the canary's
+`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
+the shipped series) and may be removed on a future canary touch; left in place
+here to keep the pin-bump diff minimal.
+
+## Source of truth
+
+The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
+source of truth and are unchanged by this bump. The DGX dev tree
+(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
+the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
--- a/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md
+++ b/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md
@@ -0,0 +1,337 @@
+# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
+
+Source-only analysis (no GPU, no build) of what it would take to give the
+gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
+backends, so the patch-series decode win extends past CUDA-family hardware.
+
+In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
+SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
+backend that ran the plain op on the discriminated node would silently
+miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
+
+This doc was written against the ggml backend trees in
+`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
+series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
+touches GDN/SSM). So every "kernel already exists" statement below is a
+conservative lower bound: the pin has at least these kernels.
+
+--------------------------------------------------------------------------------
+## 0. Headline finding (correct a stale assumption first)
+
+The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
+upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
+**That is now stale.** All three backends already carry the BASE compute ops:
+
+| op                     | Metal                              | Vulkan                                   | SYCL                            |
+|------------------------|------------------------------------|------------------------------------------|---------------------------------|
+| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
+| GGML_OP_SSM_CONV       | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
+| GGML_OP_SSM_SCAN       | yes                                | `ssm_scan.comp` (mamba2)                 | `ssm_scan.cpp` (mamba2)         |
+
+Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
+commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
+Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
+(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
+the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
+
+Consequence: the GDN-compute op being "partly there" is true on every backend,
+not just Metal. What is still missing per backend is only the FUSION plumbing
+(in-place write-back target, the ids gather read, and the conv-update kernel) -
+a materially smaller scope than "port GDN from scratch."
+
+--------------------------------------------------------------------------------
+## 1. Per-op semantics (the four fusions to port)
+
+All four reuse an existing GGML_OP enum with extra `src[]` slots as a
+discriminator; none adds a new enum value. f32 throughout. The arithmetic core
+is IDENTICAL to the upstream non-fused op; only the read source and/or the write
+target are redirected. That single fact drives the whole bit-exactness story
+(section 3).
+
+### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
+- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
+  `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
+  cache at `kv_head`). K == 1 only.
+- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
+  directly into `state_dst` instead of appending it to the op output. The op
+  output then carries only the attention scores. Removes the per-layer per-step
+  ~full-state D2D copy-back (the 0018 win).
+- Race (in-place read == write): each (seq, head) block owns a disjoint cache
+  slot. The kernel loads the whole prior state `s0` into per-thread registers
+  (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
+  BEFORE the ring write, so reading and writing the same slot is safe.
+
+### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
+- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
+  (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
+  `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
+- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
+  (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
+  materialization. Combined with OP A the op now reads AND writes the cache in
+  place.
+- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
+  read s0 in place from the destination slot (safe via the register snapshot
+  above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
+  TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
+  DISJOINT scratch that the recurrence then reads, so the recurrence never reads
+  a slot another block is writing. Value-preserving memcpy -> bit-identical to
+  the get_rows path.
+
+### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
+- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
+  `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
+  `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
+  `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
+  fuse_silu.
+- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
+  width-K conv window in registers from the K-1 cached taps + the current token,
+  compute the depthwise conv with the SAME ascending-tap FMA order as plain
+  `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
+  `sumf += b` with b==0), optionally fold SiLU, write the conv output
+  `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
+  Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
+  cpy).
+- Race: read source (gathered taps) and write target (cache view) are disjoint
+  buffers -> race-free by construction, no ids/identity logic.
+
+### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
+- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
+  FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
+- Semantics: gather-free conv-update - read each sequence's prior taps from
+  `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
+  `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
+  `ssm_conv_gather_nonident` kernel. The window is copied to a local array
+  BEFORE the (possibly aliasing) ring write so the identity read==write slot is
+  correct. Bit-identical to get_rows + OP C.
+
+### Net new kernels vs reuse, per op
+- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
+  GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
+- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
+  1 ids binding + rs_head param + 1 tiny gather kernel.
+- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
+  a windowed reduction over a PRE-concatenated input; it does not assemble the
+  window from cached taps + the current token, fold silu, or write the shifted
+  ring state. This is the largest net-new piece.
+- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
+  tiny conv gather kernel.
+
+The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
+emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
+SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
+work is the kernel(s) + the backend wiring.
+
+--------------------------------------------------------------------------------
+## 2. Per-backend: authoring model, effort, gotchas, wiring
+
+### 2.1 Metal (MSL)
+
+Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
+specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
+glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
+lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
+and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
+simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
+
+Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
+and are ergonomic to extend. OP A is a write-base redirect of the existing
+`kernel_gated_delta_net_impl` (its tail already does
+`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
+loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
+and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
+ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
+write are both new). Host glue spans 3-4 files.
+
+Gotchas:
+- In-place race: the existing kernel ALREADY snapshots the state column into
+  `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
+  must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
+- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
+  `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
+  `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
+  `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
+  only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
+  `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
+- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
+  the fused variant only redirects addresses, so it is bit-identical to Metal's
+  own non-fused path by construction (the conv per-channel FMA needs the exact
+  ascending order + the `+0.0f`).
+- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
+  pipeline name (or a function-constant branch) distinguishes the variants.
+
+### 2.2 Vulkan (GLSL .comp -> SPIR-V)
+
+Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
+`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
+etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
+push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
+binds subbuffers; `supports_op` in the device support function. Subgroup size
+VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
+
+Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
+Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
+new `ggml_vk_create_pipeline` registration with an explicit binding count and
+push-constant size, a new/extended push-constant struct (add `rs_head`), and
+GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
+(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
+pipeline is created with `"main", 7, ...` - both must change.
+
+Gotchas:
+- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
+  already ships shmem / cluster / nocluster variants keyed on subgroup size and
+  relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
+  applied across ALL of those variants, and re-validated per vendor.
+- In-place race: GLSL must read the full column shard into local registers before
+  the ring write (same pattern); confirm the SPIR-V memory model is not relied on
+  for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
+  OP C/D need the explicit window-to-local copy.
+- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
+  discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
+  Both must be tightened. This is the backend where the 0030 hazard is most
+  concrete (a present plain-conv kernel + a permissive supports_op = silent
+  miscompute) - Vulkan is the exact case 0030 was written for.
+- conv-update is per-channel (one invocation per channel) so it is
+  subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
+- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
+  constants, so the silu-fold half of OP C is partly precedented here (unlike
+  Metal); the ring write-back + tap-window assembly are still new.
+
+### 2.3 SYCL (single-source DPC++)
+
+Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
+`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
+`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
+`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
+source).
+
+Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
+near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
+`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
+a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
+almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
+read-base select, the two tiny gather kernels, and the new conv-update kernel.
+No pipeline/push-constant/binding bookkeeping.
+
+Gotchas:
+- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
+  the register-snapshot-before-write pattern carries over directly. Safe.
+- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
+  discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
+  hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
+  `ggml_sycl_op_gated_delta_net` must branch on the extra src.
+- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
+  situation as CUDA; the fused variant matches SYCL's own non-fused path by
+  construction. conv-update is per-channel -> subgroup-agnostic.
+
+### 2.4 Common wiring (all three) + the 0030 emission-gate change
+
+Per backend, four wiring touch-points beyond the kernel body:
+1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
+   entries so the discriminated/extra-src node is reported supported ONLY when
+   the new kernel handles it (and rejected otherwise, instead of today's
+   silently-true-for-the-plain-kernel).
+2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
+   dispatch the fused kernel.
+3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
+   bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
+4. The patch-0030 gate in `src/llama-context.cpp`.
+
+The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
+name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
+auto_fgdn = false`. As each backend gains kernels this must become capability-
+driven, in one of two ways:
+- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
+  kernels + tightened supports_op ship; OR
+- clean (recommended upstream form): DELETE the name allow-list and make
+  `supports_op` authoritative - have the `auto_fgdn` resolution probe
+  `ggml_backend_dev_supports_op` on a representative node that carries the
+  discriminated `src[]` slots. Then routing falls out of the normal scheduler
+  fallback and no backend name is ever hard-coded. This also fixes 0030's stated
+  weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
+  nodes and covered the discriminated SSM_CONV only incidentally.
+
+--------------------------------------------------------------------------------
+## 3. Bit-exactness per backend (the md5 gate question)
+
+Feasible on ALL THREE, and not actually constraining, because of how the gate is
+scoped:
+
+- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
+  its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
+  `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
+  cross-backend md5 and none is expected.
+- The relevant per-backend invariant is: the FUSED variant must equal that
+  backend's OWN non-fused path. The fusions change only the read source
+  (gather -> indexed read; the gather is a value-preserving memcpy) and the write
+  target (appended output -> in-place cache slot). They do NOT touch the
+  per-column FMA/reduce order. So the fused op is bit-identical to the
+  non-fused op on the same backend BY CONSTRUCTION.
+- Two arithmetic details each port MUST preserve exactly: (a) the conv
+  ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
+  `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
+  re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
+  registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
+  GATED_DELTA_NET) is the per-backend gate.
+
+--------------------------------------------------------------------------------
+## 4. Upstream path and ranked recommendation
+
+### Ops-first, then one PR per backend (NOT one big PR)
+
+Recommended sequence:
+
+1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
+   `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
+   `test-backend-ops` cases, and the capability-driven gate (the clean
+   `supports_op`-authoritative version of 0030). This is independently mergeable
+   and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
+   that way).
+2. PR #2 - Metal kernels + wiring.
+3. PR #3 - SYCL kernels + wiring.
+4. PR #4 - Vulkan kernels + wiring.
+
+Do NOT bundle the backends: each needs its own hardware to validate
+`test-backend-ops`, reviewers are backend-specialized, and a regression in one
+must not block the others.
+
+### Value x effort ranking (which backend first)
+
+| backend | user base / value          | author effort | bit-exact difficulty | net rank |
+|---------|----------------------------|---------------|----------------------|----------|
+| Metal   | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
+| SYCL    | LOW-MED (Intel GPU)        | LOWEST (near-verbatim CUDA mirror) | LOW   | **2nd** |
+| Vulkan  | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
+
+Recommendation: **Metal first.** It banks the biggest user-facing decode win at
+medium effort, the base GDN + conv kernels already exist, and Apple's fixed
+simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
+nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
+low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
+as the high-effort / high-breadth capstone - it reaches the widest hardware
+(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
+existing variant matrix, the subgroup-width variance, and the per-vendor
+validation burden make it the right capstone once the pattern is proven on
+Metal + SYCL.
+
+A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
+(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
+separate later effort.
+
+--------------------------------------------------------------------------------
+## 5. Summary
+
+- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
+  (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
+  three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
+- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
+  read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
+  (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
+  supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
+  descriptor wiring. The builders, CPU refs, model graph and tests are shared and
+  already done.
+- Bit-exactness is feasible everywhere and per-backend by construction (the
+  fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
+  (backendX-vs-CPU) is the gate.
+- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
+  name allow-list), then Metal, then SYCL, then Vulkan.
--- a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
+++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
@@ -0,0 +1,17 @@
+model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
+q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
+q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
+q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
+q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
+q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
+q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
+q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
+q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
+q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
+q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
+q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
+q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
+q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
+q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
+q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
+q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
--- a/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
@@ -0,0 +1,217 @@
+// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
+//
+// Reproduces, at the libllama level, the two host-side defects behind the
+// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
+// signature:
+//
+//   * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
+//     what llama-server issues on every reused slot - frees the kv-cache CELLS
+//     but the paged manager keeps owning the trailing BLOCKS. The manager's
+//     free pool silently shrinks. Test A measures the reclaimed-block delta.
+//
+//   * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
+//     many sequences and frees them in a scrambled order leaves the free queue a
+//     scrambled permutation of physical block ids. A later low-npl prefill then
+//     pops physically scattered blocks, so its KV scatter-write + in-kernel
+//     paged-attention gather lose locality and prefill throughput collapses;
+//     decode (single-token append) barely notices. Test B times an npl8 prefill
+//     on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
+//
+// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
+// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
+// of the fresh npl8 and num_free returns to the pristine value after the drain.
+//
+// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
+// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
+
+#include "llama.h"
+#include "paged-prefix-api.h"
+
+#include <chrono>
+#include <clocale>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <vector>
+
+static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
+
+using clk = std::chrono::steady_clock;
+static double secs(clk::time_point a, clk::time_point b) {
+    return std::chrono::duration<double>(b - a).count();
+}
+
+struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
+
+// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
+// blocks unique so no cross-request prefix sharing masks the accounting.
+static llama_token tok_of(int seq, int pos, int n_vocab) {
+    return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
+}
+
+// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
+// Returns wall seconds (sync'd).
+static double prefill(Ctx & C, int seq, int pos0, int n) {
+    clk::time_point t0 = clk::now();
+    C.batch.n_tokens = 0;
+    for (int j = 0; j < n; ++j) {
+        int i = C.batch.n_tokens;
+        C.batch.token[i]    = tok_of(seq, pos0 + j, C.n_vocab);
+        C.batch.pos[i]      = pos0 + j;
+        C.batch.n_seq_id[i] = 1;
+        C.batch.seq_id[i][0]= seq;
+        C.batch.logits[i]   = (j + 1 == n) ? 1 : 0;
+        C.batch.n_tokens++;
+    }
+    if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
+    llama_synchronize(C.ctx);
+    return secs(t0, clk::now());
+}
+
+// One decode step (single token) for seq at pos.
+static void decode1(Ctx & C, int seq, int pos) {
+    C.batch.n_tokens = 1;
+    C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
+    C.batch.pos[0]   = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
+    if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
+}
+
+int main(int argc, char ** argv) {
+    std::setlocale(LC_NUMERIC, "C");
+    const char * model_path = nullptr;
+    for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
+    if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
+
+    const int NSLOT = env_i("BURST_NSLOT", 64);
+    const int NPL   = env_i("NPL", 8);
+    const int PP    = env_i("PP", 512);
+    const int KEEP  = env_i("KEEP", 256);
+    const int GEN   = env_i("GEN", 4);
+    const int ngl   = env_i("PAGED_NGL", 99);
+    const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
+
+    ggml_backend_load_all();
+    llama_model_params mp = llama_model_default_params();
+    mp.n_gpu_layers = ngl;
+    llama_model * model = llama_model_load_from_file(model_path, mp);
+    if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
+    const llama_vocab * vocab = llama_model_get_vocab(model);
+    const int n_vocab = llama_vocab_n_tokens(vocab);
+
+    // Pool sized for the burst plus headroom so the burst fits but a later npl
+    // run draws from whatever the burst's churn left behind.
+    const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
+    llama_context_params cp = llama_context_default_params();
+    cp.n_ctx     = (uint32_t) cells;
+    cp.n_batch   = (uint32_t) (PP + 16);
+    cp.n_ubatch  = (uint32_t) (PP + 16);
+    cp.n_seq_max = NSLOT + NPL + 2;
+    cp.kv_unified = true;     // one unified stream-0 pool -> num_free(ctx) is the whole pool
+    cp.no_perf   = true;
+    llama_context * ctx = llama_init_from_model(model, cp);
+    if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
+
+    Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
+    C.batch = llama_batch_init(cp.n_batch, 0, 1);
+
+    printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
+           paged, NSLOT, NPL, PP, KEEP, GEN, cells);
+
+    llama_memory_clear(C.mem, true);
+    const long F_start = paged_prefix_api::num_free_global();
+
+    // ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
+    {
+        prefill(C, 0, 0, PP);
+        const long f_after_prefill = paged_prefix_api::num_free_global();
+        llama_memory_seq_rm(C.mem, 0, KEEP, -1);          // partial tail removal
+        const long f_after_rm = paged_prefix_api::num_free_global();
+        llama_memory_seq_rm(C.mem, 0, -1, -1);            // full free -> pristine
+        const long f_after_full = paged_prefix_api::num_free_global();
+        const long bs = 16;
+        const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
+        printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
+               "(expect %ld post-fix, 0 pre-fix)  afterFullFree=%ld\n",
+               F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
+    }
+
+    // ---- Test B: fragmentation -> npl prefill collapse -----------------------
+    // Fresh npl prefill baseline on a pristine pool.
+    llama_memory_clear(C.mem, true);
+    double tps_fresh;
+    {
+        clk::time_point t0 = clk::now();
+        long ntok = 0;
+        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
+        tps_fresh = ntok / secs(t0, clk::now());
+        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
+    }
+    const long F_pristine = paged_prefix_api::num_free_global();
+
+    // High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
+    // steps (mixed alloc), then drain them in a scrambled order (odd ids first,
+    // then even, each truncated before the full free) so the free queue becomes a
+    // scrambled permutation - the fragmentation the bug never compacts.
+    for (int s = 0; s < NSLOT; ++s) {
+        if (prefill(C, NPL + s, 0, PP) < 0) return 1;
+        for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
+    }
+    const long F_during_burst = paged_prefix_api::num_free_global();
+    // Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
+    // scrambled slot order to scramble the physical free order.
+    for (int parity = 1; parity >= 0; --parity)
+        for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
+            llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1);   // partial (Fix-1 path)
+            llama_memory_seq_rm(C.mem, NPL + s, -1, -1);     // full free
+        }
+    const long F_after_drain = paged_prefix_api::num_free_global();
+
+    // Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
+    // defragged) free queue.
+    double tps_post;
+    {
+        clk::time_point t0 = clk::now();
+        long ntok = 0;
+        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
+        tps_post = ntok / secs(t0, clk::now());
+        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
+    }
+
+    const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
+    printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
+           "(afterDrain==pristine? %s)\n",
+           F_start, F_pristine, F_during_burst, F_after_drain,
+           F_after_drain == F_pristine ? "YES" : "NO");
+    printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f  ratio=%.3f "
+           "(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
+
+    // ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
+    // Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
+    // slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
+    // low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
+    // prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
+    llama_memory_clear(C.mem, true);
+    // Touch the pool once so the manager exists, then read the full-pool size
+    // (num_free is 0 while no manager is registered).
+    if (prefill(C, 0, 0, 16) < 0) return 1;
+    llama_memory_seq_rm(C.mem, 0, -1, -1);
+    const long F_pre_c = paged_prefix_api::num_free_global();
+    for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
+    const long F_idle = paged_prefix_api::num_free_global();
+    for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
+    const long F_reclaimed = paged_prefix_api::num_free_global();
+    printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
+           "(returns_to_fresh? %s)\n",
+           F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
+           F_reclaimed == F_pre_c ? "YES" : "NO");
+
+    printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
+           paged, ratio,
+           F_after_drain == F_pristine ? "YES" : "NO",
+           F_reclaimed == F_pre_c ? "YES" : "NO");
+
+    llama_batch_free(C.batch);
+    llama_free(ctx);
+    llama_model_free(model);
+    return 0;
+}
--- a/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp
+++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp
@@ -0,0 +1,59 @@
+// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
+// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
+//
+//   Fix-1  PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
+//          beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
+//          longer strands blocks whose cells were cleared.
+//   Fix-2  defrag_free_pool() relinks the free queue into ascending block-id
+//          order once the pool is fully idle, undoing a burst's scrambled frees
+//          so a later prefill pops physically contiguous blocks again.
+
+#include "paged-kv-manager.h"
+#include <cstdio>
+
+using paged::PagedKVManager;
+
+int main() {
+    int rc = 0;
+
+    // ---- Fix-1: truncate reclaims the trailing block suffix -----------------
+    {
+        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
+        const size_t f0 = m.num_free_blocks();   // 63 (block 0 reserved as null)
+        m.allocate(0, 512);                       // ceil(512/16)=32 blocks
+        const size_t f1 = m.num_free_blocks();    // 31
+        m.truncate(0, 256);                       // keep ceil(256/16)=16, free 16
+        const size_t f2 = m.num_free_blocks();    // 47
+        printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
+               f0, f1, f2, f2 - f1);
+        if (f2 - f1 != 16) rc = 1;
+        m.truncate(0, 16);                        // keep 1 block, free 15 more
+        const size_t f3 = m.num_free_blocks();    // 62
+        printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
+        if (f3 != f0 - 1) rc = 1;
+        m.free(0);
+        if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
+    }
+
+    // ---- Fix-2: defrag restores ascending popleft order ---------------------
+    {
+        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
+        for (int s = 0; s < 8; ++s) m.allocate(s, 16);          // pop blocks 1..8
+        const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4};      // free out of order
+        for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
+        m.defrag_free_pool();                                    // all idle -> compact
+        m.allocate(100, 16 * 3);                                 // pop 3 blocks
+        const auto bt = m.block_table(100);
+        bool asc = true;
+        printf("[unit Fix-2] post-defrag block_table:");
+        for (size_t i = 0; i < bt.size(); ++i) {
+            printf(" %d", bt[i]);
+            if (i && bt[i] < bt[i - 1]) asc = false;
+        }
+        printf("  ascending=%s (expect YES)\n", asc ? "YES" : "NO");
+        if (!asc) rc = 1;
+    }
+
+    printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
+    return rc;
+}
--- a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png
--- a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png
--- a/backend/cpp/llama-cpp-localai-paged/package.sh
+++ b/backend/cpp/llama-cpp-localai-paged/package.sh
@@ -0,0 +1,66 @@
+#!/bin/bash
+
+# Script to copy the appropriate libraries based on architecture
+# This script is used in the final stage of the Dockerfile
+
+set -e
+
+CURDIR=$(dirname "$(realpath $0)")
+REPO_ROOT="${CURDIR}/../../.."
+
+# Create lib directory
+mkdir -p $CURDIR/package/lib
+
+cp -avrf $CURDIR/llama-cpp-localai-paged-* $CURDIR/package/
+cp -rfv $CURDIR/run.sh $CURDIR/package/
+
+# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml
+# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which
+# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the
+# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds.
+if [ -d "$CURDIR/ggml-shared-libs" ]; then
+    echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..."
+    cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/
+fi
+
+# Detect architecture and copy appropriate libraries
+if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then
+    # x86_64 architecture
+    echo "Detected x86_64 architecture, copying x86_64 libraries..."
+    cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then
+    # ARM64 architecture
+    echo "Detected ARM64 architecture, copying ARM64 libraries..."
+    cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so
+    cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6
+    cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2
+    cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1
+    cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0
+else
+    echo "Error: Could not detect architecture"
+    exit 1
+fi
+
+# Package GPU libraries based on BUILD_TYPE
+GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh"
+if [ -f "$GPU_LIB_SCRIPT" ]; then
+    echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..."
+    source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib"
+    package_gpu_libs
+fi
+
+echo "Packaging completed successfully"
+ls -liah $CURDIR/package/
+ls -liah $CURDIR/package/lib/
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0001-vendor-paged-kv-manager.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0001-vendor-paged-kv-manager.patch
@@ -0,0 +1,447 @@
+From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 19 Jun 2026 22:54:49 +0000
+Subject: [PATCH] vendor paged kv manager
+
+vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
+PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
+nothing uses it yet; wired in by later patches in the series.
+---
+ src/CMakeLists.txt       |   1 +
+ src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
+ src/paged-kv-manager.h   | 108 ++++++++++++++
+ 3 files changed, 405 insertions(+)
+ create mode 100644 src/paged-kv-manager.cpp
+ create mode 100644 src/paged-kv-manager.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index d15ccfd99..a030940b8 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -24,6 +24,7 @@ add_library(llama
+             llama-io.cpp
+             llama-kv-cache.cpp
+             llama-kv-cache-iswa.cpp
+            paged-kv-manager.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
+new file mode 100644
+index 000000000..ca0dcd83a
+--- /dev/null
+++ b/src/paged-kv-manager.cpp
+@@ -0,0 +1,296 @@
+#include "paged-kv-manager.h"
+#include <cassert>
+#include <stdexcept>
+
+namespace paged {
+
+// ---------------------------------------------------------------------------
+// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
+// ---------------------------------------------------------------------------
+
+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
+    num_free_blocks = blocks.size();
+    for (size_t i = 0; i < blocks.size(); ++i) {
+        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
+        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
+    }
+    if (!blocks.empty()) {
+        fake_head.next_free = blocks.front();
+        blocks.front()->prev_free = &fake_head;
+        fake_tail.prev_free = blocks.back();
+        blocks.back()->next_free = &fake_tail;
+    } else {
+        fake_head.next_free = &fake_tail;
+        fake_tail.prev_free = &fake_head;
+    }
+}
+
+KVCacheBlock* FreeBlockQueue::popleft() {
+    KVCacheBlock* first = fake_head.next_free;
+    if (first == &fake_tail || first == nullptr) {
+        assert(num_free_blocks == 0);
+        throw std::runtime_error("No free blocks available");
+    }
+    fake_head.next_free = first->next_free;
+    first->next_free->prev_free = &fake_head;
+    first->prev_free = first->next_free = nullptr;
+    num_free_blocks--;
+    return first;
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
+    std::vector<KVCacheBlock*> ret;
+    if (n == 0) return ret;
+    assert(num_free_blocks >= n);
+    num_free_blocks -= n;
+    KVCacheBlock* curr = fake_head.next_free;
+    ret.reserve(n);
+    for (size_t i = 0; i < n; ++i) {
+        assert(curr != nullptr);
+        ret.push_back(curr);
+        KVCacheBlock* last = curr;
+        curr = curr->next_free;
+        last->prev_free = last->next_free = nullptr;
+    }
+    if (curr != nullptr) {
+        fake_head.next_free = curr;
+        curr->prev_free = &fake_head;
+    }
+    return ret;
+}
+
+void FreeBlockQueue::remove(KVCacheBlock* block) {
+    if (!block->prev_free || !block->next_free)
+        throw std::runtime_error("remove() called on an invalid block");
+    block->prev_free->next_free = block->next_free;
+    block->next_free->prev_free = block->prev_free;
+    block->prev_free = block->next_free = nullptr;
+    num_free_blocks--;
+}
+
+void FreeBlockQueue::append(KVCacheBlock* block) {
+    KVCacheBlock* last = fake_tail.prev_free;
+    last->next_free = block;
+    block->prev_free = last;
+    block->next_free = &fake_tail;
+    fake_tail.prev_free = block;
+    num_free_blocks++;
+}
+
+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* last = fake_tail.prev_free;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = last;
+        last->next_free = b;
+        last = b;
+    }
+    last->next_free = &fake_tail;
+    fake_tail.prev_free = last;
+    num_free_blocks += blocks.size();
+}
+
+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* first = fake_head.next_free;
+    KVCacheBlock* prev = &fake_head;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = prev;
+        prev->next_free = b;
+        prev = b;
+    }
+    prev->next_free = first;
+    first->prev_free = prev;
+    num_free_blocks += blocks.size();
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+    std::vector<KVCacheBlock*> ret;
+    const KVCacheBlock* curr = fake_head.next_free;
+    while (curr && curr->next_free != nullptr) {
+        ret.push_back(const_cast<KVCacheBlock*>(curr));
+        curr = curr->next_free;
+    }
+    return ret;
+}
+
+// ---------------------------------------------------------------------------
+// BlockPool  (port of block_pool.py)
+// ---------------------------------------------------------------------------
+
+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
+    std::vector<KVCacheBlock*> p;
+    p.reserve(v.size());
+    for (auto& b : v) p.push_back(&b);
+    return p;
+}
+
+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
+    std::vector<KVCacheBlock> v;
+    v.reserve(num_blocks);
+    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
+    return v;
+}
+
+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
+    : enable_caching_(enable_caching),
+      blocks_(make_block_vec(num_blocks)),
+      ptrs_(make_ptrs(blocks_)),
+      free_queue_(ptrs_) {
+    // vLLM reserves block_id 0 as the null block (never cached).
+    null_block = free_queue_.popleft();
+    null_block->is_null = true;
+}
+
+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
+    if (!block->has_hash) return false;
+    auto it = cached_block_hash_to_block_.find(block->block_hash);
+    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
+    cached_block_hash_to_block_.erase(it);
+    block->reset_hash();
+    return true;
+}
+
+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
+    if (n > get_num_free_blocks())
+        throw std::runtime_error("Cannot get free blocks from pool");
+    auto ret = free_queue_.popleft_n(n);
+    for (KVCacheBlock* b : ret) {
+        if (enable_caching_) maybe_evict_cached_block(b);
+        assert(b->ref_cnt == 0);
+        b->ref_cnt += 1;
+    }
+    return ret;
+}
+
+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
+    auto it = cached_block_hash_to_block_.find(block_hash);
+    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
+}
+
+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
+    for (KVCacheBlock* b : blocks) {
+        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
+        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
+        b->ref_cnt += 1;
+    }
+}
+
+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
+    std::vector<KVCacheBlock*> without_hash, with_hash;
+    for (KVCacheBlock* b : ordered_blocks) {
+        if (b->is_null) continue;
+        b->ref_cnt -= 1;
+        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
+    }
+    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
+    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
+}
+
+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                                  size_t num_cached_blocks, size_t num_full_blocks,
+                                  const std::vector<uint64_t>& block_hashes) {
+    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
+        KVCacheBlock* blk = req_blocks[i];
+        if (blk->has_hash) continue;
+        blk->has_hash = true;
+        blk->block_hash = block_hashes[i];
+        cached_block_hash_to_block_[blk->block_hash] = blk;
+    }
+}
+
+// ---------------------------------------------------------------------------
+// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
+// ---------------------------------------------------------------------------
+
+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
+    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
+
+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t need = cdiv(total_tokens, block_size_);
+    if (need <= req.size()) return true;
+    size_t add = need - req.size();
+    if (add > pool_.get_num_free_blocks()) return false; // OOM
+    auto nb = pool_.get_new_blocks(add);
+    req.insert(req.end(), nb.begin(), nb.end());
+    return true;
+}
+
+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
+    std::vector<int32_t> bt;
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return bt;
+    bt.reserve(it->second.size());
+    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
+    return bt;
+}
+
+int64_t PagedKVManager::slot(int seq_id, int pos) const {
+    const auto& req = req_to_blocks_.at(seq_id);
+    int32_t phys = req[pos / block_size_]->block_id;
+    return (int64_t)phys * block_size_ + (pos % block_size_);
+}
+
+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
+    std::vector<int64_t> sm;
+    sm.reserve(positions.size());
+    for (int p : positions) sm.push_back(slot(seq_id, p));
+    return sm;
+}
+
+void PagedKVManager::free(int seq_id) {
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return;
+    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
+    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
+    pool_.free_blocks(ordered);
+    req_to_blocks_.erase(it);
+}
+
+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+// hash into the seed so each block hash transitively encodes its whole prefix
+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
+    uint64_t h = 1469598103934665603ull ^ parent_hash;
+    for (int t : token_ids) {
+        h ^= (uint64_t)(uint32_t)t;
+        h *= 1099511628211ull;
+    }
+    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
+    return h;
+}
+
+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
+    std::vector<uint64_t> hashes;
+    uint64_t parent = 0; // NONE_HASH analogue
+    size_t n_full = token_ids.size() / block_size_;
+    for (size_t i = 0; i < n_full; ++i) {
+        std::vector<int> blk(token_ids.begin() + i * block_size_,
+                             token_ids.begin() + (i + 1) * block_size_);
+        parent = hash_block(parent, blk);
+        hashes.push_back(parent);
+    }
+    return hashes;
+}
+
+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
+    std::vector<KVCacheBlock*> hits;
+    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
+        KVCacheBlock* cb = pool_.get_cached_block(bh);
+        if (!cb) break;
+        hits.push_back(cb);
+    }
+    pool_.touch(hits);                        // ++ref_cnt, pull from free list
+    return hits.size() * (size_t)block_size_;
+}
+
+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t n_full = num_tokens / block_size_;
+    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+}
+
+} // namespace paged
+diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
+new file mode 100644
+index 000000000..740280a7f
+--- /dev/null
+++ b/src/paged-kv-manager.h
+@@ -0,0 +1,108 @@
+#pragma once
+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
+//
+// Host-side block management is a faithful port of vLLM V1:
+//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
+//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
+//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
+//
+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
+// dependency so it can be unit-tested in isolation.
+
+#include <cstdint>
+#include <vector>
+#include <unordered_map>
+#include <map>
+
+namespace paged {
+
+// vLLM KVCacheBlock (kv_cache_utils.py).
+struct KVCacheBlock {
+    int32_t  block_id   = 0;
+    int      ref_cnt    = 0;
+    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
+    uint64_t block_hash = 0;
+    bool     is_null    = false;
+    KVCacheBlock* prev_free = nullptr;
+    KVCacheBlock* next_free = nullptr;
+
+    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
+    void reset_hash() { has_hash = false; block_hash = 0; }
+};
+
+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
+// O(1) middle removal is required so touch() can pull a warm cached block out of the
+// free list when a later request hits its prefix.
+class FreeBlockQueue {
+public:
+    size_t num_free_blocks = 0;
+
+    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
+    KVCacheBlock* popleft();
+    std::vector<KVCacheBlock*> popleft_n(size_t n);
+    void remove(KVCacheBlock* block);
+    void append(KVCacheBlock* block);
+    void append_n(const std::vector<KVCacheBlock*>& blocks);
+    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+    std::vector<KVCacheBlock*> get_all_free_blocks() const;
+
+private:
+    KVCacheBlock fake_head{-1};
+    KVCacheBlock fake_tail{-1};
+};
+
+// vLLM BlockPool (block_pool.py).
+class BlockPool {
+public:
+    KVCacheBlock* null_block = nullptr;
+
+    BlockPool(int32_t num_blocks, bool enable_caching);
+    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
+    KVCacheBlock* get_cached_block(uint64_t block_hash);
+    void touch(const std::vector<KVCacheBlock*>& blocks);
+    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
+    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                           size_t num_cached_blocks, size_t num_full_blocks,
+                           const std::vector<uint64_t>& block_hashes);
+    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+
+private:
+    bool maybe_evict_cached_block(KVCacheBlock* block);
+
+    bool enable_caching_;
+    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
+    std::vector<KVCacheBlock*> ptrs_;
+    FreeBlockQueue free_queue_;
+    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
+    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
+    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
+};
+
+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
+class PagedKVManager {
+public:
+    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
+
+    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
+    bool allocate(int seq_id, size_t total_tokens);
+    std::vector<int32_t> block_table(int seq_id) const;
+    int64_t slot(int seq_id, int pos) const;
+    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
+    void free(int seq_id);
+    int block_size() const { return block_size_; }
+
+    // Prefix caching (win 3).
+    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+
+protected:
+    int block_size_;
+    BlockPool pool_;
+    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
+};
+
+} // namespace paged
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,75 @@
+From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 19 Jun 2026 23:04:17 +0000
+Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
+
+Place each sequence's tokens at permuted, non-contiguous fixed-size block
+positions in find_slot, proving attention is invariant to physical KV placement
+(token-identical greedy generation). Default off; single-sequence scope; falls
+back to the normal allocator. The paged-placement substrate for the gather-read.
+---
+ src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 41 insertions(+)
+
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 2802103bd..999e2ae61 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -11,6 +11,8 @@
+ #include <cstring>
+ #include <limits>
+ #include <map>
+#include <numeric>
+#include <cstdlib>
+ #include <stdexcept>
+ 
+ static bool ggml_is_power_of_2(int n) {
+@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+             return { };
+         }
+ 
+        // [paged, experimental] Place this sequence's tokens at permuted,
+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+        // This validates that attention is invariant to physical KV placement -
+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+        // Single-sequence scope (uses get_used() as the logical base); falls back
+        // to the normal allocator if the permuted cells aren't available.
+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+        if (paged_mode) {
+            const uint32_t bs   = 16;                 // block size (tokens/block)
+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
+            if (nblk >= 2) {
+                // stride coprime to nblk => block-index permutation is a bijection
+                uint32_t k = 1;
+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+                }
+                const uint32_t base = cells.get_used();
+                bool ok = true;
+                for (uint32_t i = 0; i < n_tokens; ++i) {
+                    const uint32_t L    = base + i;
+                    const uint32_t b    = L / bs;
+                    const uint32_t off  = L % bs;
+                    if (b >= nblk) { ok = false; break; }
+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+                    res.idxs[s].push_back(phys);
+                }
+                if (ok && res.idxs[s].size() == n_tokens) {
+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+                    }
+                    continue; // paged placement succeeded for this sequence
+                }
+                res.idxs[s].clear(); // fall back to the normal allocator
+            }
+        }
+
+         uint32_t n_tested = 0;
+ 
+         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,369 @@
+From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 10:24:22 +0200
+Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
+
+Gather K, V and the kq_mask down to each sequence stream's non-empty cells
+before build_attn_mha. Position-sorted per stream so the flash-attn online
+softmax reduction order matches stock byte-for-byte. Multi-stream: one index
+column per stream over k->ne[3], padded to the max non-empty count with a
+masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
+---
+ src/CMakeLists.txt     |   1 +
+ src/llama-graph.cpp    |   9 ++-
+ src/llama-kv-cache.cpp |  74 ++++++++++++++++++++++++
+ src/llama-kv-cache.h   |  11 ++++
+ src/paged-attn.cpp     | 128 +++++++++++++++++++++++++++++++++++++++++
+ src/paged-attn.h       |  40 +++++++++++++
+ 6 files changed, 262 insertions(+), 1 deletion(-)
+ create mode 100644 src/paged-attn.cpp
+ create mode 100644 src/paged-attn.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index a030940..58083b3 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -25,6 +25,7 @@ add_library(llama
+             llama-kv-cache.cpp
+             llama-kv-cache-iswa.cpp
+             paged-kv-manager.cpp
+            paged-attn.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
+index 68c9e60..b59d2a5 100644
+--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
+@@ -6,6 +6,8 @@
+ #include "llama-cparams.h"
+ 
+ #include "llama-kv-cache.h"
+
+#include "paged-attn.h"
+ #include "llama-kv-cache-iswa.h"
+ #include "llama-kv-cache-dsa.h"
+ #include "llama-memory-hybrid.h"
+@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
+     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
+     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
+ 
+-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+    // [paged 0003] gather K, V and the mask to the sequence's used cells only
+    //   (no-op unless env LLAMA_KV_PAGED is set).
+    ggml_tensor * kq_mask_g = kq_mask;
+    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+
+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+     cb(cur, "kqv_out", il);
+ 
+     if (inp->self_v_rot) {
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 999e2ae..30d02d7 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1,4 +1,6 @@
+ #include "llama-kv-cache.h"
+#include <vector>
+#include <utility>
+ 
+ #include "llama-impl.h"
+ #include "llama-io.h"
+@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
+             ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
+ }
+ 
+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
+// single stream addressed by sinfo. With paged placement (patch 0002) these are
+// the sequence's scattered block cells; gathering K/V/mask by this index list
+// compacts the attention read while preserving every unmasked (token,cell) pair.
+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
+    // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
+    // n_stream], so n_gather is the MAX non-empty count across the batch streams.
+    // Streams with fewer cells are padded (see get_gather_idxs) with a masked
+    // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
+    // K is laid out over physical streams [s0, s1]; index v_cells the same way.
+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+    uint32_t mx = 0;
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        uint32_t cnt = 0;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                ++cnt;
+            }
+        }
+        mx = std::max(mx, cnt);
+    }
+    return mx;
+}
+
+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
+    const uint32_t ns       = sinfo.s1 - sinfo.s0 + 1;
+    const uint32_t n_gather = get_n_gather(n_kv, sinfo);
+    // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        // Collect the non-empty cells, then order them by token POSITION (not by
+        // physical cell index). The attention reduction (flash-attn online
+        // softmax, and the non-flash soft_max) runs over cells in array order and
+        // is order-sensitive in floating point. Stock (contiguous) placement
+        // happens to store cells in position order, so emitting the gathered
+        // indices in position order reproduces stock's exact reduction order -
+        // making the paged read bit-identical, not merely math-equivalent.
+        std::vector<std::pair<llama_pos, int32_t>> pc;
+        pc.reserve(n);
+        int32_t pad = -1;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
+            } else if (pad < 0) {
+                pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
+            }
+        }
+        std::sort(pc.begin(), pc.end());
+        int32_t * col = dst + (size_t) j * n_gather;
+        for (size_t k = 0; k < pc.size(); ++k) {
+            col[k] = pc[k].second;
+        }
+        // Pad the tail to n_gather with a masked (empty) cell so the rectangular
+        // gather drops to zero contribution for streams shorter than the max.
+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+        for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
+            col[k] = padv;
+        }
+    }
+}
+
+ ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
+     GGML_UNUSED(sinfo);
+ 
+@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
+     return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
+ }
+ 
+uint32_t llama_kv_cache_context::get_n_gather() const {
+    return kv->get_n_gather(n_kv, sinfos[i_cur]);
+}
+
+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+    kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+}
+
+ ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
+     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
+ }
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index 3d68f98..494c0fb 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -171,6 +171,12 @@ public:
+     ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+     ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+ 
+    // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
+    //   sinfo (position-sorted, padded across streams). Used by paged-attn
+    //   gather-read. get_n_gather returns the max count across streams.
+    uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+    void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+
+     // store k_cur and v_cur in the cache based on the provided head location
+     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
+     ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
+@@ -368,6 +374,11 @@ public:
+     ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
+     ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
+ 
+    // [paged 0003] gather-read helpers (delegate to the kv cache for the
+    //   current ubatch's stream).
+    uint32_t get_n_gather() const;
+    void     get_gather_idxs(int32_t * dst) const;
+
+     // store k_cur and v_cur in the cache based on the provided head location
+     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
+     //   - k_cur  [n_embd_head_k, n_head_k, n_tokens]
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+new file mode 100644
+index 0000000..ade75e8
+--- /dev/null
+++ b/src/paged-attn.cpp
+@@ -0,0 +1,128 @@
+#include "paged-attn.h"
+
+#include "llama-graph.h"
+#include "llama-kv-cache.h"
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#include <cstdlib>
+#include <cstdio>
+
+namespace paged_attn {
+
+bool active() {
+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+    return a;
+}
+
+static bool debug() {
+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+    return d;
+}
+
+namespace {
+
+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
+// with each stream's non-empty cell indices (position-sorted, padded with a
+// masked/empty cell) by delegating to the kv-cache context. Private to this
+// unit; default can_reuse()==false keeps the graph from being reused across
+// decodes (n_gather grows every step).
+class input_gather_idxs : public llm_graph_input_i {
+public:
+    input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
+        : mctx(mctx), idxs(idxs) {}
+
+    void set_input(const llama_ubatch * ubatch) override {
+        GGML_UNUSED(ubatch);
+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+        mctx->get_gather_idxs((int32_t *) idxs->data);
+    }
+
+    const llama_kv_cache_context * mctx;
+    ggml_tensor * idxs;
+};
+
+} // namespace
+
+void gather(ggml_context * ctx0,
+            llm_graph_result * res,
+            const llama_kv_cache_context * mctx,
+            ggml_tensor ** k,
+            ggml_tensor ** v,
+            ggml_tensor ** kq_mask) {
+    if (!active()) {
+        return;
+    }
+
+    ggml_tensor * K = *k;
+    ggml_tensor * V = *v;
+    ggml_tensor * M = *kq_mask;
+
+    // Number of streams (sequences) in the unified batch. K is laid out
+    // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
+    // gather is per-stream (one index column per stream), so a single
+    // ggml_get_rows over the stream axis handles 1..N streams uniformly.
+    const int64_t n_stream = K->ne[3];
+    GGML_ASSERT(M->ne[3] == n_stream);
+
+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
+    if (n_gather <= 0) {
+        // Worst-case graph reserve (empty cache) or nothing placed yet: leave
+        // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
+        return;
+    }
+
+    if (debug()) {
+        static int64_t once = 0;
+        if (once++ < 2) {
+            fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+        }
+    }
+
+    // Per-stream index tensor [n_gather, n_stream], filled at set_input from
+    // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
+    // n_stream, so column s gathers from stream s of the source.
+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
+    ggml_set_input(idx);
+    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
+
+    // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
+    {
+        ggml_tensor * t = ggml_cont(ctx0, K);                                          // [d, h, n_kv, ns]
+        t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream);           // [d*h, n_kv, ns]
+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
+        *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream);         // [d, h, n_gather, ns]
+    }
+
+    // --- gather V ---
+    // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
+    // result is contiguous and build_attn_mha sees a consistent v_trans==false.
+    {
+        const bool v_trans = V->nb[1] > V->nb[2];
+        ggml_tensor * vsrc = v_trans
+            ? ggml_permute(ctx0, V, 2, 1, 0, 3)   // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
+            : V;                                  // already [d, h, n_kv, ns]
+        ggml_tensor * t = ggml_cont(ctx0, vsrc);                                       // [d, h, n_kv, ns]
+        t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream);  // [d*h, n_kv, ns]
+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
+        *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream);   // [d, h, n_gather, ns]
+    }
+
+    // --- gather mask (cells are ne0): transpose so cells become the row axis,
+    //     gather per stream, transpose back ---
+    {
+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);      // [n_kv, n_tps, ns]
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_tps, n_kv, ns]
+        m = ggml_get_rows(ctx0, m, idx);                                               // [n_tps, n_gather, ns] (F32)
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_gather, n_tps, ns]
+        m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
+        if (M->type != m->type) {
+            m = ggml_cast(ctx0, m, M->type);   // flash-attn requires an F16 mask
+        }
+        *kq_mask = m;
+    }
+}
+
+} // namespace paged_attn
+diff --git a/src/paged-attn.h b/src/paged-attn.h
+new file mode 100644
+index 0000000..c5b7bd7
+--- /dev/null
+++ b/src/paged-attn.h
+@@ -0,0 +1,40 @@
+#pragma once
+// Paged attention gather-read (patch 0003, experimental).
+//
+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
+// build_attn_mha.
+//
+// Correctness: attention is permutation-invariant over the KV set, and dropping
+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+//
+// All logic lives here to keep the core files additive: build_attn gets one
+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
+
+#include <cstdint>
+
+struct ggml_context;
+struct ggml_tensor;
+class  llm_graph_result;
+class  llama_kv_cache_context;
+
+namespace paged_attn {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
+// point at the compacted tensors; pass them straight to build_attn_mha.
+void gather(ggml_context * ctx0,
+            llm_graph_result * res,
+            const llama_kv_cache_context * mctx,
+            ggml_tensor ** k,
+            ggml_tensor ** v,
+            ggml_tensor ** kq_mask);
+
+} // namespace paged_attn
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,298 @@
+From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 10:50:35 +0200
+Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
+ 0004
+
+Drive the paged placement in find_slot through the vendored PagedKVManager
+(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
+free pool on demand as the sequence crosses block boundaries (peak << full
+reservation) and returned on sequence end (seq_rm full removal / clear). One
+manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
+so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
+gated call. Default off; stock path byte-identical.
+---
+ src/CMakeLists.txt     |   1 +
+ src/llama-kv-cache.cpp |  69 +++++++++++++++++----------
+ src/paged-alloc.cpp    | 106 +++++++++++++++++++++++++++++++++++++++++
+ src/paged-alloc.h      |  39 +++++++++++++++
+ 4 files changed, 190 insertions(+), 25 deletions(-)
+ create mode 100644 src/paged-alloc.cpp
+ create mode 100644 src/paged-alloc.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index 58083b3..4d9d7d1 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -26,6 +26,7 @@ add_library(llama
+             llama-kv-cache-iswa.cpp
+             paged-kv-manager.cpp
+             paged-attn.cpp
+            paged-alloc.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 30d02d7..1125d9a 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1,4 +1,5 @@
+ #include "llama-kv-cache.h"
+#include "paged-alloc.h"
+ #include <vector>
+ #include <utility>
+ 
+@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
+ }
+ 
+ void llama_kv_cache::clear(bool data) {
+    // [paged 0004] return all on-demand blocks to the pool on cache clear.
+    if (paged_alloc::active()) {
+        paged_alloc::release_all(this);
+    }
+
+     for (uint32_t s = 0; s < n_stream; ++s) {
+         v_cells[s].reset();
+         v_heads[s] = 0;
+@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
+         p1 = std::numeric_limits<llama_pos>::max();
+     }
+ 
+    // [paged 0004] free a stream's on-demand blocks when its whole sequence is
+    // removed (sequence end), so they return to the pool for reuse.
+    if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+        if (seq_id >= 0) {
+            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+        } else {
+            paged_alloc::release_all(this);
+        }
+    }
+
+     if (seq_id >= 0) {
+         auto & cells = v_cells[seq_to_stream[seq_id]];
+         auto & head  = v_heads[seq_to_stream[seq_id]];
+@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+         // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+         // Single-sequence scope (uses get_used() as the logical base); falls back
+         // to the normal allocator if the permuted cells aren't available.
+-        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+-        if (paged_mode) {
+        // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
+        // invariant to physical KV placement; here that placement is driven by
+        // the vendored PagedKVManager (patch 0001): blocks are popped from a free
+        // pool only as the sequence crosses block boundaries (peak << full
+        // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
+        // falls back to the normal allocator on pool exhaustion or any conflict.
+        if (paged_alloc::active()) {
+             const uint32_t bs   = 16;                 // block size (tokens/block)
+-            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
+            const uint32_t nblk = cells.size() / bs;  // this stream's block budget
+             if (nblk >= 2) {
+-                // stride coprime to nblk => block-index permutation is a bijection
+-                uint32_t k = 1;
+-                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+-                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+-                }
+                 const uint32_t base = cells.get_used();
+-                bool ok = true;
+-                for (uint32_t i = 0; i < n_tokens; ++i) {
+-                    const uint32_t L    = base + i;
+-                    const uint32_t b    = L / bs;
+-                    const uint32_t off  = L % bs;
+-                    if (b >= nblk) { ok = false; break; }
+-                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+-                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+-                    res.idxs[s].push_back(phys);
+-                }
+-                if (ok && res.idxs[s].size() == n_tokens) {
+-                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+-                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+-                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+-                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+                const int      strm = (int) seq_to_stream[seq_id];
+                std::vector<uint32_t> placed;
+                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+                    bool ok = (placed.size() == n_tokens);
+                    for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+                        if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+                            ok = false;
+                        }
+                    }
+                    if (ok) {
+                        for (uint32_t phys : placed) {
+                            res.idxs[s].push_back(phys);
+                        }
+                        if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+                            fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
+                            for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+                            fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
+                        }
+                        continue; // on-demand paged placement succeeded
+                     }
+-                    continue; // paged placement succeeded for this sequence
+                    res.idxs[s].clear(); // fall back to the normal allocator
+                 }
+-                res.idxs[s].clear(); // fall back to the normal allocator
+             }
+         }
+ 
+diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
+new file mode 100644
+index 0000000..1d13f9c
+--- /dev/null
+++ b/src/paged-alloc.cpp
+@@ -0,0 +1,106 @@
+#include "paged-alloc.h"
+#include "paged-kv-manager.h"
+
+#include <cstdlib>
+#include <cstdio>
+#include <map>
+#include <memory>
+#include <utility>
+
+namespace paged_alloc {
+
+bool active() {
+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+    return a;
+}
+
+static bool debug() {
+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+    return d;
+}
+
+namespace {
+
+using key_t = std::pair<const void *, int>;
+
+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+// physical pool of cells.size() cells, so a manager's block ids map directly to
+// cell ranges within that stream's pool. The internal request id is always 0.
+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+
+paged::PagedKVManager * get_mgr(const void * cache, int stream,
+                                uint32_t pool_blocks, uint32_t block_size) {
+    const key_t k{cache, stream};
+    auto it = g_managers.find(k);
+    if (it == g_managers.end()) {
+        // enable_caching=false: prefix caching is a later patch; 0004 exercises
+        // only on-demand allocate / free.
+        auto mgr = std::make_unique<paged::PagedKVManager>(
+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+        it = g_managers.emplace(k, std::move(mgr)).first;
+    }
+    return it->second.get();
+}
+
+} // namespace
+
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+           uint32_t block_size, uint32_t pool_blocks,
+           std::vector<uint32_t> & out) {
+    if (n_tokens == 0) {
+        return true;
+    }
+
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+
+    const size_t before = mgr->block_table(0).size();
+
+    // Grow the request to cover the highest logical position. The manager pops
+    // free blocks only for the boundaries actually crossed - that is the on-
+    // demand behavior; an already-covered range adds nothing.
+    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+        return false; // pool exhausted -> caller falls back to the stock path
+    }
+
+    out.reserve(out.size() + n_tokens);
+    for (uint32_t i = 0; i < n_tokens; ++i) {
+        const int64_t s = mgr->slot(0, (int) (base + i));
+        out.push_back((uint32_t) s);
+    }
+
+    if (debug()) {
+        const size_t after = mgr->block_table(0).size();
+        if (after != before) {
+            fprintf(stderr,
+                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+                    "(budget=%u; base=%u +%u tok)\n",
+                    cache, stream, before, after, pool_blocks, base, n_tokens);
+        }
+    }
+
+    return true;
+}
+
+void release(const void * cache, int stream) {
+    auto it = g_managers.find({cache, stream});
+    if (it == g_managers.end()) {
+        return;
+    }
+    it->second->free(0);
+    g_managers.erase(it);
+    if (debug()) {
+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+    }
+}
+
+void release_all(const void * cache) {
+    for (auto it = g_managers.begin(); it != g_managers.end(); ) {
+        if (it->first.first == cache) {
+            it = g_managers.erase(it);
+        } else {
+            ++it;
+        }
+    }
+}
+
+} // namespace paged_alloc
+diff --git a/src/paged-alloc.h b/src/paged-alloc.h
+new file mode 100644
+index 0000000..bf66665
+--- /dev/null
+++ b/src/paged-alloc.h
+@@ -0,0 +1,39 @@
+#pragma once
+// On-demand paged KV block allocation (patch 0004, experimental).
+//
+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+// sequence's logical positions onto a fixed full-pool permutation, blocks are
+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+// and returned to the pool on sequence end. This is where the paged memory-
+// capacity benefit begins: a short sequence holds only a few blocks, not the
+// whole reserved window.
+//
+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+// struct stays untouched - find_slot only gains a gated call.
+
+#include <cstdint>
+#include <vector>
+
+namespace paged_alloc {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+// demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+// allocator. The caller still validates each returned cell is empty.
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+           uint32_t block_size, uint32_t pool_blocks,
+           std::vector<uint32_t> & out);
+
+// Return a stream's blocks to the pool (sequence end).
+void release(const void * cache, int stream);
+
+// Return every stream's blocks for a kv-cache (clear() / teardown).
+void release_all(const void * cache);
+
+} // namespace paged_alloc
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,143 @@
+From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 12:13:44 +0200
+Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
+ patch 0006
+
+Add host-side cross-request prefix sharing to the vendored PagedKVManager
+(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
+matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
+fresh blocks only for the divergent suffix. A shared block is freed only at
+ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
+write so co-owners stay byte-correct. All logic lives in the vendored
+src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
+core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
+
+Wiring the physical-cell reuse into find_slot so the engine itself skips
+recompute needs core seq-membership changes and is left to a later patch.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
+ src/paged-kv-manager.h   | 23 ++++++++++++++
+ 2 files changed, 88 insertions(+)
+
+diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
+index ca0dcd8..4c6ee4c 100644
+--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
+@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
+     pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+ }
+ 
+// ---------------------------------------------------------------------------
+// Cross-request prefix caching + copy-on-write  (patch 0006)
+// ---------------------------------------------------------------------------
+
+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
+    auto& req = req_to_blocks_[seq_id];
+
+    // Longest cached prefix: hash the full blocks and stop at the first miss.
+    // A block hash transitively encodes its whole prefix (FNV chaining), so the
+    // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
+    const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
+    std::vector<KVCacheBlock*> hits;
+    for (uint64_t bh : hashes) {
+        KVCacheBlock* cb = pool_.get_cached_block(bh);
+        if (!cb) break;
+        hits.push_back(cb);
+    }
+
+    // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
+    // splice the shared physical blocks into this sequence's block table.
+    pool_.touch(hits);
+    req.insert(req.end(), hits.begin(), hits.end());
+
+    // Allocate fresh blocks only for the divergent suffix.
+    const size_t need = cdiv(token_ids.size(), block_size_);
+    if (need > req.size()) {
+        const size_t add = need - req.size();
+        if (add > pool_.get_num_free_blocks()) {
+            // OOM: roll the sequence back (un-touch the shared prefix so no ref
+            // leaks) and report no placement; the caller falls back to stock.
+            std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
+            pool_.free_blocks(ordered);
+            req.clear();
+            return 0;
+        }
+        auto nb = pool_.get_new_blocks(add);
+        req.insert(req.end(), nb.begin(), nb.end());
+    }
+    return hits.size();
+}
+
+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
+    auto& req = req_to_blocks_.at(seq_id);
+    KVCacheBlock* old = req.at(bi);
+    if (old->ref_cnt <= 1) {
+        return { old->block_id, old->block_id }; // already private - no copy
+    }
+    // Private copy for this sequence. get_new_blocks sets the fresh block's
+    // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
+    // it is NOT returned to the pool and the other owners are left untouched.
+    KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
+    pool_.free_blocks({ old });
+    req[bi] = fresh;
+    return { old->block_id, fresh->block_id };
+}
+
+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
+    return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
+}
+
+size_t PagedKVManager::num_blocks(int seq_id) const {
+    auto it = req_to_blocks_.find(seq_id);
+    return it == req_to_blocks_.end() ? 0 : it->second.size();
+}
+
+ } // namespace paged
+diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
+index 740280a..34decbc 100644
+--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
+@@ -14,6 +14,7 @@
+ #include <vector>
+ #include <unordered_map>
+ #include <map>
+#include <utility>
+ 
+ namespace paged {
+ 
+@@ -99,6 +100,28 @@ public:
+     size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+     void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+ 
+    // Cross-request prefix caching + copy-on-write (patch 0006).
+    //
+    // Splice the longest cached prefix of token_ids into seq_id (reuse the
+    // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
+    // allocate fresh blocks only for the divergent suffix. Returns the number of
+    // shared (reused) blocks; the caller skips recomputing those tokens. On pool
+    // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
+    size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
+
+    // Copy-on-write the block at logical index bi of seq_id. If that block is
+    // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
+    // the shared one (other owners keep it, content untouched) and install the
+    // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
+    // block was already private (ref_cnt<=1) and no copy is needed. The caller
+    // copies the physical cell contents old_block_id -> new_block_id.
+    std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
+
+    // Introspection for the prefix-share gate (debug/tests).
+    int    block_ref_cnt_at(int seq_id, size_t bi) const;
+    size_t num_blocks(int seq_id) const;
+    size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
+
+ protected:
+     int block_size_;
+     BlockPool pool_;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,531 @@
+From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 12:46:28 +0200
+Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
+ patch 0007
+
+Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
+new sequence physically SHARES the cached prefix blocks and skips recomputing the
+shared prefix - the actual compute win that 0006 (which only proved the host-side
+machinery + realised reuse via the stock seq_cp) did not yet deliver from the
+paged path itself.
+
+Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
+
+  * paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
+    into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
+    requests are keyed by the real llama_seq_id. free(seq) now releases exactly
+    one sequence, so ref-counted shared blocks survive while another sharer holds
+    them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
+    slot, commit (publish a sequence into the content cache), ref-counted release,
+    plus ref/num-free introspection.
+
+  * Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
+    paged_prefix_share() reuses the longest cached content prefix for a sequence
+    and marks the shared physical cells as belonging to it (cells.seq_add) so the
+    engine's attention mask includes the already-computed prefix KV; the caller
+    then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
+    sequence's full blocks for later reuse.
+
+  * find_slot's paged branch anchors placement on each sequence's own logical base
+    (ubatch.pos) and keys the manager request by seq_id, so an independently-freed
+    sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
+    per-sequence (ref-counted) instead of nuking the whole stream.
+
+  * paged-prefix-api: a thin gated shim so a caller holding only the public
+    llama.h can reach the seam and the introspection without the internal headers.
+
+Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
+additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
+sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
+scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
+a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
+hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
+use-after-free) and returns to the pool only when all sharers are freed. The
+0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/CMakeLists.txt       |   1 +
+ src/llama-kv-cache.cpp   |  66 +++++++++++++++++++++++--
+ src/llama-kv-cache.h     |   8 +++
+ src/paged-alloc.cpp      | 104 ++++++++++++++++++++++++++++++---------
+ src/paged-alloc.h        |  69 +++++++++++++++++++-------
+ src/paged-prefix-api.cpp |  48 ++++++++++++++++++
+ src/paged-prefix-api.h   |  27 ++++++++++
+ 7 files changed, 280 insertions(+), 43 deletions(-)
+ create mode 100644 src/paged-prefix-api.cpp
+ create mode 100644 src/paged-prefix-api.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index 4d9d7d1..432f42d 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -27,6 +27,7 @@ add_library(llama
+             paged-kv-manager.cpp
+             paged-attn.cpp
+             paged-alloc.cpp
+            paged-prefix-api.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 1125d9a..7510ff9 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
+     // removed (sequence end), so they return to the pool for reuse.
+     if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+         if (seq_id >= 0) {
+-            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+            paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
+         } else {
+             paged_alloc::release_all(this);
+         }
+@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+             const uint32_t bs   = 16;                 // block size (tokens/block)
+             const uint32_t nblk = cells.size() / bs;  // this stream's block budget
+             if (nblk >= 2) {
+-                const uint32_t base = cells.get_used();
+                // [paged 0007] Anchor placement on this sequence's own logical
+                // base position (ubatch.pos), not the shared used-count, and key
+                // the manager request by the real seq_id. slot(seq,pos) is then
+                // stable per sequence, so an independently-freed (ref-counted)
+                // sequence and a shared prefix can coexist in one unified pool.
+                const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
+                 const int      strm = (int) seq_to_stream[seq_id];
+                 std::vector<uint32_t> placed;
+-                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+                if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
+                     bool ok = (placed.size() == n_tokens);
+                     for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+                         if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+     return res;
+ }
+ 
+// [paged 0007] Cross-request prefix recompute-skip.
+//
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
+// blocks for the divergent suffix. We then mark the shared physical cells as
+// belonging to seq_id - those cells already hold the owner's computed KV at the
+// matching logical positions, so the caller decodes ONLY the suffix and the
+// prefix is never recomputed. Returns the number of shared prefix tokens.
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+    if (!paged_alloc::active() || tokens.empty()) {
+        return 0;
+    }
+    const uint32_t bs   = 16;
+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+    auto & cells = v_cells[strm];
+    const uint32_t nblk = cells.size() / bs;
+    if (nblk < 2) {
+        return 0;
+    }
+
+    std::vector<int> toks(tokens.begin(), tokens.end());
+    const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
+
+    for (size_t p = 0; p < kshare; ++p) {
+        const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
+        if (cell < 0 || (uint32_t) cell >= cells.size() ||
+            cells.is_empty((uint32_t) cell) ||
+            cells.pos_get((uint32_t) cell) != (llama_pos) p) {
+            // Owner cell missing / repurposed: cannot safely share. Roll the
+            // sequence back so the caller recomputes the whole prompt.
+            paged_alloc::release(this, (int) strm, (int) seq_id);
+            return 0;
+        }
+        if (!cells.seq_has((uint32_t) cell, seq_id)) {
+            cells.seq_add((uint32_t) cell, seq_id);
+        }
+    }
+    return (int32_t) kshare;
+}
+
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
+// computed (its prefill decode has run).
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+    if (!paged_alloc::active() || tokens.empty()) {
+        return;
+    }
+    const uint32_t bs   = 16;
+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+    const uint32_t nblk = v_cells[strm].size() / bs;
+    std::vector<int> toks(tokens.begin(), tokens.end());
+    paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
+}
+
+ void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
+     // TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
+     if (other) {
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index 494c0fb..f374ac6 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -199,6 +199,14 @@ public:
+     // emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
+     void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
+ 
+    // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
+    // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
+    // for seq_id and returns the number of shared prefix tokens (the caller
+    // decodes only the suffix); paged_prefix_commit() publishes a sequence into
+    // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
+    int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+    void    paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+
+     //
+     // input API
+     //
+diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
+index 1d13f9c..c1027fb 100644
+--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
+@@ -23,9 +23,13 @@ namespace {
+ 
+ using key_t = std::pair<const void *, int>;
+ 
+-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+-// physical pool of cells.size() cells, so a manager's block ids map directly to
+-// cell ranges within that stream's pool. The internal request id is always 0.
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
+// separate physical pool of cells.size() cells, so a manager's block ids map
+// directly to cell ranges within that stream's pool. Requests inside a manager
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
+// can publish blocks and share_prefix() can hit them.
+ std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+ 
+ paged::PagedKVManager * get_mgr(const void * cache, int stream,
+@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
+     const key_t k{cache, stream};
+     auto it = g_managers.find(k);
+     if (it == g_managers.end()) {
+-        // enable_caching=false: prefix caching is a later patch; 0004 exercises
+-        // only on-demand allocate / free.
+         auto mgr = std::make_unique<paged::PagedKVManager>(
+-            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
+         it = g_managers.emplace(k, std::move(mgr)).first;
+     }
+     return it->second.get();
+ }
+ 
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
+    auto it = g_managers.find({cache, stream});
+    return it == g_managers.end() ? nullptr : it->second.get();
+}
+
+ } // namespace
+ 
+-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
+            uint32_t block_size, uint32_t pool_blocks,
+            std::vector<uint32_t> & out) {
+     if (n_tokens == 0) {
+@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ 
+     paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ 
+-    const size_t before = mgr->block_table(0).size();
+    const size_t before = mgr->block_table(seq).size();
+ 
+-    // Grow the request to cover the highest logical position. The manager pops
+-    // free blocks only for the boundaries actually crossed - that is the on-
+-    // demand behavior; an already-covered range adds nothing.
+-    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+    // Grow this sequence's request to cover its highest logical position. The
+    // manager pops free blocks only for boundaries actually crossed; if
+    // share_prefix() already reserved these blocks, this is a no-op.
+    if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
+         return false; // pool exhausted -> caller falls back to the stock path
+     }
+ 
+     out.reserve(out.size() + n_tokens);
+     for (uint32_t i = 0; i < n_tokens; ++i) {
+-        const int64_t s = mgr->slot(0, (int) (base + i));
+        const int64_t s = mgr->slot(seq, (int) (base + i));
+         out.push_back((uint32_t) s);
+     }
+ 
+     if (debug()) {
+-        const size_t after = mgr->block_table(0).size();
+        const size_t after = mgr->block_table(seq).size();
+         if (after != before) {
+             fprintf(stderr,
+-                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+                    "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
+                     "(budget=%u; base=%u +%u tok)\n",
+-                    cache, stream, before, after, pool_blocks, base, n_tokens);
+                    cache, stream, seq, before, after, pool_blocks, base, n_tokens);
+         }
+     }
+ 
+     return true;
+ }
+ 
+-void release(const void * cache, int stream) {
+-    auto it = g_managers.find({cache, stream});
+-    if (it == g_managers.end()) {
+size_t share_prefix(const void * cache, int stream, int seq,
+                    const std::vector<int> & tokens,
+                    uint32_t block_size, uint32_t pool_blocks) {
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+    const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
+    const size_t shared_tokens = shared_blocks * (size_t) block_size;
+    if (debug() && shared_blocks > 0) {
+        fprintf(stderr,
+                "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
+                "(%zu tokens) - prefix NOT recomputed\n",
+                cache, stream, seq, shared_blocks, shared_tokens);
+    }
+    return shared_tokens;
+}
+
+int64_t slot(const void * cache, int stream, int seq, int pos) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+        return -1;
+    }
+    if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
+        return -1;
+    }
+    return mgr->slot(seq, pos);
+}
+
+void commit(const void * cache, int stream, int seq,
+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+    mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
+    if (debug()) {
+        fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
+                cache, stream, seq, tokens.size());
+    }
+}
+
+void release(const void * cache, int stream, int seq) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+         return;
+     }
+-    it->second->free(0);
+-    g_managers.erase(it);
+    mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
+     if (debug()) {
+-        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+                cache, stream, seq, mgr->num_free_blocks());
+     }
+ }
+ 
+@@ -103,4 +146,21 @@ void release_all(const void * cache) {
+     }
+ }
+ 
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+        return -1;
+    }
+    const size_t bi = (size_t) pos / block_size;
+    if (bi >= mgr->num_blocks(seq)) {
+        return -1;
+    }
+    return mgr->block_ref_cnt_at(seq, bi);
+}
+
+size_t num_free(const void * cache, int stream) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    return mgr ? mgr->num_free_blocks() : 0;
+}
+
+ } // namespace paged_alloc
+diff --git a/src/paged-alloc.h b/src/paged-alloc.h
+index bf66665..88dedef 100644
+--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
+@@ -1,17 +1,27 @@
+ #pragma once
+-// On-demand paged KV block allocation (patch 0004, experimental).
+// On-demand paged KV block allocation + cross-request prefix reuse
+// (patches 0004 + 0007, experimental).
+ //
+-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+-// sequence's logical positions onto a fixed full-pool permutation, blocks are
+-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+-// and returned to the pool on sequence end. This is where the paged memory-
+-// capacity benefit begins: a short sequence holds only a few blocks, not the
+-// whole reserved window.
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
+// host-side PagedKVManager (patch 0001). Two responsibilities:
+ //
+-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+-// struct stays untouched - find_slot only gains a gated call.
+//   * On-demand allocation (0004): a sequence's logical positions are mapped to
+//     physical cells block-by-block, popped from a free pool only as the
+//     sequence grows and returned on sequence end.
+//
+//   * Cross-request prefix reuse (0007): before a new sequence's suffix is
+//     decoded, share_prefix() reuses the cached physical blocks of a matching
+//     content prefix (ref_cnt++), so the engine shares the already-computed KV
+//     cells and the caller decodes ONLY the divergent suffix - the prefix is not
+//     recomputed. commit() publishes a sequence's full blocks into the content
+//     cache so later sequences can hit them. Freeing is ref-counted: a shared
+//     block returns to the pool only when every sharer has been released.
+//
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
+// shared blocks survive at ref>0. All state lives in this unit (a static
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+ 
+ #include <cstdint>
+ #include <vector>
+@@ -21,19 +31,42 @@ namespace paged_alloc {
+ // true iff env LLAMA_KV_PAGED is set (evaluated once).
+ bool active();
+ 
+-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+-// demand, appending their physical cell indices to `out`. pool_blocks =
+-// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+// on demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
+ // `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+ // allocator. The caller still validates each returned cell is empty.
+-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
+            uint32_t block_size, uint32_t pool_blocks,
+            std::vector<uint32_t> & out);
+ 
+-// Return a stream's blocks to the pool (sequence end).
+-void release(const void * cache, int stream);
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
+// (block-aligned); the caller marks those cells for seq and decodes only the
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
+size_t share_prefix(const void * cache, int stream, int seq,
+                    const std::vector<int> & tokens,
+                    uint32_t block_size, uint32_t pool_blocks);
+
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
+int64_t slot(const void * cache, int stream, int seq, int pos);
+ 
+-// Return every stream's blocks for a kv-cache (clear() / teardown).
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
+void commit(const void * cache, int stream, int seq,
+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
+void release(const void * cache, int stream, int seq);
+
+// Drop every manager for a kv-cache (clear() / teardown).
+ void release_all(const void * cache);
+ 
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
+// ref count of the block backing logical position `pos`, or -1 if unknown.
+int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+size_t num_free(const void * cache, int stream);
+
+ } // namespace paged_alloc
+diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
+new file mode 100644
+index 0000000..8573cd2
+--- /dev/null
+++ b/src/paged-prefix-api.cpp
+@@ -0,0 +1,48 @@
+#include "paged-prefix-api.h"
+#include "paged-alloc.h"
+#include "llama-kv-cache.h"
+
+#include <vector>
+
+namespace paged_prefix_api {
+
+static llama_kv_cache * kv_of(llama_context * ctx) {
+    // The driver targets a plain unified KV-cache model; dynamic_cast yields null
+    // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
+    // not apply, so the shim degrades to a safe no-op.
+    return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
+}
+
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv || n <= 0) {
+        return 0;
+    }
+    return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv || n <= 0) {
+        return;
+    }
+    kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv) {
+        return -1;
+    }
+    return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
+}
+
+long num_free(llama_context * ctx) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv) {
+        return 0;
+    }
+    return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+}
+
+} // namespace paged_prefix_api
+diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
+new file mode 100644
+index 0000000..78a3864
+--- /dev/null
+++ b/src/paged-prefix-api.h
+@@ -0,0 +1,27 @@
+#pragma once
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
+// without pulling in the internal kv-cache headers. All entry points are no-ops
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
+
+#include "llama.h"
+
+namespace paged_prefix_api {
+
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
+// return the number of shared prefix tokens (the caller decodes only the
+// suffix). 0 if nothing was shared.
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
+// stream 0), or -1 if unknown.
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+long num_free(llama_context * ctx);
+
+} // namespace paged_prefix_api
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,130 @@
+From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 17:02:22 +0200
+Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
+ - patch 0008
+
+Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
+paged_prefix_api::share/commit) into the llama-server continuous-batching loop
+(update_slots) so CONCURRENT requests that share a long prefix physically reuse
+one committed copy of the prefix blocks and prefill only their divergent suffix.
+Patch 0007 proved the engine seam correct via a standalone driver, but the server
+never called it: two concurrent shared-prefix requests each recomputed the full
+prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
+(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
+concurrent slots. 0008 adds that cross-slot share.
+
+Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
+
+  * In update_slots prompt-processing, after the native n_past is computed and
+    only for a FRESH slot (n_past < one block, i.e. the native cache did not
+    already cover the prefix), call paged_prefix_api::share() to splice the
+    longest committed cross-request prefix into this sequence (ref_cnt++ on the
+    shared physical blocks) and advance n_past past it, so the batch fill computes
+    ONLY the suffix. The slot's own divergent tail cells are removed first so the
+    shared cells own [n_past, kshare) without colliding (the native path removes
+    these later anyway). The n_past < block gate guarantees any block-aligned
+    share the engine returns is strictly larger than n_past and therefore always
+    adopted, so the engine's reservation always matches the suffix-only batch and
+    never leaves stale blocks (which otherwise fragment the paged pool).
+
+  * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
+    KV just computed), call paged_prefix_api::commit() to publish its prefix so
+    concurrent/later sharers can reuse it.
+
+The share() / commit() entry points are forward-declared (defined in libllama,
+src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
+server translation unit.
+
+Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
+holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
+~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
+K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
+blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
+documented CUDA batch-shape non-determinism band (stock native prompt-caching
+shows the same magnitude). Cross-request sharing requires the unified KV cache.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
+ 1 file changed, 50 insertions(+)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 39b7eb2..b5f9d37 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -16,6 +16,16 @@
+ #include "mtmd.h"
+ #include "mtmd-helper.h"
+ 
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
+// cache wires into update_slots() without pulling in internal kv-cache headers.
+// Fully gated; stock (paged off) is byte-identical.
+namespace paged_prefix_api {
+    int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+    void    commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+}
+
+ #include <algorithm>
+ #include <cstddef>
+ #include <cinttypes>
+@@ -3335,6 +3345,37 @@ private:
+                             }
+                         }
+ 
+                        // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
+                        // above only reuses THIS slot's own prior prompt; when the paged KV
+                        // engine is active, also reuse a committed CROSS-slot prefix so
+                        // concurrent requests sharing a long prefix skip recompute. Gated on
+                        // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
+                        static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
+                        // Only attempt the cross-request share on a FRESH slot (the native
+                        // cache above did not already cover the prefix). With n_past < a
+                        // block, any block-aligned share the engine returns is strictly
+                        // larger than n_past and is therefore always adopted below - so the
+                        // engine's full-prompt reservation always matches the suffix-only
+                        // submission and never leaves stale blocks (which fragmented the
+                        // paged pool and crashed the server under high fan-out otherwise).
+                        if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
+                            const llama_tokens ptoks = input_tokens.get_text_tokens();
+                            // Drop this slot's own cells beyond the natively-cached prefix before
+                            // splicing the shared physical prefix in, so the shared cells can own
+                            // [n_past, kshare) without colliding (the native path removes exactly
+                            // these later; a no-op for a fresh slot).
+                            common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
+                            const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
+                            if (kshare > n_past) {
+                                slot.prompt.tokens.keep_first(n_past);
+                                for (int i = n_past; i < kshare; ++i) {
+                                    slot.prompt.tokens.push_back(ptoks[i]);
+                                }
+                                n_past = kshare;
+                                SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
+                            }
+                        }
+
+                         // [TAG_PROMPT_LOGITS]
+                         if (n_past == slot.task->n_tokens() && n_past > 0) {
+                             SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
+@@ -3741,6 +3782,15 @@ private:
+                 // prompt evaluated for next-token prediction
+                 slot.state = SLOT_STATE_GENERATING;
+ 
+                // [paged 0008] Publish this slot's computed prefix so concurrent/later
+                // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+                // for [0, n_tokens) has just run, so the prefix KV is computed.
+                static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+                if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+                    const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+                    paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+                }
+
+                 if (slot.can_speculate()) {
+                     common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
+                 }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
@@ -0,0 +1,609 @@
+From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 20:03:17 +0200
+Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
+
+Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
+contiguous buffer) with an in-kernel paged read on the decode step. build_attn
+passes the UNMODIFIED physical K/V views plus a block table (src[5] of
+ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
+index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
+reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
+read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
+(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
+causal mask in the same position order; KV_max / parallel_blocks / stream_k
+split-K are unchanged. The decode shape is forced onto the vec kernel (the only
+one wired for the block table); a nullptr block table => the stock contiguous
+read, byte-identical.
+
+Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
+so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
+tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
+the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
+ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
+recovering the gather regression to stock parity (647 ms). Gated behind
+LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                  |   6 ++
+ ggml/src/ggml-cpu/ops.cpp            |  10 ++-
+ ggml/src/ggml-cuda/fattn-common.cuh  |   8 +-
+ ggml/src/ggml-cuda/fattn-mma-f16.cuh |   4 +-
+ ggml/src/ggml-cuda/fattn-tile.cuh    |   4 +-
+ ggml/src/ggml-cuda/fattn-vec.cuh     |  25 +++++--
+ ggml/src/ggml-cuda/fattn-wmma-f16.cu |   4 +-
+ ggml/src/ggml-cuda/fattn.cu          |   9 +++
+ ggml/src/ggml.c                      |  14 ++++
+ src/llama-graph.cpp                  |  23 ++++--
+ src/llama-graph.h                    |   3 +-
+ src/llama-kv-cache.cpp               |  31 ++++++++
+ src/llama-kv-cache.h                 |   4 +
+ src/paged-attn.cpp                   | 107 +++++++++++++++++++++++++++
+ src/paged-attn.h                     |  18 +++++
+ 15 files changed, 248 insertions(+), 22 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index d6807b6..823f5a9 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2427,6 +2427,12 @@ extern "C" {
+             struct ggml_tensor * a,
+             struct ggml_tensor * sinks);
+ 
+    // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
+    // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
+    GGML_API void ggml_flash_attn_ext_set_block_table(
+            struct ggml_tensor * a,
+            struct ggml_tensor * block_table);
+
+     // TODO: needs to be adapted to ggml_flash_attn_ext
+     GGML_API struct ggml_tensor * ggml_flash_attn_back(
+            struct ggml_context * ctx,
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 74611dc..63c07a2 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+     const ggml_tensor * v     = dst->src[2];
+     const ggml_tensor * mask  = dst->src[3];
+     const ggml_tensor * sinks = dst->src[4];
+    const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
+    const int32_t     * bt    = block_table ? (const int32_t *) block_table->data : nullptr;
+ 
+     GGML_TENSOR_LOCALS(int64_t, neq, q,   ne)
+     GGML_TENSOR_LOCALS(size_t,  nbq, q,   nb)
+@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+ 
+             float s; // KQ value
+ 
+-            const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
+            // [paged] map the logical KV index ic to its physical cell via the block table.
+            const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
+            const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
+             kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
+ 
+             s = s*scale; // scale KQ value
+@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+             float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
+             float vs = 1.0f; // post-softmax KQ value, expf(s - M)
+ 
+-            const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
+            const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
+ 
+             if (v->type == GGML_TYPE_F16) {
+                 if (s > M) {
+@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
+         const int64_t dr = (nr + nchunk - 1) / nchunk;
+ 
+         static constexpr int64_t Q_TILE_SZ  = ggml_fa_tile_config::Q;
+-        bool use_tiled = !use_ref &&
+        bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
+                                (q->type == GGML_TYPE_F32 &&
+                                 kv_is_f32_or_f16 &&
+                                 k->type == v->type &&
+diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
+index 8dfa51a..3c6ddd5 100644
+--- a/ggml/src/ggml-cuda/fattn-common.cuh
+++ b/ggml/src/ggml-cuda/fattn-common.cuh
+@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33);
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table);
+ 
+ typedef float (*vec_dot_KQ_t)(
+     const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
+@@ -981,6 +982,8 @@ void launch_fattn(
+ 
+     const ggml_tensor * mask  = dst->src[3];
+     const ggml_tensor * sinks = dst->src[4];
+    const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
+    const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
+ 
+     ggml_tensor * KQV = dst;
+ 
+@@ -1217,7 +1220,8 @@ void launch_fattn(
+         K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
+         nb21, nb22, nb23,
+         mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
+-        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
+        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
+        bt_ptr
+     );
+     CUDA_CHECK(cudaGetLastError());
+ 
+diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+index 83478a0..0a92cd6 100644
+--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+     ggml_cuda_pdl_sync(); // TODO optimize placement
+ #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
+index 0a09981..0ff14e6 100644
+--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
+@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
+index 69dd936..a09e2fb 100644
+--- a/ggml/src/ggml-cuda/fattn-vec.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
+@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+     ggml_cuda_pdl_lc();
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
+                   nb11, nb12, nb13,
+                   nb21, nb22, nb23,
+                   ne31, ne32, ne33,
+-                  nb31, nb32, nb33);
+                  nb31, nb32, nb33, block_table);
+         NO_DEVICE_CODE;
+         return;
+     }
+@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
+     K += nb13*sequence + nb12*(head / gqa_ratio);
+     V += nb23*sequence + nb22*(head / gqa_ratio);
+ 
+    // [paged] in-kernel block-table read: logical KV index j -> physical cell
+    // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
+    // mask/KV_max stay logical (the table is in token-position order). nullptr =>
+    // the stock contiguous read below.
+    const char * GGML_CUDA_RESTRICT K0 = K;
+    const char * GGML_CUDA_RESTRICT V0 = V;
+    const int  * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
+     const half * maskh  = (const half  *) (mask + nb33*(sequence % ne33) + nb31*ic0);
+ 
+     const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
+@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+         for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
+             const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
+            const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
+ 
+ #pragma unroll
+             for (int j = 0; j < ncols; ++j) {
+-                float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
+                float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
+                 sum = warp_reduce_sum<nthreads_KQ>(sum);
+ 
+                 if (use_logit_softcap) {
+@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+         for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
+             const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
+            const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
+ 
+ #ifdef V_DOT2_F32_F16_AVAILABLE
+             half2 KQ_k[ncols];
+@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
+                 half2 tmp[V_rows_per_thread/2];
+                 if constexpr (type_V == GGML_TYPE_BF16) {
+                     float2 tmp_f[V_rows_per_thread/2];
+-                    dequantize_V(V + k*nb21, tmp_f,
+                    dequantize_V(V_blk, tmp_f,
+                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+ #pragma unroll
+                     for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
+                         tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
+                     }
+                 } else {
+-                    dequantize_V(V + k*nb21, tmp,
+                    dequantize_V(V_blk, tmp,
+                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+                 }
+ #pragma unroll
+@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+             for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
+                 float2 tmp[V_rows_per_thread/2];
+-                dequantize_V(V + k*nb21, tmp,
+                dequantize_V(V_blk, tmp,
+                     2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+ #pragma unroll
+                 for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
+@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
+               nb11, nb12, nb13,
+               nb21, nb22, nb23,
+               ne31, ne32, ne33,
+-              nb31, nb32, nb33);
+              nb31, nb32, nb33, block_table);
+     NO_DEVICE_CODE;
+ #endif // FLASH_ATTN_AVAILABLE
+ }
+diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+index 6850716..5357849 100644
+--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index d6c501b..e3771ee 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
+ 
+ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+     ggml_cuda_set_device(ctx.device);
+
+    // [paged] the block table (src[5]) is only honored by the vec kernel's
+    // in-kernel read; force it. build_attn only sets it for a vec-supported
+    // 1-token-per-stream decode shape.
+    if (dst->src[5] != nullptr) {
+        ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        return;
+    }
+
+     switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
+         case BEST_FATTN_KERNEL_NONE:
+             GGML_ABORT("fatal error");
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index b43016c..adbe52b 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
+     a->src[4] = sinks;
+ }
+ 
+void ggml_flash_attn_ext_set_block_table(
+        struct ggml_tensor * a,
+        struct ggml_tensor * block_table) {
+    if (!block_table) {
+        a->src[5] = NULL;
+        return;
+    }
+
+    GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
+    GGML_ASSERT(block_table->type == GGML_TYPE_I32);
+
+    a->src[5] = block_table;
+}
+
+ // ggml_flash_attn_back
+ 
+ struct ggml_tensor * ggml_flash_attn_back(
+diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
+index b59d2a5..abdb48d 100644
+--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
+@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
+          ggml_tensor * sinks,
+          ggml_tensor * v_mla,
+                float   kq_scale,
+-                 int   il) const {
+                 int   il,
+         ggml_tensor * block_table) const {
+     const bool v_trans = v->nb[1] > v->nb[2];
+ 
+     // split the batch into streams if needed
+@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
+                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
+         cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
+ 
+        if (block_table) {
+            ggml_flash_attn_ext_set_block_table(cur, block_table);
+        }
+         ggml_flash_attn_ext_add_sinks(cur, sinks);
+         ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
+ 
+@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
+     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
+     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
+ 
+-    // [paged 0003] gather K, V and the mask to the sequence's used cells only
+-    //   (no-op unless env LLAMA_KV_PAGED is set).
+-    ggml_tensor * kq_mask_g = kq_mask;
+-    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+    // [paged] decode read: when paging is active and this is a 1-token-per-stream
+    //   decode step, present K/V as n_gather views + a block table so the fattn
+    //   kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
+    //   fall back to the gather-read (prefill, transposed V, or env off). All a
+    //   no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
+    ggml_tensor * kq_mask_g   = kq_mask;
+    ggml_tensor * block_table = nullptr;
+    const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
+    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
+        paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+    }
+ 
+-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
+     cb(cur, "kqv_out", il);
+ 
+     if (inp->self_v_rot) {
+diff --git a/src/llama-graph.h b/src/llama-graph.h
+index 5e8a658..c95ae49 100644
+--- a/src/llama-graph.h
+++ b/src/llama-graph.h
+@@ -969,7 +969,8 @@ struct llm_graph_context {
+             ggml_tensor * sinks,   // [n_head_q]
+             ggml_tensor * v_mla,   // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
+                   float   kq_scale,
+-                    int   il) const;
+                    int   il,
+            ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
+ 
+     llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
+ 
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 7510ff9..0351f86 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
+     }
+ }
+ 
+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        std::vector<std::pair<llama_pos, int32_t>> pc;
+        pc.reserve(n);
+        int32_t pad = -1;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
+            } else if (pad < 0) {
+                pad = (int32_t) i;
+            }
+        }
+        std::sort(pc.begin(), pc.end());
+        int32_t * col = dst + (size_t) j * n_blk;
+        for (size_t k = 0; k < pc.size(); ++k) {
+            col[k] = pc[k].second;
+        }
+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+        for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
+            col[k] = padv;
+        }
+    }
+}
+
+ ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
+     GGML_UNUSED(sinfo);
+ 
+@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+     kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+ }
+ 
+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
+    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+}
+
+ ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
+     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
+ }
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index f374ac6..e9980b6 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -176,6 +176,9 @@ public:
+     //   gather-read. get_n_gather returns the max count across streams.
+     uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+     void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+    // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
+    //   per column with a masked empty cell) for the in-kernel paged read.
+    void     get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
+ 
+     // store k_cur and v_cur in the cache based on the provided head location
+     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
+@@ -386,6 +389,7 @@ public:
+     //   current ubatch's stream).
+     uint32_t get_n_gather() const;
+     void     get_gather_idxs(int32_t * dst) const;
+    void     get_block_table(int32_t * dst, uint32_t n_blk) const;
+ 
+     // store k_cur and v_cur in the cache based on the provided head location
+     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index ade75e8..8eebeaa 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -43,6 +43,25 @@ public:
+     ggml_tensor * idxs;
+ };
+ 
+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
+// with a masked empty cell, by delegating to the kv-cache context.
+class input_block_table : public llm_graph_input_i {
+public:
+    input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
+        : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
+
+    void set_input(const llama_ubatch * ubatch) override {
+        GGML_UNUSED(ubatch);
+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+        mctx->get_block_table((int32_t *) idxs->data, n_blk);
+    }
+
+    const llama_kv_cache_context * mctx;
+    ggml_tensor * idxs;
+    uint32_t n_blk;
+};
+
+ } // namespace
+ 
+ void gather(ggml_context * ctx0,
+@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
+     }
+ }
+ 
+bool in_kernel_decode(ggml_context * ctx0,
+                      llm_graph_result * res,
+                      const llama_kv_cache_context * mctx,
+                      ggml_tensor ** k,
+                      ggml_tensor ** v,
+                      ggml_tensor ** kq_mask,
+                      ggml_tensor ** block_table) {
+    if (!active()) {
+        return false;
+    }
+    // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
+    // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
+    static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
+    if (force_gather) {
+        return false;
+    }
+
+    ggml_tensor * K = *k;
+    ggml_tensor * V = *v;
+    ggml_tensor * M = *kq_mask;
+
+    const int64_t n_stream = K->ne[3];
+    GGML_ASSERT(M->ne[3] == n_stream);
+
+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
+    if (n_gather <= 0) {
+        // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
+        return false;
+    }
+
+    // The in-kernel read addresses V along its d-major (non-transposed) axis. If
+    // the cache stores V transposed, fall back to gather() (which normalizes it).
+    if (V->nb[1] > V->nb[2]) {
+        return false;
+    }
+
+    if (debug()) {
+        static int64_t once = 0;
+        if (once++ < 2) {
+            fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+        }
+    }
+
+    // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
+    // in token-POSITION order (identical to the gather index, so the reduction
+    // order matches stock bit-for-bit), padded with a masked empty cell. Filled
+    // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
+    // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
+    // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
+    // engages; padded entries point at a masked empty cell (0 contribution). Stays
+    // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
+    int64_t n_view = GGML_PAD(n_gather, 256);
+    if (n_view > K->ne[2]) {
+        n_view = K->ne[2];
+    }
+
+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+    ggml_set_input(idx);
+    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+
+    // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
+    // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
+    // dim shrinks to n_view. NOT materialized - the kernel reads in place.
+    *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
+                      K->nb[1], K->nb[2], K->nb[3], 0);
+    *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
+                      V->nb[1], V->nb[2], V->nb[3], 0);
+
+    // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
+    // the kernel's logical mask index aligns with the block table. Cheap: the
+    // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
+    {
+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+        m = ggml_get_rows(ctx0, m, idx);
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+        m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
+        if (M->type != m->type) {
+            m = ggml_cast(ctx0, m, M->type);
+        }
+        *kq_mask = m;
+    }
+
+    *block_table = idx;
+    return true;
+}
+
+ } // namespace paged_attn
+diff --git a/src/paged-attn.h b/src/paged-attn.h
+index c5b7bd7..23e2184 100644
+--- a/src/paged-attn.h
+++ b/src/paged-attn.h
+@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
+             ggml_tensor ** v,
+             ggml_tensor ** kq_mask);
+ 
+// [paged inc1] In-kernel paged decode read. Instead of materializing the
+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
+// full physical window and return the position-ordered physical-cell index list
+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
+// not apply - env off, nothing placed, or a transposed V cache - so the caller
+// keeps the dense gather()/contiguous read.
+bool in_kernel_decode(ggml_context * ctx0,
+                      llm_graph_result * res,
+                      const llama_kv_cache_context * mctx,
+                      ggml_tensor ** k,
+                      ggml_tensor ** v,
+                      ggml_tensor ** kq_mask,
+                      ggml_tensor ** block_table);
+
+ } // namespace paged_attn
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,269 @@
+From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 22:36:09 +0200
+Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
+ LLAMA_KV_PAGED) - patch 0010
+
+Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
+safe against silent mis-routing, and plumb the same read into the tile kernel
+for the increment-3 GQA head-group work.
+
+fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
+fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
+(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
+i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
+The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
+so every existing non-paged caller is byte-identical to stock. The mask / KV_max
+stay logical (token-position order), as in vec.
+
+fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
+the vec or tile kernel and never fall through to the best-kernel switch. The
+mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
+(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
+GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
+(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
+kernel once.
+
+Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
+vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
+Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
+vec, matching the nsys profile (flash_attn_ext_vec).
+
+The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
+and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
+tile-paged diverging from tile-stock at the first cross-tile KV depth: the
+GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
+oob_check=false while the compacted paged mask is not padded to cover the tile, so
+past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
+the tile path is increment-3 work; the default vec route and all stock paths are
+untouched.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
+ ggml/src/ggml-cuda/fattn.cu       | 38 +++++++++++++++++++++++---
+ 2 files changed, 64 insertions(+), 19 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
+index 0ff14e6..bb84d61 100644
+--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
+@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
+ // TODO: deduplicate with mma-f16
+ template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
+ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+-        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+                     const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
+ 
+                     const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
+                     ggml_cuda_memcpy_1<cpy_nb>(
+                         tile_KV + i*(J/2 + J_padding) + j,
+-                        !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+                        !oob_check || i < i_sup ? KV_row + j : zero);
+                 }
+             }
+         }
+@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+ 
+ template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
+ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+-        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+ 
+                     const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
+                     __align__(16) half2 tmp_h2[cpy_ne/2];
+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
+                     ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
+-                        tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+                        tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
+ 
+                     __align__(16) float2 tmp_f2[cpy_ne/2];
+ #pragma unroll
+@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
+         const int k_VKQ_0,
+         const int k_VKQ_sup,
+         const int k_KQ_0,
+        const int * const __restrict__ block_table,
+         float * KQ_acc) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
+     constexpr int cpw   = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
+     constexpr int np    = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
+ 
+    // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
+    const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
+     flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
+-        (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
+        (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
+     __syncthreads();
+ 
+ #ifdef FAST_FP16_AVAILABLE
+@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+         T_acc * const VKQ,
+         const int k_VKQ_0,
+         const int k_VKQ_max,
+-        const int col_Q_0) {
+        const int col_Q_0,
+        const int * const __restrict__ block_table) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+ #pragma unroll
+     for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
+         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
+-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
+     }
+     if (nbatch_K_last > 0) {
+         constexpr int k_KQ_0 = DKQ - nbatch_K_last;
+         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
+-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
+     }
+ 
+     // Apply logit softcap + mask, update KQ_max:
+@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+     static_assert(nbatch_V % np == 0, "bad nbatch_V");
+ #pragma unroll
+     for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
+        // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
+        const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
+         flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
+-            (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
+            (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
+         __syncthreads();
+ 
+ #ifdef FAST_FP16_AVAILABLE
+@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+                             const int32_t nb31, const int32_t nb32, const int64_t nb33,
+         const int  * __restrict__ block_table) {
+-    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
+                   nb11, nb12, nb13,
+                   nb21, nb22, nb23,
+                   ne31, ne32, ne33,
+-                  nb31, nb32, nb33);
+                  nb31, nb32, nb33, block_table);
+         NO_DEVICE_CODE;
+         return;
+     }
+@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
+     const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
+     const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
+ 
+    // [paged] per-sequence logical->physical block table in token-position order
+    // (mask/KV_max stay logical); nullptr => the stock contiguous read.
+    const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
+     const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
+ 
+     const int stride_K2   = nb11 / sizeof(half2);
+@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
+             constexpr bool oob_check = false;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+             k_VKQ_0 += gridDim.y*nbatch_fa;
+         }
+         if (k_VKQ_0 < k_VKQ_max) {
+             constexpr bool oob_check = true;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+         }
+     } else {
+         // Branch without out-of-bounds checks.
+@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
+             constexpr bool oob_check = false;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+         }
+     }
+ 
+@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
+               nb11, nb12, nb13,
+               nb21, nb22, nb23,
+               ne31, ne32, ne33,
+-              nb31, nb32, nb33);
+              nb31, nb32, nb33, block_table);
+     NO_DEVICE_CODE;
+ #endif // FLASH_ATTN_AVAILABLE
+ }
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index e3771ee..afcafa2 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
+ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+     ggml_cuda_set_device(ctx.device);
+ 
+-    // [paged] the block table (src[5]) is only honored by the vec kernel's
+-    // in-kernel read; force it. build_attn only sets it for a vec-supported
+-    // 1-token-per-stream decode shape.
+    // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
+    // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
+    // silently read the wrong (contiguous physical) cells. So when a block table
+    // is present we route here and NEVER fall through to the best-kernel switch
+    // below - no decode shape can silently reach an mma/wmma misread. build_attn
+    // only sets src[5] for the 1-token-per-stream decode shape; the vec
+    // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+    // and any shape that should not be paged must take the host-side gather path
+    // (LLAMA_KV_PAGED_GATHER=1) instead.
+    //
+    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+    // with oob_check=false while the compacted paged mask is not padded to cover
+    // it, so it diverges from stock. Not for production paged decode until
+    // increment-3 bounds that path; the default vec route is unaffected.
+     if (dst->src[5] != nullptr) {
+-        ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+        if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+            static bool logged = false;
+            if (!logged) {
+                logged = true;
+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+                    paged_tile ? "TILE(experimental)" : "VEC",
+                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+            }
+        }
+        if (paged_tile) {
+            ggml_cuda_flash_attn_ext_tile(ctx, dst);
+        } else {
+            ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        }
+         return;
+     }
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
@@ -0,0 +1,147 @@
+From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 00:18:35 +0200
+Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
+ gqa>=2) - patch 0011
+
+Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
+in-kernel decode to the tile kernel for the common grouped-query F16 case, and
+keep the inc-1 vec kernel for everything else.
+
+The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
+q-heads that share one kv-head, so each K/V row is loaded once for the whole
+group instead of once per q-head. vec re-streams each kv-head's K/V once per
+q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
+3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
+The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
+this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
+
+Routing guard (why conditional): the tile kernel has no K/V type template - it
+loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
+launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
+read (the table indexes the original paged layout, not the copy). So tile is
+correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
+fall back to the inc-1 vec path, exactly as before this change. The head-group
+reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
+Note: paged decode is currently exercised with an F16 cache only; quantized +
+paged is a separate pre-existing limitation, independent of this change
+(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
+after this patch, since both route the non-F16 cache to vec).
+
+Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
+1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
+same build, env-toggled:
+  STOCK (mma)            174.8 ms/step  183.1 t/s
+  PAGED-VEC  (inc-1)     186.3 ms/step  171.8 t/s   (+6.6% vs stock)
+  PAGED-TILE (inc-3)     177.9 ms/step  179.8 t/s   (+1.8% vs stock)
+GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
+paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
+vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
+takes a larger share of the step.
+
+Why not the split-K tune: the vec decode grid is already block-saturated
+(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
+SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
+intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
+directly; more split-K does not.
+
+Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
+  - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
+  - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
+    in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
+    band where vec also drifts from stock. Stock uses the mma kernel for this
+    multi-stream GQA shape, so a different kernel = different rounding =
+    autoregressive token drift; vec and tile agree with each other while both
+    differ from stock (both pick 15678 where stock picks 38835), confirming the
+    drift is kernel choice, not a paging error.
+  - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
+    (seq3: tile == stock == 624 at the token where vec picked 13).
+
+Stock is byte-identical: the dispatch guard only diverts when the block table
+(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
+path reads the last nbatch_fa tile with oob_check=false and relies on the mask
+-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
+mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
+
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+Assisted-by: Claude:opus-4.8 [Claude Code]
+---
+ ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
+ 1 file changed, 36 insertions(+), 15 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index afcafa2..6b15810 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
+     // silently read the wrong (contiguous physical) cells. So when a block table
+     // is present we route here and NEVER fall through to the best-kernel switch
+     // below - no decode shape can silently reach an mma/wmma misread. build_attn
+-    // only sets src[5] for the 1-token-per-stream decode shape; the vec
+    // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
+     // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+     // and any shape that should not be paged must take the host-side gather path
+     // (LLAMA_KV_PAGED_GATHER=1) instead.
+     //
+-    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+-    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+-    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+-    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+-    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+-    // with oob_check=false while the compacted paged mask is not padded to cover
+-    // it, so it diverges from stock. Not for production paged decode until
+-    // increment-3 bounds that path; the default vec route is unaffected.
+    // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
+    // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
+    // kv-head (ncols2), loading each K/V row once for the whole group instead of
+    // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
+    // Two constraints make this conditional: (1) the tile kernel has no K/V type
+    // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
+    // converted by launch_fattn to a contiguous F16 copy, which breaks the
+    // in-kernel block-table read (the table indexes the original paged layout, not
+    // the copy); vec instead reads the original cache with in-kernel dequant, so it
+    // is the only correct paged path for non-F16 caches. (2) the head-group reuse
+    // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
+    // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
+    // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
+    // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
+    // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
+    // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
+    // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
+    // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
+    // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
+    // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
+    // uses for ncols2>1); the compacted paged mask is gathered to the n_view
+    // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
+    // the inc-1 vec path for A/B.
+     if (dst->src[5] != nullptr) {
+-        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+        const ggml_tensor * Qp = dst->src[0];
+        const ggml_tensor * Kp = dst->src[1];
+        const ggml_tensor * Vp = dst->src[2];
+        const bool kv_f16    = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
+        const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
+        const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
+        const bool use_tile  = !force_vec && kv_f16 && gqa_ratio >= 2;
+         if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+             static bool logged = false;
+             if (!logged) {
+                 logged = true;
+-                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+-                    paged_tile ? "TILE(experimental)" : "VEC",
+-                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+-                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
+                    use_tile ? "TILE(gqa)" : "VEC",
+                    (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
+                    (long) gqa_ratio, (int) kv_f16);
+             }
+         }
+-        if (paged_tile) {
+        if (use_tile) {
+             ggml_cuda_flash_attn_ext_tile(ctx, dst);
+         } else {
+             ggml_cuda_flash_attn_ext_vec(ctx, dst);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0012-paged-mask-pad-invariant-assert.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0012-paged-mask-pad-invariant-assert.patch
@@ -0,0 +1,50 @@
+From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 11:12:05 +0200
+Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
+ route (patch 0012)
+
+The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
+leak past-end KV rows only because the compacted mask/block-table length is
+padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
+256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
+tile sits entirely inside the -inf pad window. That invariant was implicit.
+
+Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
+future change to the pad (e.g. < 256) or the tile (> 256) that broke the
+whole-tile property cannot silently reintroduce the leak. Additive only, no
+behaviour change.
+
+Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
+vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
+stays silent (n_view remains a whole number of tiles across all decode steps).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/paged-attn.cpp | 9 +++++++++
+ 1 file changed, 9 insertions(+)
+
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index 8eebeaa..fed8ca9 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
+         n_view = K->ne[2];
+     }
+ 
+    // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
+    // a whole number of such tiles so the in-kernel decode never reads past the gathered
+    // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
+    // the boundary still contributes zero. This holds today only because the pad (256) is a
+    // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
+    // silently reintroduce a past-end KV leak, so assert it rather than trust it.
+    // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
+    GGML_ASSERT(n_view % 64 == 0);
+
+     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+     ggml_set_input(idx);
+     res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -0,0 +1,136 @@
+From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 11:52:45 +0200
+Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
+ 0013)
+
+llama-server already co-batches decode with chunked prefill: update_slots()
+appends every generating slot's sampled token first, then fills the rest of the
+n_batch budget with prompt tokens, deferring the overflow to the next step. But
+the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
+~2048-token prefill chunk lands in a single compute-heavy step, and every decode
+co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
+Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
+prefill throughput, because they are coupled.
+
+Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
+(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
+The prompt-fill loop and the outer slot loop now also stop once this many prompt
+tokens have been added in the current update_slots() step, so a long prefill is
+split across more steps that each still advance in-flight decode. Default (env
+unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
+LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
+
+Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
+streams with one 6000-token prefill injected mid-stream; same binary, only
+LLAMA_PREFILL_BUDGET differs:
+
+  metric                        stock(off)  budget=256   budget=512
+  worst decode freeze (ms)         3380      482 (7.0x)   778 (4.3x)
+  median decode ITL in window      2264      411 (5.5x)   689
+  decode_stall (ms)                3285      387 (8.5x)   684 (4.8x)
+  decode steps during prefill        38      201 (5.3x)   108
+  injected-req TTFT (ms)           8493     10172 (+20%)  8432 (~0%)
+  steady-state baseline ITL          94        95          94
+
+This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
+the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
+worst freeze and 5.3x more decode progress during the prefill at budget=256), in
+exchange for a modest TTFT rise on the long request (the classic chunked-prefill
+trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
+unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
+which the scheduler cannot lift.
+
+Correctness (same model, greedy temp 0, fa on):
+- budget unset or >= n_batch: byte-identical to stock (the added break never
+  fires before the existing n_batch break; the off-path is a no-op by
+  construction).
+- short prompt (<= budget): byte-identical to stock.
+- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
+  stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
+  n_batch=2048 for decode width.
+- on a prompt larger than the budget the chunked greedy output diverges from the
+  single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
+  stock -b256 diverges from stock -b2048 the same way with the patch inactive,
+  and the output stays coherent and answers correctly.
+
+Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
+mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
+Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
+stays disjoint from the paged allocation hunks.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 33 insertions(+), 1 deletion(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index b5f9d37..afcdebe 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -3043,6 +3043,29 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+        // tokens added per step independently of n_batch, splitting a long prefill across
+        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+        // (this is a pure scheduler knob; works with paged off).
+        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        {
+            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+            if (env_pb) {
+                const int v = atoi(env_pb);
+                if (v > 0) {
+                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                }
+            }
+        }
+        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
+         auto & alora_scale       = batch.alora_scale;
+         auto & alora_disabled_id = batch.alora_disabled_id;
+ 
+@@ -3487,7 +3510,10 @@ private:
+                     const auto last_user_pos = spans.last_user_message_pos();
+ 
+                     // add prompt tokens for processing in the current batch
+-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
+                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
+                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3512,6 +3538,7 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ 
+                         // stop the prompt batch exactly before a user message
+                         if (spans.is_user_start(slot.prompt.n_tokens())) {
+@@ -3597,6 +3624,11 @@ private:
+                 if (!slot_batched) {
+                     slot_batched = &slot;
+                 }
+                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+                // leaving the remaining batch capacity for co-batched decode of other slots
+                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                    add_ok = false;
+                }
+             });
+         }
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -0,0 +1,140 @@
+From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 15:47:06 +0200
+Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
+
+On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
+sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
+mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
+originally reported npl128 throughput cliff does NOT reproduce on this build.
+llama-batched-bench decode (S_TG t/s) is monotonic across batch:
+
+  npl        1     8    32    64   128   256
+  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
+
+There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
+at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
+
+What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
+token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
+column upper bound = token count, up to 128) in one column-tile. At MoE decode
+the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
+ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
+col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
+time and burns throughput on the padding columns while the larger y-tile lowers
+occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
+covers the density would raise fill + occupancy at no extra weight read (at
+tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
+emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
+kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
+
+Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
+(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
+selection, and therefore every kernel launched, is byte-identical to stock. The
+cap only ever lowers the loop's upper bound and still selects from the same
+granularity- and shared-memory-validated mmq_x set stock already uses for
+smaller batches, so no new kernel configuration is exercised.
+
+Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
+only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
+
+  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
+   64        936          938      +0.1       2924         2883
+  128       1295         1357      +4.8       3075         3038
+  256       1784         1825      +2.3       3085         3046
+
+  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
+
+cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
+npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
+cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
+tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
+re-reads), so 64 is the recommended value and the only one that helps net.
+
+Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
+throughput unlock (llama-server continuous batching already scales). It is a
+modest high-effective-batch DECODE micro-optimization that matches vLLM's
+smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
+durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
+ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
+patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
+
+Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
+stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
+prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
+npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
+ 1 file changed, 36 insertions(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index edf546d..cff608e 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -6,6 +6,7 @@
+ 
+ #include <climits>
+ #include <cstdint>
+#include <cstdlib>
+ 
+ using namespace ggml_cuda_mma;
+ 
+@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+    static const int cap = []() -> int {
+        const char * s = getenv("LLAMA_MOE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return cap;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+    // off the ids path the cap never applies.
+    int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds != nullptr) {
+        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+        if (moe_cap > 0) {
+            const int cap = moe_cap < 8 ? 8 : moe_cap;
+            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        }
+    }
+
+     int mmq_x_best  = 0;
+     int ntiles_x_best = INT_MAX;
+ 
+-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
+         const int granularity = mmq_get_granularity_host(mmq_x, cc);
+ 
+         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -0,0 +1,238 @@
+From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 21:03:00 +0200
+Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
+ (patch 0015)
+
+The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
+0014 doc itself scoped): replace the manual env cap with a host-side, default-on
+auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
+MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
+(decode), and keeps the large 128-wide tile when density is high (prefill). No new
+kernel: the selection only lowers the loop's upper bound to an already-compiled,
+granularity- and shared-memory-validated mmq_x.
+
+Density is estimated host-side from the args the ids path already passes:
+  ne_get_rows = ncols_dst   = ne12 * n_expert_used   (token-expert assignments)
+  n_experts   = nchannels_x = ne02
+  density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
+Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
+global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
+regress by construction.
+
+density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
+a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
+standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
+16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
+sits strictly between for every n_experts in [128,511], so it caps decode and leaves
+prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
+cratered its S_PP by ~2%, the regression this threshold exists to avoid.
+
+Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
+attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
+(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
+
+  npl   S_TG stock  S_TG 0015   dTG%    S_PP stock  S_PP 0015   dPP%
+    8      183.59     183.18  -0.22%       1489.2     1500.1  +0.73%
+   32      264.02     263.44  -0.22%       2034.5     2033.5  -0.05%
+   64      311.76     310.41  -0.43%       2028.3     2027.6  -0.03%
+  128      336.10     337.32  +0.36%       2025.0     2027.7  +0.13%
+
+Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
+and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
+256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
+lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
+cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
+useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
+smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
+
+Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
+(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
+decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
+the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
+neutral on the SSM model, harmless where it does not help. Conservative by design:
+at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
+(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
+work.
+
+LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
+old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
+select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
+LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
+
+Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
+NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
+{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
+All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
+LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
+nothing changes (non-MoE mul_mat byte-identical to stock).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
+ tests/test-backend-ops.cpp |  16 ++++++
+ 2 files changed, 99 insertions(+), 17 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index cff608e..9718b12 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
+// as an explicit override / A-B knob; the default path is now the auto-select.
+ static inline int ggml_cuda_moe_mmq_x_cap() {
+     static const int cap = []() -> int {
+         const char * s = getenv("LLAMA_MOE_MMQ_X");
+@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
+     return cap;
+ }
+ 
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
+    static const bool en = []() -> bool {
+        const char * s = getenv("LLAMA_MOE_AUTO_TILE");
+        return !(s && atoi(s) == 0);
+    }();
+    return en;
+}
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
+static inline int ggml_cuda_moe_decode_tile() {
+    static const int t = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DECODE_TILE");
+        const int v = s ? atoi(s) : 0;
+        return v >= 8 ? v : 64;
+    }();
+    return t;
+}
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
+// segment never splits into an extra col-tile.
+static inline int ggml_cuda_moe_density_max() {
+    static const int d = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
+        const int v = s ? atoi(s) : 0;
+        return v > 0 ? v : 8;
+    }();
+    return d;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+-    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+-    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+-    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+-    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+-    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+-    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+-    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+-    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+-    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+-    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+-    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+-    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+-    // off the ids path the cap never applies.
+    // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+    // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
+    // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
+    // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
+    // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
+    // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
+    // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
+    // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
+    // SMALLER mmq_x when - and only when - the per-expert density is low:
+    //
+    //   ne_get_rows  = args.ncols_dst    = ne12 * n_expert_used  (total token-expert assignments)
+    //   n_experts    = args.nchannels_x  = ne02
+    //   n_active_est = min(n_experts, ne_get_rows)               (upper bound on active experts)
+    //   density      = ceil(ne_get_rows / n_active_est)          (avg tokens per active expert)
+    //
+    // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
+    // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
+    // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
+    // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
+    // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
+    // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
+    // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
+    // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
+    // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
+    // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
+    // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
+    //   - LLAMA_MOE_MMQ_X=<n>   : manual blunt global cap, overrides the auto-select (patch 0014).
+    //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+    //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+             const int cap = moe_cap < 8 ? 8 : moe_cap;
+             mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        } else if (ggml_cuda_moe_auto_tile_enabled()) {
+            const int64_t ne_get_rows = args.ncols_dst;
+            const int64_t n_experts   = args.nchannels_x;
+            if (ne_get_rows > 0 && n_experts > 0) {
+                const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
+                const int64_t density  = (ne_get_rows + n_active - 1) / n_active;
+                const int     tile     = ggml_cuda_moe_decode_tile();
+                if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
+                    mmq_x_lim = tile;
+                }
+            }
+         }
+     }
+ 
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index c83e91f..62a0989 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ 
+    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
+    // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
+    // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
+    // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
+    // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
+    // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
+    // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
+    // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
+    // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
+            test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -0,0 +1,191 @@
+From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 10:11:48 +0200
+Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
+ 0016, continuous-batch P1)
+
+Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
+decode-first token budget: the P1 of the token-granular continuous-batch
+scheduler. POLICY change only inside update_slots(): no new slot states, no
+batch-formation rewrite, zero libllama changes. llama-server already emits one
+unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
+token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
+changes the COUNT of prefill tokens admitted per step.
+
+The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
+== D (the live decode load) is known there. Instead of 0013's constant
+LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
+long prompt monopolise the step), compute a dynamic budget:
+
+  T  = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
+  prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
+       auto-shrinks as decode load rises so the step never inflates past T)
+  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
+       pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
+
+Phase 2's inner prompt-fill loop and outer admission break are bounded by
+prefill_budget_step (across slots) and a new per-slot slot_prompt_added
+counter; the n_batch hard ceiling stays as the compute bound. Decode is
+structurally claimed first and never capped (Phase 1), so the decode-first
+guarantee is free.
+
+DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
+to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
+determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
+(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
+subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
+decisions paged on or off.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
+ 1 file changed, 85 insertions(+), 22 deletions(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index afcdebe..b8b8f00 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -3043,24 +3043,78 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+-        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+-        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+-        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+-        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+-        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+-        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+-        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+-        // tokens added per step independently of n_batch, splitting a long prefill across
+-        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+-        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+-        // (this is a pure scheduler knob; works with paged off).
+-        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+        // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+        // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+        // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+        // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+        // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+        // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+        // lets one long prompt monopolise the step.
+        //
+        // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+        // a single total per-step token budget T, decode claims its D tokens first
+        // (already in the batch), and prefill gets the leftover T - D distributed across
+        // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+        // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+        // the budget self-tunes across the npl range and across dense vs MoE without a
+        // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+        // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+        // never capped (Phase 1), so the decode-first guarantee is free here.
+        //
+        //   LLAMA_MAX_BATCH_TOKENS (T)  total per-step token budget (decode + prefill),
+        //                               default n_batch, clamped to [n_ubatch, n_batch] so
+        //                               the compute loop stays a single llama_decode and
+        //                               prefill keeps an n_ubatch floor of progress.
+        //   LLAMA_PREFILL_CAP           per-slot max prompt tokens per step (the
+        //                               long_prefill_token_threshold analogue), default
+        //                               min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+        //                               one long prompt cannot eat the whole leftover.
+        //   LLAMA_PREFILL_BUDGET        legacy static cap (patch 0013); honoured ONLY when
+        //                               LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+        //
+        // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+        // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+        // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+        // scheduler policy, identical decisions with paged on or off.
+        const int32_t n_decode_in_batch = batch.size();    // D: Phase 1 appended D decode tokens above
+        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
+        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
+         {
+-            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+-            if (env_pb) {
+            int32_t mbt = 0;
+            if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+                mbt = atoi(env_mbt);
+            }
+            if (mbt > 0) {
+                // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+                int32_t T = std::min(n_batch, mbt);
+                T = std::max(T, n_ubatch);
+                // leftover after decode, floored at n_ubatch so prefill never fully starves
+                prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+                // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+                int32_t cap = 0;
+                if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+                    cap = atoi(env_cap);
+                }
+                if (cap <= 0) {
+                    const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+                    cap = std::min(T, std::max(n_ubatch, pct4));
+                }
+                cap = std::min(n_batch, std::max(n_ubatch, cap));
+                // at T == n_batch the leftover and cap both reach the n_batch ceiling
+                // together; pin the cap to n_batch so this case stays byte-identical
+                if (T >= n_batch) {
+                    cap = n_batch;
+                }
+                prefill_cap_per_slot = cap;
+            } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+                // legacy static budget (patch 0013), kept for back-compat when the
+                // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
+                 const int v = atoi(env_pb);
+                 if (v > 0) {
+-                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                    prefill_budget_step = std::min(n_batch, std::max(1, v));
+                 }
+             }
+         }
+@@ -3509,11 +3563,18 @@ private:
+                     const auto & spans = slot.task->params.message_spans;
+                     const auto last_user_pos = spans.last_user_message_pos();
+ 
+                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+                    int32_t slot_prompt_added = 0;
+
+                     // add prompt tokens for processing in the current batch
+-                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+-                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+                    // (the T - D leftover) is spent across all slots, or (b) this slot's
+                    // per-slot chunk cap is hit, so a long prompt is split across more steps
+                    // and leaves batch room for co-batched decode of the other slots
+                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
+-                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
+                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3538,7 +3599,8 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+-                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
+                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
+ 
+                         // stop the prompt batch exactly before a user message
+                         if (spans.is_user_start(slot.prompt.n_tokens())) {
+@@ -3624,9 +3686,10 @@ private:
+                 if (!slot_batched) {
+                     slot_batched = &slot;
+                 }
+-                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+-                // leaving the remaining batch capacity for co-batched decode of other slots
+-                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+                // budget (the T - D leftover) is spent, leaving the remaining batch
+                // capacity for co-batched decode of the other slots
+                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
+                     add_ok = false;
+                 }
+             });
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
@@ -0,0 +1,245 @@
+From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 19:56:05 +0200
+Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
+ occupancy instrumentation (patch 0017)
+
+Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
+bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
+honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
+kill-gate tripped - so nothing is enabled by default.
+
+P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
+row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
+M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
+default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
+
+P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
+  - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
+    weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
+    assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
+    left as the P2 kernel change; the host/device plumbing is in place and inert.
+  - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
+    (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
+  - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
+
+Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
+  DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
+    (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
+  MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
+    (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
+
+nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
+(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
+not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
+mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
+mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
+ tests/test-backend-ops.cpp | 16 +++++++
+ 2 files changed, 92 insertions(+), 9 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index 9718b12..b53e38a 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
+ #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
+ }
+ 
+-static int get_mmq_y_host(const int cc) {
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
+#ifndef GGML_CUDA_FP4_MMQ_Y
+#define GGML_CUDA_FP4_MMQ_Y 128
+#endif
+
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
+    if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+     return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
+         ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
+ }
+@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
+     return MMQ_ITER_K;
+ }
+ 
+template <ggml_type type = GGML_TYPE_COUNT>
+ static constexpr __device__ int get_mmq_y_device() {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA1)
+     return 64;
+@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
+ #endif // defined(GGML_USE_HIP)
+ }
+ 
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
+#ifndef GGML_CUDA_FP4_MINBLOCKS
+#define GGML_CUDA_FP4_MINBLOCKS 1
+#endif
+template <ggml_type type = GGML_TYPE_COUNT>
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
+        return GGML_CUDA_FP4_MINBLOCKS;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+    return stock;
+}
+
+ // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
+ // The K dimension of the tiles has either,
+ // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
+@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+     constexpr int              warp_size  = ggml_cuda_get_physical_warp_size();
+     constexpr int              nwarps     = mmq_get_nwarps_device();
+     constexpr int              qk         = ggml_cuda_type_traits<type>::qk;
+-    constexpr int              mmq_y      = get_mmq_y_device();
+    constexpr int              mmq_y      = get_mmq_y_device<type>();
+     constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
+ 
+     extern __shared__ int data_mul_mat_q[];
+@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+ template <ggml_type type, int mmq_x, bool need_check>
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+ #else
+ #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
+ #else
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+ #endif // defined(GGML_USE_HIP)
+ static __global__ void mul_mat_q(
+@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size();
+ 
+     constexpr int qk    = ggml_cuda_type_traits<type>::qk;
+-    constexpr int mmq_y = get_mmq_y_device();
+    constexpr int mmq_y = get_mmq_y_device<type>();
+ 
+     const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
+ 
+@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
+         float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
+         const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
+         const int stride_sample_dst, const uint3 ntx) {
+-    constexpr int mmq_y           = get_mmq_y_device();
+    constexpr int mmq_y           = get_mmq_y_device<type>();
+     constexpr int qk              = ggml_cuda_type_traits<type>::qk;
+     constexpr int ITER_K          = get_iter_k(type);
+     constexpr int blocks_per_iter = ITER_K / qk;
+@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     const int nsm = ggml_cuda_info().devices[id].nsm;
+     const int warp_size = ggml_cuda_info().devices[id].warp_size;
+     const int nwarps = mmq_get_nwarps_host(cc, warp_size);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     const dim3 block_dims(warp_size, nwarps, 1);
+ 
+@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
+     return d;
+ }
+ 
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
+    static const int c = []() -> int {
+        const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return c;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int nwarps    = mmq_get_nwarps_host(cc, warp_size);
+ 
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+     // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+     //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
+        // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
+        const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
+        if (cap > 0 && cap < mmq_x_max) {
+            mmq_x_lim = cap < 8 ? 8 : cap;
+        }
+    }
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index f219309..291c275 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+         }
+     }
+ 
+    // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
+    // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
+    // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
+    // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
+    // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
+    // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
+    // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
+    // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int64_t m : {2048, 1600, 2050}) {
+            for (int64_t n : {32, 128}) {
+                test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
+            }
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
@@ -0,0 +1,349 @@
+From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 00:44:13 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
+ write-back (patch 0018)
+
+Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
+16 full-attention layers) was dominated by recurrent-state plumbing, not the
+FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
+recurrent state into graph scratch, then a separate ggml_cpy persisted it into
+the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
+~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
+
+This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
+ggml_gated_delta_net_inplace writes the final recurrent state directly into the
+active sequences contiguous cache slot (at kv_head), removing the copy-back. The
+op output then carries only the attention scores; the SSM arithmetic is
+unchanged (bit-identical greedy output vs the copy-back baseline).
+
+- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
+- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
+- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
+  dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
+
+Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
+  npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
+  npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
+MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
+nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
+(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
+next lever.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                   | 14 ++++++
+ ggml/src/ggml-cpu/ops.cpp             | 13 ++++-
+ ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
+ ggml/src/ggml.c                       | 68 +++++++++++++++++++++++++++
+ src/models/delta-net-base.cpp         | 30 ++++++++++++
+ 5 files changed, 152 insertions(+), 12 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 823f5a9..4e7ab32 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2579,6 +2579,20 @@ extern "C" {
+             struct ggml_tensor  * state,
+             int64_t               K);
+ 
+    // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+    // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+    // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+    // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * q,
+            struct ggml_tensor  * k,
+            struct ggml_tensor  * v,
+            struct ggml_tensor  * g,
+            struct ggml_tensor  * beta,
+            struct ggml_tensor  * state,
+            struct ggml_tensor  * state_dst);
+
+     // custom operators
+ 
+     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 63c07a2..9457add 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
+ 
+     const int64_t S_v      = src_v->ne[0];
+     const int64_t H        = src_v->ne[1];
+@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+ 
+     const float scale = 1.0f / sqrtf((float) S_v);
+ 
+    // when src_state_dst is provided (in-place decode write-back) the final state is written
+    // directly into the persistent cache view, removing the separate state copy-back node.
+    float * inplace_state_base = nullptr;
+    if (src_state_dst != nullptr) {
+        GGML_ASSERT(K == 1);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        inplace_state_base = (float *) src_state_dst->data;
+    }
+
+     for (int64_t ir = ir0; ir < ir1; ++ir) {
+         const int64_t iv1 = ir % H; // head_index
+         const int64_t iv3 = ir / H; // sequence
+@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+         // For K>1, work in scratch and copy out per-token when the slot is in range.
+         float * s_out = (K > 1)
+             ? state_work
+-            : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+            : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
+ 
+         // copy input state into the working buffer and operate in-place
+         // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index a547360..61a2b91 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
+                                      const uint3   neqk1_magic,
+                                      const uint3   rq3_magic,
+                                      float         scale,
+-                                     int           K) {
+                                     int           K,
+                                     float *       state_dst) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+     // each warp owns one column, using warp-level primitives to reduce across rows
+@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
+ 
+     const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+     float *       attn_data        = dst;
+-    float *       state            = dst + attn_score_elems;
+    // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+    // directly into the persistent cache view instead of being appended to the op output; this
+    // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+    float *       state            = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
+ 
+     // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
+     // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
+@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
+ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+         const float * g_d, const float * b_d, const float * s_d,
+-        float * dst_d,
+        float * dst_d, float * state_dst_d,
+         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
+         int64_t sq1,   int64_t sq2, int64_t sq3,
+         int64_t sv1,   int64_t sv2, int64_t sv3,
+@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 32:
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 64: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         case 128: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         default:
+@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
+ 
+     GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
+     GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
+@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const float * s_d   = (const float *) src_state->data;
+     float *       dst_d = (float *) dst->data;
+ 
+    float * state_dst_d = nullptr;
+    if (src_state_dst != nullptr) {
+        // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+        GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        state_dst_d = (float *) src_state_dst->data;
+    }
+
+     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
+@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const int K = ggml_get_op_params_i32(dst, 0);
+     const bool keep_rs = K > 1;
+ 
+    // in-place write-back is only valid for the single-snapshot (final-state) case
+    GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
+     if (kda) {
+         if (keep_rs) {
+-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+     } else {
+         if (keep_rs) {
+-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index adbe52b..b8d34bf 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
+     return result;
+ }
+ 
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * v,
+        struct ggml_tensor  * g,
+        struct ggml_tensor  * beta,
+        struct ggml_tensor  * state,
+        struct ggml_tensor  * state_dst) {
+    GGML_ASSERT(ggml_is_contiguous_rows(q));
+    GGML_ASSERT(ggml_is_contiguous_rows(k));
+    GGML_ASSERT(ggml_is_contiguous_rows(v));
+    GGML_ASSERT(ggml_is_contiguous(g));
+    GGML_ASSERT(ggml_is_contiguous(beta));
+    GGML_ASSERT(ggml_is_contiguous(state));
+
+    GGML_ASSERT(q->type == GGML_TYPE_F32);
+    GGML_ASSERT(k->type == GGML_TYPE_F32);
+    GGML_ASSERT(v->type == GGML_TYPE_F32);
+    GGML_ASSERT(g->type == GGML_TYPE_F32);
+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
+    GGML_ASSERT(state->type == GGML_TYPE_F32);
+    GGML_ASSERT(state_dst != NULL);
+    GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+    const int64_t S_v      = v->ne[0];
+    const int64_t H        = v->ne[1];
+    const int64_t n_tokens = v->ne[2];
+    const int64_t n_seqs   = v->ne[3];
+
+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+    GGML_ASSERT(beta->ne[0] == 1);
+
+    GGML_ASSERT(state->ne[0] == S_v);
+    GGML_ASSERT(state->ne[1] == S_v);
+    GGML_ASSERT(state->ne[2] == H);
+    GGML_ASSERT(state->ne[3] == n_seqs);
+
+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+    const int64_t state_rows = S_v * n_seqs; // K == 1
+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+    ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+    result->op     = GGML_OP_GATED_DELTA_NET;
+    result->src[0] = q;
+    result->src[1] = k;
+    result->src[2] = v;
+    result->src[3] = g;
+    result->src[4] = beta;
+    result->src[5] = state;
+    result->src[6] = state_dst;
+
+    return result;
+}
+
+ ////////////////////////////////////////////////////////////////////////////////
+ 
+ struct ggml_hash_set ggml_hash_set_new(size_t size) {
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index ad9ce77..26a718b 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+     const bool keep = cparams.n_rs_seq > 0;
+ 
+     if (!keep) {
+        const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+        if (fused) {
+            // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+            // directly into the persistent cache slot for the active sequences (a contiguous block
+            // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+            // dominated decode. The op output then carries only the attention scores.
+            ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+            ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+            if (n_seq_tokens == 1) {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+            } else {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+            }
+
+            ggml_tensor * output = ggml_view_4d(ctx0, result,
+                    S_v, H_v, n_seq_tokens, n_seqs,
+                    ggml_row_size(result->type, S_v),
+                    ggml_row_size(result->type, S_v * H_v),
+                    ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+            cb(output, "attn_output", il);
+
+            // the state write is a side effect of the op; pull the op into the graph via the output
+            ggml_build_forward_expand(gf, output);
+
+            return output;
+        }
+
+         auto attn_out = build_delta_net(q, k, v, g, b, s, il);
+         ggml_tensor * output    = attn_out.first;
+         ggml_tensor * new_state = attn_out.second;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
@@ -0,0 +1,583 @@
+From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 01:45:02 +0200
+Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
+ (patch 0019)
+
+Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
+write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
+state get_rows gather (18.8% of decode GPU time): build_rs materialized each
+sequence's prior state into a contiguous scratch via ggml_get_rows before the
+gated-DeltaNet op read it.
+
+This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
+ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
+s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
+each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
+in-place write the op now reads AND writes the cache directly: no recurrent-state
+materialization at all. build_recurrent_attn feeds the full cache + ids through
+the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
+clear and the extra-states copy around the op.
+
+Race-free by construction on CUDA. In-place write plus an ids read of the same
+cache is only safe when read slot == write slot; s_copy is identity
+(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
+remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
+The recurrence kernel handles both per (seq, head) block on device: identity
+sequences read s0 in place from the destination slot (the kernel loads all of s0
+into registers before writing, so reading and writing the same slot is safe),
+and non-identity sequences read from a disjoint scratch that a small gather
+kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
+another block writes. The CPU op mirrors this (host identity check + a serial
+gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
+device-resident at op-execute time). Bit-identical to the get_rows path in every
+case.
+
+- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
+  (gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
+  CPU identity guard + serial gather fallback in the dispatcher
+- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
+  qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
+  the rollback (n_rs_seq > 0) path are unchanged.
+
+Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
+  dense q36-27b-nvfp4 : npl 32  137.64 -> 170.68 (+24.0 percent)
+                        npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
+  MoE   q36-35b-a3b-nvfp4: npl 32  299.68 -> 366.69 (+22.4 percent)
+                           npl 128 409.30 -> 553.63 (+35.3 percent)
+Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
+(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
+k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
+is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
+is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                   | 17 ++++++
+ ggml/src/ggml-cpu/ops.cpp             | 49 ++++++++++++++-
+ ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
+ ggml/src/ggml.c                       | 76 +++++++++++++++++++++++
+ src/models/delta-net-base.cpp         | 63 ++++++++++++++++++++
+ src/models/models.h                   | 13 ++++
+ src/models/qwen35.cpp                 |  6 +-
+ src/models/qwen35moe.cpp              |  6 +-
+ 8 files changed, 292 insertions(+), 23 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 4e7ab32..951dd21 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2593,6 +2593,23 @@ extern "C" {
+             struct ggml_tensor  * state,
+             struct ggml_tensor  * state_dst);
+ 
+    // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
+    // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
+    // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
+    // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
+    // base slot. Eliminates the recurrent-state gather on the decode path.
+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * q,
+            struct ggml_tensor  * k,
+            struct ggml_tensor  * v,
+            struct ggml_tensor  * g,
+            struct ggml_tensor  * beta,
+            struct ggml_tensor  * state,
+            struct ggml_tensor  * state_dst,
+            struct ggml_tensor  * ids,
+            int                   rs_head);
+
+     // custom operators
+ 
+     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 9457add..b6a1976 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+     const int64_t K = ggml_get_op_params_i32(dst, 0);
+     GGML_ASSERT(K >= 1);
+     // per-seq stride in floats (seq s starts at state + s * seq_stride)
+-    const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+    int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+ 
+     const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
+     const int ith = params->ith;
+@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+ 
+     const float * state_in_base = (const float *)src_state->data;
+ 
+    // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
+    // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
+    // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
+    // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
+    // region. Bit-identical to the get_rows path.
+    ggml_tensor * src_ids = dst->src[7];
+    if (src_ids != nullptr) {
+        const int64_t   D       = S_v * S_v * H;
+        const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
+        const int32_t * ids     = (const int32_t *) src_ids->data;
+        bool identity = true;
+        for (int64_t s = 0; s < n_seqs; ++s) {
+            if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+        }
+        state_seq_stride = D;
+        state_in_base = identity
+            ? (const float *) src_state->data + (int64_t) rs_head * D
+            : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
+    }
+
+   //const int64_t rq1 = nev1 / neq1;
+   //const int64_t rk1 = nev1 / nek1;
+     const int64_t rq3 = nev3 / neq3;
+@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
+ 
+     if (ith == 0) {
+       ggml_threadpool_chunk_set(params->threadpool, nth);
+
+      // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
+      // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
+      // recurrence, so the in-place write never aliases another sequence's read.
+      ggml_tensor * src_ids = dst->src[7];
+      if (src_ids != nullptr) {
+          const ggml_tensor * src_state = dst->src[5];
+          const int64_t S_v      = V->ne[0];
+          const int64_t H        = V->ne[1];
+          const int64_t n_tokens = V->ne[2];
+          const int64_t n_seqs   = V->ne[3];
+          const int64_t D        = S_v * S_v * H;
+          const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
+          const int32_t * ids     = (const int32_t *) src_ids->data;
+          bool identity = true;
+          for (int64_t s = 0; s < n_seqs; ++s) {
+              if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+          }
+          if (!identity) {
+              const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+              const float * cache   = (const float *) src_state->data;
+              float *       scratch = (float *) dst->data + attn_score_elems;
+              for (int64_t s = 0; s < n_seqs; ++s) {
+                  memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
+              }
+          }
+      }
+     }
+ 
+     ggml_barrier(params->threadpool);
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index 61a2b91..86d5e2a 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -1,6 +1,34 @@
+ #include "gated_delta_net.cuh"
+ #include "ggml-cuda/common.cuh"
+ 
+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
+                                           float * scratch, int64_t D, int n_seqs) {
+    const int s = blockIdx.x;
+    if (s >= n_seqs) {
+        return;
+    }
+    const int r = ids[s];
+    if (r == rs_head + s) {
+        return; // identity: prior state already lives in the in-place destination slot
+    }
+    const float * src = cache   + (int64_t) r * D;
+    float *       dst = scratch + (int64_t) s * D;
+    for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
+        dst[i] = src[i];
+    }
+}
+
+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
+                                          float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
+    if (n_seqs <= 0) {
+        return;
+    }
+    gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+}
+
+ template <int S_v, bool KDA, bool keep_rs_t>
+ __global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+ gated_delta_net_cuda(const float * q,
+@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
+                                      const uint3   rq3_magic,
+                                      float         scale,
+                                      int           K,
+-                                     float *       state_dst) {
+                                     float *       state_dst,
+                                     const int32_t * ids,
+                                     int           rs_head) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+     // each warp owns one column, using warp-level primitives to reduce across rows
+@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
+     const int64_t state_in_offset      = sequence * H * S_v * S_v + h_idx * S_v * S_v;
+     const int64_t state_out_offset     = (sequence * H + h_idx) * S_v * S_v;
+     state += state_out_offset;
+-    curr_state += state_in_offset + col * S_v;
+    // Step 2: select the prior-state read base per sequence. For the ids variant, identity
+    // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
+    // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
+    // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
+    // offset. The whole s0 is loaded into registers before the new state is written, so reading and
+    // writing the same slot per block (identity) is race-free.
+    const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+        ? state_dst : curr_state;
+    read_state += state_in_offset + col * S_v;
+     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
+ 
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
+@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
+ #pragma unroll
+     for (int r = 0; r < rows_per_lane; r++) {
+         const int i = r * warp_size + lane;
+-        s_shard[r]  = curr_state[i];
+        s_shard[r]  = read_state[i];
+     }
+ 
+     for (int t = 0; t < n_tokens; t++) {
+@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+         const float * g_d, const float * b_d, const float * s_d,
+         float * dst_d, float * state_dst_d,
+        const int32_t * ids_d, int rs_head,
+         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
+         int64_t sq1,   int64_t sq2, int64_t sq3,
+         int64_t sv1,   int64_t sv2, int64_t sv3,
+@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         case 32:
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         case 64: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         }
+         case 128: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         }
+         default:
+@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const float * g_d = (const float *) src_g->data;
+     const float * b_d = (const float *) src_beta->data;
+ 
+-    const float * s_d   = (const float *) src_state->data;
+     float *       dst_d = (float *) dst->data;
+ 
+     float * state_dst_d = nullptr;
+@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+         state_dst_d = (float *) src_state_dst->data;
+     }
+ 
+    // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
+    // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
+    // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
+    // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
+    // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
+    // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
+    // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
+    ggml_tensor * src_ids = dst->src[7];
+    const float *   s_d     = (const float *) src_state->data;
+    const int32_t * ids_d   = nullptr;
+    int             rs_head = 0;
+    ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
+    if (src_ids != nullptr) {
+        GGML_ASSERT(state_dst_d != nullptr);
+        GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
+        rs_head = ggml_get_op_params_i32(dst, 1);
+        ids_d   = (const int32_t *) src_ids->data;
+        const int64_t D = S_v * S_v * H;
+        float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
+        ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
+        s_d = scratch;
+    }
+
+     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
+@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+ 
+     if (kda) {
+         if (keep_rs) {
+-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+     } else {
+         if (keep_rs) {
+-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index b8d34bf..1762037 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
+     return result;
+ }
+ 
+// ggml_gated_delta_net_inplace_ids
+//
+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
+// + s), where the prior state already lives in the in-place destination slots.
+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * v,
+        struct ggml_tensor  * g,
+        struct ggml_tensor  * beta,
+        struct ggml_tensor  * state,
+        struct ggml_tensor  * state_dst,
+        struct ggml_tensor  * ids,
+        int                   rs_head) {
+    GGML_ASSERT(ggml_is_contiguous_rows(q));
+    GGML_ASSERT(ggml_is_contiguous_rows(k));
+    GGML_ASSERT(ggml_is_contiguous_rows(v));
+    GGML_ASSERT(ggml_is_contiguous(g));
+    GGML_ASSERT(ggml_is_contiguous(beta));
+    GGML_ASSERT(ggml_is_contiguous(state));
+
+    GGML_ASSERT(q->type    == GGML_TYPE_F32);
+    GGML_ASSERT(k->type    == GGML_TYPE_F32);
+    GGML_ASSERT(v->type    == GGML_TYPE_F32);
+    GGML_ASSERT(g->type    == GGML_TYPE_F32);
+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
+    GGML_ASSERT(state->type == GGML_TYPE_F32);
+    GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+    const int64_t S_v      = v->ne[0];
+    const int64_t H        = v->ne[1];
+    const int64_t n_tokens = v->ne[2];
+    const int64_t n_seqs   = v->ne[3];
+
+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+    GGML_ASSERT(beta->ne[0] == 1);
+
+    // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
+    GGML_ASSERT(state->ne[0] == S_v);
+    GGML_ASSERT(state->ne[1] == S_v);
+    GGML_ASSERT(state->ne[2] == H);
+    GGML_ASSERT(state->ne[3] >= n_seqs);
+
+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+    // ids: per-seq source slot into the full cache (s_copy_main)
+    GGML_ASSERT(ids->ne[0] >= n_seqs);
+
+    const int64_t state_rows = S_v * n_seqs; // K == 1
+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+    ggml_set_op_params_i32(result, 0, 1);       // K == 1
+    ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
+
+    result->op     = GGML_OP_GATED_DELTA_NET;
+    result->src[0] = q;
+    result->src[1] = k;
+    result->src[2] = v;
+    result->src[3] = g;
+    result->src[4] = beta;
+    result->src[5] = state;     // FULL cache (read via ids)
+    result->src[6] = state_dst; // in-place final-state write-back target
+    result->src[7] = ids;       // per-seq source slots (s_copy)
+
+    return result;
+}
+
+ ////////////////////////////////////////////////////////////////////////////////
+ 
+ struct ggml_hash_set ggml_hash_set_new(size_t size) {
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index 26a718b..194e611 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
+     return conv_input;
+ }
+ 
+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
+// and rollback paths fall back to materializing the prior state and delegating below.
+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+        llm_graph_input_rs * inp,
+        ggml_tensor *        ssm_states_all,
+        ggml_tensor *        q,
+        ggml_tensor *        k,
+        ggml_tensor *        v,
+        ggml_tensor *        g,
+        ggml_tensor *        b,
+        int                  il) {
+    const auto * mctx_cur = inp->mctx;
+    const auto   kv_head  = mctx_cur->get_head();
+
+    const int64_t S_v          = v->ne[0];
+    const int64_t H_v          = v->ne[1];
+    const int64_t n_seqs       = v->ne[3];
+    const int64_t n_seq_tokens = q->ne[2];
+
+    const bool keep  = cparams.n_rs_seq > 0;
+    const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+    if (!keep && fused) {
+        // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
+        // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
+        // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
+        // the final state in place at kv_head; no recurrent-state materialization at all.
+        auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+            ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
+            ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+            return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
+        };
+
+        ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
+        if (n_seq_tokens == 1) {
+            cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+        } else {
+            cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+        }
+
+        ggml_tensor * output = ggml_view_4d(ctx0, result,
+                S_v, H_v, n_seq_tokens, n_seqs,
+                ggml_row_size(result->type, S_v),
+                ggml_row_size(result->type, S_v * H_v),
+                ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+        cb(output, "attn_output", il);
+
+        // the state write is a side effect of the op; pull the op into the graph via the output
+        ggml_build_forward_expand(gf, output);
+
+        return output;
+    }
+
+    // non-fused / rollback: materialize the prior state via gather and delegate to the
+    // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
+    ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+    s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
+    return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
+}
+
+ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+         llm_graph_input_rs * inp,
+         ggml_tensor *        ssm_states_all,
+diff --git a/src/models/models.h b/src/models/models.h
+index 2ac8415..98b89e9 100644
+--- a/src/models/models.h
+++ b/src/models/models.h
+@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
+             ggml_tensor *        b,
+             ggml_tensor *        s,
+             int                  il);
+
+    // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
+    // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
+    // delegates to the state-taking overload for the non-fused and rollback paths.
+    ggml_tensor * build_recurrent_attn(
+            llm_graph_input_rs * inp,
+            ggml_tensor *        ssm_states_all,
+            ggml_tensor *        q,
+            ggml_tensor *        k,
+            ggml_tensor *        v,
+            ggml_tensor *        g,
+            ggml_tensor *        b,
+            int                  il);
+ };
+ 
+ struct llm_build_rwkv6_base : public llm_graph_context {
+diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
+index 6783d98..0be3247 100644
+--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
+@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+ 
+     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+-    cb(state, "state_predelta", il);
+-
+     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+     cb(conv_output_proper, "conv_output_raw", il);
+ 
+@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+     cb(k_conv, "k_conv_predelta", il);
+     cb(v_conv, "v_conv_predelta", il);
+ 
+-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
+ 
+     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
+     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
+diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
+index eb5e9a4..2995f04 100644
+--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
+@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+ 
+     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+-    cb(state, "state_predelta", il);
+-
+     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+     cb(conv_output_proper, "conv_output_raw", il);
+ 
+@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+     cb(k_conv, "k_conv_predelta", il);
+     cb(v_conv, "v_conv_predelta", il);
+ 
+-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
+ 
+     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
+     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
@@ -0,0 +1,140 @@
+From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 12:40:49 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
+ (patch 0020)
+
+Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
+models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
+(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
+both engines pinned the largest llama-specific overage to the gated-DeltaNet
+OUTPUT projection (ssm_out).
+
+The GDN op left its output in SSM layout and the graph reshaped it to 3D
+[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
+src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
+sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
+ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
+the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
+M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
+
+The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
+(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
+routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
+all 128 tokens). The result is then already 2D, so the redundant post-matmul
+reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
+Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
+untouched.
+
+Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
+q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
+test-backend-ops MUL_MAT and MUL_MAT_ID OK.
+
+decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
+  dense q36-27b:    170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
+  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
+Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
+
+nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
+to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
+per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
+vs 2.77 ms/call for the old GEMV.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/models/qwen35.cpp       | 13 ++++---
+ src/models/qwen35moe.cpp    | 13 ++++---
+ src/models/qwen3next.cpp    | 13 ++++---
+ 3 files changed, 21 insertions(+), 18 deletions(-)
+
+diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
+index 0be3247..0874c43 100644
+--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
+@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
+index 2995f04..1f6f643 100644
+--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
+@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
+index 97200a4..bfdf026 100644
+--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
+@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
+     // Apply gated normalization: self.norm(core_attn_out, z)
+     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
+ 
+-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+    // data, just a 2D vs 3D view, so the result is bit-identical.
+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
+     cb(final_output, "final_output", il);
+ 
+-    // Output projection
+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
+     cur = build_lora_mm(model.layers[il].ssm_out, final_output);
+     cb(cur, "linear_attn_out", il);
+ 
+-    // Reshape back to original dimensions
+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
+-
+     return cur;
+ }
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
@@ -0,0 +1,655 @@
+From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 18:55:54 +0200
+Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch
+ 0021)
+
+The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
+design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
+recurrence is already single-pass at the f32 byte floor), the decode conv path
+was the only remaining bit-exact lever.
+
+New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
+by a non-null src[3]). On the single-token decode path it replaces the four-op
+conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
+(channel, sequence), assembles the width-K window in registers from the K-1 cached
+taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
+ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
+output, and writes the 1-token-shifted ring state back IN PLACE into the conv
+cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
+in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
+write target (the cache view) are disjoint buffers, so it is race-free by
+construction with no ids/identity logic.
+
+- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
+  src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
+  op_params[0]=fuse_silu)
+- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
+  ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
+- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+  + branch in ggml_compute_forward_ssm_conv
+- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
+  conv-tap gather; fuses conv+silu+shifted write-back)
+- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
+  (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
+  the original chain
+- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
+
+test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
+
+Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
+(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
+ac163882... both BYTE-IDENTICAL.
+
+decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
+  dense q36-27b-nvfp4 : npl 32  199.76 -> 202.99 (+1.6%)
+                        npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
+  MoE   q36-35b-a3b   : npl 32  421.72 -> 432.39 (+2.5%)
+                        npl 128 689.74 -> 713.54 (+3.5%)
+Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
+(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
+decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
+conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
+conv-cache plumbing.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h            |  16 +++++
+ ggml/src/ggml-cpu/ops.cpp      |  73 ++++++++++++++++++++-
+ ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
+ ggml/src/ggml.c                |  54 ++++++++++++++++
+ src/models/delta-net-base.cpp  |  51 +++++++++++++++
+ src/models/models.h            |  14 +++++
+ src/models/qwen35.cpp          |  23 +++++--
+ src/models/qwen35moe.cpp       |  23 +++++--
+ src/models/qwen3next.cpp       |  29 ++++++---
+ tests/test-backend-ops.cpp     |  47 ++++++++++++++
+ 10 files changed, 420 insertions(+), 22 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 951dd21..76fa401 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2447,6 +2447,22 @@ extern "C" {
+             struct ggml_tensor  * sx,
+             struct ggml_tensor  * c);
+ 
+    // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles
+    // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels,
+    // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise
+    // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes
+    // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs]
+    // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back +
+    // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV;
+    // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode).
+    GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * conv_states,
+            struct ggml_tensor  * conv_kernel,
+            struct ggml_tensor  * x_cur,
+            struct ggml_tensor  * conv_state_dst,
+            bool                  fuse_silu);
+
+     GGML_API struct ggml_tensor * ggml_ssm_scan(
+             struct ggml_context * ctx,
+             struct ggml_tensor  * s,
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index b6a1976..f9cd850 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32(
+     }
+ }
+ 
+// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the
+// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same
+// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv
+// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split
+// over channels.
+static void ggml_compute_forward_ssm_conv_update_f32(
+        const ggml_compute_params * params,
+        ggml_tensor * dst) {
+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
+    ggml_tensor       * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+    const int ith = params->ith;
+    const int nth = params->nth;
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = conv_states->ne[2];
+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+
+    const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float);
+    const int64_t states_ch_stride  = conv_states->nb[1] / sizeof(float);
+    const int64_t w_stride          = conv_kernel->nb[1] / sizeof(float);
+    const int64_t x_seq_stride      = x_cur->nb[2] / sizeof(float);
+    const int64_t dst_seq_stride    = dst->nb[2] / sizeof(float);
+    const int64_t cdst_seq_stride   = cdst->nb[1] / sizeof(float);
+
+    const float * states_base = (const float *) conv_states->data;
+    const float * w_base      = (const float *) conv_kernel->data;
+    const float * x_base      = (const float *) x_cur->data;
+    float *       cdst_base   = (float *) cdst->data;
+    float *       dst_base    = (float *) dst->data;
+
+    const int64_t dc = (channels + nth - 1) / nth;
+    const int64_t c0 = dc * ith;
+    const int64_t c1 = MIN(c0 + dc, channels);
+
+    for (int64_t s = 0; s < n_seqs; ++s) {
+        for (int64_t c = c0; c < c1; ++c) {
+            const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride;
+            const float * w_c      = w_base + c * w_stride;
+            const float   xc       = x_base[s * x_seq_stride + c];
+
+            // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
+            float sumf = 0.0f;
+            for (int64_t j = 0; j < d_conv - 1; ++j) {
+                sumf += states_c[j] * w_c[j];
+            }
+            sumf += xc * w_c[d_conv - 1];
+            sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
+
+            dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
+
+            // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
+            float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
+            for (int64_t j = 0; j < d_conv - 2; ++j) {
+                out_state[j] = states_c[j + 1];
+            }
+            out_state[d_conv - 2] = xc;
+        }
+    }
+}
+
+ void ggml_compute_forward_ssm_conv(
+         const ggml_compute_params * params,
+         ggml_tensor * dst) {
+     switch (dst->src[0]->type) {
+         case GGML_TYPE_F32:
+             {
+-                ggml_compute_forward_ssm_conv_f32(params, dst);
+                if (dst->src[3] != nullptr) {
+                    ggml_compute_forward_ssm_conv_update_f32(params, dst);
+                } else {
+                    ggml_compute_forward_ssm_conv_f32(params, dst);
+                }
+             } break;
+         default:
+             {
+diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
+index 1463169..e1af1cd 100644
+--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
+@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
+     }
+ }
+ 
+// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of
+// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the
+// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as
+// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted
+// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back.
+template <bool apply_silu, int d_conv>
+static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states,
+                                           const float * __restrict__ conv_kernel,
+                                           const float * __restrict__ x_cur,
+                                           float       * __restrict__ conv_state_dst,
+                                           float       * __restrict__ dst,
+                                           const int channels,
+                                           const int states_seq_stride,
+                                           const int w_stride,
+                                           const int x_seq_stride,
+                                           const int dst_seq_stride,
+                                           const int cdst_seq_stride) {
+    const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
+    const int s = blockIdx.y;                            // sequence
+    if (c >= channels) {
+        return;
+    }
+
+    const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1);
+    const float * w_c       = conv_kernel + (int64_t) c * w_stride;
+    const float   xc        = x_cur[(int64_t) s * x_seq_stride + c];
+
+    // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window
+    float window[d_conv];
+#pragma unroll
+    for (int j = 0; j < d_conv - 1; j++) {
+        window[j] = states_c[j];
+    }
+    window[d_conv - 1] = xc;
+
+    float sumf = 0.0f;
+#pragma unroll
+    for (int j = 0; j < d_conv; j++) {
+        sumf += window[j] * w_c[j];
+    }
+    sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
+    dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
+
+    // 1-token-shifted ring write-back: drop the oldest tap, append the current token
+    float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
+#pragma unroll
+    for (int j = 0; j < d_conv - 1; j++) {
+        out_state[j] = window[j + 1];
+    }
+}
+
+static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
+    const ggml_tensor * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = conv_states->ne[2];
+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+    GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
+    GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+    GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
+
+    const float * states_d = (const float *) conv_states->data;
+    const float * w_d      = (const float *) conv_kernel->data;
+    const float * x_d      = (const float *) x_cur->data;
+    float *       cdst_d   = (float *) cdst->data;
+    float *       dst_d    = (float *) dst->data;
+    cudaStream_t  stream   = ctx.stream();
+
+    const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float));
+    const int w_stride          = (int) (conv_kernel->nb[1] / sizeof(float));
+    const int x_seq_stride      = (int) (x_cur->nb[2] / sizeof(float));
+    const int dst_seq_stride    = (int) (dst->nb[2] / sizeof(float));
+    const int cdst_seq_stride   = (int) (cdst->nb[1] / sizeof(float));
+
+    const int threads = 128;
+    const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
+
+    auto launch = [&](auto NC) {
+        constexpr int kNC = decltype(NC)::value;
+        if (apply_silu) {
+            ssm_conv_update_f32<true, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+        } else {
+            ssm_conv_update_f32<false, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+        }
+    };
+
+    switch (d_conv) {
+        case 3: launch(std::integral_constant<int, 3>{}); break;
+        case 4: launch(std::integral_constant<int, 4>{}); break;
+        default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4");
+    }
+}
+
+ template <bool apply_silu>
+ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
+                               const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
+@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa
+ }
+ 
+ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
+    // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a
+    // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/
+    // silu of the decode conv path into a single kernel.
+    if (dst->src[3] != nullptr) {
+        GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
+        ggml_cuda_op_ssm_conv_update(ctx, dst);
+        return;
+    }
+
+     const struct ggml_tensor * src0 = dst->src[0];  // conv_x
+     const struct ggml_tensor * src1 = dst->src[1];  // conv1d.weight
+     const bool fuse_bias = bias_add_node != nullptr;
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index 1762037..b777748 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv(
+     return result;
+ }
+ 
+// ggml_ssm_conv_update_inplace
+//
+// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a
+// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new
+// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv),
+// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the
+// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache
+// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern.
+struct ggml_tensor * ggml_ssm_conv_update_inplace(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * conv_states,
+        struct ggml_tensor  * conv_kernel,
+        struct ggml_tensor  * x_cur,
+        struct ggml_tensor  * conv_state_dst,
+        bool                  fuse_silu) {
+    GGML_ASSERT(ggml_is_3d(conv_states));
+    GGML_ASSERT(ggml_is_matrix(conv_kernel));
+    GGML_ASSERT(ggml_is_3d(x_cur));
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = conv_states->ne[2];
+
+    GGML_ASSERT(conv_states->type    == GGML_TYPE_F32);
+    GGML_ASSERT(conv_kernel->type    == GGML_TYPE_F32);
+    GGML_ASSERT(x_cur->type          == GGML_TYPE_F32);
+    GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
+
+    // conv_states: [K-1, channels, n_seqs], contiguous taps per channel
+    GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
+    GGML_ASSERT(conv_states->ne[1] == channels);
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    // x_cur: single decode token per sequence
+    GGML_ASSERT(x_cur->ne[0] == channels);
+    GGML_ASSERT(x_cur->ne[1] == 1);
+    GGML_ASSERT(x_cur->ne[2] == n_seqs);
+    // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
+    GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
+    GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
+
+    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+
+    ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
+
+    result->op     = GGML_OP_SSM_CONV;
+    result->src[0] = conv_states;
+    result->src[1] = conv_kernel;
+    result->src[2] = x_cur;
+    result->src[3] = conv_state_dst;
+
+    return result;
+}
+
+ // ggml_ssm_scan
+ 
+ struct ggml_tensor * ggml_ssm_scan(
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index 194e611..0eee804 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
+     return conv_input;
+ }
+ 
+// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same
+// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token-
+// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat
+// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate
+// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the
+// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy
+// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu.
+ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
+        llm_graph_input_rs * inp,
+        ggml_tensor *        conv_states_all,
+        ggml_tensor *        qkv_mixed,
+        ggml_tensor *        conv_kernel,
+        int64_t              conv_kernel_size,
+        int64_t              conv_channels,
+        int                  il) {
+    const auto * mctx_cur = inp->mctx;
+    const auto   kv_head  = mctx_cur->get_head();
+
+    const int64_t n_seqs       = ubatch.n_seqs;
+    const int64_t n_seq_tokens = ubatch.n_seq_tokens;
+
+    GGML_ASSERT(n_seq_tokens == 1);        // single-token decode only
+    GGML_ASSERT(cparams.n_rs_seq == 0);    // no rollback splits on this path
+
+    // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
+    // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
+    ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
+    conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
+    cb(conv_states, "conv_states_reshaped", il);
+
+    // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
+    ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
+
+    // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
+    // destination the baseline ggml_cpy wrote to (s_slot == 0).
+    const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
+    const size_t  row_size  = ggml_row_size(conv_states_all->type, row_count);
+    ggml_tensor * conv_state_dst =
+        ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
+    cb(conv_state_dst, "conv_state_update", il);
+
+    ggml_tensor * conv_output =
+        ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
+    cb(conv_output, "conv_output_silu", il);
+
+    // the ring write is a side effect of the op; pull the op into the graph via the output
+    ggml_build_forward_expand(gf, conv_output);
+
+    return conv_output; // [conv_channels, 1, n_seqs], already silu'd
+}
+
+ // Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
+ // gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
+ // ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
+diff --git a/src/models/models.h b/src/models/models.h
+index 98b89e9..da0dd86 100644
+--- a/src/models/models.h
+++ b/src/models/models.h
+@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context {
+             int64_t              conv_channels,
+             int                  il);
+ 
+    // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu +
+    // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and
+    // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring
+    // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv
+    // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain.
+    ggml_tensor * build_conv_state_fused(
+            llm_graph_input_rs * inp,
+            ggml_tensor *        conv_states_all,
+            ggml_tensor *        qkv_mixed,
+            ggml_tensor *        conv_kernel,
+            int64_t              conv_kernel_size,
+            int64_t              conv_channels,
+            int                  il);
+
+     // run delta-net attention and write the new recurrent state(s) back to ssm_states_all
+     // s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs)
+     ggml_tensor * build_recurrent_attn(
+diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
+index 0874c43..b6dcc5f 100644
+--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
+@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+     const int64_t conv_kernel_size = conv_kernel->ne[0];
+     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
+ 
+-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+    ggml_tensor * conv_qkv_mix;
+    if (conv_decode_fused) {
+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+                conv_kernel_size, conv_channels, il);
+    } else {
+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+-    cb(conv_output_proper, "conv_output_raw", il);
+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+        cb(conv_output_proper, "conv_output_raw", il);
+ 
+-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+-    cb(conv_output_silu, "conv_output_silu", il);
+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+        cb(conv_output_silu, "conv_output_silu", il);
+ 
+-    ggml_tensor * conv_qkv_mix = conv_output_silu;
+        conv_qkv_mix = conv_output_silu;
+    }
+ 
+     // Calculate the total conv dimension
+     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
+diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
+index 1f6f643..c7c7c44 100644
+--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
+@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+     const int64_t conv_kernel_size = conv_kernel->ne[0];
+     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
+ 
+-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+    ggml_tensor * conv_qkv_mix;
+    if (conv_decode_fused) {
+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+                conv_kernel_size, conv_channels, il);
+    } else {
+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+-    cb(conv_output_proper, "conv_output_raw", il);
+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+        cb(conv_output_proper, "conv_output_raw", il);
+ 
+-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+-    cb(conv_output_silu, "conv_output_silu", il);
+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+        cb(conv_output_silu, "conv_output_silu", il);
+ 
+-    ggml_tensor * conv_qkv_mix = conv_output_silu;
+        conv_qkv_mix = conv_output_silu;
+    }
+ 
+     // Calculate the total conv dimension
+     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
+diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
+index bfdf026..92749d1 100644
+--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
+@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
+     const int64_t conv_kernel_size = conv_kernel->ne[0];
+     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
+ 
+-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+    ggml_tensor * conv_qkv_mix;
+    if (conv_decode_fused) {
+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+                conv_kernel_size, conv_channels, il);
+    } else {
+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+-    cb(state, "state_predelta", il);
+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+        cb(conv_output_proper, "conv_output_raw", il);
+ 
+-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+-    cb(conv_output_proper, "conv_output_raw", il);
+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+        cb(conv_output_silu, "conv_output_silu", il);
+ 
+-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+-    cb(conv_output_silu, "conv_output_silu", il);
+        conv_qkv_mix = conv_output_silu;
+    }
+ 
+-    ggml_tensor * conv_qkv_mix = conv_output_silu;
+    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+    cb(state, "state_predelta", il);
+ 
+     // Calculate the total conv dimension
+     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index 291c275..c7348d6 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case {
+     }
+ };
+ 
+// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021).
+// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token-
+// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy
+// md5 gate); here it just exercises the in-place write target as an op src.
+struct test_ssm_conv_update : public test_case {
+    const int64_t d_conv;
+    const int64_t channels;
+    const int64_t n_seqs;
+
+    std::string op_desc(ggml_tensor * t) override {
+        GGML_UNUSED(t);
+        return "SSM_CONV_UPDATE";
+    }
+
+    std::string vars() override {
+        return VARS_TO_STR3(d_conv, channels, n_seqs);
+    }
+
+    test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
+        : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
+
+    ggml_tensor * build_graph(ggml_context * ctx) override {
+        ggml_tensor * conv_states    = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
+        ggml_tensor * conv_kernel    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
+        ggml_tensor * x_cur          = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+        ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
+        ggml_set_name(conv_states, "conv_states");
+        ggml_set_name(conv_kernel, "conv_kernel");
+        ggml_set_name(x_cur, "x_cur");
+        ggml_set_name(conv_state_dst, "conv_state_dst");
+
+        ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true);
+        ggml_set_name(out, "out");
+        return out;
+    }
+};
+
+ // GGML_OP_SSM_SCAN
+ struct test_ssm_scan : public test_case {
+     const ggml_type type;
+@@ -8355,6 +8392,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+         }
+     }
+ 
+    // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be
+    // a multiple of 128 for the CUDA SSM_CONV supports_op gate.
+    for (int64_t d_conv : {3, 4}) {
+        for (int64_t channels : {256, 3328}) {
+            for (int64_t n_seqs : {1, 4, 32, 128}) {
+                test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs));
+            }
+        }
+    }
+
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64,  8, 2, 32, 4)); // Falcon-H1
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
@@ -0,0 +1,403 @@
+From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 20:32:55 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
+ occupancy/coalescing retune (patch 0022)
+
+Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
+kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
+float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
+efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
+the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
+order byte-identical (md5-gateable). The bf16-state plan stays shelved.
+
+Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
+(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
+state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
+... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
+grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
+across the lanes by the same strided i = r*warp_size + lane mapping, and every
+column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
+unchanged; only the (warp,block)->column assignment and visit order differ, which a
+column's value provably does not depend on (columns are fully independent). This
+raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
+state-load bursts before any reduction + interleaved butterfly reductions hiding
+each other's shfl latency), covering more DRAM latency on this bandwidth-bound
+kernel. Every global access stays identically coalesced, so it is a scheduling /
+latency-coverage win, not a coalescing change. The forbidden float4 state load
+(which would repartition a lane to 4 contiguous rows and change the reduction
+grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
+env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
+GB10 winner (16, 8).
+
+GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
+gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
+200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
+102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
+dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
+35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
+unchanged.
+
+Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
+both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
+test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
+ 1 file changed, 157 insertions(+), 79 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index 86d5e2a..d071d5a 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -1,6 +1,8 @@
+ #include "gated_delta_net.cuh"
+ #include "ggml-cuda/common.cuh"
+ 
+#include <cstdlib>
+
+ // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+ // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+ // destination slot by the recurrence kernel and are skipped here. One block per sequence.
+@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
+     gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+ }
+ 
+-template <int S_v, bool KDA, bool keep_rs_t>
+-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
+ gated_delta_net_cuda(const float * q,
+                                      const float * k,
+                                      const float * v,
+@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
+                                      int           rs_head) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+-    // each warp owns one column, using warp-level primitives to reduce across rows
+    // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
+     const int      lane     = threadIdx.x;
+-    const int      col      = blockIdx.z * blockDim.y + threadIdx.y;
+    const int      col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
+ 
+     const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
+     const uint32_t iq3 = fastdiv(sequence, rq3_magic);
+@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
+     // writing the same slot per block (identity) is race-free.
+     const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+         ? state_dst : curr_state;
+-    read_state += state_in_offset + col * S_v;
+    read_state += state_in_offset;
+     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
+ 
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
+     static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
+     constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
+-    float         s_shard[rows_per_lane];
+-    // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
+    // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
+    float         s_shard[COLS_PER_WARP][rows_per_lane];
+ 
+     ggml_cuda_pdl_sync();
+ #pragma unroll
+-    for (int r = 0; r < rows_per_lane; r++) {
+-        const int i = r * warp_size + lane;
+-        s_shard[r]  = read_state[i];
+    for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+        const int     col = col_base + cc * NUM_WARPS;
+        const float * rs  = read_state + col * S_v;
+#pragma unroll
+        for (int r = 0; r < rows_per_lane; r++) {
+            const int i   = r * warp_size + lane;
+            s_shard[cc][r] = rs[i];
+        }
+     }
+ 
+     for (int t = 0; t < n_tokens; t++) {
+@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
+ 
+         const float beta_val = *beta_t;
+ 
+-        // Cache k and q in registers
+        // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
+         float k_reg[rows_per_lane];
+         float q_reg[rows_per_lane];
+ #pragma unroll
+@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
+         if constexpr (!KDA) {
+             const float g_val = expf(*g_t);
+ 
+-            // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+-            float kv_shard = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                kv_shard += s_shard[r] * k_reg[r];
+-            }
+-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                const int col = col_base + cc * NUM_WARPS;
+ 
+-            // delta[col] = (v[col] - g * kv[col]) * beta
+-            float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+                // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+                float kv_shard = 0.0f;
+#pragma unroll
+                for (int r = 0; r < rows_per_lane; r++) {
+                    kv_shard += s_shard[cc][r] * k_reg[r];
+                }
+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ 
+-            // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+-            float attn_partial = 0.0f;
+                // delta[col] = (v[col] - g * kv[col]) * beta
+                float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+
+                // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+                float attn_partial = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                s_shard[r]  = g_val * s_shard[r] + k_reg[r] * delta_col;
+-                attn_partial += s_shard[r] * q_reg[r];
+-            }
+                for (int r = 0; r < rows_per_lane; r++) {
+                    s_shard[cc][r]  = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
+                    attn_partial += s_shard[cc][r] * q_reg[r];
+                }
+ 
+-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ 
+-            if (lane == 0) {
+-                attn_data[col] = attn_col * scale;
+                if (lane == 0) {
+                    attn_data[col] = attn_col * scale;
+                }
+             }
+         } else {
+-            // kv[col] = sum_i g[i] * S[i][col] * k[i]
+-            float kv_shard = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                const int i = r * warp_size + lane;
+-                kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
+-            }
+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                const int col = col_base + cc * NUM_WARPS;
+
+                // kv[col] = sum_i g[i] * S[i][col] * k[i]
+                float kv_shard = 0.0f;
+#pragma unroll
+                for (int r = 0; r < rows_per_lane; r++) {
+                    const int i = r * warp_size + lane;
+                    kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
+                }
+ 
+-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ 
+-            // delta[col] = (v[col] - kv[col]) * beta
+-            float delta_col = (v_t[col] - kv_col) * beta_val;
+                // delta[col] = (v[col] - kv[col]) * beta
+                float delta_col = (v_t[col] - kv_col) * beta_val;
+ 
+-            // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+-            float attn_partial = 0.0f;
+                // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+                float attn_partial = 0.0f;
+ #pragma unroll
+-            for (int r = 0; r < rows_per_lane; r++) {
+-                const int i = r * warp_size + lane;
+-                s_shard[r]  = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
+-                attn_partial += s_shard[r] * q_reg[r];
+-            }
+                for (int r = 0; r < rows_per_lane; r++) {
+                    const int i = r * warp_size + lane;
+                    s_shard[cc][r]  = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
+                    attn_partial += s_shard[cc][r] * q_reg[r];
+                }
+ 
+-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ 
+-            if (lane == 0) {
+-                attn_data[col] = attn_col * scale;
+                if (lane == 0) {
+                    attn_data[col] = attn_col * scale;
+                }
+             }
+         }
+ 
+@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
+             const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
+             const int target_slot = (int) n_tokens - 1 - t;
+             if (target_slot >= 0 && target_slot < K) {
+-                float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+ #pragma unroll
+-                for (int r = 0; r < rows_per_lane; r++) {
+-                    const int i = r * warp_size + lane;
+-                    curr_state[col * S_v + i] = s_shard[r];
+                for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+                    const int col = col_base + cc * NUM_WARPS;
+                    float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+#pragma unroll
+                    for (int r = 0; r < rows_per_lane; r++) {
+                        const int i = r * warp_size + lane;
+                        curr_state[col * S_v + i] = s_shard[cc][r];
+                    }
+                 }
+             }
+         }
+@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
+ 
+     if constexpr (!keep_rs_t) {
+ #pragma unroll
+-        for (int r = 0; r < rows_per_lane; r++) {
+-            const int i          = r * warp_size + lane;
+-            state[col * S_v + i] = s_shard[r];
+        for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+            const int col = col_base + cc * NUM_WARPS;
+#pragma unroll
+            for (int r = 0; r < rows_per_lane; r++) {
+                const int i          = r * warp_size + lane;
+                state[col * S_v + i] = s_shard[cc][r];
+            }
+         }
+     }
+ }
+ 
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
+#ifndef GDN_DEFAULT_NW
+#define GDN_DEFAULT_NW 16
+#endif
+#ifndef GDN_DEFAULT_CPW
+#define GDN_DEFAULT_CPW 8
+#endif
+
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
+static void launch_gdn_variant(
+        const float * q_d, const float * k_d, const float * v_d,
+        const float * g_d, const float * b_d, const float * s_d,
+        float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
+        int64_t H, int64_t n_tokens, int64_t n_seqs,
+        int64_t sq1, int64_t sq2, int64_t sq3,
+        int64_t sv1, int64_t sv2, int64_t sv3,
+        int64_t sb1, int64_t sb2, int64_t sb3,
+        const uint3 neqk1_magic, const uint3 rq3_magic,
+        float scale, int K, int warp_size, cudaStream_t stream) {
+    static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
+    dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
+    dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
+    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+    ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+        n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+        sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+}
+
+ template <bool KDA, bool keep_rs_t>
+ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
+         float scale, int K, cudaStream_t stream) {
+     //TODO: Add chunked kernel for even faster pre-fill
+     const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
+-    const int num_warps = 4;
+-    dim3      grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
+-    dim3      block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
+ 
+     const uint3 neqk1_magic = init_fastdiv_values(neqk1);
+     const uint3 rq3_magic   = init_fastdiv_values(rq3);
+ 
+-    int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+#define GDN_LAUNCH_ARGS \
+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
+        H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
+        neqk1_magic, rq3_magic, scale, K, warp_size, stream
+ 
+-    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+     switch (S_v) {
+         case 16:
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+         case 32:
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+-        case 64: {
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+        case 64:
+            launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+             break;
+-        }
+         case 128: {
+-            ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+            // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
+            // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
+            // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
+            // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
+            static const int gdn_nw  = []{ const char * e = getenv("GDN_NW");  return e ? atoi(e) : GDN_DEFAULT_NW;  }();
+            static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
+            // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
+            // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
+            // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
+            if      (gdn_nw == 4  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4,  1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 4  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4,  2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 4  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4,  4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8,  1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8,  2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8,  4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 8  && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8,  8, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
+            else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
+            else                                   launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
+             break;
+         }
+         default:
+             GGML_ABORT("fatal error");
+             break;
+     }
+
+#undef GDN_LAUNCH_ARGS
+ }
+ 
+ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
@@ -0,0 +1,144 @@
+From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 23:47:25 +0200
+Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
+ (patch 0023)
+
+Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
+mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
+ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
+expert of a token receives the SAME token activation, so the stock path
+re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
+block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
+reduction), so the gathered blocks are byte-identical across the experts.
+
+Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
+gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
+ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
+byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
+to re-quantizing each gathered row; the GEMM is untouched. down_proj
+(ne11 == n_expert_used, distinct per expert) keeps the stock path.
+
+Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
+- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
+  gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
+- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
+- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
+
+Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
+  q36-27b-nvfp4     5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
+  q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
+  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
+
+On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cu       | 21 +++++++++++++++++--
+ ggml/src/ggml-cuda/quantize.cu  | 37 +++++++++++++++++++++++++++++++++
+ ggml/src/ggml-cuda/quantize.cuh |  4 ++++
+ 3 files changed, 60 insertions(+), 2 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
+index e1add5e..9933fa6 100644
+--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
+@@ -1,3 +1,4 @@
+#include <cstdlib>
+ #include "common.cuh"
+ #include "mmq.cuh"
+ #include "quantize.cuh"
+@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
+         const int64_t s13 = src1->nb[3] / ts_src1;
+ 
+         if (use_native_fp4) {
+-            quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+            // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
+            // across experts, so quantize the ne12 unique tokens once and gather the blocks.
+            static const bool moe_quant_dedup = []{
+                const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
+                return e ? atoi(e) != 0 : true;  // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
+            }();
+            if (moe_quant_dedup && ne11 == 1) {
+                const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
+                    get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
+                ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
+                quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
+                                        ne10_padded, ne12, 1, 1, stream);
+                gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
+                                    ne11_flat, ne12, ne10_padded, stream);
+            } else {
+                quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+                                        ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+            }
+         } else {
+             quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
+index 39a500a..a7fd86f 100644
+--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
+@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
+     }
+ }
+ 
+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
+// gathered row references one of ne12 unique token activations, so the stock path
+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
+static __global__ void gather_mmq_fp4(
+        const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
+        uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
+        const int64_t total_words) {
+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
+    const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+    if (t >= total_words) {
+        return;
+    }
+    const int     w   = (int) (t % W);
+    const int64_t ib  = t / W;                 // destination block index = kb*ne11_flat + j
+    const int     j   = (int) (ib % ne11_flat);
+    const int     kb  = (int) (ib / ne11_flat);
+    const int     src = ids[j];
+    const int64_t ib_u = (int64_t) kb * ne12_unique + src;
+    gathered[t] = unique[ib_u * W + w];
+}
+
+void gather_mmq_fp4_cuda(
+        const void * unique, const int32_t * ids, void * gathered,
+        int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
+    const int     blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
+    const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
+    const int     bs = 256;
+    const dim3    block_size(bs, 1, 1);
+    const dim3    num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
+    gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
+        (const uint4 *) unique, ids, (uint4 *) gathered,
+        (int) ne11_flat, (int) ne12_unique, total_words);
+}
+
+ void quantize_mmq_fp4_cuda(
+         const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
+         const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
+diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
+index 768a3ae..7f64069 100644
+--- a/ggml/src/ggml-cuda/quantize.cuh
+++ b/ggml/src/ggml-cuda/quantize.cuh
+@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
+         ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
+         int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
+ 
+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
+                         int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
+                         cudaStream_t stream);
+
+ void quantize_mmq_fp4_cuda(const float *   x,
+                              const int32_t * ids,
+                              void *          vy,
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0024-paged-pool-burst-reclaim.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0024-paged-pool-burst-reclaim.patch
@@ -0,0 +1,357 @@
+From a8a9d129ae2226a08a12c30ece697865c0fc85c4 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 26 Jun 2026 12:41:49 +0200
+Subject: [PATCH] feat(paged): paged-pool burst-reclaim (truncate + defrag +
+ slot release) (patch 0024)
+
+Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
+Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
+burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
+draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
+"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
+block accounting and placement, never KV values or compute, and is gated behind
+LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
+
+Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
+block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
+the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
+Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
+free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
+content-cache hashes.
+Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
+under the paged engine so a finished-idle slot returns its blocks promptly.
+
+Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
+- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
+  (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
+  unaffected (no ggml op touched).
+- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
+  ascending popleft order. UNIT PASS.
+- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
+  64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
+- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
+  488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
+  532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
+  (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
+  arms (bit-exact serving).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/llama-kv-cache.cpp          | 13 ++++++++++
+ src/paged-alloc.cpp             | 31 +++++++++++++++++++++++
+ src/paged-alloc.h               | 18 +++++++++++++
+ src/paged-kv-manager.cpp        | 45 +++++++++++++++++++++++++++++++++
+ src/paged-kv-manager.h          | 24 ++++++++++++++++++
+ src/paged-prefix-api.cpp        |  8 ++++++
+ src/paged-prefix-api.h          |  6 +++++
+ tools/server/server-context.cpp | 17 +++++++++++++
+ 8 files changed, 162 insertions(+)
+
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 0351f86..21b8f1e 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -425,6 +425,19 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
+         }
+     }
+ 
+    // [paged 0024 Fix-1] Reclaim trailing blocks on a partial TAIL truncation
+    // (p1 == MAX, p0 > 0). llama-server issues seq_rm(slot, n_past, -1) on every
+    // reused slot and before a cross-request prefix splice; the kv-cache frees the
+    // cells [p0, end) but, without this, the paged manager keeps owning those
+    // blocks - the reclamation gap that leaks and fragments the pool across a
+    // burst. truncate() frees the blocks beyond ceil(p0/bs) so the manager's
+    // accounting tracks the kv-cache exactly. Gated so LLAMA_PAGED_NO_RECLAIM
+    // restores the pre-fix behavior for A/B.
+    if (paged_alloc::active() && paged_alloc::reclaim_active() && seq_id >= 0 &&
+        p0 > 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+        paged_alloc::truncate(this, (int) seq_to_stream[seq_id], (int) seq_id, (uint32_t) p0);
+    }
+
+     if (seq_id >= 0) {
+         auto & cells = v_cells[seq_to_stream[seq_id]];
+         auto & head  = v_heads[seq_to_stream[seq_id]];
+diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
+index c1027fb..ba98dd5 100644
+--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
+@@ -14,6 +14,11 @@ bool active() {
+     return a;
+ }
+ 
+bool reclaim_active() {
+    static const bool off = (std::getenv("LLAMA_PAGED_NO_RECLAIM") != nullptr);
+    return !off;
+}
+
+ static bool debug() {
+     static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+     return d;
+@@ -124,12 +129,28 @@ void commit(const void * cache, int stream, int seq,
+     }
+ }
+ 
+void truncate(const void * cache, int stream, int seq, uint32_t n_keep) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+        return;
+    }
+    mgr->truncate(seq, (size_t) n_keep);     // Fix-1: reclaim trailing blocks
+    mgr->defrag_free_pool();                 // Fix-2: compact iff the pool emptied
+    if (debug()) {
+        fprintf(stderr, "[paged-alloc] truncate cache=%p stream=%d seq=%d keep<=%u (free=%zu)\n",
+                cache, stream, seq, n_keep, mgr->num_free_blocks());
+    }
+}
+
+ void release(const void * cache, int stream, int seq) {
+     paged::PagedKVManager * mgr = find_mgr(cache, stream);
+     if (!mgr) {
+         return;
+     }
+     mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
+    if (reclaim_active()) {
+        mgr->defrag_free_pool();             // Fix-2: compact iff the pool emptied
+    }
+     if (debug()) {
+         fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+                 cache, stream, seq, mgr->num_free_blocks());
+@@ -163,4 +184,14 @@ size_t num_free(const void * cache, int stream) {
+     return mgr ? mgr->num_free_blocks() : 0;
+ }
+ 
+size_t num_free_global() {
+    size_t total = 0;
+    for (auto & kv : g_managers) total += kv.second->num_free_blocks();
+    return total;
+}
+
+size_t num_managers() {
+    return g_managers.size();
+}
+
+ } // namespace paged_alloc
+diff --git a/src/paged-alloc.h b/src/paged-alloc.h
+index 88dedef..bfaf45b 100644
+--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
+@@ -31,6 +31,12 @@ namespace paged_alloc {
+ // true iff env LLAMA_KV_PAGED is set (evaluated once).
+ bool active();
+ 
+// [paged 0024] The burst-reclaim fix (truncate + defrag-on-empty + slot release)
+// is on by default whenever the paged engine is active. LLAMA_PAGED_NO_RECLAIM=1
+// restores the pre-fix behavior (no trailing-block reclaim, no compaction) for
+// A/B measurement. Evaluated once.
+bool reclaim_active();
+
+ // Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+ // on demand, appending their physical cell indices to `out`. pool_blocks =
+ // cells.size()/block_size is the stream's block budget. Returns false (leaving
+@@ -58,6 +64,12 @@ int64_t slot(const void * cache, int stream, int seq, int pos);
+ void commit(const void * cache, int stream, int seq,
+             const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+ 
+// [paged 0024 Fix-1] Reclaim the trailing blocks of (cache,stream,seq) beyond
+// logical position n_keep (ref-counted), mirroring a partial kv-cache seq_rm
+// [n_keep, end). When the stream's pool empties as a result, its free queue is
+// defragged to pristine contiguous order (Fix-2). No-op if no manager exists.
+void truncate(const void * cache, int stream, int seq, uint32_t n_keep);
+
+ // Return one sequence's blocks to the pool (ref-counted; sequence end).
+ void release(const void * cache, int stream, int seq);
+ 
+@@ -69,4 +81,10 @@ void release_all(const void * cache);
+ int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+ size_t num_free(const void * cache, int stream);
+ 
+// [paged 0024] Total free blocks summed across every live manager (all caches /
+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
+size_t num_free_global();
+size_t num_managers();
+
+ } // namespace paged_alloc
+diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
+index 4c6ee4c..738b332 100644
+--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
+@@ -104,6 +104,22 @@ void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+     num_free_blocks += blocks.size();
+ }
+ 
+void FreeBlockQueue::rebuild(const std::vector<KVCacheBlock*>& blocks) {
+    // Relink the intrusive list using THIS queue's stable fake head/tail nodes.
+    num_free_blocks = blocks.size();
+    for (size_t i = 0; i < blocks.size(); ++i) {
+        blocks[i]->prev_free = (i == 0)                  ? &fake_head : blocks[i - 1];
+        blocks[i]->next_free = (i + 1 < blocks.size())   ? blocks[i + 1] : &fake_tail;
+    }
+    if (!blocks.empty()) {
+        fake_head.next_free = blocks.front();
+        fake_tail.prev_free = blocks.back();
+    } else {
+        fake_head.next_free = &fake_tail;
+        fake_tail.prev_free = &fake_head;
+    }
+}
+
+ std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+     std::vector<KVCacheBlock*> ret;
+     const KVCacheBlock* curr = fake_head.next_free;
+@@ -199,6 +215,20 @@ void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+     }
+ }
+ 
+void BlockPool::defrag_free_queue() {
+    // Pool is fully idle: every non-null block is free (ref_cnt 0). Rebuild the
+    // free list in ascending block_id order so popleft hands out physically
+    // contiguous blocks again. Hashes / the content-cache map are left intact so
+    // a warm committed prefix stays re-hittable.
+    std::vector<KVCacheBlock*> ordered;
+    ordered.reserve(ptrs_.size());
+    for (KVCacheBlock* b : ptrs_) {
+        if (b->is_null) continue;
+        ordered.push_back(b);
+    }
+    free_queue_.rebuild(ordered);
+}
+
+ // ---------------------------------------------------------------------------
+ // PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
+ // ---------------------------------------------------------------------------
+@@ -250,6 +280,21 @@ void PagedKVManager::free(int seq_id) {
+     req_to_blocks_.erase(it);
+ }
+ 
+void PagedKVManager::truncate(int seq_id, size_t n_keep) {
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return;
+    auto & blocks = it->second;
+    const size_t keep = cdiv(n_keep, block_size_); // blocks covering [0, n_keep)
+    if (keep >= blocks.size()) return;             // nothing trailing to reclaim
+    // Free the trailing blocks [keep, end) tail-first (vLLM eviction order). Their
+    // cells were just cleared by the partial seq_rm, so they are safe to reuse.
+    std::vector<KVCacheBlock*> ordered(blocks.rbegin(),
+                                       blocks.rbegin() + (blocks.size() - keep));
+    pool_.free_blocks(ordered);
+    blocks.resize(keep);
+    if (blocks.empty()) req_to_blocks_.erase(it);
+}
+
+ // FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+ // hash into the seed so each block hash transitively encodes its whole prefix
+ // (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
+index 34decbc..e410d58 100644
+--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
+@@ -47,6 +47,11 @@ public:
+     void append_n(const std::vector<KVCacheBlock*>& blocks);
+     void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+     std::vector<KVCacheBlock*> get_all_free_blocks() const;
+    // [paged 0024 Fix-2] Relink the intrusive free list to the given order using
+    // THIS queue's fake head/tail (the nodes' addresses are stable; a temporary
+    // FreeBlockQueue would leave dangling fake-node pointers). Used to restore a
+    // pristine, contiguous popleft order after a fragmenting burst drains.
+    void rebuild(const std::vector<KVCacheBlock*>& blocks);
+ 
+ private:
+     KVCacheBlock fake_head{-1};
+@@ -67,6 +72,14 @@ public:
+                            size_t num_cached_blocks, size_t num_full_blocks,
+                            const std::vector<uint64_t>& block_hashes);
+     size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+    // [paged 0024 Fix-2] Total non-null blocks, and whether the pool is fully
+    // idle (every non-null block back in the free queue). defrag_free_queue()
+    // relinks the free queue into pristine ascending-block-id order; only valid
+    // when all_free() so no live request's block table is disturbed. Block hashes
+    // are preserved, so a warm committed prefix stays re-hittable.
+    size_t total_blocks() const { return blocks_.size(); }
+    bool   all_free()    const { return free_queue_.num_free_blocks + 1 == blocks_.size(); }
+    void   defrag_free_queue();
+ 
+ private:
+     bool maybe_evict_cached_block(KVCacheBlock* block);
+@@ -94,6 +107,17 @@ public:
+     void free(int seq_id);
+     int block_size() const { return block_size_; }
+ 
+    // [paged 0024 Fix-1] Reclaim the trailing blocks of seq_id beyond logical
+    // position n_keep: free every block at index >= ceil(n_keep/bs) (ref-counted,
+    // mirroring vLLM's free of a truncated block suffix). Called on a partial tail
+    // seq_rm [n_keep, end) so the manager's block accounting tracks the kv-cache
+    // exactly instead of stranding the blocks whose cells were just cleared.
+    void truncate(int seq_id, size_t n_keep);
+
+    // [paged 0024 Fix-2] When no live request holds a block, relink the free
+    // queue into pristine contiguous order (undo a burst's scrambled free order).
+    void defrag_free_pool() { if (pool_.all_free()) pool_.defrag_free_queue(); }
+
+     // Prefix caching (win 3).
+     static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+     std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
+index 8573cd2..209cee8 100644
+--- a/src/paged-prefix-api.cpp
+++ b/src/paged-prefix-api.cpp
+@@ -45,4 +45,12 @@ long num_free(llama_context * ctx) {
+     return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+ }
+ 
+long num_free_global() {
+    return (long) paged_alloc::num_free_global();
+}
+
+long num_managers() {
+    return (long) paged_alloc::num_managers();
+}
+
+ } // namespace paged_prefix_api
+diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
+index 78a3864..8dd817e 100644
+--- a/src/paged-prefix-api.h
+++ b/src/paged-prefix-api.h
+@@ -24,4 +24,10 @@ int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+ // Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+ long num_free(llama_context * ctx);
+ 
+// [paged 0024] Total free blocks across every live paged manager (all caches /
+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
+long num_free_global();
+long num_managers();
+
+ } // namespace paged_prefix_api
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index f7a114c..8c19cfb 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -411,6 +411,23 @@ struct server_slot {
+ 
+             reset();
+ 
+            // [paged 0024 Fix-3] Return this finished slot's paged blocks to the
+            // pool promptly. Stock llama-server keeps an idle slot's KV for its own
+            // next-prompt cache, but under the paged engine that strands blocks in
+            // idle slots after a high-fan-out burst, so a later low-npl run sees a
+            // depleted, fragmented pool and its prefill collapses. prompt_clear()
+            // issues a full seq_rm (clearing the cells AND, via the paged hook,
+            // releasing + defragging the blocks) and clears the slot-local prompt
+            // cache so the next reuse recomputes from a pristine pool; cross-request
+            // reuse still works through the committed paged content cache. Gated on
+            // LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM opts out for A/B); stock
+            // (paged off) is byte-identical.
+            static const bool paged_release_on_idle =
+                getenv("LLAMA_KV_PAGED") != nullptr && getenv("LLAMA_PAGED_NO_RECLAIM") == nullptr;
+            if (paged_release_on_idle && prompt.n_tokens() > 0) {
+                prompt_clear(false);
+            }
+
+             callback_on_release(id);
+         }
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch
@@ -0,0 +1,56 @@
+From 2f4f5ab7c9050f890ee1137ef9c8ee09dfcd9ae7 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 26 Jun 2026 16:52:21 +0200
+Subject: [PATCH] feat(paged): qwen35moe NVFP4 MoE-decode re-graph
+ (should_use_mmq graph-safe id-path) (patch 0025)
+
+The MUL_MAT_ID CUDA-graph guard (ggml-cuda.cu [TAG_MUL_MAT_ID_CUDA_GRAPHS]) disables CUDA graphs for
+the whole decode step whenever a MUL_MAT_ID node has ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121),
+because the per-expert host-loop fallback synchronizes the stream. But on Blackwell NVFP4 the path
+actually taken is should_use_mmq()==true -> the grouped stream-k mul_mat_q id-branch, which launches
+on one stream with NO host sync (no cudaStreamSynchronize/Memcpy in mmq.cu/mmid.cu). The disable is
+therefore conservative; graphs are safe for the grouped path.
+
+Env-gated (LLAMA_MOE_FORCE_GRAPHS, default-off = byte-identical to stock): when set and the node
+takes the grouped MMQ path, keep CUDA graphs on for the MoE decode step.
+
+Measured (DGX GB10 sm_121, q36-35b-a3b-nvfp4, llama-batched-bench -fa on -npp128 -ntg128, decode_agg):
+  npl 8   226.0 -> 226.4  +0.2% (noise; ne2<=8 already on the MMVQ-graphed path)
+  npl 32  433.8 -> 452.7  +4.4%
+  npl 64  589.0 -> 605.9  +2.9%
+  npl 128 743.1 -> 757.1  +1.9%
+
+Bit-exact (graph replay re-issues identical kernels): test-backend-ops MUL_MAT_ID 806/806 CUDA0 OK;
+parallel-greedy np16 (ne2=16>8) generated content byte-identical ON==OFF.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/ggml-cuda.cu | 12 +++++++++++-
+ 1 file changed, 11 insertions(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index cca7059..254d2e0 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -3275,7 +3275,17 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
+         if (node->op == GGML_OP_MUL_MAT_ID) {
+             const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+             const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc);
+-            if (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max) {
+            bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
+            // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is
+            // launched on-stream with no host sync (only the per-expert host-loop fallback syncs);
+            // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe
+            // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step.
+            if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
+                getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
+                ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
+                mmid_needs_sync = false;
+            }
+            if (mmid_needs_sync) {
+                 // under these conditions, the mul_mat_id operation will need to synchronize the stream, so we cannot use CUDA graphs
+                 // TODO: figure out a way to enable for larger batch sizes, without hurting performance
+                 // ref: https://github.com/ggml-org/llama.cpp/pull/18958
+--
+2.43.0
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
@@ -0,0 +1,578 @@
+From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 26 Jun 2026 22:58:47 +0200
+Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
+ 0028)
+
+The MoE-gap groundtruth found k_get_rows_float to be the single biggest decode
+kernel vLLM has no equivalent of (~5.2 ms/step MoE; also dense): vLLM updates its
+gated-DeltaNet recurrent state in place, while llama ran a separate ggml_get_rows
+gather. Patch 0019 fused the SSM-state gather; patch 0021 fused the conv compute
+but kept a build_rs gather for the conv taps. This closes that residual.
+
+nsys located the residual k_get_rows as the conv-state tap gather in
+build_conv_state_fused: a 24576-float (= n_embd_r = (d_conv-1)*(d_inner +
+2*n_group*d_state)) row x 128 sequences, once per GDN layer per decode step
+(~720 big ~115 us gathers / 24-step window). The SSM-state gather is already
+fused by 0019, so this conv gather is the last k_get_rows in the GDN decode path.
+
+New op ggml_ssm_conv_update_inplace_ids (reuses GGML_OP_SSM_CONV, discriminated
+by a non-null src[4] = ids) takes the FULL conv cache + the s_copy ids and reads
+each active sequence's prior taps directly from cache[ids[s]] in the kernel (no
+ggml_get_rows). Identity sequences (ids[s] == rs_head + s, the AR-decode path)
+read in place from the conv_state_dst write slot (the whole window is loaded into
+registers before the ring write-back, so read==write is race-free); non-identity
+sequences (reorder / rs_zero) are gathered into a disjoint scratch by a small
+ssm_conv_gather_nonident_kernel first. Mirrors the 0019 in-place + ids gather
+fusion. The read VALUES are unchanged; only the read path (gather -> indexed
+in-kernel read) changes, so it is bit-identical to the build_rs gather + 0021 op.
+
+build_conv_state_fused now feeds the full cache + ids through the build_rs
+get_state_rows lambda (rs_zero clear + extra-states copy still run around it).
+Helps BOTH dense and MoE (shared GDN conv path).
+
+GATE test-backend-ops (CUDA0 vs CPU, 2/2 backends): SSM_CONV_UPDATE_IDS OK (new),
+SSM_CONV_UPDATE OK, SSM_CONV OK, GATED_DELTA_NET OK, GET_ROWS OK.
+
+GATE greedy md5 (--temp 0 --seed 1 -n 48) BYTE-IDENTICAL both models:
+q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4
+07db32c2bcb78d17a43ed18bc22705cd (== baseline).
+
+nsys: k_get_rows_float float,float 10174 -> 9454 instances (720 fewer = 30 GDN
+layers x 24 steps), 186.3 -> 102.8 ms; the 720 ~115 us conv gathers replaced by a
+720 x ~1.1 us no-op ssm_conv_gather_nonident (all identity at steady decode).
+MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h            |  20 ++++
+ ggml/src/ggml-cpu/ops.cpp      |  90 +++++++++++++++++-
+ ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
+ ggml/src/ggml.c                |  62 +++++++++++++
+ src/models/delta-net-base.cpp  |  26 ++++--
+ tests/test-backend-ops.cpp     |  69 ++++++++++++++
+ 6 files changed, 411 insertions(+), 11 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 2a5cbce..5fa220a 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2463,6 +2463,26 @@ extern "C" {
+             struct ggml_tensor  * conv_state_dst,
+             bool                  fuse_silu);
+ 
+    // Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
+    // per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
+    // n_cells]) plus the per-sequence `ids` ([n_seqs], I32, = the recurrent-state s_copy) and reads
+    // each active sequence's prior taps directly from cache[ids[s]] inside the kernel -- no
+    // ggml_get_rows materialization (mirrors ggml_gated_delta_net_inplace_ids). Identity sequences
+    // (ids[s] == rs_head + s) are read in place from `conv_state_dst` (the write slot); any
+    // non-identity sequence (reorder / rs_zero remap) is gathered into a disjoint scratch by the
+    // backend first, so the read never aliases another sequence's in-place ring write -> race-free
+    // and bit-identical to the get_rows + ggml_ssm_conv_update_inplace path. op_params[0]=fuse_silu,
+    // op_params[1]=rs_head. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
+    GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * conv_states,
+            struct ggml_tensor  * conv_kernel,
+            struct ggml_tensor  * x_cur,
+            struct ggml_tensor  * conv_state_dst,
+            struct ggml_tensor  * ids,
+            int                   rs_head,
+            bool                  fuse_silu);
+
+     GGML_API struct ggml_tensor * ggml_ssm_scan(
+             struct ggml_context * ctx,
+             struct ggml_tensor  * s,
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 07ab9e5..515aae4 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -9580,6 +9580,90 @@ static void ggml_compute_forward_ssm_conv_update_f32(
+     }
+ }
+ 
+// Patch 0028: CPU reference for ggml_ssm_conv_update_inplace_ids (mirror of the CUDA
+// ssm_conv_update_ids_f32). Reads each active sequence's prior K-1 taps directly from the FULL conv
+// cache (src[0]) via ids (src[4]) -- identity sequences (ids[s] == rs_head + s) read in place from the
+// destination slot src[3], non-identity from cache[ids[s]] -- computes the depthwise conv with the
+// same ascending-tap FMA order, optionally folds silu, writes the conv output to dst, and writes the
+// 1-token-shifted ring state back in place into src[3]. The window is copied to a local before the
+// write so the identity (read == write slot) case is correct. Threads split over channels.
+static void ggml_compute_forward_ssm_conv_update_ids_f32(
+        const ggml_compute_params * params,
+        ggml_tensor * dst) {
+    const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
+    ggml_tensor       * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+    const ggml_tensor * ids         = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
+
+    const int ith = params->ith;
+    const int nth = params->nth;
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = x_cur->ne[2];
+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+    const int32_t rs_head    = ggml_get_op_params_i32(dst, 1);
+
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+    GGML_ASSERT(ids->type == GGML_TYPE_I32);
+    GGML_ASSERT(d_conv <= 8);
+
+    const int64_t cache_row_stride = conv_states->nb[2] / sizeof(float); // (K-1)*channels
+    const int64_t w_stride         = conv_kernel->nb[1] / sizeof(float);
+    const int64_t x_seq_stride     = x_cur->nb[2] / sizeof(float);
+    const int64_t dst_seq_stride   = dst->nb[2] / sizeof(float);
+    const int64_t cdst_seq_stride  = cdst->nb[1] / sizeof(float);
+
+    const float * cache_base = (const float *) conv_states->data;
+    const float * w_base     = (const float *) conv_kernel->data;
+    const float * x_base     = (const float *) x_cur->data;
+    float *       cdst_base  = (float *) cdst->data;
+    float *       dst_base   = (float *) dst->data;
+    const int32_t * ids_base = (const int32_t *) ids->data;
+
+    const int64_t dc = (channels + nth - 1) / nth;
+    const int64_t c0 = dc * ith;
+    const int64_t c1 = MIN(c0 + dc, channels);
+
+    for (int64_t s = 0; s < n_seqs; ++s) {
+        const int32_t r     = ids_base[s];
+        const bool    ident = (r == rs_head + (int32_t) s);
+        // identity reads the K-1 taps in place from the destination slot; non-identity from cache[r].
+        const float * states_seq = ident
+            ? (cdst_base  + s * cdst_seq_stride)
+            : (cache_base + (int64_t) r * cache_row_stride);
+        for (int64_t c = c0; c < c1; ++c) {
+            const float * states_c = states_seq + c * (d_conv - 1);
+            const float * w_c      = w_base + c * w_stride;
+            const float   xc       = x_base[s * x_seq_stride + c];
+
+            // window = [tap0 .. tap_{K-2}, xc], copied to a local before the (possibly aliasing) write
+            float window[8];
+            for (int64_t j = 0; j < d_conv - 1; ++j) {
+                window[j] = states_c[j];
+            }
+            window[d_conv - 1] = xc;
+
+            // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
+            float sumf = 0.0f;
+            for (int64_t j = 0; j < d_conv; ++j) {
+                sumf += window[j] * w_c[j];
+            }
+            sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
+
+            dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
+
+            // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
+            float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
+            for (int64_t j = 0; j < d_conv - 1; ++j) {
+                out_state[j] = window[j + 1];
+            }
+        }
+    }
+}
+
+ void ggml_compute_forward_ssm_conv(
+         const ggml_compute_params * params,
+         ggml_tensor * dst) {
+@@ -9587,7 +9671,11 @@ void ggml_compute_forward_ssm_conv(
+         case GGML_TYPE_F32:
+             {
+                 if (dst->src[3] != nullptr) {
+-                    ggml_compute_forward_ssm_conv_update_f32(params, dst);
+                    if (dst->src[4] != nullptr) {
+                        ggml_compute_forward_ssm_conv_update_ids_f32(params, dst);
+                    } else {
+                        ggml_compute_forward_ssm_conv_update_f32(params, dst);
+                    }
+                 } else {
+                     ggml_compute_forward_ssm_conv_f32(params, dst);
+                 }
+diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
+index e1af1cd..28b3cce 100644
+--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
+@@ -226,6 +226,153 @@ static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_t
+     }
+ }
+ 
+// Patch 0028: gather only the NON-identity sequences' prior conv taps from the FULL conv cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the update kernel and are skipped here. One block per sequence. Mirrors
+// gdn_gather_nonident_kernel (the 0019 recurrent-state gather fusion).
+static __global__ void ssm_conv_gather_nonident_kernel(const float * __restrict__ cache,
+                                                       const int32_t * __restrict__ ids, int rs_head,
+                                                       float * __restrict__ scratch, int row_stride, int n_seqs) {
+    const int s = blockIdx.x;
+    if (s >= n_seqs) {
+        return;
+    }
+    const int r = ids[s];
+    if (r == rs_head + s) {
+        return; // identity: prior taps already live in the in-place destination slot
+    }
+    const float * src = cache   + (int64_t) r * row_stride;
+    float *       dst = scratch + (int64_t) s * row_stride;
+    for (int i = threadIdx.x; i < row_stride; i += blockDim.x) {
+        dst[i] = src[i];
+    }
+}
+
+// Patch 0028: gather-free fused conv update. Per (channel, sequence), read the K-1 prior taps from the
+// active sequence's cache slot via ids -- identity (ids[s] == rs_head + s) reads in place from
+// conv_state_dst (the same slot it writes; the whole window is loaded into registers before any write,
+// so it is race-free), non-identity reads the pre-gathered disjoint scratch -- then computes the
+// depthwise conv with the SAME ascending-tap FMA order as ssm_conv_update_f32, folds silu, writes the
+// conv output, and writes the 1-token-shifted ring state back in place. Bit-identical to the get_rows +
+// ssm_conv_update_f32 path: the read VALUES are the same; only the read POINTER changes.
+template <bool apply_silu, int d_conv>
+static __global__ void ssm_conv_update_ids_f32(const float * __restrict__ nonident_scratch,
+                                               const float * __restrict__ conv_kernel,
+                                               const float * __restrict__ x_cur,
+                                               float       * __restrict__ conv_state_dst,
+                                               float       * __restrict__ dst,
+                                               const int32_t * __restrict__ ids,
+                                               const int   rs_head,
+                                               const int   channels,
+                                               const int   scratch_seq_stride,
+                                               const int   w_stride,
+                                               const int   x_seq_stride,
+                                               const int   dst_seq_stride,
+                                               const int   cdst_seq_stride) {
+    const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
+    const int s = blockIdx.y;                            // sequence
+    if (c >= channels) {
+        return;
+    }
+
+    const bool ident = (ids[s] == rs_head + s);
+    const float * states_c = ident
+        ? conv_state_dst   + (int64_t) s * cdst_seq_stride    + (int64_t) c * (d_conv - 1)
+        : nonident_scratch + (int64_t) s * scratch_seq_stride + (int64_t) c * (d_conv - 1);
+    const float * w_c = conv_kernel + (int64_t) c * w_stride;
+    const float   xc  = x_cur[(int64_t) s * x_seq_stride + c];
+
+    // window = [tap0 .. tap_{K-2}, current-token], same ordering as ssm_conv_update_f32
+    float window[d_conv];
+#pragma unroll
+    for (int j = 0; j < d_conv - 1; j++) {
+        window[j] = states_c[j];
+    }
+    window[d_conv - 1] = xc;
+
+    float sumf = 0.0f;
+#pragma unroll
+    for (int j = 0; j < d_conv; j++) {
+        sumf += window[j] * w_c[j];
+    }
+    sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
+    dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
+
+    // 1-token-shifted ring write-back: drop the oldest tap, append the current token
+    float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
+#pragma unroll
+    for (int j = 0; j < d_conv - 1; j++) {
+        out_state[j] = window[j + 1];
+    }
+}
+
+static void ggml_cuda_op_ssm_conv_update_ids(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+    const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
+    const ggml_tensor * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+    const ggml_tensor * ids         = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = x_cur->ne[2];
+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+    const int     rs_head    = ggml_get_op_params_i32(dst, 1);
+
+    GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
+    GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+    GGML_ASSERT(ids->type == GGML_TYPE_I32);
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+    GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
+
+    const float *   cache_d = (const float *) conv_states->data;
+    const float *   w_d     = (const float *) conv_kernel->data;
+    const float *   x_d     = (const float *) x_cur->data;
+    float *         cdst_d  = (float *) cdst->data;
+    float *         dst_d   = (float *) dst->data;
+    const int32_t * ids_d   = (const int32_t *) ids->data;
+    cudaStream_t    stream  = ctx.stream();
+
+    // n_embd_r = (K-1)*channels: the per-cell row stride of the full conv cache.
+    const int cache_row_stride = (int) (conv_states->nb[2] / sizeof(float));
+    const int w_stride         = (int) (conv_kernel->nb[1] / sizeof(float));
+    const int x_seq_stride     = (int) (x_cur->nb[2] / sizeof(float));
+    const int dst_seq_stride   = (int) (dst->nb[2] / sizeof(float));
+    const int cdst_seq_stride  = (int) (cdst->nb[1] / sizeof(float));
+
+    // Gather only the non-identity sequences' prior taps into a disjoint scratch (identity sequences
+    // read in place from cdst). The scratch is written here and read-only by the update kernel, so the
+    // update kernel never reads a slot another block writes -> race-free. No-op at steady AR decode.
+    ggml_cuda_pool_alloc<float> nonident_scratch(ctx.pool());
+    float * scratch = nonident_scratch.alloc((size_t) cache_row_stride * n_seqs);
+    if (n_seqs > 0) {
+        ssm_conv_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(
+            cache_d, ids_d, rs_head, scratch, cache_row_stride, (int) n_seqs);
+    }
+
+    const int threads = 128;
+    const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
+
+    auto launch = [&](auto NC) {
+        constexpr int kNC = decltype(NC)::value;
+        if (apply_silu) {
+            ssm_conv_update_ids_f32<true, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
+                ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+        } else {
+            ssm_conv_update_ids_f32<false, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
+                ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+        }
+    };
+
+    switch (d_conv) {
+        case 3: launch(std::integral_constant<int, 3>{}); break;
+        case 4: launch(std::integral_constant<int, 4>{}); break;
+        default: GGML_ABORT("ssm_conv_update_ids only supports d_conv 3 or 4");
+    }
+}
+
+ template <bool apply_silu>
+ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
+                               const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
+@@ -266,7 +413,13 @@ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, g
+     // silu of the decode conv path into a single kernel.
+     if (dst->src[3] != nullptr) {
+         GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
+-        ggml_cuda_op_ssm_conv_update(ctx, dst);
+        // Patch 0028: a non-null src[4] (ids) selects the gather-free variant that reads each
+        // sequence's prior taps directly from the full cache via ids (no get_rows materialization).
+        if (dst->src[4] != nullptr) {
+            ggml_cuda_op_ssm_conv_update_ids(ctx, dst);
+        } else {
+            ggml_cuda_op_ssm_conv_update(ctx, dst);
+        }
+         return;
+     }
+ 
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index 16b180f..dcc09bd 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -5606,6 +5606,68 @@ struct ggml_tensor * ggml_ssm_conv_update_inplace(
+     return result;
+ }
+ 
+// ggml_ssm_conv_update_inplace_ids
+//
+// Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
+// per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
+// n_cells]) plus the per-sequence `ids` (the recurrent-state s_copy) and reads each active sequence's
+// prior taps directly from cache[ids[s]] inside the kernel (no ggml_get_rows). Identity sequences
+// (ids[s] == rs_head + s) read in place from the `conv_state_dst` write slot; non-identity sequences
+// are gathered into a disjoint scratch by the backend first. Bit-identical to the get_rows +
+// ggml_ssm_conv_update_inplace path. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
+// op_params[1] carries rs_head. Mirrors the 0019 ggml_gated_delta_net_inplace_ids gather fusion.
+struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * conv_states,
+        struct ggml_tensor  * conv_kernel,
+        struct ggml_tensor  * x_cur,
+        struct ggml_tensor  * conv_state_dst,
+        struct ggml_tensor  * ids,
+        int                   rs_head,
+        bool                  fuse_silu) {
+    GGML_ASSERT(ggml_is_3d(conv_states));
+    GGML_ASSERT(ggml_is_matrix(conv_kernel));
+    GGML_ASSERT(ggml_is_3d(x_cur));
+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+    const int64_t d_conv   = conv_kernel->ne[0];
+    const int64_t channels = conv_kernel->ne[1];
+    const int64_t n_seqs   = x_cur->ne[2];
+
+    GGML_ASSERT(conv_states->type    == GGML_TYPE_F32);
+    GGML_ASSERT(conv_kernel->type    == GGML_TYPE_F32);
+    GGML_ASSERT(x_cur->type          == GGML_TYPE_F32);
+    GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
+
+    // conv_states: FULL cache [K-1, channels, n_cells], contiguous taps per channel
+    GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
+    GGML_ASSERT(conv_states->ne[1] == channels);
+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+    // x_cur: single decode token per sequence
+    GGML_ASSERT(x_cur->ne[0] == channels);
+    GGML_ASSERT(x_cur->ne[1] == 1);
+    // ids: one slot index per active sequence
+    GGML_ASSERT(ids->ne[0] == n_seqs);
+    // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
+    GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
+    GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
+
+    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+
+    ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
+    ggml_set_op_params_i32(result, 1, rs_head);
+
+    result->op     = GGML_OP_SSM_CONV;
+    result->src[0] = conv_states;
+    result->src[1] = conv_kernel;
+    result->src[2] = x_cur;
+    result->src[3] = conv_state_dst;
+    result->src[4] = ids;
+
+    return result;
+}
+
+ // ggml_ssm_scan
+ 
+ struct ggml_tensor * ggml_ssm_scan(
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index 58f3d0c..962f5eb 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -548,25 +548,33 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
+     GGML_ASSERT(n_seq_tokens == 1);        // single-token decode only
+     GGML_ASSERT(cparams.n_rs_seq == 0);    // no rollback splits on this path
+ 
+-    // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
+-    // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
+-    ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
+-    conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
+-    cb(conv_states, "conv_states_reshaped", il);
+-
+     // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
+     ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
+ 
+     // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
+     // destination the baseline ggml_cpy wrote to (s_slot == 0).
+-    const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
+    const int64_t row_count = (conv_kernel_size - 1) * conv_channels; // = n_embd_r
+     const size_t  row_size  = ggml_row_size(conv_states_all->type, row_count);
+     ggml_tensor * conv_state_dst =
+         ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
+     cb(conv_state_dst, "conv_state_update", il);
+ 
+-    ggml_tensor * conv_output =
+-        ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
+    // Patch 0028: fuse the residual conv-state tap gather (the k_get_rows that build_conv_state's
+    // build_rs left firing -- ~the biggest single residual decode kernel, see MOE_GAP_VS_VLLM.md).
+    // Exactly like the 0019 SSM-state gather fusion, build_rs feeds the FULL conv cache + the s_copy
+    // ids into the op (via the get_state_rows lambda) and still performs the rs_zero clear and the
+    // extra-states copy around it; the op reads each active sequence's prior taps directly from
+    // cache[ids[s]] (identity sequences read in place from conv_state_dst), so the separate
+    // ggml_get_rows materialization is eliminated. The read VALUES are unchanged, only the read path
+    // (gather -> indexed in-kernel read) changes, so it is bit-identical to the build_rs gather.
+    auto get_conv_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+        // states = full conv-state cache reshaped 2d [n_embd_r, n_cells]
+        ggml_tensor * cache3d = ggml_reshape_3d(ctx, states, conv_kernel_size - 1, conv_channels, states->ne[1]);
+        return ggml_ssm_conv_update_inplace_ids(ctx, cache3d, conv_kernel, x_cur, conv_state_dst,
+                ids, (int) kv_head, /*fuse_silu=*/true);
+    };
+
+    ggml_tensor * conv_output = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs, get_conv_op);
+     cb(conv_output, "conv_output_silu", il);
+ 
+     // the ring write is a side effect of the op; pull the op into the graph via the output
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index b5e3048..302975f 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -3793,6 +3793,65 @@ struct test_ssm_conv_update : public test_case {
+     }
+ };
+ 
+// GGML_OP_SSM_CONV gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids,
+// patch 0028). conv_states is the FULL cache; ids (a shuffled permutation of [0,n_seqs), rs_head=0)
+// selects each sequence's slot, exercising BOTH the identity in-place read (ids[s]==s) and the
+// non-identity cache read. Validates the conv + silu output (dst) against the CPU reference.
+struct test_ssm_conv_update_ids : public test_case {
+    const int64_t d_conv;
+    const int64_t channels;
+    const int64_t n_seqs;
+
+    std::string op_desc(ggml_tensor * t) override {
+        GGML_UNUSED(t);
+        return "SSM_CONV_UPDATE_IDS";
+    }
+
+    std::string vars() override {
+        return VARS_TO_STR3(d_conv, channels, n_seqs);
+    }
+
+    test_ssm_conv_update_ids(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
+        : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
+
+    ggml_tensor * build_graph(ggml_context * ctx) override {
+        ggml_tensor * conv_states    = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
+        ggml_tensor * conv_kernel    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
+        ggml_tensor * x_cur          = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+        ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
+        ggml_tensor * ids            = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, n_seqs);
+        ggml_set_name(conv_states, "conv_states");
+        ggml_set_name(conv_kernel, "conv_kernel");
+        ggml_set_name(x_cur, "x_cur");
+        ggml_set_name(conv_state_dst, "conv_state_dst");
+        ggml_set_name(ids, "ids");
+
+        ggml_tensor * out = ggml_ssm_conv_update_inplace_ids(ctx, conv_states, conv_kernel, x_cur,
+                conv_state_dst, ids, /*rs_head=*/0, /*fuse_silu=*/true);
+        ggml_set_name(out, "out");
+        return out;
+    }
+
+    void initialize_tensors(ggml_context * ctx) override {
+        std::random_device rd;
+        std::default_random_engine rng(rd());
+        for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
+            if (t->type == GGML_TYPE_I32) {
+                // ids: shuffled permutation of [0, n_seqs) into the full cache (rs_head == 0), so some
+                // sequences are identity (ids[s] == s, in-place read) and some are not (scratch read).
+                std::vector<int32_t> data(t->ne[0]);
+                for (int i = 0; i < t->ne[0]; i++) {
+                    data[i] = i;
+                }
+                std::shuffle(data.begin(), data.end(), rng);
+                ggml_backend_tensor_set(t, data.data(), 0, t->ne[0] * sizeof(int32_t));
+            } else {
+                init_tensor_uniform(t);
+            }
+        }
+    }
+};
+
+ // GGML_OP_SSM_SCAN
+ struct test_ssm_scan : public test_case {
+     const ggml_type type;
+@@ -8504,6 +8563,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+         }
+     }
+ 
+    // gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids, patch 0028).
+    // channels must be a multiple of 128 for the CUDA SSM_CONV supports_op gate.
+    for (int64_t d_conv : {3, 4}) {
+        for (int64_t channels : {256, 3328}) {
+            for (int64_t n_seqs : {1, 4, 32, 128}) {
+                test_cases.emplace_back(new test_ssm_conv_update_ids(d_conv, channels, n_seqs));
+            }
+        }
+    }
+
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
+     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64,  8, 2, 32, 4)); // Falcon-H1
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0029-qwen35-blocktable-within-step-cache.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0029-qwen35-blocktable-within-step-cache.patch
@@ -0,0 +1,176 @@
+From e2acb3bca4d12ecef4964a214d397fc91ecfcebc Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sat, 27 Jun 2026 03:45:19 +0200
+Subject: [PATCH] feat(paged): block-table within-step host cache (patch 0029)
+
+Lever 5 (host pipeline). get_block_table() is called once per full-attention
+layer per decode step, but the KV cell layout (and therefore the block table)
+is fixed for the whole step: it only changes in apply() when the ubatch's slots
+are committed. The old path recomputed the full table on every layer.
+
+This caches the table the first time it is built in a step and reuses the bytes
+(memcpy) for every subsequent full-attention layer, invalidating the cache in
+apply(). The reused bytes are identical to a fresh compute, so the change is
+bit-exact. Toggle off with LLAMA_PAGED_NO_BT_CACHE=1.
+
+Measured host-side get_block_table time (llama-batched-bench, npp128 ntg128
+npl128, cache OFF -> ON):
+- MoE  q36-35b-a3b-nvfp4: 112.94 -> 14.82 ms  (-87%)
+- dense q36-27b-nvfp4   : 193.78 -> 16.90 ms  (-91%)
+
+Throughput: dense is partly host-bound and gains (TG 364.8 -> 374.7 t/s,
+2.7%, ~95.8% of the vLLM 391 t/s reference @npl128). MoE decode is compute-
+bound (FP4 GEMM dominates) so the saved host time is off the critical path and
+TG is flat (752.2 -> 757.0 t/s). The cache is therefore a pure pipeline cleanup,
+not a numeric change.
+
+Bit-exact, per path (llama-completion --temp 0 --seed 1, 48 tok):
+- non-paged MoE   = 07db32c2bcb78d17a43ed18bc22705cd  (unchanged baseline)
+- paged MoE       = 8cb0ce23777bf55f92f63d0292c756b0  (paged baseline)
+- paged MoE cache OFF == cache ON (both 8cb0ce23)
+- dense non-paged == dense paged = 5951a5b4d624ce891e22ab5fca9bc439
+
+The paged-MoE md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
+benign FP-accumulation-order difference of the paged attention reduction, not a
+bug: KL-divergence vs the f16 reference (16 chunks, c512) gives KLD(paged||f16)
+= 0.13600 <= KLD(nonpaged||f16) = 0.13660 and PPL(paged) = 7.4009 ~
+PPL(nonpaged) = 7.3896 (within +/- 0.29). See PAGED_BITEXACT_NOTE.md and
+LEVER5_HOSTPIPE_RESULTS.md.
+
+Includes the [L5INSTR] host-timing instrumentation used to measure the lever.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/llama-context.cpp  |  7 +++++++
+ src/llama-kv-cache.cpp | 28 +++++++++++++++++++++++++++-
+ src/llama-kv-cache.h   |  9 +++++++++
+ src/paged-attn.cpp     |  9 +++++++++
+ 4 files changed, 52 insertions(+), 1 deletion(-)
+
+diff --git a/src/llama-context.cpp b/src/llama-context.cpp
+index 5c90c48..ad7939e 100644
+--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
+@@ -1306,7 +1306,11 @@ bool llama_context::set_adapter_cvec(
+     return res;
+ }
+ 
+extern "C" void l5_add_setinp(double ns);
+extern "C" void l5_add_hostproc(double ns);
+static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; }
+ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {
+    double _l5_t0=l5c_now_ns();
+     if (mctx && !mctx->apply()) {
+         LLAMA_LOG_ERROR("%s: failed to apply memory context\n", __func__);
+         ret = GGML_STATUS_FAILED;
+@@ -1361,11 +1365,14 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
+         //const auto t_start_us = ggml_time_us();
+ 
+         // FIXME this call causes a crash if any model inputs were not used in the graph and were therefore not allocated
+        double _l5_si=l5c_now_ns();
+         res->set_inputs(&ubatch);
+        l5_add_setinp(l5c_now_ns()-_l5_si);
+ 
+         //LLAMA_LOG_INFO("graph set inputs time: %.3f ms\n", (ggml_time_us() - t_start_us)/1000.0);
+     }
+ 
+    l5_add_hostproc(l5c_now_ns()-_l5_t0);
+     const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
+     if (status != GGML_STATUS_SUCCESS) {
+         LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 21b8f1e..17aaf40 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -2772,6 +2772,9 @@ bool llama_kv_cache_context::apply() {
+     kv->apply_ubatch(sinfos[i_cur], ubatches[i_cur]);
+     n_kv = kv->get_n_kv(sinfos[i_cur]);
+ 
+    // the cells for this ubatch just changed -> drop the cached block table
+    bt_cache_valid = false;
+
+     return true;
+ }
+ 
+@@ -2814,7 +2817,30 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+ }
+ 
+ void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
+-    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+    const auto & sinfo = sinfos[i_cur];
+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+    const size_t total = (size_t) ns * n_blk;
+
+    // within-step reuse: all full-attention layers of a step request the same
+    // table (same i_cur/n_blk, cells fixed since apply()). The bytes are
+    // identical to a fresh compute, so this is bit-exact.
+    static const bool nocache = (getenv("LLAMA_PAGED_NO_BT_CACHE") != nullptr);
+    if (nocache) {
+        kv->get_block_table(dst, n_blk, n_kv, sinfo);
+        return;
+    }
+
+    if (bt_cache_valid && bt_cache_n_blk == n_blk && bt_cache.size() == total) {
+        memcpy(dst, bt_cache.data(), total * sizeof(int32_t));
+        return;
+    }
+
+    kv->get_block_table(dst, n_blk, n_kv, sinfo);
+
+    bt_cache.resize(total);
+    memcpy(bt_cache.data(), dst, total * sizeof(int32_t));
+    bt_cache_n_blk = n_blk;
+    bt_cache_valid = true;
+ }
+ 
+ ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index e9980b6..b03de78 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -451,4 +451,13 @@ private:
+     // a heuristic, to avoid attending the full cache if it is not yet utilized
+     // as the cache gets filled, the benefit from this heuristic disappears
+     int32_t n_kv;
+
+    // [paged L5] within-step block-table cache. get_block_table() is called once
+    // per full-attention layer per decode step, but the cell layout (and hence
+    // the table) is identical across all layers of a step. Compute it on the
+    // first call and reuse the bytes for the rest; invalidated in apply() when
+    // the ubatch's slots are committed (the only host-side mutation per step).
+    mutable std::vector<int32_t> bt_cache;
+    mutable uint32_t bt_cache_n_blk = 0;
+    mutable bool     bt_cache_valid = false;
+ };
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index fed8ca9..ebd92be 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -8,6 +8,13 @@
+ 
+ #include <cstdlib>
+ #include <cstdio>
+#include <ctime>
+namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } }
+double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0;
+extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; }
+extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; }
+namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; }
+
+ 
+ namespace paged_attn {
+ 
+@@ -54,7 +61,9 @@ public:
+     void set_input(const llama_ubatch * ubatch) override {
+         GGML_UNUSED(ubatch);
+         GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+        double _t=l5_now_ns();
+         mctx->get_block_table((int32_t *) idxs->data, n_blk);
+        g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++;
+     }
+ 
+     const llama_kv_cache_context * mctx;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/patches/paged/0030-fused-op-backend-gate.patch
+++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0030-fused-op-backend-gate.patch
@@ -0,0 +1,106 @@
+From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Sat, 27 Jun 2026 07:30:43 +0000
+Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
+ emission (patch 0030)
+
+Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
+Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
+and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
+[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
+slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
+(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
+CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
+reference ONLY.
+
+The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
+the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
+the node and the scheduler assigns the discriminated conv to it; it then runs the
+wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
+device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
+discriminated-SSM_CONV safety was only incidentally covered (it happened to share
+backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
+build of a gated-DeltaNet model exists.
+
+FIX: gate the fused-op emission on the active compute backend type. Before the
+auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
+backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
+fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
+these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
+so disabling them routes the graph to the upstream non-fused path: a PLAIN
+ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
+correctly. This makes the discriminated-op safety explicit and decoupled from the
+GDN-op device-mismatch heuristic.
+
+INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
+fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
+graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
+non-CUDA/non-CPU backends.
+
+GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
+0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
+edited llama-context.cpp compiles clean (uses only already-included <cstring> +
+backend-reg API already used in this TU). test-backend-ops correctness for
+SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
+CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
+registered and exercised on the CUDA DGX run.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 39 insertions(+)
+
+diff --git a/src/llama-context.cpp b/src/llama-context.cpp
+index ad7939e..c408eef 100644
+--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
+@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
+         cparams.auto_fa = false;
+     }
+ 
+    // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
+    // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
+    // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
+    // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
+    // built from the hipified ggml-cuda TU) and the CPU reference. Any other
+    // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
+    // ignores the discriminator src would silently run the WRONG conv. The
+    // upstream auto_fgdn device-mismatch check below only inspects
+    // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
+    // explicitly to the backend type here: keep the fused path enabled only when
+    // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
+    // untouched, so the emitted decode graph is byte-identical.
+    if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
+        bool fgdn_backend_ok = true;
+        for (auto & backend : backends) {
+            ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
+            if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
+                // CPU reference handles the fused/discriminated ops
+                continue;
+            }
+            ggml_backend_reg_t reg  = ggml_backend_dev_backend_reg(dev);
+            const char *       name = reg ? ggml_backend_reg_name(reg) : "";
+            // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
+            // same ggml-cuda TU that carries the discriminated-op handling.
+            if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
+                fgdn_backend_ok = false;
+                break;
+            }
+        }
+
+        if (!fgdn_backend_ok) {
+            cparams.fused_gdn_ar = false;
+            cparams.fused_gdn_ch = false;
+            cparams.auto_fgdn    = false;
+            LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
+                    "(compute backend is not CUDA/HIP/CPU)\n", __func__);
+        }
+    }
+
+     if (cparams.auto_fgdn) {
+         LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp-localai-paged/run.sh
+++ b/backend/cpp/llama-cpp-localai-paged/run.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+set -ex
+
+# Get the absolute current dir where the script is located
+CURDIR=$(dirname "$(realpath $0)")
+
+cd /
+
+echo "CPU info:"
+grep -e "model\sname" /proc/cpuinfo | head -1
+grep -e "flags" /proc/cpuinfo | head -1
+
+BINARY=llama-cpp-localai-paged-fallback
+
+# x86/arm64 ship a single llama-cpp-localai-paged-cpu-all built with ggml
+# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for
+# this host, so no shell-side probing. ROCm ships only the fallback, so fall back
+# to it when cpu-all is absent.
+if [ -e $CURDIR/llama-cpp-localai-paged-cpu-all ]; then
+	BINARY=llama-cpp-localai-paged-cpu-all
+fi
+
+if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then
+	if [ -e $CURDIR/llama-cpp-localai-paged-grpc ]; then
+		BINARY=llama-cpp-localai-paged-grpc
+	fi
+fi
+
+# Extend ld library path with the dir where this script is located/lib
+if [ "$(uname)" == "Darwin" ]; then
+	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
+else
+	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+	# Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files)
+	if [ -d "$CURDIR/lib/rocblas/library" ]; then
+		export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library
+	fi
+fi
+
+# If there is a lib/ld.so, use it
+if [ -f $CURDIR/lib/ld.so ]; then
+	echo "Using lib/ld.so"
+	echo "Using binary: $BINARY"
+	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
+fi
+
+echo "Using binary: $BINARY"
+exec $CURDIR/$BINARY "$@"
+
+# We should never reach this point, however just in case we do, run fallback
+exec $CURDIR/llama-cpp-localai-paged-fallback "$@"
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,4 +1,9 @@

+# This pin is auto-bumped nightly by .github/workflows/bump_deps.yaml (the stock
+# llama-cpp backend is patch-free, so a naive bump is safe). The paged backend
+# (backend/cpp/llama-cpp-localai-paged) does NOT inherit this pin: it owns its
+# own LLAMA_VERSION because its vendored patch series would break on a naive
+# bump and is advanced only by the manual PIN_SYNC process.
 LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp

@@ -169,7 +174,12 @@ llama.cpp:
 	git remote add origin $(LLAMA_REPO)  && \
 	git fetch --all --tags && \
 	git checkout -b build $(LLAMA_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch
+	git submodule update --init --recursive --depth 1 --single-branch && \
+	for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \
+		[ -e "$$p" ] || continue; \
+		echo "applying llama.cpp patch: $$p"; \
+		git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
+	done

 llama.cpp/tools/grpc-server: llama.cpp
 	mkdir -p llama.cpp/tools/grpc-server
--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -750,6 +750,118 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.kv_unified = false;
            }
+        // --- paged KV cache (experimental, off by default) ---
+        // Enables the on-demand paged KV-cache engine (vendored PagedKVManager
+        // + paged placement/gather/alloc seams). The engine is gated inside
+        // llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use;
+        // here we expose it as a per-server model option instead of forcing the
+        // operator to export a process-wide env. When enabled we set the env
+        // BEFORE the model/context is created (later in this handler), so the
+        // engine latches on. When the option is absent we touch nothing, so an
+        // externally exported LLAMA_KV_PAGED still works as an escape hatch.
+        // Note: the engine's env check is process-wide and latches on first
+        // use, so enabling it for one model enables it for the worker process;
+        // LocalAI runs one model per llama.cpp worker, so this maps cleanly to
+        // per-server configuration. `kv_paged_debug` turns on the per-slot
+        // [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG).
+        //
+        // The continuous-batching serving loop (update_slots) drives paged KV
+        // transparently through the existing kv-cache seams: each slot's
+        // sequence allocates paged blocks on arrival (find_slot placement) and
+        // returns them on slot release (the seq_rm free seam). This is
+        // token-identical to stock under both the unified and per-sequence
+        // caches. The per-slot allocate/free capacity benefit, however, only
+        // materialises with a per-sequence cache, since paged block ownership
+        // is keyed by stream and the unified cache collapses every slot onto a
+        // single stream. Operators who want that benefit should pair this with
+        // `kv_unified:false`; we do NOT flip kv_unified here, to keep the
+        // default serving behaviour (and the idle-slot prompt cache) unchanged.
+        } else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                setenv("LLAMA_KV_PAGED", "1", 1);
+            }
+        } else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
+            }
+        // --- chunked-prefill QoS budget (experimental, off by default) ---
+        // Caps the number of prompt tokens any single slot may prefill per
+        // update_slots iteration, so a large prompt cannot monopolise the batch
+        // and freeze the in-flight decoders. The serving loop reads this budget
+        // from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
+        // kv_paged above) and splits oversized prompts across iterations,
+        // interleaving decode steps for the other slots. A 6k-token prefill that
+        // stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
+        // cut) with zero TTFT cost and no steady-state regression. Unset or a
+        // non-positive value leaves the env untouched, so the stock unbounded
+        // prefill behaviour is preserved (an externally exported
+        // LLAMA_PREFILL_BUDGET still works as an escape hatch).
+        } else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
+            if (optval != NULL) {
+                try {
+                    int budget = std::stoi(optval_str);
+                    if (budget > 0) {
+                        setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
+        // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
+        // T - D budget read by update_slots(): a single total per-step token budget T
+        // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
+        // decode claims its live load D first and prefill gets the leftover, plus an
+        // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
+        // threshold analogue). Both are set BEFORE context init, like kv_paged /
+        // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
+        // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
+        // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
+        // it takes precedence over max_prefill_tokens: the engine honours the legacy
+        // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
+        } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
+            if (optval != NULL) {
+                try {
+                    int mbt = std::stoi(optval_str);
+                    if (mbt > 0) {
+                        setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        } else if (!strcmp(optname, "prefill_cap")) {
+            if (optval != NULL) {
+                try {
+                    int cap = std::stoi(optval_str);
+                    if (cap > 0) {
+                        setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the per-slot cap unset (engine default)
+                }
+            }
+        // --- hybrid per-head bf16 SSM-state precision (patch 0026, qwen3.5 gated-DeltaNet decode) ---
+        // Opt-in reduced-precision fast mode for the recurrent SSM state: a gated-DeltaNet head whose
+        // memory length tau_h = 1/(|ssm_a|*softplus(ssm_dt)) tokens exceeds this threshold stays f32;
+        // faster-decaying heads persist their state as bf16, halving that head's dominant recurrence
+        // byte stream on decode. The value is the tau threshold in tokens (e.g. 32 / 64); 0 keeps every
+        // head f32 (the bit-exact default). Set BEFORE context init via LLAMA_SSM_BF16_TAU, consumed in
+        // common_context_params_to_llama (patch 0026) only when the --ssm-bf16-tau CLI flag is unset.
+        // Unset / non-positive => env untouched, so stock stays byte-identical and bit-exact (an
+        // externally exported LLAMA_SSM_BF16_TAU still works as an escape hatch). NOTE: this mode is
+        // NOT bit-exact (~91% same-top-p ceiling); see backend/cpp/llama-cpp-localai-paged/README.md (Dev notes).
+        } else if (!strcmp(optname, "ssm_bf16_tau") || !strcmp(optname, "ssm_hybrid_tau")) {
+            if (optval != NULL) {
+                try {
+                    float tau = std::stof(optval_str);
+                    if (tau > 0.0f) {
+                        setenv("LLAMA_SSM_BF16_TAU", std::to_string(tau).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the threshold unset (bit-exact f32 default)
+                }
+            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {
--- a/backend/cpp/llama-cpp/prepare.sh
+++ b/backend/cpp/llama-cpp/prepare.sh
@@ -2,12 +2,18 @@

 ## Patches

-## Apply patches from the `patches` directory
+## Apply the base `patches/` series (top-level *.patch only; *.md/dirs skipped).
+## The stock llama-cpp backend is patch-free by default, so this normally does
+## nothing. The Makefile `llama.cpp` target already `git apply`s any base patch
+## at checkout, so each apply here is `-N` (skip already-applied): re-applying a
+## git-format patch with `patch` would fuzzily duplicate hunks. This block only
+## does real work if prepare.sh is run against an unpatched checkout.
 if [ -d "patches" ]; then
-    for patch in $(ls patches); do
+    for patch in patches/*.patch; do
+        [ -e "$patch" ] || continue
        echo "Applying patch $patch"
-        patch -d llama.cpp/ -p1 < patches/$patch
-    done 
+        patch -d llama.cpp/ -p1 -N -r - < "$patch" || true
+    done
 fi

 set -e
--- a/backend/index.yaml
+++ b/backend/index.yaml
@@ -72,6 +72,43 @@
    nvidia-cuda-12: "cuda12-turboquant"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant"
+- &llamacpplocalaipaged
+  name: "llama-cpp-localai-paged"
+  alias: "llama-cpp-localai-paged"
+  license: mit
+  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
+  description: |
+    LocalAI's paged-attention llama.cpp variant: on-demand paged KV cache plus a
+    decode-first prefill budget. The SAME upstream llama.cpp grpc-server as the
+    stock llama-cpp backend, with the LocalAI paged patch series applied
+    (vendored in this backend). Tuned for NVFP4 dense / MoE on Blackwell / GB10. Reuses the
+    llama-cpp gRPC server sources; the paged engine is gated at runtime by the
+    paged_kv / max_batch_tokens model options. Qwen3.5 gated-DeltaNet models can
+    additionally opt into the reduced-precision hybrid SSM-state fast mode with
+    the ssm_bf16_tau:<tokens> option (default off = bit-exact f32; non-bit-exact
+    when enabled).
+  urls:
+    - https://github.com/ggerganov/llama.cpp
+  tags:
+    - text-to-text
+    - LLM
+    - GPU
+    - CUDA
+    - paged-attention
+    - nvfp4
+  # CUDA-only: the paged patchset's wins (GDN fusions, NVFP4 FP4-MMA) are
+  # CUDA/Blackwell-specific; off-CUDA they gate off and the backend is
+  # neutral-to-negative, so non-CUDA users should use the stock llama-cpp
+  # backend. default points at cuda12 (mirrors faster-qwen3-tts) so the gallery
+  # entries always resolve to a CUDA variant.
+  capabilities:
+    default: "cuda12-llama-cpp-localai-paged"
+    nvidia: "cuda12-llama-cpp-localai-paged"
+    nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-localai-paged"
+    nvidia-cuda-13: "cuda13-llama-cpp-localai-paged"
+    nvidia-cuda-12: "cuda12-llama-cpp-localai-paged"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-localai-paged"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged"
 - &ds4
  name: "ds4"
  alias: "ds4"
@@ -1639,6 +1676,16 @@
    nvidia-cuda-12: "cuda12-turboquant-development"
    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development"
    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development"
+- !!merge <<: *llamacpplocalaipaged
+  name: "llama-cpp-localai-paged-development"
+  capabilities:
+    default: "cuda12-llama-cpp-localai-paged-development"
+    nvidia: "cuda12-llama-cpp-localai-paged-development"
+    nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-localai-paged-development"
+    nvidia-cuda-13: "cuda13-llama-cpp-localai-paged-development"
+    nvidia-cuda-12: "cuda12-llama-cpp-localai-paged-development"
+    nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-localai-paged-development"
+    nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged-development"
 - !!merge <<: *ds4
  name: "ds4-development"
  capabilities:
@@ -2307,6 +2354,47 @@
  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant"
  mirrors:
    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant
+## llama-cpp-localai-paged (CUDA-only; see backend/cpp/llama-cpp-localai-paged/README.md section 4c)
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda12-llama-cpp-localai-paged"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-12-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-12-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda12-llama-cpp-localai-paged-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-12-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-12-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda13-llama-cpp-localai-paged"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:latest-gpu-nvidia-cuda-13-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda13-llama-cpp-localai-paged-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:master-gpu-nvidia-cuda-13-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "nvidia-l4t-arm64-llama-cpp-localai-paged"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-arm64-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-arm64-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "nvidia-l4t-arm64-llama-cpp-localai-paged-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-arm64-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-arm64-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged"
+  uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged
+- !!merge <<: *llamacpplocalaipaged
+  name: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged-development"
+  uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged"
+  mirrors:
+    - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged
 ## ds4
 - !!merge <<: *ds4
  name: "cpu-ds4"
--- a/backend/python/fish-speech/requirements.txt
+++ b/backend/python/fish-speech/requirements.txt
@@ -7,3 +7,7 @@ setuptools
 six
 scipy
 numpy
+# fish-speech is installed editable with --no-build-isolation, so the build
+# backends of its transitive deps must already be in the venv. One of them
+# builds a Rust extension and needs setuptools-rust present at metadata time.
+setuptools-rust
--- a/backend/python/llama-cpp-quantization/install.sh
+++ b/backend/python/llama-cpp-quantization/install.sh
@@ -11,14 +11,31 @@ fi
 EXTRA_PIP_INSTALL_FLAGS+=" --upgrade "
 installRequirements

-# Fetch convert_hf_to_gguf.py from llama.cpp
+# Fetch convert_hf_to_gguf.py from llama.cpp.
+# Upstream split the model-specific logic out of the single file into a
+# sibling `conversion/` package (convert_hf_to_gguf.py now does
+# `from conversion import ...`), so a single-file download no longer runs —
+# it fails with `ModuleNotFoundError: No module named 'conversion'`. We clone
+# the repo and copy both the script and the package; Python puts the script's
+# own directory on sys.path[0], so the package resolves when placed beside it.
 LLAMA_CPP_CONVERT_VERSION="${LLAMA_CPP_CONVERT_VERSION:-master}"
+LLAMA_CPP_SRC="${EDIR}/llama.cpp"
 CONVERT_SCRIPT="${EDIR}/convert_hf_to_gguf.py"
-if [ ! -f "${CONVERT_SCRIPT}" ]; then
-    echo "Downloading convert_hf_to_gguf.py from llama.cpp (${LLAMA_CPP_CONVERT_VERSION})..."
-    curl -L --fail --retry 3 \
-        "https://raw.githubusercontent.com/ggml-org/llama.cpp/${LLAMA_CPP_CONVERT_VERSION}/convert_hf_to_gguf.py" \
-        -o "${CONVERT_SCRIPT}" || echo "Warning: Failed to download convert_hf_to_gguf.py."
+
+cloneLlamaCpp() {
+    if [ ! -d "${LLAMA_CPP_SRC}/.git" ]; then
+        git clone --depth 1 --branch "${LLAMA_CPP_CONVERT_VERSION}" \
+            https://github.com/ggml-org/llama.cpp.git "${LLAMA_CPP_SRC}" 2>/dev/null || \
+        git clone --depth 1 https://github.com/ggml-org/llama.cpp.git "${LLAMA_CPP_SRC}"
+    fi
+}
+
+if [ ! -f "${CONVERT_SCRIPT}" ] || [ ! -d "${EDIR}/conversion" ]; then
+    echo "Fetching convert_hf_to_gguf.py + conversion/ from llama.cpp (${LLAMA_CPP_CONVERT_VERSION})..."
+    cloneLlamaCpp
+    cp "${LLAMA_CPP_SRC}/convert_hf_to_gguf.py" "${CONVERT_SCRIPT}"
+    rm -rf "${EDIR}/conversion"
+    cp -r "${LLAMA_CPP_SRC}/conversion" "${EDIR}/conversion"
 fi

 # Install gguf package from the same llama.cpp commit to keep them in sync
@@ -41,12 +58,7 @@ QUANTIZE_BIN="${EDIR}/llama-quantize"
 if [ ! -x "${QUANTIZE_BIN}" ] && ! command -v llama-quantize &>/dev/null; then
    if command -v cmake &>/dev/null; then
        echo "Building llama-quantize from llama.cpp (${LLAMA_CPP_CONVERT_VERSION})..."
-        LLAMA_CPP_SRC="${EDIR}/llama.cpp"
-        if [ ! -d "${LLAMA_CPP_SRC}" ]; then
-            git clone --depth 1 --branch "${LLAMA_CPP_CONVERT_VERSION}" \
-                https://github.com/ggml-org/llama.cpp.git "${LLAMA_CPP_SRC}" 2>/dev/null || \
-            git clone --depth 1 https://github.com/ggml-org/llama.cpp.git "${LLAMA_CPP_SRC}"
-        fi
+        cloneLlamaCpp  # reuses the clone fetched for convert_hf_to_gguf.py
        cmake -B "${LLAMA_CPP_SRC}/build" -S "${LLAMA_CPP_SRC}" -DGGML_NATIVE=OFF -DBUILD_SHARED_LIBS=OFF
        cmake --build "${LLAMA_CPP_SRC}/build" --target llama-quantize -j"$(nproc 2>/dev/null || echo 2)"
        cp "${LLAMA_CPP_SRC}/build/bin/llama-quantize" "${QUANTIZE_BIN}"
--- a/backend/python/sglang/install.sh
+++ b/backend/python/sglang/install.sh
@@ -85,9 +85,15 @@ if [ "x${BUILD_TYPE}" == "x" ] || [ "x${FROM_SOURCE:-}" == "xtrue" ]; then
    # The resulting binary still requires an AVX-512 capable CPU at runtime,
    # same constraint sglang upstream documents in docker/xeon.Dockerfile.

+    # Pin the source build to the same release the GPU path floors on
+    # (0.5.11, see requirements-cublas12-after.txt). An unpinned master clone
+    # pulls in newer CPU kernels (e.g. mamba/fla.cpp) that fail to compile
+    # (constexpr non-constant + kineto_LIBRARY-NOTFOUND). Bump deliberately.
+    SGLANG_VERSION="${SGLANG_VERSION:-v0.5.11}"
    _sgl_src=$(mktemp -d)
    trap 'rm -rf "${_sgl_src}"' EXIT
-    git clone --depth 1 https://github.com/sgl-project/sglang "${_sgl_src}/sglang"
+    git clone --depth 1 --branch "${SGLANG_VERSION}" \
+        https://github.com/sgl-project/sglang "${_sgl_src}/sglang"

    # Patch -march=native → -march=sapphirerapids in the CPU kernel CMakeLists
    sed -i 's/-march=native/-march=sapphirerapids/g' \
--- a/backend/rust/kokoros/src/service.rs
+++ b/backend/rust/kokoros/src/service.rs
@@ -570,6 +570,43 @@ impl Backend for KokorosService {
    ) -> Result<Response<backend::Result>, Status> {
        Err(Status::unimplemented("Not supported"))
    }
+
+    async fn sound_detection(
+        &self,
+        _: Request<backend::SoundDetectionRequest>,
+    ) -> Result<Response<backend::SoundDetectionResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    async fn depth(
+        &self,
+        _: Request<backend::DepthRequest>,
+    ) -> Result<Response<backend::DepthResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    async fn token_classify(
+        &self,
+        _: Request<backend::TokenClassifyRequest>,
+    ) -> Result<Response<backend::TokenClassifyResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    async fn score(
+        &self,
+        _: Request<backend::ScoreRequest>,
+    ) -> Result<Response<backend::ScoreResponse>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
+
+    type ForwardStream = ReceiverStream<Result<backend::ForwardReply, Status>>;
+
+    async fn forward(
+        &self,
+        _: Request<tonic::Streaming<backend::ForwardRequest>>,
+    ) -> Result<Response<Self::ForwardStream>, Status> {
+        Err(Status::unimplemented("Not supported"))
+    }
 }

 #[cfg(test)]
--- a/core/backend/hardware_defaults.go
+++ b/core/backend/hardware_defaults.go
@@ -0,0 +1,43 @@
+package backend
+
+// Hardware-specific backend defaults.
+//
+// This file centralizes tuning that depends on the *detected hardware* rather
+// than on the model config. The model config (explicit `batch:`, `context_size:`
+// …) always takes precedence; these helpers only fill values the user left
+// unset, so behavior is unchanged unless the matching hardware is present.
+//
+// Placement note: this runs in the process that builds the gRPC ModelOptions
+// sent to every backend (including the C++ llama.cpp grpc-server), so it is the
+// one common point that covers all backends. For distributed setups where the
+// backend runs on a different host than the orchestrator, worker-side detection
+// (e.g. the C++ backend reading cudaGetDeviceProperties) would be more precise;
+// this single-host default is the pragmatic common case.
+
+import (
+	"github.com/mudler/LocalAI/pkg/xsysinfo"
+	"github.com/mudler/xlog"
+)
+
+// BlackwellBatchSize is the physical batch (n_batch/n_ubatch) default on NVIDIA
+// Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark). A larger
+// physical batch materially lifts MoE prefill throughput there (per-expert GEMM
+// tiles fill better); measured on a GB10 with Qwen3-30B-A3B to lift the prefill
+// ceiling ~+10-15% and saturate around 2048. Only applied when the model config
+// does not set an explicit `batch:`.
+const BlackwellBatchSize = 2048
+
+// detectBlackwellGPU is a seam over xsysinfo.IsNVIDIABlackwell so tests can
+// force the hardware branch deterministically.
+var detectBlackwellGPU = xsysinfo.IsNVIDIABlackwell
+
+// hardwareDefaultBatchSize returns the physical-batch default for the detected
+// hardware, falling back to the given value when no hardware-specific tuning
+// applies. Used by EffectiveBatchSize only when the config leaves batch unset.
+func hardwareDefaultBatchSize(fallback int) int {
+	if detectBlackwellGPU() {
+		xlog.Debug("Blackwell GPU detected; defaulting physical batch higher for MoE prefill", "batch", BlackwellBatchSize)
+		return BlackwellBatchSize
+	}
+	return fallback
+}
--- a/core/backend/hardware_defaults_internal_test.go
+++ b/core/backend/hardware_defaults_internal_test.go
@@ -0,0 +1,50 @@
+package backend
+
+import (
+	"github.com/mudler/LocalAI/core/config"
+	. "github.com/onsi/ginkgo/v2"
+	. "github.com/onsi/gomega"
+)
+
+var _ = Describe("hardware-specific defaults", func() {
+	var origDetect func() bool
+
+	BeforeEach(func() {
+		origDetect = detectBlackwellGPU
+	})
+	AfterEach(func() {
+		detectBlackwellGPU = origDetect
+	})
+
+	Describe("hardwareDefaultBatchSize", func() {
+		It("returns the fallback when not Blackwell", func() {
+			detectBlackwellGPU = func() bool { return false }
+			Expect(hardwareDefaultBatchSize(512)).To(Equal(512))
+		})
+
+		It("returns BlackwellBatchSize on Blackwell", func() {
+			detectBlackwellGPU = func() bool { return true }
+			Expect(hardwareDefaultBatchSize(512)).To(Equal(BlackwellBatchSize))
+		})
+	})
+
+	Describe("EffectiveBatchSize on Blackwell", func() {
+		threads := 1
+		ctx := 4096
+
+		It("defaults an unset batch to 2048 on Blackwell", func() {
+			detectBlackwellGPU = func() bool { return true }
+			cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
+			opts := grpcModelOpts(cfg, "/tmp/models")
+			Expect(opts.NBatch).To(BeEquivalentTo(BlackwellBatchSize))
+		})
+
+		It("keeps an explicit batch over the Blackwell default", func() {
+			detectBlackwellGPU = func() bool { return true }
+			cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
+			cfg.Batch = 256
+			opts := grpcModelOpts(cfg, "/tmp/models")
+			Expect(opts.NBatch).To(BeEquivalentTo(256))
+		})
+	})
+})
--- a/core/backend/options.go
+++ b/core/backend/options.go
@@ -191,7 +191,10 @@ func EffectiveBatchSize(c config.ModelConfig) int {
 	if ctx := EffectiveContextSize(c); singlePass && ctx > DefaultBatchSize {
 		return ctx
 	}
-	return DefaultBatchSize
+	// Hardware-tuned default when the config leaves batch unset (e.g. a larger
+	// physical batch lifts MoE prefill on Blackwell). Explicit `batch:` (handled
+	// above) always overrides this. See hardware_defaults.go.
+	return hardwareDefaultBatchSize(DefaultBatchSize)
 }

 func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions {
--- a/core/backend/options_internal_test.go
+++ b/core/backend/options_internal_test.go
@@ -103,6 +103,18 @@ var _ = Describe("grpcModelOpts NBatch", func() {
 	threads := 1
 	ctx := 4096

+	// Pin the hardware seam off so these baseline expectations are
+	// deterministic regardless of the host GPU. Blackwell behavior is covered
+	// in hardware_defaults_internal_test.go.
+	var origDetect func() bool
+	BeforeEach(func() {
+		origDetect = detectBlackwellGPU
+		detectBlackwellGPU = func() bool { return false }
+	})
+	AfterEach(func() {
+		detectBlackwellGPU = origDetect
+	})
+
 	It("defaults to 512 for an ordinary model", func() {
 		cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}}
 		opts := grpcModelOpts(cfg, "/tmp/models")
--- a/core/gallery/importers/importers_test.go
+++ b/core/gallery/importers/importers_test.go
@@ -154,6 +154,19 @@ var _ = Describe("DiscoverModelConfig", func() {
 			Expect(err).ToNot(HaveOccurred())
 			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm"))
 		})
+
+		It("should use llama-cpp-localai-paged backend when specified as a drop-in", func() {
+			// The paged variant is a curated AdditionalBackends() drop-in: the
+			// llama-cpp pipeline matches (the .gguf URI), and the backend
+			// preference is honoured in the emitted YAML.
+			uri := "https://example.com/my-model.gguf"
+			preferences := json.RawMessage(`{"backend": "llama-cpp-localai-paged"}`)
+
+			modelConfig, err := importers.DiscoverModelConfig(uri, preferences)
+
+			Expect(err).ToNot(HaveOccurred())
+			Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: llama-cpp-localai-paged"))
+		})
 	})

 	Context("with HuggingFace URI formats", func() {
@@ -288,7 +301,7 @@ var _ = Describe("DiscoverModelConfig", func() {
 				names = append(names, e.Name)
 				modalities = append(modalities, e.Modality)
 			}
-			Expect(names).To(ContainElements("ik-llama-cpp", "turboquant"))
+			Expect(names).To(ContainElements("ik-llama-cpp", "turboquant", "llama-cpp-localai-paged"))
 			for _, m := range modalities {
 				Expect(m).To(Equal("text"))
 			}
--- a/core/gallery/importers/llama-cpp.go
+++ b/core/gallery/importers/llama-cpp.go
@@ -37,6 +37,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry {
 	return []KnownBackendEntry{
 		{Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"},
 		{Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"},
+		{Name: "llama-cpp-localai-paged", Modality: "text", Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first prefill budget), tuned for NVFP4 on Blackwell/GB10"},
 	}
 }

@@ -130,7 +131,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error)
 	backend := "llama-cpp"
 	if b, ok := preferencesMap["backend"].(string); ok {
 		switch b {
-		case "ik-llama-cpp", "turboquant":
+		case "ik-llama-cpp", "turboquant", "llama-cpp-localai-paged":
 			backend = b
 		}
 	}
--- a/core/gallery/importers/llama-cpp_test.go
+++ b/core/gallery/importers/llama-cpp_test.go
@@ -375,7 +375,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 	})

 	Context("AdditionalBackends", func() {
-		It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() {
+		It("advertises ik-llama-cpp, turboquant and llama-cpp-localai-paged as drop-in replacements", func() {
 			entries := importer.AdditionalBackends()

 			names := make([]string, 0, len(entries))
@@ -384,7 +384,7 @@ var _ = Describe("LlamaCPPImporter", func() {
 				names = append(names, e.Name)
 				byName[e.Name] = e
 			}
-			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant"))
+			Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "llama-cpp-localai-paged"))

 			ik := byName["ik-llama-cpp"]
 			Expect(ik.Modality).To(Equal("text"))
@@ -393,6 +393,10 @@ var _ = Describe("LlamaCPPImporter", func() {
 			tq := byName["turboquant"]
 			Expect(tq.Modality).To(Equal("text"))
 			Expect(tq.Description).NotTo(BeEmpty())
+
+			paged := byName["llama-cpp-localai-paged"]
+			Expect(paged.Modality).To(Equal("text"))
+			Expect(paged.Description).NotTo(BeEmpty())
 		})
 	})
 })
--- a/docs/content/features/backends.md
+++ b/docs/content/features/backends.md
@@ -125,6 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github.
 LocalAI supports various types of backends:

 - **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX)
+  - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. For Qwen3.5 gated-DeltaNet (hybrid SSM) models you can additionally set `options: [ssm_bf16_tau:<tokens>]` to enable the reduced-precision hybrid SSM-state fast mode: fast-decaying recurrent heads (memory length tau below the threshold, e.g. `32` / `64`) persist their state as bf16, halving that head's decode byte stream. Default off (`0`) keeps every head f32 and is bit-exact; when enabled the mode is **not** bit-exact (~91% same-top-p ceiling - see `backend/cpp/llama-cpp-localai-paged/README.md` for the quality/throughput profile).
 - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo)
 - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS)
 - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step)
--- a/gallery/index.yaml
+++ b/gallery/index.yaml
@@ -1,4 +1,274 @@
 ---
+# =============================================================================
+# NVFP4 Qwen3.6 (dense + MoE) for the LocalAI paged-attention llama.cpp backend.
+# These reproduce the GB10 / DGX Spark benchmark serving config (see
+# backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md section 2).
+#
+# PUBLISHED: the dense + MoE base NVFP4 GGUFs are live at huggingface.co/mudler/
+# Qwen3.6-27B-NVFP4-GGUF and .../Qwen3.6-35B-A3B-NVFP4-GGUF (file_type MOSTLY_NVFP4);
+# the sha256 below were verified against the Hub LFS hash and the uris resolve (200).
+# Converted from the unsloth/nvidia NVFP4 sources via llama.cpp --outtype auto.
+#
+# NOTE(NVFP4 read): the paged backend (pinned llama.cpp c299a92c) reads NVFP4 GGUF
+# (the GB10 benchmark + the pin-sync md5 gate both ran NVFP4 GGUFs). These gallery
+# GGUFs were re-quantized with a newer convert (origin/master) preserving the same
+# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate.
+#
+# NOTE(ssm_bf16_tau): Qwen3.5 gated-DeltaNet (hybrid SSM) models can opt into the
+# reduced-precision hybrid SSM-state fast mode by adding `ssm_bf16_tau:<tokens>`
+# (e.g. 32 / 64) to a model's `options:` list - fast-decaying recurrent heads then
+# persist their state as bf16 (LLAMA_SSM_BF16_TAU), halving that head's decode byte
+# stream. Default off (0) = every head f32 = bit-exact; when enabled the mode is NOT
+# bit-exact (~91% same-top-p, beats vLLM dense) - see
+# backend/cpp/llama-cpp-localai-paged/README.md for the quality profile.
+# The two NVFP4 entries below intentionally stay bit-exact (no ssm_bf16_tau).
+# =============================================================================
+- name: "qwen3.6-27b-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF
+  description: |
+    Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).
+
+    Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
+    paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged KV cache
+    plus a decode-first prefill budget. Benchmarked on GB10 / DGX Spark (consumer Blackwell)
+    at 90-117% of vLLM dense decode throughput at 1.5-3x lower memory (GB10-specific figures).
+
+    Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged backend's
+    upstream pin) - verify on a GPU box before relying on this entry.
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+    - reasoning
+  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
+      sha256: 2fdd857b13cbaa37b913d9566bf0a69443dcdb702e95694ca8d75236710575d4
+      uri: https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
+- name: "qwen3.6-35b-a3b-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF
+  description: |
+    Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).
+
+    Qwen3.6-35B-A3B MoE (~3B active), native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for
+    LocalAI's paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged
+    KV cache plus a decode-first prefill budget. Lighter on memory than the dense 27B thanks
+    to the sparse MoE activation.
+
+    Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged backend's
+    upstream pin) - verify on a GPU box before relying on this entry.
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+    - moe
+    - reasoning
+  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # decode-first budget; set 256 for max saturated MoE decode (sweep winner)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
+      sha256: 1690d0424e232527b8bb135a38033e4699ad11817677eebacd40349020faea52
+      uri: https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
+- name: "qwen3.6-27b-nvfp4-mtp-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF
+  description: |
+    Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).
+
+    Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF with a built-in MTP
+    (multi-token-prediction / speculative) draft head, configured for LocalAI's
+    paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged KV
+    cache plus a decode-first prefill budget. The MTP draft head accelerates decode
+    via self-speculation; ships with the recommended Qwen3.6 sampling defaults.
+
+    Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged
+    backend's upstream pin) - verify on a GPU box before relying on this entry.
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+    - mtp
+    - reasoning
+  icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_27b_score.png
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.6-27B-NVFP4-MTP-GGUF/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-MTP-GGUF/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf
+      sha256: d088e57e8c35ff62c2a420cb888dad3fd53c8db3ed9ead4286bd383224f81b50
+      uri: https://huggingface.co/michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf
+- name: "qwen3.6-35b-a3b-nvfp4-mtp-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF
+  description: |
+    Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).
+
+    Qwen3.6-35B-A3B MoE (~3B active), native Blackwell NVFP4 (FP4-MMA) GGUF with a
+    built-in MTP (multi-token-prediction / speculative) draft head, configured for
+    LocalAI's paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand
+    paged KV cache plus a decode-first prefill budget. The MTP draft head accelerates
+    decode via self-speculation; ships with the recommended Qwen3.6 sampling defaults.
+
+    Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged
+    backend's upstream pin) - verify on a GPU box before relying on this entry.
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+    - moe
+    - mtp
+    - reasoning
+  icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_35b_a3b_score.png
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # decode-first budget; set 256 for max saturated MoE decode (sweep winner)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      min_p: 0
+      model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
+      presence_penalty: 1.5
+      repeat_penalty: 1
+      temperature: 0.7
+      top_k: 20
+      top_p: 0.8
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
+      sha256: f3d2fdc74e3ef19925ccbf794b04d7f6f11fb12eba7722b7749219d0cc5c36ed
+      uri: https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf
+- name: "qwen-agentworld-35b-a3b"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/unsloth/Qwen-AgentWorld-35B-A3B-GGUF
+  description: |
+    # Qwen-AgentWorld-35B-A3B
+
+    📑 Technical Report |
+    📖 Blog |
+    🤗 Hugging Face |
+    🤖 ModelScope |
+    💻 GitHub |
+    🖥️ Demo
+
+    > [!Note]
+    > This repository contains the model weights and configuration files for **Qwen-AgentWorld-35B-A3B**, a native language world model trained for agentic environment simulation.
+    >
+    > These artifacts are compatible with Hugging Face Transformers, vLLM, SGLang, etc.
+
+    **Qwen-AgentWorld** is the first language world model to cover seven agent interaction domains within a single model. It simulates agentic environments via long chain-of-thought reasoning, predicting the next environment state given an agent's action and interaction history. Trained through a three-stage pipeline — CPT injects environment knowledge, SFT activates next-state-prediction reasoning, RL sharpens simulation fidelity — Qwen-AgentWorld is a **native world model**: environment modeling is the training objective from the CPT stage onward, not a post-hoc add-on.
+
+    ## Highlights
+
+    ...
+  license: "apache-2.0"
+  tags:
+    - llm
+    - gguf
+    - qwen
+  icon: https://qianwen-res.oss-accelerate-overseas.aliyuncs.com/Qwen-AgentWorld/logo.png
+  overrides:
+    backend: llama-cpp
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+    parameters:
+      model: llama-cpp/models/Qwen-AgentWorld-35B-A3B-GGUF/Qwen-AgentWorld-35B-A3B-UD-Q4_K_M.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwen-AgentWorld-35B-A3B-GGUF/Qwen-AgentWorld-35B-A3B-UD-Q4_K_M.gguf
+      sha256: e7a8eafdd8013443b6bcc4b6fb47b2d2025f772d359650b9ceb7d75971e22cad
+      uri: https://huggingface.co/unsloth/Qwen-AgentWorld-35B-A3B-GGUF/resolve/main/Qwen-AgentWorld-35B-A3B-UD-Q4_K_M.gguf
 - name: "ornith-1.0-9b"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
@@ -606,6 +876,81 @@
    - filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf
      sha256: 1c163f0e1f29485d432b466b9e5e0593ea9b10c5a62cf3eb71b77fcfe41db46c
      uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf
+- name: "qwopus3.6-27b-v2-mtp-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF
+  description: "Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).\n\n\U0001FA90 Qwopus3.6-27B-v2-MTP\nMTP Release\n\nMulti-Token Prediction reasoning model fine-tuned from Qwen3.6-27B\n\n\U0001F9EC Trace Inversion & Negentropy\n\U0001F9E0 27B Parameters\n⚡ Speculative Decoding\n\U0001F6E0️ Coding / DevOps / Math\n\n\U0001F4A1 What is Qwopus3.6-27B-v2-MTP?\n\U0001FA90 Qwopus3.6-27B-v2-MTP is a speed-oriented reasoning release built on top of Qwen3.6-27B. It keeps the Qwopus line's focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while adding Multi-Token Prediction for faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster.\n\n⚡ MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts.\n\U0001F9E9 Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories.\n\U0001F9EA GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks.\n\U0001F680 Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not.\n\n...\n\n\nLocalAI paged-attention backend variant (llama-cpp-localai-paged): on-demand paged KV cache plus a decode-first prefill budget.\n"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # decode-first QoS budget (27B dense)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf
+      sha256: 2a0a36fd10374c2a85356121c7c315bda725c7eaca0b3ae14838567629c6924a
+      uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf
+- name: "qwopus3.6-27b-coder-mtp-nvfp4-paged"
+  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
+  urls:
+    - https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF
+  description: "Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).\n\n\U0001FA90 Qwopus-3.6-27B-Coder\nCoder SFT Release\n\nAgentic Coding &amp; Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2\n\n\U0001F9EC Trace Inversion & Negentropy\n\U0001F9E0 27B Dense Model\n⚡ Agentic Coding\n\U0001F6E0️ Tool Calling & Agent\n\U0001F3C6 SWE-bench Verified: 67.0% (off-thinking)\n\n\U0001F4A1 What is Qwopus-3.6-27B-Coder?\n\U0001FA90 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro (300ex) and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments.\n\n\U0001F9E9 Agentic Coding\nOptimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows.\n\n\U0001F6E0️ Tool Calling\nLearns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution.\n\n...\n\n\nLocalAI paged-attention backend variant (llama-cpp-localai-paged): on-demand paged KV cache plus a decode-first prefill budget.\n"
+  tags:
+    - llm
+    - gguf
+    - nvfp4
+    - blackwell
+  icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/sGQKmrMc6L6guMoaB5_Y2.png
+  overrides:
+    backend: llama-cpp-localai-paged
+    f16: true
+    flash_attention: "on"
+    context_size: 131072
+    gpu_layers: 99
+    batch: 512
+    function:
+      automatic_tool_parsing_fallback: true
+      grammar:
+        disable: true
+    known_usecases:
+      - chat
+    options:
+      - use_jinja:true
+      - paged_kv:true              # LLAMA_KV_PAGED=1
+      - max_batch_tokens:512       # decode-first QoS budget (27B dense)
+      - kv_unified:false           # per-slot paged capacity/memory benefit needs a per-sequence cache
+      - parallel:128               # 128 serving slots
+    parameters:
+      model: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf
+    template:
+      use_tokenizer_template: true
+  files:
+    - filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf
+      sha256: 1c163f0e1f29485d432b466b9e5e0593ea9b10c5a62cf3eb71b77fcfe41db46c
+      uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf
 - name: "qwen3.6-27b-nvfp4-mtp"
  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
  urls:
--- a/pkg/xsysinfo/gpu.go
+++ b/pkg/xsysinfo/gpu.go
@@ -440,6 +440,20 @@ func parseComputeCap(cc string) (int, int) {
 	return maj, min
 }

+// IsNVIDIABlackwell reports whether an NVIDIA Blackwell-class consumer GPU is
+// present, i.e. compute capability 12.x (sm_120 RTX 50-series, sm_121 GB10 /
+// DGX Spark). Cached via NVIDIAComputeCapability.
+//
+// Note: datacenter Blackwell (B100/B200/GB200, sm_100 / cc 10.0) reports a
+// different compute capability and is intentionally NOT matched here: this
+// targets the sm_12x family where we measured the larger-physical-batch MoE
+// prefill win. Returns false when nvidia-smi is unavailable or reports no 12.x
+// device.
+func IsNVIDIABlackwell() bool {
+	maj, _ := parseComputeCap(NVIDIAComputeCapability())
+	return maj >= 12
+}
+
 // getNVIDIAGPUMemory queries NVIDIA GPUs using nvidia-smi
 func getNVIDIAGPUMemory() []GPUMemoryInfo {
 	// Check if nvidia-smi is available
--- a/scripts/build/golang-darwin.sh
+++ b/scripts/build/golang-darwin.sh
@@ -17,9 +17,15 @@ rm -rf "${BACKEND_DIR}"/build-*
 # run.sh's final `exec $CURDIR/<binary>` is the contract for what gets launched;
 # the binary is not always named after the backend (e.g. parakeet-cpp launches
 # parakeet-cpp-grpc), so derive it from run.sh and fall back to ${BACKEND}.
+#
+# Only scan the `exec` line(s): many run.sh select a runtime CPU variant via
+# unquoted `LIBRARY=$CURDIR/libgo<x>-avx512.so` lines, and a whole-file grep
+# would pick the last of those (avx512, which Darwin never builds) instead of
+# the binary — failing the check below for whisper/sam3-cpp/vibevoice-cpp/...
+# Also tolerate the exec being quoted (`exec "$CURDIR"/<binary>`).
 RUN_BINARY=""
 if [ -f "${BACKEND_DIR}/run.sh" ]; then
-        RUN_BINARY=$(grep -oE '\$CURDIR/[A-Za-z0-9._-]+' "${BACKEND_DIR}/run.sh" | grep -v 'ld\.so' | tail -1 | sed 's|\$CURDIR/||')
+        RUN_BINARY=$(grep -E '^[[:space:]]*exec[[:space:]]' "${BACKEND_DIR}/run.sh" | grep -oE '"?\$CURDIR"?/[A-Za-z0-9._-]+' | grep -v 'ld\.so' | tail -1 | sed -E 's|"?\$CURDIR"?/||')
 fi
 RUN_BINARY="${RUN_BINARY:-${BACKEND}}"

--- a/scripts/changed-backends.js
+++ b/scripts/changed-backends.js
@@ -47,6 +47,15 @@ function inferBackendPath(item) {
    // via a thin wrapper Makefile. Changes to either dir should retrigger it.
    return `backend/cpp/turboquant/`;
  }
+  // llama-cpp-localai-paged is the LocalAI paged-attention llama.cpp variant: the
+  // SAME upstream pin as stock llama-cpp plus the paged patch series, reusing
+  // backend/cpp/llama-cpp sources via a thin wrapper Makefile. Keep this branch
+  // BEFORE the generic `endsWith("llama-cpp")` branch below: although
+  // "Dockerfile.llama-cpp-localai-paged".endsWith("llama-cpp") is already false,
+  // the specific branch documents the mapping and is robust to future renames.
+  if (item.dockerfile.endsWith("llama-cpp-localai-paged")) {
+    return `backend/cpp/llama-cpp-localai-paged/`;
+  }
  if (item.dockerfile.endsWith("privacy-filter")) {
    return `backend/cpp/privacy-filter/`;
  }
@@ -66,6 +75,13 @@ function inferBackendPathDarwin(item) {
  if (item.backend === "llama-cpp") {
    return `backend/cpp/llama-cpp/`;
  }
+  // llama-cpp-localai-paged on Darwin (the -metal-darwin-arm64-llama-cpp-localai-paged
+  // includeDarwin row) builds from the C++ sources under
+  // backend/cpp/llama-cpp-localai-paged, like stock llama-cpp. The matrix entry
+  // carries lang=go for runner/toolchain selection, but the source is C++.
+  if (item.backend === "llama-cpp-localai-paged") {
+    return `backend/cpp/llama-cpp-localai-paged/`;
+  }
  // ds4 is C++ too (built via `make backends/ds4-darwin`); the matrix entry
  // carries lang=go for runner/toolchain selection, but the source is C++.
  if (item.backend === "ds4") {
@@ -281,6 +297,11 @@ function emitFilteredMatrix(changedFiles) {
    if (backend === "turboquant" && !changed) {
      changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
    }
+    // llama-cpp-localai-paged reuses backend/cpp/llama-cpp sources via a thin
+    // wrapper; changes to either directory should retrigger its pipeline.
+    if (backend === "llama-cpp-localai-paged" && !changed) {
+      changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
+    }
    fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`);
  }
 }