refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

Move ALL paged-attention content out of the stock backend/cpp/llama-cpp backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is pure upstream llama.cpp and the paged backend owns and applies its own vendored patch series. - Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/ (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen, its own 0001-0002 patches, dense-era design docs, tests). Zero references repo-wide. - Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock backend keeps no patches/ dir; it had no non-paged base patches. - Purify the stock backend: remove the LLAMA_PAGED make variable, the patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh; remove the paged-series handling from prepare.sh. The stock llama.cpp target now only clones the pin and applies its own (currently empty) base patches/ series. The runtime paged option hooks in the shared grpc-server.cpp are untouched (inert without the patches). - The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto each freshly cloned tree via strict git apply (apply-paged-patches), after the copied stock infra clones the pin and applies base patches. - Repoint every reference to the old patches/paged path: the upstream canary workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs, backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on build-toggle from comments. Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed canary apply script resolves and applies the series end to end. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
2026-06-28 02:17:00 -04:00 · 2026-06-27 11:01:22 +00:00
parent fb2dc33d52
commit 78fac9a28f
87 changed files with 109 additions and 3997 deletions
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -6,14 +6,6 @@
 # bump and is advanced only by the manual PIN_SYNC process.
 LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
-# LLAMA_PAGED controls whether the vendored paged-attention patch series
-# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
-# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
-# dep-bump if an upstream change breaks a paged hook - the paged carry is then
-# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
-# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
-# env is set.
-LLAMA_PAGED?=on

 CMAKE_ARGS?=
 BUILD_TYPE?=
@@ -187,23 +179,14 @@ llama.cpp:
 		[ -e "$$p" ] || continue; \
 		echo "applying llama.cpp patch: $$p"; \
 		git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
-	done && \
-	if [ "$(LLAMA_PAGED)" = "off" ]; then \
-		echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
-	else \
-		for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
-			[ -e "$$p" ] || continue; \
-			echo "applying llama.cpp PAGED patch: $$p"; \
-			git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
-		done; \
-	fi
+	done

 llama.cpp/tools/grpc-server: llama.cpp
 	mkdir -p llama.cpp/tools/grpc-server
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh

 rebuild:
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh
 	rm -rf grpc-server
 	$(MAKE) grpc-server

--- a/backend/cpp/llama-cpp/paged/.gitignore
+++ b/backend/cpp/llama-cpp/paged/.gitignore
@@ -1,7 +0,0 @@
-tests/test_free_block_queue
-tests/test_block_pool
-tests/test_paged_kv_manager
-tests/test_prefix_cache
-tests/test_ggml_paged_rw
-tests/test_ggml_paged_attn
-paged-bench
--- a/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
+++ b/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
@@ -1,105 +0,0 @@
-# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
-
-Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
-profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
-Blackwell**. And the parity target is far lower than the headline vLLM number implied.
-
-## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
-
-vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
-compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
-(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
-(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
-the BF16 ceiling) or ~6,600 (beats it, via FP4).
-
-## 2. GB10 per-precision DENSE peaks (measured, not spec)
-
-| precision | dense peak | vs BF16 |
-|---|---|---|
-| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
-| INT8 | ~215 TOPS | **1.0×** |
-| FP4 (MXFP4/NVFP4) | ~427–500 TFLOP/s | **2.0×** |
-
-Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
-**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
-int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
-direct GB10 measurement; on-the-record discrepancy.)
-
-## 3. Measured gaps (nsys, GB10)
-
-| path | kernel | % of prefill | achieved | % of ceiling |
-|---|---|---|---|---|
-| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
-| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~4–5% of 500** (or ~10% of BF16) |
-
-Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
-RTX 3000/4000" (Ampere/Ada). The headroom (4–5×) is recoverable; it's not an architectural ceiling.
-
-## 4. ggml's current quantized-matmul paths (what exists)
-
- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
-  for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
-  by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
-  Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
-
-## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
-
-**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
-a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
-is also broken on sm_121.
-
-Ranked, by effort:
-1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
-   recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
-   the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
-   activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
-2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
-   Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
-   reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
-   genuine ggml gap. **This is the recommended kernel to MATCH.**
-
-**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
-3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
-   (RTX 5090: +43–68% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
-   miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
-   (it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
-
-**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
-ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
-problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
-should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
-
-## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
-
-Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
-speed-only test) and benched prefill:
-
-| quant | kernel | pp512 | pp2048 | vs Q4_K |
-|---|---|---|---|---|
-| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
-| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
-
-**Findings:**
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
-  dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
-  **FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
-  from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~4–6× on the table.
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
-  serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
-  tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
-  to *match* on the BF16 ceiling if FP4 tuning stalls.
-
-Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
-it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
-
-## Sources
-GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
-arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
-DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
-vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
-
-> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.
--- a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
@@ -1,334 +0,0 @@
-# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
-
-Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
-`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
-plan for what the brief called "chunked prefill".
-
-Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
-  `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
-  build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
-  lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
-  `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
-  cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
-  `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
-  a few rows at the pin — match on the quoted comment strings, not the integers.
-
---
-
-## TL;DR — the headline finding
-
-**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
-llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
-this version. `update_slots()` in `server-context.cpp`:
-
-1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
-   ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
-   one sampled token into the shared `llama_batch` before any prefill is added.
-2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
-   "next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
-   gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
-   default, `grpc-server.cpp:547`). The per-slot prefill fill loop
-   (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
-   batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
-   the **remaining** budget and defers the rest to the next iteration.
-3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
-   and prefill-chunk tokens go through the **same `llama_decode`**, which then
-   splits internally into `n_ubatch` physical sub-batches.
-
-This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
-("server : chunked prefill support") asked for — "the first task is no longer
-blocked by the second long prompt processing task." That PR is still marked OPEN
-but its goal was absorbed into the natural evolution of `update_slots()`; we do
-**not** need to port it. A long prefill no longer stalls the decode batch: decode
-slots are serviced first every iteration, prefill consumes only the leftover
-budget.
-
-**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
-narrow and is the rest of this plan:
-
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
-  the scheduler token budget (`n_batch`) to the physical forward width
-  (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
-  `n_batch == n_ubatch`, so the logical scheduling window can never be wider than
-  one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
-  spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
-  into a larger logical window. There is no first-class `batch:`/`ubatch:` split
-  on the Go side, and there is only a one-directional `ubatch` override on the C++
-  side (you can shrink ubatch below the coupled value, never grow n_batch above
-  it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
-  caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
-  one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
-  to the decoders sharing that forward. vLLM exposes
-  `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
-  LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
-  bounds that jitter. This is genuinely not in upstream and is the only place a
-  scheduler-policy change is warranted.
-
---
-
-## 1. Current behavior — precise citations
-
-### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
-  `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
-  service + `params_parse` + `parse_options`. `update_slots()`, the slot state
-  machine, and the batch builder are **upstream `server-context.cpp`**, untouched
-  by LocalAI today.
- Slot states: `server-context.cpp:36-42` —
-  `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
-  GENERATING`.
-
-### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
-  `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
-  token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
-  `n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
-  → with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
-  `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
-  — adds prompt tokens until the slot is done **or** the shared budget is hit.
-  Whatever does not fit stays for the next iteration (the slot remains
-  `SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
-  it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
-  the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
-  calls `llama_decode`; the physical `n_ubatch` split happens inside
-  `llama_decode`.
-
-### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
-  embeddings with non-LAST pooling. So **completion/generation tasks always
-  chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
-  ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
-  size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
-
-### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515` — `params.n_batch  = request->nbatch();`
- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
-  that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
-  `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
-  There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
-  above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
-  in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
-  (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
-  `c.Options` (`core/backend/options.go:221`).
-
-### 1.5 Go side sends a single batch number
- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
-  is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
-  else context size for single-pass (score/embed/rerank), else
-  `hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
-  backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
-  on Blackwell an unset batch defaults to 2048, so today
-  `n_batch == n_ubatch == 2048` there.
-
---
-
-## 2. Why the decouple matters for serving (not just rerank)
-
-Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
-width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
-**scheduler token budget** — the logical window shared by decode + prefill chunks,
-analogous to vLLM's `max_num_batched_tokens`.
-
-With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
-physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
-  is capped at the physical ubatch, so aggregate prefill cannot grow past one
-  ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
-  degrading prefill GEMM efficiency — and vice versa.
-
-Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
-`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
-logical window, lifting aggregate prefill under mixed load — `llama_decode` still
-tiles the physical work at 2048.
-
---
-
-## 3. Phased implementation
-
-### Phase 0 — Verification harness (do first; TDD red)
-Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
-  `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
-  decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
-  PR #10718's body works). Capture each stream's full token id sequence. Re-run
-  with the prefill request absent. **Assert the short streams' token ids are
-  byte-identical** in both runs — proves interleaving does not perturb decode
-  numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
-  spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
-  the same tree) or a small driver hitting `/v1/chat/completions`: measure
-  aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
-  under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
-  config. This is the before of Phase A/B.
-
-Expected result of Phase 0: 0.1 already passes (interleave is correct today);
-0.2 gives the baseline the decouple must beat.
-
-### Phase A — Decouple n_batch from n_ubatch
-Goal: let model config set the physical ubatch independently of the logical batch,
-defaulting to today's behavior (no regression).
-
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
-  In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
-  sibling branch:
-  ```cpp
-  } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
-      if (optval != NULL) {
-          try { params.n_batch = std::stoi(optval_str); } catch (...) {}
-      }
-  ```
-  This is the missing direction (raise `n_batch` above the coupled value). Order
-  matters: both `:515/:519` run first (coupling as default), then option parsing
-  overrides either independently. Add a clamp note: if a user sets
-  `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
-  `:519` aliasing for backward compat (rerank still works with no options).
-
- **A.2 Proto: add an explicit physical ubatch field.**
-  `backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
-  4). Regenerate with `make protogen-go` + the C++ proto build.
-
- **A.3 C++: honor `NUBatch` when present.**
-  In `grpc-server.cpp` `params_parse`, after `:519`, add:
-  ```cpp
-  if (request->nubatch() > 0) {
-      params.n_ubatch = request->nubatch();
-  }
-  ```
-  so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
-  string-option as a third path for users who only edit `options:`.
-
- **A.4 Go: config surface + plumbing.**
-  - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
-    (search `core/config` for the `Batch` field; mirror it).
-  - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
-    `EffectiveBatchSize` (return `c.UBatch` if set, else
-    `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
-    stays at the hardware sweet spot while `n_batch` may be larger). Set
-    `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
-  - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
-    byte-identical to today.
-
- **A.5 Serving default (the lever).**
-  In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
-  measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
-  configs (when `n_parallel > 1` and the model is a completion model), while
-  `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
-  Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
-  paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
-  Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
-
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
-  `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
-  `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
-  0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
-  neutral ITL) at `n_batch=4096, n_ubatch=2048`.
-
-### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
-Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
-one change that touches the inherited scheduler, so it lives as a patch in
-`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
-`:141-145`), never as an edit to a checked-in upstream file.
-
-Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
-`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
-
-```
-# token budget for THIS iteration, decode already seated:
-n_decode_in_batch = batch.n_tokens            # set after the decode phase
-prefill_budget    = n_batch                    # default == today
-
-if serving_mode and n_decode_in_batch > 0:
-    # leave room so decoders are not starved/jittered by one giant prefill chunk
-    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
-    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
-
-# fill loop guard becomes:
-while slot.prompt.n_tokens() < slot.task->n_tokens()
-      and batch.n_tokens < prefill_budget:
-      ...
-```
-
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
-  `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
-  exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
-  ongoing decodes keep a steady cadence; the remaining prompt rides the next
-  iteration (already supported by the state machine — slot stays
-  `PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
-  `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
-  from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
-  *how many* tokens are added this iteration, not *which* positions, so 0.1 must
-  remain token-identical.
-
-### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
-  `docs/content/` model-config reference, with the serving recipe
-  (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
-  `PHASED_VLLM_PARITY_PLAN.md` Phase 3.
-
---
-
-## 4. Risk / correctness
-
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
-  token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
-  (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
-  boundaries are transparent to the KV cache because positions are absolute, not
-  per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
-  per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
-  unaffected — co-batching prefill+decode across slots is what the unified cache is
-  for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
-  EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
-  `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
-  context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
-  — do not let the serving `BlackwellLogicalBatch` default leak into single-pass
-  configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
-  `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
-  `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
-  guard the new field behind a `#ifndef` like the checkpoint block does.
-
-## 5. Orthogonality to paged KV (Phase 2)
-
-Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
-and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
-prefill / this decouple changes **how many tokens per iteration** the scheduler
-batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
-KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
-scheduling window to feed those slots; neither touches the other's data structures.
-The only contact point is `update_slots()` — if both ship a vendored patch to it,
-land them as separate, ordered patches in `patches/` and keep the hunks disjoint
-(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
-budget).
-
---
-
-## 6. Bottom line
-
- Chunked prefill + decode interleave: **already present and correct** on the
-  pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
-  default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
-  if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
-  + proto + `options.go`; B as a vendored `patches/` hunk.
--- a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
+++ b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
@@ -1,215 +0,0 @@
-# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
-
-Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
-vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
-sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
-`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
-
-## TL;DR (the result overturns the brief's premise)
-
-On **this** build the prime suspect is wrong and the host-overhead premise does not
-hold:
-
-1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
-   decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
-   key, no property churn post-warmup). The keyed-warmup gate works.
-2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
-   decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
-   **GPU-compute-bound**, not host-bound.
-3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
-   different/earlier build (mainline without these graph fixes). It is not
-   reproducible on `llama.cpp-pr24423`.
-4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
-   the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
-   npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
-5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
-   (Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
-   memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
-   GEMM kernels, not host-side work. This is a kernel project (the direction the
-   prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
-   started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
-
-## 1. Why CUDA graphs are (not) disabled - exact code + measurement
-
-### The gate (code)
-
-PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
-`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
-
- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
-  `cgraph->nodes[0]` (first-node pointer).
- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
-  - **split buffers** (`ggml_backend_buft_is_cuda_split`), and
-  - **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
-    `ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
-  Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
-  after **2 consecutive calls with no property change** (`warmup_complete`); any
-  property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
-  detects change by `memcmp` of the full `ggml_tensor` struct + per-src
-  data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
-
-### Why it stays enabled across decode steps
-
-The graph stays stable because llama.cpp's host-side graph reuse holds during
-decode, so node pointers/props (and `cgraph->uid`) do not churn:
-
- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
-  a multiple of 256** ("so that the graph remains constant across batches and can
-  be reused"). For ntg<=256 within the first KV block, n_kv is constant.
- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
-  `ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
-  at npl=128.
- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
-  scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
-  `ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
-  which is skipped on the reuse path -> stable uid -> CUDA graph replays.
-
-### Measurement (instrumented build, npl=128, ntg=96)
-
-Env-gated counters added to `ggml_backend_cuda_graph_compute` /
-`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
-Release / NDEBUG). End-of-run summary:
-
-```
-[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
-```
-
-94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
-warmup resets; a **single** distinct graph key for the whole decode; no node
-property churn after warmup. Graphs are fully engaged at npl=128.
-
-(The instrumentation was reverted afterwards; the checkout is back to its
-pre-task state and the `.so` rebuilt clean.)
-
-## 2. The per-step CPU "hotspot" - there isn't one on this build
-
-GPU utilization during npl=128 decode (ntg=256):
-
- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
-  steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
-  total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
-  window = ~96% GPU-active**. Even with every kernel launched individually from
-  the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
-
-Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
-graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
-step ~= wall time per step. The host work between steps is in the low single-digit
-ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
-
-This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
-come from a pre-graphs build.
-
-### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
-
-| Kernel | % GPU time | ~ms/step |
-|--------|-----------:|---------:|
-| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
-| `flash_attn_ext_f16` | 19.3 | ~44 |
-| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
-| `unary_gated` silu | 4.1 | ~9 |
-| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
-| rms_norm / rope / set_rows / add | ~4 | ~10 |
-
-Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
-
-`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
-the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
-
-## 3. Fix attempt + measured result
-
-### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
-
-Graphs are already enabled and the batch is already stable (n_kv padded to 256,
-kq_mask dims constant). The clean cold A/B (cooldowns between every run):
-
-| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
-|----:|----------------:|-----------------:|------:|
-| 32  | 242.60 | 235.75 | +2.9% |
-| 64  | 398.59 | 389.06 | +2.5% |
-| 128 | 543.95 | 535.71 | +1.5% |
-
-Baseline (separate cold runs, original non-instrumented build):
-npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
-
-Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
-the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
-remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
-toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
-
-### Where the number actually is, and the real lever
-
- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
-  ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
-  per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
-  llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
-  floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
-  efficiency difference is the ~547 -> 667 gap.
- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
-  (median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
-  n=128 decode shape.
-
-**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
-int4 kernel for the decode shapes. This is exactly the direction of the prior
-session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
-`fp4-grouped-moe.cu` (already wired via
-`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
-Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
-so they are inert for this model - a Marlin path for K-quants (or shipping the
-model in a Marlin-friendly int4 format) would be required. That is a multi-day
-kernel effort, out of scope for this session, but it is the only lever that can
-move the number.
-
-### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
-
-`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
-(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
-MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
-drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
-graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
-confirms the host term is real but tiny at 256 - not a path to 667.
-
-## How this would land in LocalAI
-
- **No host/graph patch is warranted** for this build: graphs already engage and
-  the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
-  would change nothing measurable at high concurrency.
- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
-  (extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
-  model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
-  concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
-
-## Reproduction
-
-```
-# baseline / A/B (cold, 30s cooldowns)
-llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
-  -ngl 99 -b 2048 -ub 2048 -fa on            # graphs on
-GGML_CUDA_DISABLE_GRAPHS=1 ...same...        # graphs off
-
-# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
-# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
-#   nsys stats --report cuda_gpu_kern_sum  -> sum/0.516 ~= 7.72s of 8s = ~96%
-```
-
-## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
-
-The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
-and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
-that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
-
-| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
-|---|---|---|---|
-| Q4_K_M | 547 (548/546) | - | 82% |
-| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
-
-NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
-decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
-as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
-vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
-decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
-from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
-both the prefill and the decode gap.
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -1,253 +0,0 @@
-# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
-
-Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
-Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
-Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
-
-> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
-> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
-> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
-> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
-
---
-
-## Baseline results (established)
-
-### Single-stream (B=1), matched ~8-bit
-| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
-|---|---|---|
-| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
-| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
-| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
-
-\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
-
-### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
-| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
-|---|---|---|---|---|
-| 1 | 1080 | 9644 | 60.1 | 48.0 |
-| 8 | 2189 | 33373 | 160.8 | 312.4 |
-| 32 | 2198 | 99398 | 357.1 | 1171 |
-| 64 | 2194 | 151990 | 519.2 | 2064 |
-
-llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
-
-### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
-Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
-| | B=1 | B=8 | B=32 |
-|---|---|---|---|
-| stock decode | 61.2 | 171.7 | 377.0 |
-| paged decode | 62.7 | 170.8 | 376.8 |
-
-Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
-
---
-
-## Root-cause diagnosis (nsys + code audit)
-
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
- **The 45–69× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~25–50% long-context**, not 45–69×.
-
-Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
-
---
-
-## The levers (cheap → expensive) — execution log
-
-### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
-Status: **IN PROGRESS** — single-stream done, concurrency next.
-Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
-
-Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
-| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
-|---|---|---|---|
-| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
-| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
-| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
-| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
-
-Findings:
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
-Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
-| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
-|---|---|---|---|---|---|---|
-| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
-| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
-| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
-| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
-
-**Lever-1 verdict:** MXFP4 is a large, free win — decode +50–66% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 2–3. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
-
-### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
-Status: **DONE + SHIPPED (auto-default implemented)**
-MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
-**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~3300–3650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 3–5).
-**Implemented:** `core/backend/hardware_defaults.go` — `EffectiveBatchSize` now defaults the physical batch
-(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
-config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
-common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
-
-### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
-**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
-| metric | Q4_K_M | MXFP4 | Q8 |
-|---|---|---|---|
-| decode tg128 | **93.5** | 86.4 | 62.2 |
-| prefill pp512 | 2164 | **3061** | 2215 |
-| prefill pp2048 | 2953 | **3441** | ~2200 |
-**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
-MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
-**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
-
-### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
-Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
-
-**Decisive measurements:**
- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
-  pp16384=2051. So the plateau is not a batch-size fix.
- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
-  at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
-  `mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
-  inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
-
-**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
-the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
-fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
-Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
-(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
-(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
-streaming and flat vs batch.
-Approach:
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
-  experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
-  m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
-  separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
-  `ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
- Effort: high (2–4 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
-  prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
-  not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
-
-### Lever 4 — FP8 (e4m3) GEMM for dense layers
-Status: **DESIGNED, not built** (blocked on a core ggml API change).
-Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
-tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
-Approach (two options):
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
-  compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
-  scale-tensor plumbing below.
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042 —
-  per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
-  scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
-  prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
-
-### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
-Status: **DESIGNED, not built** (very high effort; last increment).
-Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
-tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
-Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
-3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
-`mma.sync`.
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
-  worth it after Levers 3–4 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
-
---
-
-## Paged attention — complete implementation (after kernels are fair)
-The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
-four gaps. CPU foundation already built & verified (`PagedKVManager` P0–P3, `README.md`); the in-model parts
-are unbuilt. **Build order and concrete design:**
-
-1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
-   - Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
-     KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
-     (already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
-   - Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
-2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165` → `ggml_get_rows`
-   gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
-   7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
-3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
-   the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
-   current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
-4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
-
-Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0–P3 done; in-model #1–#4 unbuilt.
-**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
-**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
-test that (correctly) showed no benefit.
-
-> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
-> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
-> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
-> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 1–3), separate
-> from paging.
-
---
-
-## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
-
-Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
-`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
-Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
-tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
-(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
-
-Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build --target llama-bench` →
-`llama-bench` MXFP4 pp/concurrency → record):
-1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
-   FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
-   - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
-     (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
-     (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
-     it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
-2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
-   a global round-trip. Tractable, ~1.1x.
-   - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
-     `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
-     *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
-     post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
-
-**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
-not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
-dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
-(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
-is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
-3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
-   CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
-   one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
-   Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
-4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
-   the whole prefill runs on FP4 tensor cores, not int8-MMQ.
-Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
-Matching vLLM realistically requires phase 3; phases 1–2 are the warm-up + de-risking.
-
-## Implementation plan B — Complete paged attention (the pivot)
-
-CPU foundation done (P0–P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
-numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
-Phases:
-1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
-   `PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
-   fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
-2. **Gather-read** (`get_k/get_v` `:1145/1165` → `ggml_get_rows` into scratch) + `build_attn_paged` branch
-   in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
-3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
-   real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
-4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
-Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
-the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
-(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
-
-## Honest scope note
-Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
--- a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
+++ b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
@@ -1,59 +0,0 @@
-# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
-
-The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
-bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
-**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
-it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
-tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
-
-## Scaffold (DONE — builds clean, default byte-identical)
-
-Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
-upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
-
- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
-  (true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
-  returns false → **default build unchanged**.
- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
-  `if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
-  after adding the file — GLOB is configure-time).
-
-This is the integration seam. The kernel fills the stub.
-
-## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
-
-1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
-   dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
-   on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
-2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
-   GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
-3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
-   Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
-   3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
-4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
-5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
-
-## Dependencies / decisions
-
- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
-  Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
-  early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
-  (exhaustive search confirmed). Net-new value upstream.
-
-## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
-
-Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.6–32×
-behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
-So the kernel track is **two kernels, not one**:
-
- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
-  land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
-  before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
-
-Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
-prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
-GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
-W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)
--- a/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
+++ b/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
@@ -1,140 +0,0 @@
-# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
-
-## Question
-
-MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
-Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
-routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
-default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
-stronger 4-bit format).
-
-Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
-with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
-MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
-
-## Verdict
-
-**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
-Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
-shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
-
- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
-  than Q4_K**).
-
-The double-quant suspicion was correct but it was **not** the main culprit: even a clean
-MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
-speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
-gibberish), so it is usable where raw throughput dominates and a quality hit is
-acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
-
-## Evidence
-
-### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
-
-`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
-created it:
-
-```
-SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf      # <-- source is Q4_K_M, not F16/BF16
-OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
-$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
-    "$SRC" "$OUT" MXFP4_MOE
-```
-
-Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
-overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
-fair answer.
-
-### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
-
-`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
-
-| 32B model | PPL | vs Q4_K |
-|---|---|---|
-| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
-| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
-
-MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
-unfair -> escalated to a clean small-model comparison.
-
-### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
-
-The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
-small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
-GiB) and quantized it **from that same BF16 source** to both formats with the identical
-recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
-
-```
-llama-quantize  q3-4b-bf16.gguf  q3-4b-q4km.gguf   Q4_K_M
-llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
-               q3-4b-bf16.gguf  q3-4b-mxfp4.gguf  MXFP4_MOE
-```
-
-Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
-
-| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
-|---|---|---|---|---|
-| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
-| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
-| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
-
-This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
-than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
-Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
-smaller models being more quantization-sensitive - the double-quant did not invent the
-problem, it is intrinsic to the format as quantized by `llama-quantize`.
-
-### 4. Coherence spot-check (32B, llama-simple, n=60)
-
-MXFP4-dense 32B is fully coherent, not degraded gibberish:
-
- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
-  (correct); Q4_K similar.
- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
-  (sound); Q4_K answers 391 directly (correct).
- "def fibonacci(n):" -> both emit valid Python.
-
-So the quality cost shows up as measurably higher perplexity (and would surface on harder
-/ longer tasks), not as obviously broken text at short generation lengths.
-
-## Why
-
-`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
-round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
-per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
-better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
-kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
-but the weights it accepts (MXFP4 here) are lossy for dense.
-
-## Caveat, stated precisely
-
-This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
-against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
-format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
-through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
-
- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
-  (**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
-  may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
- The result is for Qwen3 dense; other families may differ in magnitude but the
-  format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
-
-## Recommendation
-
- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
-  quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
-  inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
-  option with the perplexity caveat surfaced, not a default.
- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
-  Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
-  format before making any FP4-dense recommendation.
-
-## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
-
- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
-  correctness metric, so thermal state does not affect these numbers.
--- a/backend/cpp/llama-cpp/paged/Makefile
+++ b/backend/cpp/llama-cpp/paged/Makefile
@@ -1,41 +0,0 @@
-CXX ?= g++
-CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
-
-TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
-BINS  = $(addprefix tests/,$(TESTS))
-
-all: $(BINS)
-
-tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
-
-check: all
-	@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
-
-bench: paged-bench
-	./paged-bench
-
-# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
-# Requires a built ggml. Override these to point at your checkout / build:
-#   make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
-GGML_SRC   ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
-GGML_BUILD ?= /tmp/ggml-build
-GGML_LIBDIR = $(GGML_BUILD)/src
-
-GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
-GGML_BINS  = $(addprefix tests/,$(GGML_TESTS))
-
-tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-		-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
-
-ggml-check: $(GGML_BINS)
-	@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-clean:
-	rm -f $(BINS) $(GGML_BINS) paged-bench
-
-.PHONY: all check ggml-check clean
--- a/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
+++ b/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
@@ -1,114 +0,0 @@
-# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
-
-Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
-kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
-established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
-BF16, no imatrix.
-
-## Verdict (short)
-
-YES on all the load-bearing questions, with one honest caveat:
-
-1. llama.cpp CAN produce an NVFP4 GGUF.
-2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
-   slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
-3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
-   4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
-4. Output is coherent.
-
-Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
-essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
-tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
-workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
-NVFP4 quant would likely close most of that remaining gap.
-
-## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
-
- The type exists with a full quantize path, not just a kernel:
-  - `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
-  - `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
-  - type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
-  no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
-  `--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
-  `ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
-  MXFP4-dense.
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
-  norms F32, all 2D attn+ffn weights to FP4):
-
-  ```
-  llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
-                 q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
-  ```
-
-  Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
-  Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
-
-The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
-do not feed llama.cpp - confirmed and irrelevant.
-
-## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
-
-`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
-
-| Quant   | PPL    | vs BF16  | vs Q4_K  |
-|---------|--------|----------|----------|
-| BF16    | 13.32  | -        | -        |
-| Q4_K_M  | 13.66  | +2.6%    | -        |
-| NVFP4   | 14.31  | +7.4%    | +4.8%    |
-| MXFP4   | 17.42  | +30.8%   | +27.6%   |
-
-(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
-
-NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
-sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
-all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
-firmly in the "acceptable 4-bit" regime, not the lossy one.
-
-## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
-
-No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
-so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
-cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
-
-Prefill S_PP (t/s):
-
-| B   | Q4_K   | NVFP4  | MXFP4  | NVFP4 / Q4_K | NVFP4 / MXFP4 |
-|-----|--------|--------|--------|--------------|---------------|
-| 8   | 4862   | 6313   | 6602   | 1.30x        | 0.96x         |
-| 32  | 5020   | 6497   | 6836   | 1.29x        | 0.95x         |
-| 64  | 5031   | 6490   | 6831   | 1.29x        | 0.95x         |
-
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
-  kernel. NVFP4 does NOT fall back to a slow path.
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
-  Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
-  32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
-  smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
-
-## 4. Coherence
-
-`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
-  ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
- "Q: What is 17 plus 25? A:" -> "42." (correct)
-
-Coherent and factually accurate.
-
-## Recommendation for LocalAI on Blackwell
-
-Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
-via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
-norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
-expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
-MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
-
-Caveats / follow-ups:
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
-  does not matter, Q4_K_M remains the better pick.
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
-  next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
-  blanket recommendation.
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
-  confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
@@ -1,115 +0,0 @@
-# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
-
-Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
-"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
-`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
-(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
-**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
-where decode aggregate is free to keep climbing with concurrency.
-
-Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
-`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
-sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
-`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
-ngl 99`. Cold runs, 12 s cooldowns.
-
-## TL;DR for the decision
-
-**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
-aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
-already reaches it.** Measured across two model regimes and concurrency up to 2048
-sequences:
-
- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
-  - 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
-  - 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
-  throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
-  max**) while aggregate stays flat.
-
-**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
-regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
-does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
-prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
-single-model benchmark deliberately does not create the memory-pressure regime where
-paging pays off, which is exactly why no win appears.
-
-## The numbers
-
-### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
-
-| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
-|----:|------------------------:|----------------------------:|------------------:|---------------------:|
-| 128 | 2,643 | 2,887 | 23-25 | - |
-| 256 | 2,925 | - | - | - |
-| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
-| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
-| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
-
-Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
-**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
-`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
-contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
-paged 12-13% *behind*.)
-
-### Why it plateaus (the hardware ceiling, not the KV layout)
-
-Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
-read across the whole batch. Once concurrency is high enough that the shared weight-read
-is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
-which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
-is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
-the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
-plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
-bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
-
-## What paged KV is actually for (the honest, deliverable value)
-
-Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
-to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
-which require the regime this hardware+benchmark does not exercise, are:
-
-1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
-   in-flight sequences (variable, dynamically arriving/leaving contexts) without the
-   contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
-   compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
-   with very long/variable contexts.
-2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
-   prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
-   real token-budget win for shared-prefix workloads; PR #22569 defers this to a
-   non-existent Phase 2 (our from-scratch P0 has the machinery).
-
-These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
-aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
-
-## Recommendation
-
- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
-  the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
-  benchmark.
- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
-  ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
-  npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
-  MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
-  Blackwell number.
- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
-  lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
-  framed and measured as concurrent-tenant capacity and KV memory saved, on a
-  variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
-  aggregate regardless of KV layout; that is a fleet-level result.
-
-## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
-
-```sh
-M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
-# contiguous
-for NPL in 128 256 512 1024; do
-  ./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
-    -b 2048 -ub 2048 -fa on -c $((NPL*160)); done
-# paged
-for NPL in 512 1024 2048; do
-  ./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
-    -np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
-```
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
@@ -1,170 +0,0 @@
-# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
-
-Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
-the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
-artifacts (low bandwidth caps throughput before KV memory ever binds). This document
-delivers the three things needed to push paged KV toward the real target:
-
-1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
-2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
-3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
-
---
-
-## 1. Correctness: PASS (after fixing the auto-fit OOM)
-
-`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
-(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
-it aborted at context creation. Root cause found:
-
- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
-  `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
-  GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
-  `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
-  explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
-
-**Fix (item-1 patch, applied on the box):**
-
-```diff
--- a/tests/test-paged-kv-e2e.cpp
-+++ b/tests/test-paged-kv-e2e.cpp
-@@ run_paged()
-     params.kv_paged      = true;
-+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
-     params.n_gpu_blocks  = 64;
-```
-
-**Result (Qwen3-0.6B-Q8_0, GB10):**
-
-```
-test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
-test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
-test-paged-kv-e2e: PASSED
-```
-
-The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
-bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
-
-**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
-brittle and must be hardened before it runs on a real serving box - even though
-`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
-(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
-`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
-(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
-
---
-
-## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
-
-**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
-`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
-load. That has no over-reservation and no fragmentation, so contiguous KV is already
-memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
-paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
-the real serving regime. No tool in the tree creates it.
-
-`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
-API:
-
- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
-  cross-request prefix sharing,
- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
-  `LG_GENSHORT`) - the over-reservation driver,
- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
-  one finishes.
-
-It reports the load-bearing number for the buy decision - the **capacity ratio**:
-
-```
-paged peak KV      = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
-contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token   (worst-case per slot)
-CAPACITY RATIO     = contiguous_reserve / paged_peak   (+ prefix sharing on top)
-```
-
-`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
-`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
-
-**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
-CMakeLists next to `llama-paged`, build, then e.g.
-`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
-Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
-It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
-the ratio is uninteresting because throughput plateaus before memory binds (see below).
-
---
-
-## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
-
-### Measured on GB10 (this work)
-
-| model | decode plateau (aggregate) | plateau concurrency | bound by |
-|---|---|---|---|
-| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
-| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
-
-### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
-
-| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
-|---|---|---|---|---|
-| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
-| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
-| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
-
-Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
-is reached scale with bandwidth (~30x on 2x H200)**:
-
- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
-  ~128 x 30 ~= **3,800 concurrent sequences**.
-
-### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
-
-To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
-
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
-  sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
-
-So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
-and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
-is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
-caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
-inverted on the real target.
-
-### Magnitude of the paged win
-
-Paging recovers concurrency two ways, both multiplicative on achievable throughput:
-
-1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
-   `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
-   long, prompts ~512) the average held context is several-fold below `max_ctx` ->
-   `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
-   your workload's length distribution).
-2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
-   workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
-
-Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
-concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
-decode ceiling. **That is the datacenter payoff, and it is real on the target even though
-GB10 cannot exhibit it.**
-
-### Honest caveats for the buy case
-
- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
-  workload's context-length distribution (more variable -> bigger paged win) and TP
-  efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency
-  (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
-  the fit-robustness bug above. Adopting paged KV for the target means either hardening
-  #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
-  competitive* op, which is the remaining engineering.
- Prefill on either KV layout is compute-capped, not a paged concern.
-
-**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
-the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
-**correctness-verified**, the **benchmark to size the win exists**, and the projection
-says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
-decode** on the target. The remaining work is hardening/finishing the paged op, not
-proving the thesis.
--- a/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
@@ -1,55 +0,0 @@
-# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
-
-Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
-quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
-`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
-
-## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
-
- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
-  viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
-
-## Phases
-
-### Phase 1 — Hardware-tuned config (PR #10411) — DONE
-Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
- Blackwell physical batch (n_ubatch) = 2048.
- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
-  which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
-  KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
-  flash-attn=auto, context=4096.
-
-### Phase 2 — Paged / block KV cache  ← biggest structural multi-user lever
-vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
-contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
-  rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
-  design informs the review/port; the upstream momentum is the place to land it.
- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
-
-### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
-llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
-capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
-  a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
-
-### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
-Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
-Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
-BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
-
-### Phase 5 — Backend GPU sampling
-CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
-
-### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
-Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
-config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
-
-## Sequencing rationale
-Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
-Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
-(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.
--- a/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
@@ -1,90 +0,0 @@
-# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
-
-Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
-Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
-
-## TL;DR (clean negative)
-
-1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
-2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
-3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
-4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
-5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
-
-## 1. What PR #17004 does + state
-
- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
-
-Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
-
-## 2/3. Apply + build
-
-No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
-
-```
-cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-  -DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-  -DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
-cmake --build build --target llama-batched llama-batched-bench -j20
-```
-
-**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
-
-## 4. Decode aggregate: fix vs baseline vs vLLM
-
-### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
-`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
-
-| npl | S_TG t/s |
-|-----|----------|
-| 32  | 241.8 |
-| 64  | 395.1 |
-| 128 | 542.6 |
-| 256 | 567.2 |
-
-Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
-
-### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
-`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
-
-| np  | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
-|-----|------------------|------------------------|-------|
-| 32  | 174.1 | 217.5 | +25% |
-| 64  | 390.5 | 403.4 | +3.3% |
-| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-
-(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
-
-**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
-
-## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
-
-`nsys profile -t cuda ... -n 96 -np 64`
-
-| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
-|------|-----------|------------------------------|----------------------|
-| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
-| GPU `-bs`    | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
-
-GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
-
-(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
-
-## LocalAI adoption path
-
-**The code arrives transparently with a version bump; enabling it is not transparent.**
-
- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
-
-### Recommendation
-Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.
--- a/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
@@ -1,136 +0,0 @@
-# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
-
-Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
-high-concurrency GPU throughput, or should we finish our own from-scratch P4
-(`backend/cpp/llama-cpp/paged/`)?
-
-Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
-memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
-n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
-the correctness gate.
-
-## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
-
-On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
-win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
-256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
-hold for this PR or this hardware/model. On top of that it is broken out of the box,
-wired to the wrong integration surface, and a contested draft.
-
-## 1. Builds? Correct?
-
- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
-  `0b0f7bd...`, base = current master). Clean CUDA build for sm_121
-  (`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
-  `llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
-  It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
-  depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
-
- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
-  at context creation:
-  `build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
-  `GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
-  `--fit off` (so it is the real graph, not just the memory probe).
-  **Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
-  wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
-  32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
-  "qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
-  (use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
-
- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
-  during the device-memory probe (before the fix). After the reshape fix, paged
-  auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
-
- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
-  output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
-  256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
-  correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
-  top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
-  **not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
-  (96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
-  issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
-  **unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
-  output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
-  does not change the verdict, which is decided by throughput below.
-
-## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
-
-Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
-Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
-`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
-
-| npl  | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
-|------|----------------------|------------------------|---------|
-| 128  | **537** (S 553)      | **477**                | both run; paged ~12% slower |
-| 256  | **541** (S 550)      | **471**                | both run; paged ~13% slower; neither gains over 128 |
-| 512  | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
-| 1024 | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
-
-### The decisive facts
-
-1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
-   identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
-   `LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
-   is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
-   "paged unlocks thousands of seqs" premise is false for this PR.
-
-2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
-   the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
-   concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
-   24 GiB card; that lever does not exist on GB10's 119 GiB.
-
-3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
-   flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
-   nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
-   recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
-   ~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
-   layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
-   problem, not a KV-fragmentation problem.
-
-## 3. Verdict and reasoning: finish our own P4
-
-**Do not adopt #22569 as the base.** Reasons:
-
- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
-  what we already have and capped at the same 256 seqs. There is no throughput or
-  concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
-  (`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
-  bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
-  only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
-  LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
-  around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
-  decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
-  creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
-  (gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
-  `seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
-  API is even the right design; maintainers are skeptical of paged for single-node use.
-
-**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
-gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
-The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
-implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
-  over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
-  prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
-
-Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
-prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
-saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
-`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
-paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
-worth upstreaming to #22569 regardless, but the PR is not our base.
-
-### Reproduction (DGX, `~/llama.cpp-pr22569`)
-```sh
-export PATH=/usr/local/cuda/bin:$PATH
-# contiguous
-./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
-  -npl 128 -c 20480 -b 2048 -ub 2048        # 256/512/1024 -> n_seq_max must be <= 256
-# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
-./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
-  -np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99   # 512/1024 -> n_seq_max must be <= 256
-```
--- a/backend/cpp/llama-cpp/paged/README.md
+++ b/backend/cpp/llama-cpp/paged/README.md
@@ -1,95 +0,0 @@
-# Paged Attention for llama.cpp (vLLM-parity), CPU-first
-
-A from-scratch port of vLLM V1's paged KV-cache model into the llama.cpp / ggml
-world, built CPU-first and verified incrementally. The host-side block manager is
-a faithful port of vLLM; the compute stays in ggml (no new op — the read path
-gathers blocks with `ggml_get_rows` and feeds the existing attention ops).
-
-Design: `docs/superpowers/specs/2026-06-19-paged-attention-llamacpp-design.md`
-Plan:   `docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md`
-
-## Status
-
-| Phase | What | State |
-|------|------|-------|
-| P0 | vLLM-parity host block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache) | ✅ verified — `make check`, 4/4 suites |
-| P1 | ggml paged write/gather mechanism (`set_rows` by slot_mapping → `get_rows` gather) | ✅ verified — `make ggml-check`, non-contiguous blocks `[2,1,5]` round-trip + isolation |
-| P2 (core) | attention over gathered paged KV matches independent host reference | ✅ verified — max abs err **7.5e-08** |
-| P3 (partial) | capacity & prefix-sharing wins | ✅ measured — `make bench`: **9.2×** more concurrent seqs, **11.3×** less KV memory |
-| **P3 (in-model placement)** | **paged, non-contiguous block KV placement in the real model** | ✅ **Gate 0 PASSED** — Qwen3-0.6B token-identical (`patches/0001-paged-kv-block-placement.patch`) |
-| P4 (in-model compute) | gather-read (`build_attn_paged`, read only a seq's blocks) + win-2 throughput + multi-seq | ⛔ remaining |
-
-The design's central risk — *does paged (non-contiguous) KV produce correct attention?* —
-is **retired at two levels**: (1) at the ggml-op level (P2, 7.5e-08 vs reference) and
-(2) **in a real model** (P3): with KV physically scattered across permuted, non-contiguous
-blocks (cells `0-15, 144-159, 32-47, …`), Qwen3-0.6B greedy generation is **token-for-token
-identical** to the contiguous cache. Reproduce:
-
-```sh
-# from backend/cpp/llama-cpp-fallback-build/llama.cpp (patch applied, CPU build)
-B=build-cpu/bin/llama-simple; M=<Qwen3-0.6B.Q4_K_M.gguf>; P="...long prompt..."
-"$B" -m "$M" -n 40 "$P"                         > base.txt
-LLAMA_KV_PAGED=1 "$B" -m "$M" -n 40 "$P"        > paged.txt
-diff base.txt paged.txt && echo TOKEN-IDENTICAL
-# LLAMA_KV_PAGED_DEBUG=1 prints the permuted physical cells per step
-```
-
-This proves the **storage/placement** layer of paged attention in-model. What remains (P4)
-is the **compute** optimization that yields the throughput win: a gather-read that attends
-only a sequence's own blocks (instead of scanning `[0,n_kv)` with a mask), plus the
-multi-sequence driver to measure tok/s vs concurrency. The patch is single-sequence scope.
-
-## Build & test
-
-```sh
-make check                     # P0 host-manager unit suites (pure C++, no deps)
-make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>   # P1/P2 ggml tests
-make bench                     # P3 capacity + prefix-sharing numbers
-```
-
-`ggml-check` needs a built ggml. To build one CPU-only from a llama.cpp checkout:
-`cmake -S <llama.cpp>/ggml -B /tmp/ggml-build -DGGML_CUDA=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build /tmp/ggml-build -j`
-(if it complains about a missing `ggml.pc.in`, add a minimal pkg-config stub).
-
-## Files
-
- `paged_kv_manager.{h,cpp}` — the vLLM-parity block manager (no ggml/llama dep).
- `tests/test_free_block_queue.cpp` — intrusive LRU free list.
- `tests/test_block_pool.cpp` — alloc/touch/free/evict/cache.
- `tests/test_paged_kv_manager.cpp` — allocate/block_table/slot_mapping/free.
- `tests/test_prefix_cache.cpp` — chained block hashing + first-miss cache hit.
- `tests/test_ggml_paged_rw.cpp` — paged write/gather through real ggml ops.
- `tests/test_ggml_paged_attn.cpp` — attention over paged KV vs host reference.
- `paged-bench.cpp` — capacity (win 1) + prefix-sharing (win 3) measurements.
-
-## Remaining work — integration map (for the next session)
-
-Target: a paged read path active behind a flag, producing **token-identical** greedy
-output vs the contiguous cache on a real model (Gate 0), then `paged-bench` win 2.
-
-Exact seams in the vendored llama.cpp (`backend/cpp/llama-cpp-fallback-build/llama.cpp`,
-the pinned build fetches `LLAMA_VERSION=f3e182816421…`):
-
-1. **Memory type** — `src/llama-model.cpp:2070` `create_memory()` constructs `llama_kv_cache`.
-   Add a paged variant (or a flag on the existing cache) implementing `llama_memory_i`
-   (`src/llama-memory.h`), backed by `PagedKVManager`.
-2. **Allocation** — `src/llama-kv-cache.cpp:818` `find_slot()` produces `slot_info.idxs`.
-   Replace the ring-buffer scan with block-aligned allocation from `PagedKVManager`.
-3. **Read path** — `src/llama-kv-cache.cpp:1145/1165` `get_k`/`get_v` return a contiguous
-   `[0,n_kv)` view. For paged, gather the sequence's blocks (`ggml_get_rows`) into scratch.
-   The new branch lives alongside `build_attn` in `src/llama-graph.cpp` (`build_attn_mha`).
-4. **Mask** — `src/llama-graph.cpp` `build_attn_inp_kq_mask` sizes the mask to the gathered
-   length per sequence.
-5. **Gate 0 driver** — `build-cpu/bin/llama-simple` (greedy argmax) on
-   `Qwen3-0.6B.Q4_K_M.gguf`; assert paged output == contiguous output token-for-token.
-
-### Honest caveats (from the maintainer discussion + reading `find_slot`)
-
- llama.cpp's **unified cache already shares one KV pool** across sequences and already
-  tolerates non-contiguous slots. So win-1 vs *unified* is smaller than vs per-seq
-  reservation (stream mode). The durable LocalAI wins are **on-demand sizing** and
-  **automatic cross-tenant prefix sharing** (P0 implements the block-hash machinery).
- vLLM's classic `paged_attention_v1/v2` CUDA kernel is **deprecated**; the live path is
-  FlashAttention/FlashInfer over a block table. The port targets that pattern, not the
-  old kernel. Upstream draft PRs #22569 (new `ggml_paged_attn` op) and #17579 (CUDA) are
-  unmerged; maintainers are skeptical for single-user use.
--- a/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
+++ b/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
@@ -1,78 +0,0 @@
-# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
-
-**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
-
-## Summary
-
-On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
-`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
-fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
-on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
-limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
-
-## Hardware / build
-
- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
-
-## Measurements
-
-Single-stream (`llama-bench`, ub2048):
-
-| metric | Q8_0 | MXFP4 | vLLM FP8 |
-|---|---|---|---|
-| prefill pp2048 | ~2200 | 3441 | — |
-| decode tg128 | 62 | 86 | 52 |
-
-Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
-
-| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
-|---|---|---|---|---|
-| 1 | 1625 | 9644 | 83 | 48 |
-| 8 | 3634 | 33373 | 267 | 312 |
-| 32 | 3651 | 99398 | 551 | 1171 |
-| 64 | 3648 | 151990 | 770 | 2064 |
-
-Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
-
-## Profiling (nsys, MXFP4 pp2048 kernel time)
-
-| kernel | % |
-|---|---|
-| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
-| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
-| `flash_attn_ext_f16` | 8.8 |
-| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
-
-Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
-
-## What we ruled out (so it's the kernel, not config)
-
- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
-  helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
-  1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
-
-## Proposal
-
-A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
-  vs today's per-expert MMQ scheduler.
- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
-  warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
-  tile sizes).
- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
-
-This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
-
-## Repro
-
-```sh
-llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
-llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
-llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
-```
--- a/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
@@ -1,83 +0,0 @@
-# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured)
-
-Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag
-`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the
-scheduler track.
-
-## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts)
-
-| | vLLM W4A16 Marlin | llama.cpp | verdict |
-|---|---|---|---|
-| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** |
-| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps |
-| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** |
-
-vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k
-headline is entirely the aggregate decode multiplier.
-
-## The kernel vLLM actually runs on sm_121 (W4A16, forced)
-
-Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) →
-`compressed_tensors_w4a4_nvfp4.py:28` → `kernels/linear/__init__.py:894` (`if use_a16: force_kernel =
-MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py` → `marlin_utils_fp4.py:182`
-`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu`
-+ `marlin_template.h` + `marlin.cuh`.
-
-Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722
-^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait<stages-2>`), ldmatrix+mma,
-FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent
-on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.**
-
-## The scheduler (the real multiplier) — what llama.cpp lacks
-
- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very
-  high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency).
- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill
-  chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch).
- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`,
-  rudimentary — we enabled VRAM-scaled slots in #10411).
-
-## Sizing the scheduler gap — MEASURED (llama.cpp aggregate, the surprise)
-
-`llama-batched-bench` Qwen3-32B-Q4_K_M, npp=128 ntg=128, npl scaling (DGX):
-
-| npl | S_PP (agg prefill) | **S_TG (agg decode)** | vLLM decode | llama % of vLLM |
-|---|---|---|---|---|
-| 1 | 628 | 10.2 | 11.8 | 86% |
-| 8 | 773 | 59.8 | - | - |
-| 32 | 763 | **235** | **328** | **72%** |
-| 64 | 761 | **391** | **569** | **69%** |
-| 128 | 762 | **540** | **667** | **81%** |
-
-**The "30x gap" headline is wrong for realistic concurrency.** llama.cpp's continuous batching already
-captures **~70-81% of vLLM's aggregate decode** at npl<=128, with a near-identical multiplier (10.2 -> 540 =
-**53x**, vs vLLM's 56x). And it is still climbing linearly at 128 (not plateaued). Combined with llama.cpp being
-*ahead* single-stream (MXFP4 1153 > vLLM 800), **llama.cpp is already broadly competitive with vLLM on GB10 at
-self-hosted concurrency.**
-
-Two real findings remain:
-1. **Aggregate prefill is flat ~760** regardless of npl - but that is the **GB10 compute roofline** (vLLM single-
-   stream is ~800; neither can prefill faster aggregate, it is compute-bound). So prefill is **not a throughput
-   gap**; chunked prefill is a **latency/TTFT** win (stop a long prefill stalling the decode batch), not a
-   throughput one.
-2. **vLLM's ~24k headline lives at thousands-of-sequences concurrency**, which **paged KV** unlocks (block KV,
-   no fragmentation). llama.cpp's contiguous KV caps how far npl can scale before memory/fragmentation bite. So
-   paged KV is the **high-concurrency (datacenter) lever**, not a moderate-concurrency one.
-
-## Recommendation
-
-**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.**
-Now that the gap is measured, ROI-ordered:
-1. **Ship the MXFP4-dense win** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant
-   recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. Cheapest.
-2. **Chunked prefill** — the tractable scheduler win: interleave prefill chunks with decode so a long prompt
-   doesn't stall the decode batch. Payoff is **latency/TTFT under mixed load** (and steadier decode batches),
-   not aggregate prefill throughput (that's GB10-compute-capped at ~760-800 for both engines). A grpc-server
-   scheduler change; no KV-layout rewrite.
-3. **Paged KV** — the **high-concurrency (thousands-of-seqs) lever** that unlocks vLLM's 24k regime. Heavy
-   (block KV manager; contested upstream PR #22569 / vendored `patches/`). Worth it only if datacenter-scale
-   concurrency is a target; at self-hosted concurrency (npl<=128) llama.cpp is already ~75-80% of vLLM.
-
-**Reframed expectation:** llama.cpp on GB10 is NOT 30x behind vLLM. It is ahead single-stream (MXFP4) and
-~70-81% of vLLM aggregate at npl<=128. The genuine differentiator vLLM still has is **scaling to very high
-concurrency via paged KV**. Kernel tracks (W4A16 178 t/s; FP4-MMA) stay **banked** - not the lever.
--- a/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
@@ -1,59 +0,0 @@
-# Where vLLM beats llama.cpp on a DGX Spark (GB10), and how to close it — keeping quality
-
-The question: "vLLM is faster at the end — what do we improve, while keeping good quality?" Answer: the
-gap is **three independent things**, and the biggest *per-user, quality-preserving* one is **speculative
-decoding**, which llama.cpp already supports.
-
-## Decomposition (measured + researched)
-
-| vLLM advantage | helps single user? | llama.cpp answer | quality cost | status |
-|---|---|---|---|---|
-| **Per-user decode speed** | **yes** | **speculative decoding** (Qwen3 draft / EAGLE3) | **none** (target-verified, lossless) | mature in llama.cpp; **the main lever** |
-| Prefill / TTFT | no (it's first-token latency) | tune FP4-MMA / Marlin W4A16 kernel | none | hard; `BLACKWELL_KERNEL_GAPS.md` |
-| Aggregate throughput @ concurrency | no (per-user = 0) | continuous batching (paged engine) | none | also kernel-bound |
-
-Key measured fact: **single-user decode is already at parity** (Qwen3-32B: llama 10.2 vs vLLM 11.7 t/s) —
-both hit GB10's ~273 GB/s bandwidth wall (~15 t/s ceiling) **without** spec-dec. So vLLM's real per-user
-speed edge is spec-dec, not architecture.
-
-## Why spec-dec is THE lever here (and quality-safe)
-
- **Lossless:** the 32B target verifies every drafted token (accept/reject) — output distribution is
-  identical to no-drafting. So you keep **Q4_K_M quality** (no lossy MXFP4 needed) *and* get speed.
- **GB10 is best-case for it:** decode is bandwidth-bound (one ~17 GB weight-read per token) with huge idle
-  compute. Spec-dec verifies K drafted tokens in **one** weight-read → converts the loop to compute-bound,
-  where GB10 has headroom. Realized speedup ≈ mean accepted length.
- **Measured (others, same model class):** llama.cpp Qwen2.5-32B dense + 0.5B draft = **2.9×** (13→38 t/s);
-  vLLM EAGLE3 on Qwen3-32B = ~1.8–2.5× general, up to ~3× code/structured. **Competitive.**
- **Regime caveat:** spec-dec gives **~nothing for MoE-A3B** models (only ~3B active → not bandwidth-bound,
-  nothing to amortize). It shines for **dense** 27–32B — the opposite regime. So this lever is *dense-model*
-  specific.
-
-## Qwen3-32B specifics
-
- **No native MTP head** (MTP is a Qwen3-*Next*/MoE feature). Options: a **same-family draft**
-  (Qwen3-0.6B or **1.7B** — same tokenizer, llama.cpp vocab check passes) or an external **EAGLE3 head**
-  (RedHatAI/AngelSlim Qwen3-32B-eagle3, accept length 2.15–2.49).
- Draft pick: **lean Qwen3-1.7B** (0.6B had ~60% lower acceptance in AWS's test; on a bandwidth-bound box the
-  32B weight-read dwarfs the draft cost, so maximize acceptance). `--spec-draft-n-max 5–8`.
-
-## Recommended LocalAI actions (quality-preserving, ranked)
-
-1. **Make speculative decoding easy/recommended for dense ≥14B models on Blackwell** — a draft-model field in
-   the model config (`-md` / `--spec-draft-*`), with a suggested Qwen3-1.7B draft for the Qwen3 family. This
-   is the biggest per-user speed win, lossless, available **now** (no kernel). Gallery: ship target+draft pairs.
-2. Kernel work (FP4-MMA tuning / Marlin W4A16) — improves **prefill/TTFT**, separate metric.
-3. Continuous batching (paged engine) — **aggregate** concurrency only; per-user = 0.
-
-## Honesty / status
-
-The research conclusion is solid (sources below). **Our own empirical spec-dec run on the DGX is pending** —
-the box rebooted mid-session and `llama-cli` now hangs at 0% GPU (while `llama-bench` works), plus the network
-is dropping ssh mid-command. Drafts (Qwen3-0.6B/1.7B Q8) are downloaded and the spec-dec flags are confirmed;
-re-run `llama-cli -m Qwen3-32B-Q4_K_M -md Qwen3-1.7B-Q8_0 -ngl 99 -ngld 99 --spec-draft-n-max 8` when the box
-is stable to confirm the ~2× locally. The conclusion does not depend on it (it's measured-reproducible by
-others on this exact model class), but we should bank our own number.
-
-Sources: llama.cpp Discussion #10466 (Qwen2.5-32B+0.5B = 2.9×), #16578 (DGX Spark), DandinPower/llama.cpp_bench
-(32B = 10.7 t/s, bandwidth-bound); vLLM MTP docs + Red Hat EAGLE3 article (lossless, up to 2.5×); AWS spec-dec
-blog (Qwen3-32B+1.7B up to 3×, 0.6B ~60% lower accept); RedHatAI/AngelSlim Qwen3-32B-eagle3 heads.
--- a/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
@@ -1,176 +0,0 @@
-# W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan
-
-> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.**
-> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED
-> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved
-> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy
-> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS
-> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so
-> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×)
-> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV +
-> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is
-> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail:
-> `VLLM_DECOMPOSITION.md`.
-
-The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213
-TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's
-own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).
-
-## Why a custom kernel (validated, not assumed)
-
-On GB10 (sm_121), measured: **both** llama-MMQ (int8, Ampere-tuned) **and** cuBLAS-FP16 sit at ~46 TFLOP/s
-(~21% of peak). cuBLAS falls back to an Ampere `cutlass_80_tensorop` kernel (CUDA-13 has no sm_121 GEMM for
-these shapes); rebuilt with `-DGGML_CUDA_FORCE_CUBLAS=ON` it's *slower* than MMQ (690 vs 750). **No library
-path reaches the ceiling on consumer Blackwell** — a hand-tuned sm_120a kernel is required. `mmapeak` measures
-the 213 BF16 peak as reachable, and vLLM's Marlin hits it, so the ceiling is real; the work is reaching it.
-
-## What Marlin does (the design we mirror)
-
-Weights stored 4-bit, **dequantized in-register/shared-mem** in-flight; GEMM math on **FP16/BF16 tensor
-cores** (`mma.sync m16n8k16`). Speed comes from: `cp.async` global→shared with a **multi-stage double-buffered
-pipeline**, **offline weight reshuffle** into the MMA-friendly layout, activations kept resident in registers,
-and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLLM machete (Hopper successor).
-
-## Phases (each ends with: numerical parity vs MMQ + a prefill benchmark)
-
-### P0 — Harness + baseline — DONE
- **Correctness gate (GREEN):** `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103 passed** (CUDA vs CPU
-  reference, covers Q4_0/Q4_K at the real FFN shapes m=4096,k=14336,n=1..512). This is *the* parity check the
-  W4A16 kernel must keep green at every phase — it tests the CUDA MUL_MAT path the kernel will hook. The
-  `not supported` lines are `type_b=f16` combos (irrelevant; prefill uses f32 activations).
- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
-  of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
-  per-shape GFLOPS too; build it once with the harness.)
- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
-  | n (tokens) | q4_0 | q4_K | regime |
-  |---|---|---|---|
-  | 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
-  | 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
-  | **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
-
-  So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
-  is cleaner than end-to-end for kernel iteration.
- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
-  phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
-
-### P1 — Dispatch seam (no behavior change) — DONE
- `marlin-w4a16.{cuh,cu}` + a gated hook in `ggml_cuda_mul_mat` (dense, non-ids path), behind
-  `GGML_CUDA_W4A16` + sm_120/121 (`cc >= GGML_CUDA_CC_BLACKWELL`) + type∈{Q4_0,Q4_K} + f32 activations.
-  Returns false → falls back to MMQ. Source + apply instructions: `kernel/w4a16/` (`HOOK.md`).
- **Verified on GB10:** clean build; `test-backend-ops MUL_MAT` = **1103/1103** (byte-identical default);
-  `llama-bench` dense Q4 pp512 unchanged (717.77 default / 718.26 with flag); `GGML_CUDA_W4A16=1` reaches the
-  seam (stderr `[w4a16] ... P1 seam - using MMQ`) and falls back. The empty frame P2/P3 fills.
-
-### P2 — Correctness-first kernel (slow OK) — DONE
- **Kernel:** `marlin-w4a16.cu` replaces the P1 TODO with a real W4A16 GEMM. In-kernel dequant Q4→BF16 into
-  shared mem, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` via ggml's `mma.cuh` tile abstractions
-  (`tile<16,8,nv_bfloat162>` A, `tile<8,8,nv_bfloat162>` B, `tile<16,8,float>` C), F32 accumulate, F32 write.
-  One warp per 16(M)x8(N) output tile, K looped in steps of 16. Both src0 (weights, row m) and src1 (acts,
-  row n) are row-major `[row][k]`, so A and B load symmetrically via `load_generic`; the mma does the dot over k.
- **Types handled:** Q4_0 and Q4_K. Q4_0 dequant `w=d*(q-8)` inline; Q4_K via the superblock decode mirrored
-  from `convert.cu` (`get_scale_min_k4`, 8x32 sub-blocks, `d*q-m`).
- **Shape classes handled:** contiguous 2D GEMM (the prefill path), `ne2==ne3==1`, f32 activations, K%16==0
-  (always true: Q4_0 K%32, Q4_K K%256). **Falls back to MMQ (returns false)** for batched (bs!=[1,1]),
-  broadcast (nr!=[1,1]), permuted / non-contiguous (per!=[0,1,2,3]), and any non-f32 activation (e.g. f16) -
-  keeps the gate green. M / N boundaries are zero-padded in-kernel (handles M not %16, N not %8).
- **Parity (the gate):** `GGML_CUDA_W4A16=1 test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103 passed**
-  (the Q4_0/Q4_K f32 contiguous shapes run the kernel and match the CPU reference; batched/permuted/f16 fall
-  back). Default (flag-unset) build still **1103/1103** (byte-identical, seam returns false).
- **Model sanity / P2 perf:** `GGML_CUDA_W4A16=1 llama-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -p 512 -n 16
-  -ub 2048` runs clean: **pp512 = 31.75 t/s**, tg16 = 6.28 t/s. Slow as expected (naive 1-warp/tile, weights
-  re-dequantized per n-tile, no pipeline) - this is the correctness checkpoint; P3 brings the speedup. The real
-  Q4_K model matmul path engages the kernel without error.
-
-### P3 — The Marlin pipeline (the speedup) — STEP 1 + SKEW-PAD/TILING LANDED; PREPACK + PIPELINE + STREAM-K DEFERRED
-Goal: `cp.async` double/triple-buffered global->shared; offline weight reshuffle (a one-time repack of the Q4
-tensor into the mma+pipeline layout); register-resident activation tiles; Stream-K split for the prefill M.
-Target: >=150 TFLOP/s (>=~2,300 t/s), then ~213. **MMQ baseline to beat: 47.1 TFLOPS (q4_K n=512) / pp512 718.**
-
-**Kernel structure now (committed P3b):** block-tiled multi-warp GEMM with a CONFLICT-FREE shared feed via skew
-padding. `blockDim=(32, WM*WN)` so `threadIdx.x` is the warp lane (required by `mma.cuh` get_i/get_j) and
-`threadIdx.y` is the warp index; the original 1-warp P2 launch put 128 threads on `threadIdx.x` and exploded
-`get_j` into an out-of-bounds shared read (found via compute-sanitizer). `WM*WN` warps compute a
-`BM(=WM*FM*16) x BN(=WN*FN*8)` output tile; each warp owns an `FM x FN` grid of m16n8k16 mma fragments
-accumulated in F32. Per k-step (16-deep): all warps cooperatively dequant the `BM x 16` Q4 weight strip + load
-the `BN x 16` f32->bf16 activation strip into shared, one `__syncthreads`, then `ldmatrix.x4` (A) / `ldmatrix.x2`
-(B) fragments + `FM*FN` mmas. The shared rows hold 8 bf162 of data but are stored at a PADDED stride of 12 bf162
-(`W4A16_SPAD`): ldmatrix's per-lane address is `row*stride`, and the natural stride 8 (a divisor of the
-32-bank / 128-byte cycle) collides rows 0,4,8,12 into a 2-way bank conflict; skewing to 12 (4-byte aligned, so
-ldmatrix's 16-byte alignment holds) makes `{r*12 mod 32}` hit 8 distinct bank-quads for r in 0..7, so both
-halves of ldmatrix are conflict-free at only +50% on the small staged tile (~12 KB at the shipping tile).
-Shipping config `WM=4,WN=4,FM=2,FN=4` -> `BM=128, BN=128`, 16 warps, 8 m16n8 C-tiles per warp (keeping
-register pressure low is what lets BN grow without an occupancy cliff). M/N tails zero-padded in-kernel; still
-gated to contiguous 2D Q4_0/Q4_K f32 prefill, else falls back to MMQ.
-
-**Per-step results (q4_K n=512 via `test-backend-ops perf`; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):**
-
-| step | q4_K n=512 | q4_0 n=512 | pp512 | pp2048 | vs MMQ 47 / 718 | notes |
-|---|---|---|---|---|---|---|
-| P2 (1 warp/tile) | ~2 TFLOPS | - | 31.75 | - | 0.04x | correctness checkpoint |
-| Step 1: block tiling (load_generic, BM64/4w) | 6.63 (cold) | 7.53 | 119 | 123 | 0.14x | original committed kernel |
-| P3b-1: skew-pad ldmatrix + BM128/8w | 8.50 (cold) | 10.56 | 148.5 | 153.9 | 0.18x | +28% q4_K, +40% q4_0 over step 1 |
-| **P3b-2: + BN128/16w (current)** | **9.92 (cold)** | **11.68** | **177.6** | **185.0** | **0.21x** | +17% q4_K, +20% pp512 over P3b-1 (+49% pp512 over step 1) |
-
-Parity gate **1103/1103** at every step, flag set and unset (byte-identical when unset). All P3b numbers above
-are from thermally-bracketed cold A/B sessions (committed measured immediately before AND after each candidate,
-identical both times -> the deltas are real, not thermal). P3b-1 cold A/B: 6.63/7.53 vs 8.52/10.49. P3b-2 cold
-A/B: BN64/8w 10.56/8.50 then 10.51/8.45 (bracket) vs BN128/16w 11.68/9.92.
-
-**What landed / what was tried (honest):**
- **P3b - LANDED (committed).** Two combined changes lift the prior committed kernel: (1) **skew-pad
-  conflict-free ldmatrix** (shared row stride 8->12 bf162; makes `ldmatrix.x4`/`.x2` bank-conflict-free at near
-  zero occupancy cost) and (2) **bigger tile / more warps** (`BM=128, BN=64`, 8 warps). Cold A/B: q4_K
-  6.63->8.52 (+28%), q4_0 7.53->10.49 (+40%), pp512 119->148.5 (+25%). **Still ~5.5x under MMQ (47) per-op and
-  ~4.8x under pp512 718 - does NOT beat MMQ.** This is forward progress, not the finish line.
- **The XOR-swizzle-FIRST plan was tested and is WRONG for this GPU - documented so it is not re-tried.** A
-  wide-row (BK=64, 128-byte rows) XOR swizzle `seg ^ (row&7)` IS conflict-free, but the 16 KB shared it needs
-  collapsed occupancy and dropped q4_K n=512 to **2.84 TFLOPS** (worse than the unswizzled 6.63) - the same
-  occupancy cliff P3 hit with a 32 KB pipeline. The conflict-free feed must be bought WITHOUT widening shared:
-  skew padding (above) does exactly that (6 KB), which is why it is the committed form. Lesson: on GB10 occupancy
-  dominates bank-conflict latency; never trade occupancy for a conflict-free layout.
- **Conflict-free feed alone did NOT beat the unswizzled kernel - the limiter moved.** At the SAME BM64/4w tile,
-  skew-pad ldmatrix (6.70) ~= load_generic (6.63): removing bank conflicts bought ~nothing. The win came only
-  when the tile grew (BM128/8w). A 5-config tile sweep then split the two quant types:
-  - **q4_0 SCALES with warps/tiles** (7.7 -> 10.5 -> **15.8 TFLOPS at BM128/16w**): feed/global-traffic bound,
-    helped by cutting redundant activation re-reads (more BM = fewer M-blocks each re-reading the act column).
-  - **q4_K is largely DEQUANT-COMPUTE bound** (the BM64/16w tile gives q4_0=15.8 but q4_K=6.8 - they diverge
-    hard). This **refines P3's "within 12%" finding**: that held only in the low-throughput memory-bound regime;
-    once the feed is unblocked, q4_K's per-element 6-bit superblock decode (`get_scale_min_k4` + superblock
-    indexing, redone every k-step AND re-done by every N-block) becomes the wall. BM256 regressed both (too few
-    blocks / register pressure).
- **Growing BN partly relieves the q4_K dequant wall (P3b-2).** Because every N-block re-decodes the same
-  weight strip, halving the N-block count (BN 64->128) halves that redundant q4_K decode - but only when BN is
-  spread across MORE WARPS (16w, 8 C-tiles/warp), not more fragments-per-warp: the FN=8 / FM=4 variants (16
-  C-tiles/warp) regressed to ~6.6 on register pressure, while WM=4,WN=4,FM=2,FN=4 (16w, 8 tiles/warp) lifted
-  q4_K 8.5->9.9 and q4_0 10.6->11.7 cold. BN=256 was no better and costs more shared. **BN128/16w is the
-  shipping tile.**
- **Next blocker (the remaining q4_K unlock) = offline prepack.** BN growth only divides the redundant decode by
-  the N-block count; it cannot remove the per-k-step decode itself. The full fix is the **one-time offline
-  repack** - decode the Q4 tensor ONCE into a cached device buffer keyed off the tensor data pointer, in a layout
-  with the scale/min pre-applied (store reshuffled 4-bit + per-subblock bf16 d,m, ~1.25x the q4 size, NOT a full
-  bf16 blow-up which would be ~4x), so the in-kernel path becomes a cheap `q*d - m` with coalesced loads. Then
-  `cp.async` multi-stage (sized to NOT widen shared past the occupancy cliff) and **Stream-K** over M. These
-  remain the multi-week core; **prepack is the highest-value next step for q4_K specifically** (it should let
-  q4_K join q4_0 on the feed-bound scaling curve instead of plateauing at ~10).
- **Methodology note (unchanged):** the box thermally throttles under sustained perf+bench runs (identical code
-  ~8.8 cold vs ~6.6 hot earlier), so only same-session A/Bs are trustworthy. The P3b deltas above were taken in
-  one bracketed cold session for exactly this reason.
-
-### P4 — Tune
- Tile (mmq_x/y analogues), warps, pipeline depth, occupancy. We have nsys (throughput) but **not ncu** on the
-  DGX — tuning is empirical (sweep configs, measure t/s). Note ncu would need sudo/driver perms we lack.
-
-### P5 — Enable
- Default on for sm_120/121 + Q4_0/Q4_K dense when parity holds + faster; keep the flag as an escape hatch.
-  Ship as a LocalAI llama.cpp patch (the patches/ series) and/or upstream (ggml has no Marlin-equivalent —
-  issue #1519 — so it's net-new upstream value; float it with maintainers first).
-
-## Risks / notes
- **Multi-week, expert-CUDA, DGX-only** (GB10 is the only sm_121). The session's network flakiness +
-  `llama-cli` hang make `llama-bench`/`test-backend-ops` the reliable verification tools (both work).
- Quantization correctness: Q4_K's superblock structure (256-elem, 6-bit scales) is more complex to dequant
-  in-kernel than Q4_0; consider landing Q4_0 first, then Q4_K.
- **Beat-path follow-on:** the FP4-MMA path (`mul_mat_q<MXFP4>`, ~5% of FP4 peak) tuned/fixed on sm_121 reaches
-  ~6,600 (2× BF16). Separate track; this W4A16 kernel is the match-path foundation.
- Reuse ggml's `mma.cuh` tile abstractions (MMQ already uses them) rather than raw PTX where possible.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
@@ -1,31 +0,0 @@
-# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout
-
-Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
-new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).
-
-## Files (copy into `ggml/src/ggml-cuda/`)
- `marlin-w4a16.cuh`
- `marlin-w4a16.cu`
-
-## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`
-
-1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
-   ```cpp
-   #include "ggml-cuda/marlin-w4a16.cuh"
-   ```
-
-2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
-   `if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
-   ```cpp
-   if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
-   ```
-
-## Verify (P1 acceptance — met)
- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
-  gating passes on sm_121, falls back).
-
-The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
-and returns `true` once parity holds.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
@@ -1,66 +0,0 @@
-# W4A16 kernel - subagent dispatch briefs (P3, P4, P5)
-
-**Dispatch strategy.** Each phase = one fresh **Opus-4.8** subagent handed a complete zero-context brief.
-Phases are **sequential** (P3 needs P2's correct kernel; P4 needs P3's pipeline; P5 needs P4's tuned kernel),
-so dispatch phase N+1 only after phase N's commit lands, and before dispatching, splice phase N's *actual*
-deliverable (final kernel shape, configs, fallback set) into the next brief. P2's brief (already dispatched)
-is the template; reuse the COMMON section below verbatim in every dispatch.
-
---
-
-## COMMON (paste into every phase brief)
-
- **Kernel dev is on the remote DGX** (GB10, sm_121): `ssh -o ConnectTimeout=25 -o ServerAliveInterval=10 -o ServerAliveCountMax=10 dgx.casa '<cmd>'`. Network is FLAKY (re-poll on drop; nohup jobs survive). `llama-cli` HANGS - never use it. Only `llama-bench` + `test-backend-ops` work.
- Checkout `~/llama.cpp-pr24423`, build `~/llama.cpp-pr24423/build` (sm_121, `-DLLAMA_BUILD_TESTS=ON`). Kernel file `ggml/src/ggml-cuda/marlin-w4a16.cu`. Build auto-GLOBs it; no CMakeLists edits. Hook already in `ggml-cuda.cu`, gated behind env `GGML_CUDA_W4A16`.
- Dense test model: `~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
- **Builds run detached + poll** (never blocking foreground): write a `~/pN.sh` that builds `--target test-backend-ops llama-bench`, echoes `RC=$?`, runs the gate, echoes `PN_DONE`; `nohup` it; poll `for i in $(seq 1 90); do grep -q PN_DONE ~/pN.out && break; sleep 20; done; tail ~/pN.out`.
- **GPU hygiene:** check `docker ps | grep local-ai` + `nvidia-smi`; `docker stop` a running localai worker if present (authorized); never pkill native procs; never start model servers.
- **Parity gate (must stay green every step):** `GGML_CUDA_W4A16=1 CUDA_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**; and flag-unset stays 1103/1103 (byte-identical). A wrong result is worse than a fallback - return false for any shape you can't do correctly.
- **Perf measurement:** `test-backend-ops perf -o MUL_MAT -b CUDA0` (per-shape GFLOPS; the canonical target is q4_K m=4096 k=14336 **n=512**, baseline **47.1 TFLOPS**, ceiling ~213) + `llama-bench -m <model> -ngl 99 -p 512,2048 -n 0 -ub 2048` (baseline pp512 ~718).
- **LocalAI repo (commit here; you do NOT inherit cwd - `cd` explicitly):** `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`. Plan: `backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md`. Source mirror: `backend/cpp/llama-cpp/paged/kernel/w4a16/`. After a phase passes: fetch the final `marlin-w4a16.cu` from the DGX (`ssh ... 'cat ...'`), overwrite the mirror, update the plan (mark the phase DONE with numbers), `git commit -s` (DCO sign-off; user is Ettore Di Giacinto <mudler@localai.io>). **No `Co-Authored-By`. No em-dashes anywhere. Trailer `Assisted-by: Claude:opus-4.8 [Claude Code]`. Do NOT push.**
- Final message = the result (gate ?/1103, the perf delta, blockers + resolutions, commit hash). A precise partial result beats a vague success claim.
-
---
-
-## P3 brief - the Marlin pipeline (the speedup)
-
-**Goal.** Take P2's correct-but-slow kernel from ~47 toward ~150+ TFLOPS (then ~213) on the q4_K n=512 prefill GEMM, **without ever breaking parity**. This is the Marlin design: the math is the same BF16 mma; the speed comes from feeding the tensor cores without stalling.
-
-**Implement, incrementally (re-run the parity gate after each):**
-1. **`cp.async` multi-stage pipeline** - double/triple-buffer global->shared loads of both the Q4 weight tiles and the activation tiles so dequant+mma on stage k overlaps the load of stage k+1. (Study `mma.cuh` + how `mmq.cu`/`mmf.cu` stage shared memory; ggml already uses `cp.async`/`__pipeline_*`.)
-2. **Offline weight reshuffle** - repack the Q4 weights once into the mma+pipeline-friendly layout (Marlin's interleave) so loads are coalesced and the mma fragment maps directly. Do this as a load-time transform of src0 (a new prepacked buffer keyed off the tensor) - NOT per-call. Document where the repack lives + its memory cost.
-3. **Register-resident activation tiles + Stream-K** split of the M dimension across blocks for the prefill (large-M) case so all SMs stay busy.
-
-**Acceptance.** Parity gate stays **1103/1103** at every commit; `test-backend-ops perf` q4_K n=512 climbs materially above 47 TFLOPS (target >=150) and `llama-bench` pp512 climbs above ~718. Report the TFLOPS + t/s after each of the 3 steps so the contribution of each is visible. If a step regresses parity, revert it and report why.
-
-**Reference.** IST-DASLab/marlin (github), arXiv 2408.11743, vLLM machete. Mirror `mmf.cu`'s BF16 GEMM structure; Marlin = that + Q4 dequant-on-load + the pipeline/reshuffle.
-
-**Splice before dispatch:** P2's final kernel structure (tile sizes, which types/shapes it handles vs falls back, helper functions it defined).
-
---
-
-## P4 brief - tune to the ceiling
-
-**Goal.** Drive the P3 kernel as close to the ~213 TFLOPS ceiling as empirical tuning allows. **No `ncu` on this box** (no driver perms) - tune by throughput: `test-backend-ops perf` + `llama-bench` + `nsys` (throughput only).
-
-**Do.** Parametrize the kernel (template params / constants) over: tile M/N/K, warps per block, pipeline depth (stages), and occupancy (regs, shared-mem budget). Sweep systematically (a script that rebuilds + benches each config, logs q4_K n=512 TFLOPS + pp512/pp2048 t/s), pick the best, hard-set it (with a short comment on the sweep). Check both prefill shapes (n=512 and n=2048) and confirm decode (n=1) didn't regress (it should still route to mat-vec, not this kernel - verify the gating).
-
-**Acceptance.** Best config maximizes q4_K n=512 TFLOPS (stretch ~150-213) with parity **1103/1103** intact; the sweep table (config -> TFLOPS/t-s) is recorded in the plan's P4 section. Report the chosen config + the final pp512/pp2048 t/s vs the 718/750 baseline and vs vLLM's ~3300 single-stream target.
-
-**Splice before dispatch:** P3's pipeline structure + the perf it reached + which knobs are already fixed vs free.
-
---
-
-## P5 brief - enable + package + (maybe) upstream
-
-**Goal.** Make W4A16 the default dense-Q4 path on Blackwell and ship it through LocalAI.
-
-**Do.**
-1. **Flip the gate:** default-ON for sm_120/121 + Q4_0/Q4_K dense when faster, keep an opt-out env (e.g. `GGML_CUDA_W4A16=0`) as an escape hatch. The existing return-false-on-unhandled-shape path is the correctness safety net; keep it. Verify the default (no env) build now runs W4A16 for dense Q4, gate green, faster than the old MMQ baseline.
-2. **Package as a LocalAI llama.cpp patch:** produce `backend/cpp/llama-cpp/paged/patches/kernel/0002-w4a16-marlin.patch` (the new files + the `ggml-cuda.cu` hook + the gate flip) that applies cleanly to the pinned llama.cpp, mirroring the existing `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`. Confirm LocalAI's `make backends/llama-cpp` build path can consume it (read `.agents/llama-cpp-backend.md` + the build memory: `make -C backend/cpp/llama-cpp clean` before rebuilds).
-3. **Docs:** update `BLACKWELL_KERNEL_GAPS.md` + the plan with the shipped result; add a short note to the LocalAI docs if there's a Blackwell/performance page.
-4. **Upstream decision (do NOT open without surfacing first):** ggml has no Marlin-equivalent (issue #1519) so this is net-new upstream value. Draft (do not submit) an upstream PR description + note the sm_121 build-flag caveats; report it for the user to decide.
-
-**Acceptance.** Default Blackwell build uses W4A16 for dense Q4, parity 1103/1103, measurably faster than MMQ; the patch applies + the LocalAI llama-cpp backend builds with it (verify or, if the full backend build is too heavy, document the exact build command + that the patch applies cleanly). Report the end-to-end LocalAI dense-Q4 prefill number vs the start-of-project 765 t/s.
-
-**Splice before dispatch:** P4's final kernel + config + the measured ceiling reached; the exact enable condition decided.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
@@ -1,258 +0,0 @@
-#include "marlin-w4a16.cuh"
-#include "mma.cuh"
-
-#include <cstdio>
-#include <cstdlib>
-#include <cuda_bf16.h>
-
-// W4A16 Marlin-style GEMM.
-//
-// In-kernel dequantize Q4 weights -> BF16, multiply against BF16-converted F32
-// activations using mma.sync m16n8k16 BF16 tensor-core ops, accumulate in F32,
-// write F32 output. Handles only the contiguous 2D GEMM (prefill) case for
-// Q4_0 / Q4_K; everything else returns false and falls back to MMQ.
-//
-// ggml MUL_MAT convention: dst[m,n] = sum_k src0[k,m] * src1[k,n].
-//   src0 (weights): ne0=K (contiguous), ne1=M  -> row m is K contiguous quants.
-//   src1 (acts,f32): ne0=K (contiguous), ne1=N -> row n is K contiguous floats.
-//   dst  (f32):      ne0=M (contiguous), ne1=N -> element (m,n) at m + n*M.
-// Both operands are row-major [row][k]; m16n8k16 computes C[m,n] += sum_k A[m,k]*B[n,k].
-//
-// Thread layout: blockDim = (32, WM*WN). threadIdx.x is the warp lane (0..31,
-// required by mma.cuh get_i/get_j), threadIdx.y is the warp index.
-//
-// P3b step 1 - conflict-free shared layout via SKEW PADDING:
-//  - WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile; each warp
-//    owns an FM x FN grid of m16n8k16 mma fragments accumulated in F32.
-//  - Per 16-deep k-step the warps cooperatively dequant the BM x 16 Q4 weight
-//    strip + load the BN x 16 f32->bf16 activation strip into shared, then feed
-//    the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B).
-//  - The shared rows are PADDED to SPAD(=12) bf162 instead of the natural 8.
-//    ldmatrix's per-lane address is row*stride; with the natural stride 8 (a
-//    divisor of the 32-bank / 128-byte cycle) rows 0,4,8,12 collide -> 2-way
-//    bank conflict on every fragment load (this is why P3 measured a plain
-//    ldmatrix swap as neutral). Skewing the stride to 12 (4-byte aligned, so
-//    ldmatrix's 16-byte alignment holds) makes {r*12 mod 32} hit 8 distinct
-//    bank-quads for r in 0..7, so both halves of ldmatrix.x4 and ldmatrix.x2 are
-//    conflict-free. The pad costs only +50% on the small (~4 KB) staged tile, so
-//    unlike a 128-byte-row XOR swizzle it does NOT collapse occupancy on GB10
-//    (a wide-row swizzle pushed shared to 16 KB and dropped this to ~2.8 TFLOPS).
-//
-// Dead-ends already proven (do not re-try): a double-buffered KSTAGE=64 cp.async
-// pipeline collapsed occupancy (32 KB shared -> 2.7 TFLOPS); a plain ldmatrix on
-// the UNpadded layout was neutral (bank conflicts); a wide-row (BK=64) XOR swizzle
-// was conflict-free but occupancy-starved (16 KB shared -> 2.8 TFLOPS). Skew
-// padding gets the conflict-free feed at near-zero occupancy cost.
-
-using namespace ggml_cuda_mma;
-
-typedef tile<16, 8, nv_bfloat162> tile_A; // 16(M) x 16(K)
-typedef tile< 8, 8, nv_bfloat162> tile_B; //  8(N) x 16(K)
-typedef tile<16, 8, float>        tile_C; // 16(M) x  8(N)
-
-// bf162 columns actually live per shared row (16 k-values = 8 bf162) ...
-#define W4A16_KP   8
-// ... padded to this stride to bank-skew the ldmatrix row addresses.
-#define W4A16_SPAD 12
-
-static bool w4a16_enabled() {
-    static const bool en = (std::getenv("GGML_CUDA_W4A16") != nullptr);
-    return en;
-}
-
-// 6-bit packed scale/min decode for Q4_K (mirrors convert.cu get_scale_min_k4).
-static __device__ __forceinline__ void w4a16_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
-    if (j < 4) {
-        d = q[j] & 63; m = q[j + 4] & 63;
-    } else {
-        d = (q[j+4] & 0xF) | ((q[j-4] >> 6) << 4);
-        m = (q[j+4] >>  4) | ((q[j-0] >> 6) << 4);
-    }
-}
-
-// Dequantize a single Q4_0 weight at column k of a row.
-static __device__ __forceinline__ float w4a16_dq_q4_0(const char * row, int k) {
-    const block_q4_0 * blk = (const block_q4_0 *) row + (k / QK4_0);
-    const int j = k % QK4_0;
-    const float d = __half2float(blk->d);
-    const int q = (j < QK4_0/2) ? (blk->qs[j] & 0xF) : (blk->qs[j - QK4_0/2] >> 4);
-    return (q - 8) * d;
-}
-
-// Dequantize a single Q4_K weight at column k of a row.
-static __device__ __forceinline__ float w4a16_dq_q4_K(const char * row, int k) {
-    const block_q4_K * blk = (const block_q4_K *) row + (k / QK_K);
-    const int e = k % QK_K;
-    const int il     = e / 64;        // 0..3
-    const int within = e % 64;
-    const int half   = within / 32;   // 0..1
-    const int pos    = within % 32;
-    const int ir     = pos / 4;       // 0..7
-    const int l      = pos % 4;       // 0..3
-    const int is     = 2*il + half;
-    const float dall = __low2half (blk->dm);
-    const float dmin = __high2half(blk->dm);
-    uint8_t sc, mn;
-    w4a16_scale_min_k4(is, blk->scales, sc, mn);
-    const float d = dall * sc;
-    const float m = dmin * mn;
-    const uint8_t qb = blk->qs[32*il + 4*ir + l];
-    const int q = (half == 0) ? (qb & 0xF) : (qb >> 4);
-    return d * q - m;
-}
-
-template <bool IS_Q4_K, int WM, int WN, int FM, int FN>
-static __global__ void __launch_bounds__(WM*WN*32, 1)
-w4a16_gemm_kernel(
-        const char * __restrict__ src0,
-        const char * __restrict__ src1,
-        float      * __restrict__ dst,
-        const int M, const int N, const int K,
-        const int64_t nb01, const int64_t nb11, const int64_t dst_ne0) {
-    constexpr int KP   = W4A16_KP;      // 8 bf162 = 16 k per row
-    constexpr int SPAD = W4A16_SPAD;    // padded row stride (bank skew)
-    constexpr int BM  = WM*FM*16;
-    constexpr int BN  = WN*FN*8;
-    constexpr int NTH = WM*WN*32;
-
-    const int m0 = blockIdx.x * BM;
-    const int n0 = blockIdx.y * BN;
-
-    const int warp_id = threadIdx.y;        // 0 .. WM*WN-1
-    const int warp_n  = warp_id % WN;
-    const int warp_m  = warp_id / WN;
-    const int tid     = threadIdx.y*32 + threadIdx.x;
-
-    __shared__ nv_bfloat162 sW[BM*SPAD]; // [m][kpair], padded row stride SPAD
-    __shared__ nv_bfloat162 sB[BN*SPAD]; // [n][kpair], padded row stride SPAD
-
-    tile_C C[FM][FN]; // zero-initialized accumulators
-
-    for (int k0 = 0; k0 < K; k0 += 16) {
-        // Dequantize the BM x 16 weight strip once; reused across the block's BN span.
-        #pragma unroll
-        for (int idx = tid; idx < BM*KP; idx += NTH) {
-            const int m  = idx / KP;
-            const int kk = idx % KP;
-            const int k  = k0 + 2*kk;
-            float w0 = 0.0f, w1 = 0.0f;
-            if (m0 + m < M) {
-                const char * row = src0 + (int64_t)(m0 + m) * nb01;
-                if (IS_Q4_K) { w0 = w4a16_dq_q4_K(row, k); w1 = w4a16_dq_q4_K(row, k + 1); }
-                else         { w0 = w4a16_dq_q4_0(row, k); w1 = w4a16_dq_q4_0(row, k + 1); }
-            }
-            sW[m*SPAD + kk] = __floats2bfloat162_rn(w0, w1);
-        }
-        // Load the BN x 16 activation strip (f32 -> bf16).
-        #pragma unroll
-        for (int idx = tid; idx < BN*KP; idx += NTH) {
-            const int n  = idx / KP;
-            const int kk = idx % KP;
-            const int k  = k0 + 2*kk;
-            float a0 = 0.0f, a1 = 0.0f;
-            if (n0 + n < N) {
-                const float * arow = (const float *)(src1 + (int64_t)(n0 + n) * nb11);
-                a0 = arow[k]; a1 = arow[k + 1];
-            }
-            sB[n*SPAD + kk] = __floats2bfloat162_rn(a0, a1);
-        }
-        __syncthreads();
-
-        tile_A Af[FM];
-        tile_B Bf[FN];
-        #pragma unroll
-        for (int fm = 0; fm < FM; ++fm) {
-            const int mrow = (warp_m*FM + fm) * 16;
-            load_ldmatrix(Af[fm], sW + mrow*SPAD, SPAD);
-        }
-        #pragma unroll
-        for (int fn = 0; fn < FN; ++fn) {
-            const int ncol = (warp_n*FN + fn) * 8;
-            load_ldmatrix(Bf[fn], sB + ncol*SPAD, SPAD);
-        }
-        #pragma unroll
-        for (int fm = 0; fm < FM; ++fm) {
-            #pragma unroll
-            for (int fn = 0; fn < FN; ++fn) {
-                mma(C[fm][fn], Af[fm], Bf[fn]);
-            }
-        }
-        __syncthreads();
-    }
-
-    #pragma unroll
-    for (int fm = 0; fm < FM; ++fm) {
-        #pragma unroll
-        for (int fn = 0; fn < FN; ++fn) {
-            const int mbase = m0 + (warp_m*FM + fm) * 16;
-            const int nbase = n0 + (warp_n*FN + fn) * 8;
-            #pragma unroll
-            for (int l = 0; l < tile_C::ne; ++l) {
-                const int m = mbase + tile_C::get_i(l);
-                const int n = nbase + tile_C::get_j(l);
-                if (m < M && n < N) {
-                    dst[(int64_t)n * dst_ne0 + m] = C[fm][fn].x[l];
-                }
-            }
-        }
-    }
-}
-
-bool ggml_cuda_w4a16_mul_mat(
-        ggml_backend_cuda_context & ctx,
-        const ggml_tensor * src0,
-        const ggml_tensor * src1,
-        ggml_tensor       * dst) {
-    if (!w4a16_enabled()) {
-        return false;
-    }
-    if (src0->type != GGML_TYPE_Q4_0 && src0->type != GGML_TYPE_Q4_K) {
-        return false;
-    }
-    if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
-        return false;
-    }
-    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-    if (!GGML_CUDA_CC_IS_NVIDIA(cc) || cc < GGML_CUDA_CC_BLACKWELL) {
-        return false; // consumer Blackwell (sm_120/121) only
-    }
-
-    if (src0->ne[2] != 1 || src0->ne[3] != 1 ||
-        src1->ne[2] != 1 || src1->ne[3] != 1 ||
-        dst->ne[2]  != 1 || dst->ne[3]  != 1) {
-        return false;
-    }
-    if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
-        return false;
-    }
-
-    const int64_t K = src0->ne[0];
-    const int64_t M = src0->ne[1];
-    const int64_t N = src1->ne[1];
-    if (src1->ne[0] != K || dst->ne[0] != M || dst->ne[1] != N) {
-        return false;
-    }
-    if (K % 16 != 0) {
-        return false;
-    }
-
-    cudaStream_t stream = ctx.stream();
-
-    // Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
-    constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
-    constexpr int BM = WM*FM*16;
-    constexpr int BN = WN*FN*8;
-    const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
-    const dim3 block(32, WM*WN, 1);
-
-    if (src0->type == GGML_TYPE_Q4_K) {
-        w4a16_gemm_kernel<true, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
-            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
-            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
-    } else {
-        w4a16_gemm_kernel<false, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
-            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
-            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
-    }
-    return true;
-}
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
@@ -1,14 +0,0 @@
-#pragma once
-
-#include "common.cuh"
-
-// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
-// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
-// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
-// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
-// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
-bool ggml_cuda_w4a16_mul_mat(
-        ggml_backend_cuda_context & ctx,
-        const ggml_tensor * src0,   // 4-bit weights (Q4_0/Q4_K)
-        const ggml_tensor * src1,   // F32 activations
-        ggml_tensor       * dst);   // F32 output
--- a/backend/cpp/llama-cpp/paged/paged-bench.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-bench.cpp
@@ -1,129 +0,0 @@
-// paged-bench: quantify the multi-tenant wins of paged KV allocation that are
-// properties of the host-side block model (vLLM-parity), independent of the
-// in-model compute path.
-//
-//   Win 1 (capacity):       on-demand block allocation vs contiguous per-seq
-//                           reservation, under a fixed KV block budget.
-//   Win 3 (prefix sharing): automatic cross-tenant prefix dedup via block
-//                           hashing.
-//
-// Win 2 (throughput) is intentionally NOT here: it requires the paged read
-// path wired into llama-graph.cpp (Gate 0). Measuring it at this layer would
-// be dishonest, so it is reported as pending.
-
-#include "paged_kv_manager.h"
-
-#include <cstdio>
-#include <vector>
-#include <numeric>
-
-using namespace paged;
-
-// A deterministic LCG so sequence lengths vary without Math.random-style nondeterminism.
-struct Lcg {
-    uint64_t s;
-    explicit Lcg(uint64_t seed) : s(seed) {}
-    uint32_t next() { s = s * 6364136223846793005ULL + 1442695040888963407ULL; return (uint32_t)(s >> 33); }
-    int range(int lo, int hi) { return lo + (int)(next() % (uint32_t)(hi - lo + 1)); }
-};
-
-static size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-
-int main() {
-    const int block_size = 16;
-    const int n_ctx      = 2048;   // max context a sequence could use
-    const int num_blocks = 512;    // fixed KV budget: 512 blocks * 16 = 8192 cells
-
-    printf("paged-bench  (block_size=%d, n_ctx=%d, budget=%d blocks = %d cells)\n\n",
-           block_size, n_ctx, num_blocks, num_blocks * block_size);
-
-    // ---------------------------------------------------------------------
-    // WIN 1: concurrency capacity. Sequences have realistic, VARYING lengths
-    // (most short, a few long) - the regime where reserving n_ctx per seq
-    // wastes the most. Count how many fit under the same block budget.
-    // ---------------------------------------------------------------------
-    {
-        Lcg rng(12345);
-        const int blocks_per_ctx = (int) cdiv(n_ctx, block_size); // contiguous reserves this per seq
-
-        // Contiguous (stream-style) reservation: every seq reserves n_ctx worth.
-        int contiguous_fit = num_blocks / blocks_per_ctx;
-
-        // Paged on-demand: draw real lengths until the pool is exhausted.
-        PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-        int paged_fit = 0;
-        long total_tokens = 0;
-        for (int seq = 0; ; ++seq) {
-            // 80% short (8-128 tok), 20% long (up to n_ctx)
-            int len = (rng.range(0, 99) < 80) ? rng.range(8, 128) : rng.range(128, n_ctx);
-            if (!m.allocate(seq, (size_t) len)) break;
-            paged_fit++;
-            total_tokens += len;
-        }
-
-        printf("WIN 1  concurrency capacity @ %d-block budget\n", num_blocks);
-        printf("  contiguous (reserve n_ctx/seq): %d sequences\n", contiguous_fit);
-        printf("  paged (on-demand blocks):       %d sequences  (avg %ld tok/seq)\n",
-               paged_fit, paged_fit ? total_tokens / paged_fit : 0);
-        printf("  --> paged fits %.1fx more concurrent sequences\n\n",
-               contiguous_fit ? (double) paged_fit / contiguous_fit : 0.0);
-    }
-
-    // ---------------------------------------------------------------------
-    // WIN 3: cross-tenant prefix sharing. N tenants share a long system
-    // prompt / RAG context, then diverge. Compare physical blocks consumed
-    // with prefix caching on vs off.
-    // ---------------------------------------------------------------------
-    {
-        const int n_tenants    = 32;
-        const int shared_len   = 1024;  // shared system prompt (64 blocks)
-        const int distinct_len = 64;    // per-tenant suffix (4 blocks)
-
-        // Shared prefix token ids (identical across tenants -> identical block hashes).
-        std::vector<int> shared(shared_len);
-        for (int i = 0; i < shared_len; ++i) shared[i] = 1000 + i;
-
-        // --- prefix caching OFF: every tenant pays for the whole prefix ---
-        long blocks_off = 0;
-        {
-            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/false);
-            for (int t = 0; t < n_tenants; ++t) {
-                m.allocate(t, (size_t) (shared_len + distinct_len));
-                blocks_off += m.block_table(t).size();
-            }
-        }
-
-        // --- prefix caching ON: shared blocks are deduped to one physical copy ---
-        long blocks_on = 0;
-        {
-            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/true);
-            // tenant 0 fills + caches the shared prefix
-            auto h = m.compute_block_hashes(shared);
-            m.allocate(0, (size_t) (shared_len + distinct_len));
-            m.cache_blocks(0, h, (size_t) shared_len);
-            long physical = m.block_table(0).size();
-            // tenants 1..N-1 hit the cached prefix; only their distinct suffix is new
-            for (int t = 1; t < n_tenants; ++t) {
-                size_t cached_tokens = m.get_computed_blocks(h); // shared blocks reused
-                size_t new_tokens = (shared_len - cached_tokens) + distinct_len;
-                m.allocate(t, (size_t) (shared_len + distinct_len));
-                // physically new blocks = only what wasn't already resident
-                physical += (long) cdiv(new_tokens, block_size);
-            }
-            blocks_on = physical;
-        }
-
-        printf("WIN 3  cross-tenant prefix sharing (%d tenants, %d-tok shared prefix)\n",
-               n_tenants, shared_len);
-        printf("  prefix-cache OFF: %ld physical blocks\n", blocks_off);
-        printf("  prefix-cache ON:  %ld physical blocks\n", blocks_on);
-        printf("  --> %.1fx less KV memory for the shared workload\n\n",
-               blocks_on ? (double) blocks_off / blocks_on : 0.0);
-    }
-
-    printf("WIN 2  aggregate throughput under load: PENDING\n");
-    printf("  Requires the paged gather-read path wired into llama-graph.cpp\n");
-    printf("  (Gate 0) to measure tok/s vs concurrency. Not measurable at the\n");
-    printf("  allocation layer; not reported here to avoid overclaiming.\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
@@ -1,169 +0,0 @@
-// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the
-// regime where paging wins - variable prompt lengths, variable generation lengths,
-// staggered (continuous) arrival, and a shared system prefix. The stock
-// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a
-// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and
-// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md).
-//
-// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to
-// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU
-// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV
-// memory becomes the binding constraint - that is where paged KV pays off and where
-// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput
-// plateaus long before memory binds, so the win is not observable there regardless.
-//
-// Metrics reported:
-//   - goodput (decode tokens/s aggregate) under the dynamic load
-//   - peak concurrent in-flight sequences actually sustained
-//   - paged peak KV bytes used  vs  the contiguous reservation a unified cache needs
-//     (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks
-//
-// The capacity ratio is the load-bearing number for the buy decision: it is how many
-// more concurrent tenants a fixed HBM budget serves with paging than without.
-
-#include "common.h"
-#include "llama.h"
-
-#include <cmath>
-#include <cstdio>
-#include <cstring>
-#include <random>
-#include <string>
-#include <vector>
-
-// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ----
-static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
-
-struct workload_cfg {
-    int    total_requests  = env_int("LG_TOTAL",    2000); // total requests to serve
-    int    target_inflight = env_int("LG_INFLIGHT",  256); // continuous-batching concurrency target
-    int    prefix_tokens   = env_int("LG_PREFIX",    512); // shared system-prompt prefix (prefix-cache target)
-    int    suffix_min      = env_int("LG_SUFMIN",     16); // per-request unique prompt suffix range
-    int    suffix_max      = env_int("LG_SUFMAX",    768);
-    int    gen_short       = env_int("LG_GENSHORT",   32); // bimodal generation: most short...
-    int    gen_long        = env_int("LG_GENLONG",  1024); // ...some long (the over-reservation driver)
-    int    gen_long_pct    = env_int("LG_LONGPCT",    15); // % of requests that are long
-    int    block_size      = env_int("LG_BLOCK",      16); // must match -kvbls
-    unsigned seed          = (unsigned) env_int("LG_SEED", 1234);
-};
-
-// Per-request plan drawn from the workload distribution.
-struct req_plan { int prompt_len; int gen_len; };
-
-int main(int argc, char ** argv) {
-    common_params params;
-    params.n_predict = -1; // per-request, controlled by the plan below
-    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) {
-        fprintf(stderr, "usage: %s -m <model> -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]);
-        return 1;
-    }
-    params.kv_paged = true;
-
-    common_init_result init = common_init_from_params(params);
-    llama_model *   model = init.model.get();
-    llama_context * ctx   = init.context.get();
-    if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; }
-    const llama_vocab * vocab = llama_model_get_vocab(model);
-
-    workload_cfg cfg;
-    std::mt19937 rng(cfg.seed);
-    std::uniform_int_distribution<int> suf(cfg.suffix_min, cfg.suffix_max);
-    std::uniform_int_distribution<int> pct(1, 100);
-
-    // KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed
-    // against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio.
-    const int n_layers   = llama_model_n_layer(model);
-    const int n_head_kv  = llama_model_n_head_kv(model);
-    const int head_dim   = llama_model_n_embd(model) / llama_model_n_head(model);
-    const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t);
-
-    // A long shared system prefix that every request reuses (the prefix-cache target).
-    std::vector<llama_token> prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true);
-
-    // Pre-draw all request plans so paged peak usage and the contiguous reservation are
-    // computed from the SAME workload.
-    std::vector<req_plan> plans(cfg.total_requests);
-    int max_ctx = 0;
-    for (auto & p : plans) {
-        p.prompt_len = cfg.prefix_tokens + suf(rng);
-        p.gen_len    = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short;
-        max_ctx      = std::max(max_ctx, p.prompt_len + p.gen_len);
-    }
-
-    llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx);
-    if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; }
-
-    // ---- continuous-arrival loop: keep ~target_inflight requests live at all times ----
-    int    next_req = 0, done = 0, inflight = 0, peak_inflight = 0;
-    long   total_decoded = 0;
-    size_t peak_kv_bytes_paged = 0;   // sum over live seqs of ceil(used/block)*block*kv_bytes
-    size_t live_used_tokens = 0;      // running sum of actual KV tokens held by live seqs
-
-    auto admit = [&](int rid) {
-        const req_plan & p = plans[rid];
-        std::vector<llama_token> toks = prefix; // shared prefix...
-        std::vector<llama_token> suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false);
-        toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix
-        if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) {
-            inflight++; peak_inflight = std::max(peak_inflight, inflight);
-            live_used_tokens += p.prompt_len;
-        }
-    };
-
-    const int64_t t0 = ggml_time_us();
-    for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++);
-
-    llama_batch batch = {};
-    std::vector<llama_token> sampled; std::vector<int8_t> stop_flags;
-
-    while (done < cfg.total_requests) {
-        if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break;
-        const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched);
-        sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0);
-
-        // (decode is done inside the scheduler/update path in PR #22569; greedy here)
-        for (int i = 0; i < info->n_seq; ++i) {
-            const int rid = info->seq_ids[i];
-            llama_paged_seq_state st{};
-            llama_paged_scheduler_get_seq_state(sched, rid, &st);
-            // greedy argmax from the i-th row of logits
-            const float * lg = llama_get_logits_ith(ctx, i);
-            int best = 0; float bv = lg[0];
-            for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; }
-            sampled[i] = best;
-            const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len;
-            stop_flags[i] = stop ? 1 : 0;
-            if (!stop) { total_decoded++; live_used_tokens++; }
-            if (stop) {
-                done++; inflight--;
-                live_used_tokens -= (plans[rid].prompt_len + st.n_decoded);
-                if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival
-            }
-        }
-        // paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate
-        // current paged footprint from live_used_tokens rounded up per the block size.
-        const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size)
-                                 * cfg.block_size * kv_bytes_per_token;
-        peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now);
-
-        llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data());
-    }
-    const double secs = (ggml_time_us() - t0) / 1e6;
-
-    // Contiguous unified-KV reservation needed to serve the SAME peak concurrency without
-    // mid-generation eviction: every live slot must be backed for the worst-case context.
-    const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token;
-
-    printf("\n==== paged-loadgen ====\n");
-    printf("requests served      : %d  (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight);
-    printf("goodput (decode)     : %.1f tok/s   (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs);
-    printf("kv bytes / token     : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim);
-    printf("paged peak KV        : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0);
-    printf("contiguous reserve   : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx);
-    printf("CAPACITY RATIO       : %.2fx  <- tenants-per-HBM paging unlocks\n",
-           peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0);
-    printf("  (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens);
-
-    llama_paged_scheduler_free(sched);
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
@@ -1,296 +0,0 @@
-#include "paged_kv_manager.h"
-#include <cassert>
-#include <stdexcept>
-
-namespace paged {
-
-// ---------------------------------------------------------------------------
-// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
-// ---------------------------------------------------------------------------
-
-FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
-    num_free_blocks = blocks.size();
-    for (size_t i = 0; i < blocks.size(); ++i) {
-        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
-        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
-    }
-    if (!blocks.empty()) {
-        fake_head.next_free = blocks.front();
-        blocks.front()->prev_free = &fake_head;
-        fake_tail.prev_free = blocks.back();
-        blocks.back()->next_free = &fake_tail;
-    } else {
-        fake_head.next_free = &fake_tail;
-        fake_tail.prev_free = &fake_head;
-    }
-}
-
-KVCacheBlock* FreeBlockQueue::popleft() {
-    KVCacheBlock* first = fake_head.next_free;
-    if (first == &fake_tail || first == nullptr) {
-        assert(num_free_blocks == 0);
-        throw std::runtime_error("No free blocks available");
-    }
-    fake_head.next_free = first->next_free;
-    first->next_free->prev_free = &fake_head;
-    first->prev_free = first->next_free = nullptr;
-    num_free_blocks--;
-    return first;
-}
-
-std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
-    std::vector<KVCacheBlock*> ret;
-    if (n == 0) return ret;
-    assert(num_free_blocks >= n);
-    num_free_blocks -= n;
-    KVCacheBlock* curr = fake_head.next_free;
-    ret.reserve(n);
-    for (size_t i = 0; i < n; ++i) {
-        assert(curr != nullptr);
-        ret.push_back(curr);
-        KVCacheBlock* last = curr;
-        curr = curr->next_free;
-        last->prev_free = last->next_free = nullptr;
-    }
-    if (curr != nullptr) {
-        fake_head.next_free = curr;
-        curr->prev_free = &fake_head;
-    }
-    return ret;
-}
-
-void FreeBlockQueue::remove(KVCacheBlock* block) {
-    if (!block->prev_free || !block->next_free)
-        throw std::runtime_error("remove() called on an invalid block");
-    block->prev_free->next_free = block->next_free;
-    block->next_free->prev_free = block->prev_free;
-    block->prev_free = block->next_free = nullptr;
-    num_free_blocks--;
-}
-
-void FreeBlockQueue::append(KVCacheBlock* block) {
-    KVCacheBlock* last = fake_tail.prev_free;
-    last->next_free = block;
-    block->prev_free = last;
-    block->next_free = &fake_tail;
-    fake_tail.prev_free = block;
-    num_free_blocks++;
-}
-
-void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
-    if (blocks.empty()) return;
-    KVCacheBlock* last = fake_tail.prev_free;
-    for (KVCacheBlock* b : blocks) {
-        b->prev_free = last;
-        last->next_free = b;
-        last = b;
-    }
-    last->next_free = &fake_tail;
-    fake_tail.prev_free = last;
-    num_free_blocks += blocks.size();
-}
-
-void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
-    if (blocks.empty()) return;
-    KVCacheBlock* first = fake_head.next_free;
-    KVCacheBlock* prev = &fake_head;
-    for (KVCacheBlock* b : blocks) {
-        b->prev_free = prev;
-        prev->next_free = b;
-        prev = b;
-    }
-    prev->next_free = first;
-    first->prev_free = prev;
-    num_free_blocks += blocks.size();
-}
-
-std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
-    std::vector<KVCacheBlock*> ret;
-    const KVCacheBlock* curr = fake_head.next_free;
-    while (curr && curr->next_free != nullptr) {
-        ret.push_back(const_cast<KVCacheBlock*>(curr));
-        curr = curr->next_free;
-    }
-    return ret;
-}
-
-// ---------------------------------------------------------------------------
-// BlockPool  (port of block_pool.py)
-// ---------------------------------------------------------------------------
-
-static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
-    std::vector<KVCacheBlock*> p;
-    p.reserve(v.size());
-    for (auto& b : v) p.push_back(&b);
-    return p;
-}
-
-static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
-    std::vector<KVCacheBlock> v;
-    v.reserve(num_blocks);
-    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
-    return v;
-}
-
-BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
-    : enable_caching_(enable_caching),
-      blocks_(make_block_vec(num_blocks)),
-      ptrs_(make_ptrs(blocks_)),
-      free_queue_(ptrs_) {
-    // vLLM reserves block_id 0 as the null block (never cached).
-    null_block = free_queue_.popleft();
-    null_block->is_null = true;
-}
-
-bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
-    if (!block->has_hash) return false;
-    auto it = cached_block_hash_to_block_.find(block->block_hash);
-    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
-    cached_block_hash_to_block_.erase(it);
-    block->reset_hash();
-    return true;
-}
-
-std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
-    if (n > get_num_free_blocks())
-        throw std::runtime_error("Cannot get free blocks from pool");
-    auto ret = free_queue_.popleft_n(n);
-    for (KVCacheBlock* b : ret) {
-        if (enable_caching_) maybe_evict_cached_block(b);
-        assert(b->ref_cnt == 0);
-        b->ref_cnt += 1;
-    }
-    return ret;
-}
-
-KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
-    auto it = cached_block_hash_to_block_.find(block_hash);
-    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
-}
-
-void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
-    for (KVCacheBlock* b : blocks) {
-        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
-        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
-        b->ref_cnt += 1;
-    }
-}
-
-void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
-    std::vector<KVCacheBlock*> without_hash, with_hash;
-    for (KVCacheBlock* b : ordered_blocks) {
-        if (b->is_null) continue;
-        b->ref_cnt -= 1;
-        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
-    }
-    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
-    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
-}
-
-void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-                                  size_t num_cached_blocks, size_t num_full_blocks,
-                                  const std::vector<uint64_t>& block_hashes) {
-    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
-        KVCacheBlock* blk = req_blocks[i];
-        if (blk->has_hash) continue;
-        blk->has_hash = true;
-        blk->block_hash = block_hashes[i];
-        cached_block_hash_to_block_[blk->block_hash] = blk;
-    }
-}
-
-// ---------------------------------------------------------------------------
-// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
-// ---------------------------------------------------------------------------
-
-static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-
-PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
-    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
-
-bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
-    auto& req = req_to_blocks_[seq_id];
-    size_t need = cdiv(total_tokens, block_size_);
-    if (need <= req.size()) return true;
-    size_t add = need - req.size();
-    if (add > pool_.get_num_free_blocks()) return false; // OOM
-    auto nb = pool_.get_new_blocks(add);
-    req.insert(req.end(), nb.begin(), nb.end());
-    return true;
-}
-
-std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
-    std::vector<int32_t> bt;
-    auto it = req_to_blocks_.find(seq_id);
-    if (it == req_to_blocks_.end()) return bt;
-    bt.reserve(it->second.size());
-    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
-    return bt;
-}
-
-int64_t PagedKVManager::slot(int seq_id, int pos) const {
-    const auto& req = req_to_blocks_.at(seq_id);
-    int32_t phys = req[pos / block_size_]->block_id;
-    return (int64_t)phys * block_size_ + (pos % block_size_);
-}
-
-std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
-    std::vector<int64_t> sm;
-    sm.reserve(positions.size());
-    for (int p : positions) sm.push_back(slot(seq_id, p));
-    return sm;
-}
-
-void PagedKVManager::free(int seq_id) {
-    auto it = req_to_blocks_.find(seq_id);
-    if (it == req_to_blocks_.end()) return;
-    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
-    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
-    pool_.free_blocks(ordered);
-    req_to_blocks_.erase(it);
-}
-
-// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
-// hash into the seed so each block hash transitively encodes its whole prefix
-// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
-uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
-    uint64_t h = 1469598103934665603ull ^ parent_hash;
-    for (int t : token_ids) {
-        h ^= (uint64_t)(uint32_t)t;
-        h *= 1099511628211ull;
-    }
-    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
-    return h;
-}
-
-std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
-    std::vector<uint64_t> hashes;
-    uint64_t parent = 0; // NONE_HASH analogue
-    size_t n_full = token_ids.size() / block_size_;
-    for (size_t i = 0; i < n_full; ++i) {
-        std::vector<int> blk(token_ids.begin() + i * block_size_,
-                             token_ids.begin() + (i + 1) * block_size_);
-        parent = hash_block(parent, blk);
-        hashes.push_back(parent);
-    }
-    return hashes;
-}
-
-size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
-    std::vector<KVCacheBlock*> hits;
-    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
-        KVCacheBlock* cb = pool_.get_cached_block(bh);
-        if (!cb) break;
-        hits.push_back(cb);
-    }
-    pool_.touch(hits);                        // ++ref_cnt, pull from free list
-    return hits.size() * (size_t)block_size_;
-}
-
-void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
-    auto& req = req_to_blocks_[seq_id];
-    size_t n_full = num_tokens / block_size_;
-    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
-}
-
-} // namespace paged
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.h
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.h
@@ -1,108 +0,0 @@
-#pragma once
-// Paged KV cache block manager for llama.cpp (CPU-first prototype).
-//
-// Host-side block management is a faithful port of vLLM V1:
-//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
-//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
-//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
-//
-// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
-// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
-// dependency so it can be unit-tested in isolation.
-
-#include <cstdint>
-#include <vector>
-#include <unordered_map>
-#include <map>
-
-namespace paged {
-
-// vLLM KVCacheBlock (kv_cache_utils.py).
-struct KVCacheBlock {
-    int32_t  block_id   = 0;
-    int      ref_cnt    = 0;
-    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
-    uint64_t block_hash = 0;
-    bool     is_null    = false;
-    KVCacheBlock* prev_free = nullptr;
-    KVCacheBlock* next_free = nullptr;
-
-    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
-    void reset_hash() { has_hash = false; block_hash = 0; }
-};
-
-// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
-// O(1) middle removal is required so touch() can pull a warm cached block out of the
-// free list when a later request hits its prefix.
-class FreeBlockQueue {
-public:
-    size_t num_free_blocks = 0;
-
-    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
-    KVCacheBlock* popleft();
-    std::vector<KVCacheBlock*> popleft_n(size_t n);
-    void remove(KVCacheBlock* block);
-    void append(KVCacheBlock* block);
-    void append_n(const std::vector<KVCacheBlock*>& blocks);
-    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
-    std::vector<KVCacheBlock*> get_all_free_blocks() const;
-
-private:
-    KVCacheBlock fake_head{-1};
-    KVCacheBlock fake_tail{-1};
-};
-
-// vLLM BlockPool (block_pool.py).
-class BlockPool {
-public:
-    KVCacheBlock* null_block = nullptr;
-
-    BlockPool(int32_t num_blocks, bool enable_caching);
-    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
-    KVCacheBlock* get_cached_block(uint64_t block_hash);
-    void touch(const std::vector<KVCacheBlock*>& blocks);
-    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
-    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-                           size_t num_cached_blocks, size_t num_full_blocks,
-                           const std::vector<uint64_t>& block_hashes);
-    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
-
-private:
-    bool maybe_evict_cached_block(KVCacheBlock* block);
-
-    bool enable_caching_;
-    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
-    std::vector<KVCacheBlock*> ptrs_;
-    FreeBlockQueue free_queue_;
-    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
-    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
-    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
-};
-
-// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
-// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
-class PagedKVManager {
-public:
-    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
-
-    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
-    bool allocate(int seq_id, size_t total_tokens);
-    std::vector<int32_t> block_table(int seq_id) const;
-    int64_t slot(int seq_id, int pos) const;
-    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
-    void free(int seq_id);
-    int block_size() const { return block_size_; }
-
-    // Prefix caching (win 3).
-    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
-    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
-    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
-
-protected:
-    int block_size_;
-    BlockPool pool_;
-    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
-};
-
-} // namespace paged
--- a/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
@@ -1,59 +0,0 @@
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index a49a055a6..d95102bbd 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -11,6 +11,8 @@
- #include <cstring>
- #include <limits>
- #include <map>
-+#include <numeric>
-+#include <cstdlib>
- #include <stdexcept>
- 
- static bool ggml_is_power_of_2(int n) {
-@@ -931,6 +933,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             return { };
-         }
- 
-+        // [paged, experimental] Place this sequence's tokens at permuted,
-+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
-+        // This validates that attention is invariant to physical KV placement -
-+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-+        // Single-sequence scope (uses get_used() as the logical base); falls back
-+        // to the normal allocator if the permuted cells aren't available.
-+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+        if (paged_mode) {
-+            const uint32_t bs   = 16;                 // block size (tokens/block)
-+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            if (nblk >= 2) {
-+                // stride coprime to nblk => block-index permutation is a bijection
-+                uint32_t k = 1;
-+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-+                }
-+                const uint32_t base = cells.get_used();
-+                bool ok = true;
-+                for (uint32_t i = 0; i < n_tokens; ++i) {
-+                    const uint32_t L    = base + i;
-+                    const uint32_t b    = L / bs;
-+                    const uint32_t off  = L % bs;
-+                    if (b >= nblk) { ok = false; break; }
-+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-+                    res.idxs[s].push_back(phys);
-+                }
-+                if (ok && res.idxs[s].size() == n_tokens) {
-+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                    }
-+                    continue; // paged placement succeeded for this sequence
-+                }
-+                res.idxs[s].clear(); // fall back to the normal allocator
-+            }
-+        }
-+
-         uint32_t n_tested = 0;
- 
-         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
--- a/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
@@ -1,12 +0,0 @@
-diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp
-index 5a352e3..06ead50 100644
--- a/tests/test-paged-kv-e2e.cpp
-+++ b/tests/test-paged-kv-e2e.cpp
-@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) {
-     params.sampling.temp = 0.0f;  // greedy
-     params.warmup        = false;
-     params.kv_paged      = true;
-+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
-     params.n_gpu_blocks  = 64;
-     params.n_cpu_blocks  = 16;
-     params.n_sequences   = 1;
--- a/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
@@ -1,42 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-using namespace paged;
-
-int main() {
-    BlockPool pool(/*num_blocks=*/8, /*enable_caching=*/true);
-    // block 0 is reserved as null_block (vLLM pops one at init)
-    assert(pool.null_block != nullptr && pool.null_block->block_id == 0);
-    assert(pool.get_num_free_blocks() == 7);
-
-    // get_new_blocks sets ref_cnt=1 and removes from free list
-    auto b = pool.get_new_blocks(2);
-    assert(b.size() == 2 && b[0]->ref_cnt == 1 && b[1]->ref_cnt == 1);
-    assert(pool.get_num_free_blocks() == 5);
-
-    // cache two full blocks with chained hashes, then look them up
-    std::vector<uint64_t> hashes = {1111, 2222};
-    pool.cache_full_blocks(b, /*num_cached=*/0, /*num_full=*/2, hashes);
-    assert(b[0]->has_hash && b[0]->block_hash == 1111);
-    assert(pool.get_cached_block(1111) == b[0]);
-    assert(pool.get_cached_block(2222) == b[1]);
-    assert(pool.get_cached_block(9999) == nullptr);
-
-    // free: hashed blocks go to tail (kept warm), so they remain queryable.
-    pool.free_blocks(b);
-    assert(b[0]->ref_cnt == 0);
-    assert(pool.get_num_free_blocks() == 7);
-    assert(pool.get_cached_block(1111) == b[0]); // still cached/warm
-
-    // touch a warm cached block: pulls it out of free list, ++ref_cnt
-    pool.touch({b[0]});
-    assert(b[0]->ref_cnt == 1);
-    assert(pool.get_num_free_blocks() == 6);
-
-    // exhausting the pool then allocating evicts a warm cached hash
-    auto rest = pool.get_new_blocks(pool.get_num_free_blocks());
-    (void) rest;
-    assert(pool.get_cached_block(2222) == nullptr); // evicted on reuse
-    printf("test_block_pool: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
@@ -1,44 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-#include <vector>
-
-using namespace paged;
-
-static std::vector<KVCacheBlock> make_blocks(int n) {
-    std::vector<KVCacheBlock> v;
-    v.reserve(n);
-    for (int i = 0; i < n; ++i) v.push_back(KVCacheBlock{i});
-    return v;
-}
-
-int main() {
-    // ordered 0..9 at init; popleft yields ascending block_ids
-    auto blocks = make_blocks(10);
-    std::vector<KVCacheBlock*> ptrs;
-    for (auto& b : blocks) ptrs.push_back(&b);
-    FreeBlockQueue q(ptrs);
-    assert(q.num_free_blocks == 10);
-
-    KVCacheBlock* b0 = q.popleft();
-    assert(b0->block_id == 0);
-    assert(q.num_free_blocks == 9);
-
-    auto two = q.popleft_n(2);            // {1,2}
-    assert(two.size() == 2 && two[0]->block_id == 1 && two[1]->block_id == 2);
-    assert(q.num_free_blocks == 7);
-
-    // O(1) middle removal: remove block 5 (currently free), count drops
-    q.remove(ptrs[5]);
-    assert(q.num_free_blocks == 6);       // free: 3,4,6,7,8,9
-
-    // append puts a block at the tail; it comes back out only after the rest
-    q.append(b0);                          // free order now: 3,4,6,7,8,9,0
-    assert(q.num_free_blocks == 7);
-    auto all = q.get_all_free_blocks();
-    assert(all.front()->block_id == 3);
-    assert(all.back()->block_id == 0);
-
-    printf("test_free_block_queue: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
@@ -1,133 +0,0 @@
-// Phase 2 (core numeric de-risk): attention over GATHERED paged KV must equal
-// an independent host-computed reference.
-//
-// This answers the central risk in the design: feeding gather-to-scratch KV
-// (a sequence whose blocks are non-contiguous in the shared pool) into ggml's
-// standard attention ops (mul_mat -> soft_max_ext -> mul_mat) produces correct
-// attention. If this holds, the paged read path is numerically sound; the
-// remaining work is wiring it into llama-graph.cpp (Gate 0 in a real model).
-
-#include "../paged_kv_manager.h"
-
-#include "ggml.h"
-#include "ggml-cpu.h"
-#include "ggml-alloc.h"
-#include "ggml-backend.h"
-
-#include <cassert>
-#include <cstdio>
-#include <cmath>
-#include <vector>
-
-using namespace paged;
-
-int main() {
-    const int d          = 8;     // head dim
-    const int n_kv       = 48;    // 3 blocks worth of KV tokens
-    const int n_q        = 4;     // query tokens
-    const int block_size = 16;
-    const int num_blocks = 8;
-    const int total_slots = block_size * num_blocks;
-    const float scale = 1.0f / std::sqrt((float) d);
-
-    // Non-contiguous physical layout for the KV sequence (blocks [2,1,5]).
-    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-    assert(m.allocate(0, 2 * block_size));
-    assert(m.allocate(1, 2 * block_size));
-    m.free(0);
-    assert(m.allocate(2, n_kv));
-    std::vector<int> positions(n_kv);
-    for (int i = 0; i < n_kv; ++i) positions[i] = i;
-    auto slots64 = m.slot_mapping(2, positions);
-    std::vector<int32_t> slots32(slots64.begin(), slots64.end());
-
-    // Deterministic K, V, Q in logical [d, n] layout (column-major: col = token).
-    std::vector<float> K(d * n_kv), V(d * n_kv), Q(d * n_q);
-    for (int t = 0; t < n_kv; ++t)
-        for (int e = 0; e < d; ++e) {
-            K[t * d + e] = std::sin(0.1f * t + 0.3f * e);
-            V[t * d + e] = std::cos(0.2f * t - 0.1f * e);
-        }
-    for (int q = 0; q < n_q; ++q)
-        for (int e = 0; e < d; ++e) Q[q * d + e] = std::sin(0.05f * q + 0.7f * e);
-
-    // ---- Independent host reference attention -------------------------------
-    std::vector<float> ref(d * n_q, 0.0f);
-    for (int q = 0; q < n_q; ++q) {
-        std::vector<float> score(n_kv);
-        float mx = -1e30f;
-        for (int t = 0; t < n_kv; ++t) {
-            float dot = 0.0f;
-            for (int e = 0; e < d; ++e) dot += K[t * d + e] * Q[q * d + e];
-            score[t] = dot * scale;
-            mx = std::fmax(mx, score[t]);
-        }
-        float sum = 0.0f;
-        for (int t = 0; t < n_kv; ++t) { score[t] = std::exp(score[t] - mx); sum += score[t]; }
-        for (int t = 0; t < n_kv; ++t) {
-            float p = score[t] / sum;
-            for (int e = 0; e < d; ++e) ref[q * d + e] += p * V[t * d + e];
-        }
-    }
-
-    // ---- ggml paged path ----------------------------------------------------
-    ggml_backend_t backend = ggml_backend_cpu_init();
-    struct ggml_init_params dp = { ggml_tensor_overhead() * 16, NULL, true };
-    struct ggml_context * ctx_data = ggml_init(dp);
-
-    struct ggml_tensor * poolK = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
-    struct ggml_tensor * poolV = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
-    struct ggml_tensor * kSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
-    struct ggml_tensor * vSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
-    struct ggml_tensor * qT    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_q);
-    struct ggml_tensor * wIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_kv);
-    struct ggml_tensor * gIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_kv);
-
-    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
-    std::vector<float> zeros(d * total_slots, 0.0f);
-    ggml_backend_tensor_set(poolK, zeros.data(), 0, ggml_nbytes(poolK));
-    ggml_backend_tensor_set(poolV, zeros.data(), 0, ggml_nbytes(poolV));
-    ggml_backend_tensor_set(kSrc, K.data(), 0, ggml_nbytes(kSrc));
-    ggml_backend_tensor_set(vSrc, V.data(), 0, ggml_nbytes(vSrc));
-    ggml_backend_tensor_set(qT,   Q.data(), 0, ggml_nbytes(qT));
-    ggml_backend_tensor_set(wIdx, slots64.data(), 0, ggml_nbytes(wIdx));
-    ggml_backend_tensor_set(gIdx, slots32.data(), 0, ggml_nbytes(gIdx));
-
-    struct ggml_init_params cp = { ggml_tensor_overhead() * 64 + ggml_graph_overhead(), NULL, true };
-    struct ggml_context * ctx = ggml_init(cp);
-
-    struct ggml_tensor * wroteK = ggml_set_rows(ctx, poolK, kSrc, wIdx);
-    struct ggml_tensor * wroteV = ggml_set_rows(ctx, poolV, vSrc, wIdx);
-    struct ggml_tensor * gK = ggml_get_rows(ctx, wroteK, gIdx);          // [d, n_kv]
-    struct ggml_tensor * gV = ggml_get_rows(ctx, wroteV, gIdx);          // [d, n_kv]
-
-    struct ggml_tensor * kq    = ggml_mul_mat(ctx, gK, qT);              // [n_kv, n_q]
-    struct ggml_tensor * probs = ggml_soft_max_ext(ctx, kq, NULL, scale, 0.0f);
-    struct ggml_tensor * vT    = ggml_cont(ctx, ggml_transpose(ctx, gV)); // [n_kv, d]
-    struct ggml_tensor * out   = ggml_mul_mat(ctx, vT, probs);           // [d, n_q]
-    ggml_set_output(out);
-
-    struct ggml_cgraph * gf = ggml_new_graph(ctx);
-    ggml_build_forward_expand(gf, out);
-    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
-    assert(ggml_gallocr_alloc_graph(galloc, gf));
-    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
-
-    std::vector<float> got(d * n_q);
-    ggml_backend_tensor_get(out, got.data(), 0, ggml_nbytes(out));
-
-    // ---- compare ------------------------------------------------------------
-    double max_err = 0.0;
-    for (int i = 0; i < d * n_q; ++i) max_err = std::fmax(max_err, std::fabs(got[i] - ref[i]));
-    printf("paged attention max abs err vs host reference: %.3e\n", max_err);
-    assert(max_err < 1e-4 && "paged-gathered attention must match host reference");
-
-    ggml_gallocr_free(galloc);
-    ggml_free(ctx);
-    ggml_free(ctx_data);
-    ggml_backend_buffer_free(buf);
-    ggml_backend_free(backend);
-
-    printf("test_ggml_paged_attn: OK (attention over non-contiguous paged KV matches reference)\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
@@ -1,142 +0,0 @@
-// Phase 1 integration test: prove the paged KV write+read MECHANISM at the
-// ggml-op level, driven by PagedKVManager.
-//
-//   write:  ggml_set_rows(pool, k_src, slot_mapping)   // scatter by slot
-//   read:   ggml_get_rows(pool, gather_idx)            // gather seq's slots
-//
-// The decisive property: a sequence's physical blocks are NON-CONTIGUOUS and
-// OUT-OF-ORDER (forced via allocate/free/reallocate), yet gather(write(x)) == x,
-// and a second sequence written into disjoint blocks does not contaminate it.
-// This is exactly how a paged read path feeds contiguous scratch to attention.
-
-#include "../paged_kv_manager.h"
-
-#include "ggml.h"
-#include "ggml-cpu.h"
-#include "ggml-alloc.h"
-#include "ggml-backend.h"
-
-#include <cassert>
-#include <cstdio>
-#include <cmath>
-#include <vector>
-
-using namespace paged;
-
-int main() {
-    const int n_embd      = 8;
-    const int block_size  = 16;
-    const int num_blocks  = 8;                       // block 0 reserved as null
-    const int total_slots = block_size * num_blocks; // 128
-
-    // --- Force a non-contiguous, out-of-order block layout for seqC ----------
-    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-    assert(m.allocate(/*seqA=*/0, 2 * block_size)); // blocks {1,2}
-    assert(m.allocate(/*seqB=*/1, 2 * block_size)); // blocks {3,4}
-    m.free(0);                                       // returns {1,2} to free list
-    assert(m.allocate(/*seqC=*/2, 3 * block_size));  // reuses freed blocks, reordered
-
-    auto btC = m.block_table(2);
-    auto btB = m.block_table(1);
-    printf("seqC block_table = [");
-    for (size_t i = 0; i < btC.size(); ++i) printf("%s%d", i ? "," : "", btC[i]);
-    printf("]\n");
-    assert(btC.size() == 3);
-    // sanity: seqC and seqB occupy disjoint physical blocks
-    for (int cb : btC) for (int bb : btB) assert(cb != bb);
-
-    const int n_tokens = 3 * block_size; // 48 tokens for seqC
-
-    // slot_mapping for seqC positions 0..n_tokens-1
-    std::vector<int> positions(n_tokens);
-    for (int i = 0; i < n_tokens; ++i) positions[i] = i;
-    std::vector<int64_t> slots64 = m.slot_mapping(2, positions); // I64 for set_rows
-    std::vector<int32_t> slots32(slots64.begin(), slots64.end()); // I32 for get_rows
-
-    // seqB occupies different blocks; write a sentinel there to prove isolation.
-    std::vector<int> posB(2 * block_size);
-    for (size_t i = 0; i < posB.size(); ++i) posB[i] = (int) i;
-    std::vector<int64_t> slotsB64 = m.slot_mapping(1, posB);
-
-    // --- ggml backend + persistent (statically allocated) tensors ------------
-    ggml_backend_t backend = ggml_backend_cpu_init();
-    assert(backend);
-
-    struct ggml_init_params dp = { /*mem_size=*/ ggml_tensor_overhead() * 16,
-                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
-    struct ggml_context * ctx_data = ggml_init(dp);
-
-    // The shared paged KV pool: one flat block pool, exactly like a paged layer.
-    struct ggml_tensor * pool    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, total_slots);
-    struct ggml_tensor * k_src   = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, n_tokens);
-    struct ggml_tensor * w_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_tokens);
-    struct ggml_tensor * g_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_tokens);
-    struct ggml_tensor * kB_src  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, (int) posB.size());
-    struct ggml_tensor * wB_idx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, (int) posB.size());
-
-    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
-    assert(buf);
-
-    // pool starts zeroed
-    std::vector<float> zeros(n_embd * total_slots, 0.0f);
-    ggml_backend_tensor_set(pool, zeros.data(), 0, ggml_nbytes(pool));
-
-    // token t carries the value (float) t in every embedding lane -> easy to verify
-    std::vector<float> ksrc(n_embd * n_tokens);
-    for (int t = 0; t < n_tokens; ++t)
-        for (int e = 0; e < n_embd; ++e) ksrc[t * n_embd + e] = (float) t;
-    ggml_backend_tensor_set(k_src, ksrc.data(), 0, ggml_nbytes(k_src));
-    ggml_backend_tensor_set(w_idx, slots64.data(), 0, ggml_nbytes(w_idx));
-    ggml_backend_tensor_set(g_idx, slots32.data(), 0, ggml_nbytes(g_idx));
-
-    // seqB sentinel = 999 everywhere
-    std::vector<float> kBsrc(n_embd * posB.size(), 999.0f);
-    ggml_backend_tensor_set(kB_src, kBsrc.data(), 0, ggml_nbytes(kB_src));
-    ggml_backend_tensor_set(wB_idx, slotsB64.data(), 0, ggml_nbytes(wB_idx));
-
-    // --- compute graph: write seqB, write seqC, then gather seqC -------------
-    struct ggml_init_params cp = { /*mem_size=*/ ggml_tensor_overhead() * 32 + ggml_graph_overhead(),
-                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
-    struct ggml_context * ctx = ggml_init(cp);
-
-    struct ggml_tensor * wroteB = ggml_set_rows(ctx, pool,   kB_src, wB_idx); // view(pool)
-    struct ggml_tensor * wroteC = ggml_set_rows(ctx, wroteB, k_src,  w_idx);  // chain so order is fixed
-    struct ggml_tensor * gathered = ggml_get_rows(ctx, wroteC, g_idx);
-    ggml_set_output(gathered);
-
-    struct ggml_cgraph * gf = ggml_new_graph(ctx);
-    ggml_build_forward_expand(gf, gathered);
-
-    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
-    assert(ggml_gallocr_alloc_graph(galloc, gf));
-
-    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
-
-    // --- verify gather(write(x)) == x for the non-contiguous sequence --------
-    std::vector<float> out(n_embd * n_tokens);
-    ggml_backend_tensor_get(gathered, out.data(), 0, ggml_nbytes(gathered));
-
-    int mism = 0;
-    for (int t = 0; t < n_tokens; ++t)
-        for (int e = 0; e < n_embd; ++e)
-            if (std::fabs(out[t * n_embd + e] - (float) t) > 1e-6f) mism++;
-    assert(mism == 0 && "gathered paged KV must equal source (round-trip)");
-
-    // --- verify isolation: read seqC slots directly from pool, unaffected by seqB
-    std::vector<float> pool_host(n_embd * total_slots);
-    ggml_backend_tensor_get(pool, pool_host.data(), 0, ggml_nbytes(pool));
-    for (int t = 0; t < n_tokens; ++t) {
-        int slot = (int) slots64[t];
-        for (int e = 0; e < n_embd; ++e)
-            assert(std::fabs(pool_host[slot * n_embd + e] - (float) t) < 1e-6f);
-    }
-
-    ggml_gallocr_free(galloc);
-    ggml_free(ctx);
-    ggml_free(ctx_data);
-    ggml_backend_buffer_free(buf);
-    ggml_backend_free(backend);
-
-    printf("test_ggml_paged_rw: OK (non-contiguous paged write/gather round-trip)\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
@@ -1,32 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-using namespace paged;
-
-int main() {
-    PagedKVManager m(/*num_blocks=*/8, /*block_size=*/16, /*enable_caching=*/false);
-    // 20 tokens -> ceil(20/16)=2 blocks
-    assert(m.allocate(/*seq=*/0, 20));
-    auto bt = m.block_table(0);
-    assert(bt.size() == 2);
-
-    // slot arithmetic: pos 0 -> block bt[0]*16 + 0 ; pos 17 -> bt[1]*16 + 1
-    assert(m.slot(0, 0)  == (int64_t)bt[0] * 16 + 0);
-    assert(m.slot(0, 17) == (int64_t)bt[1] * 16 + 1);
-
-    auto sm = m.slot_mapping(0, {0, 16, 17});
-    assert(sm.size() == 3 && sm[1] == (int64_t)bt[1] * 16 + 0);
-
-    // growing the same seq reuses existing blocks, adds only new ones
-    assert(m.allocate(0, 40)); // ceil(40/16)=3 -> +1 block
-    assert(m.block_table(0).size() == 3);
-
-    // OOM: blocks left = 8 - 1(null) - 3 = 4 blocks; ask for 5 blocks
-    assert(m.allocate(1, 5 * 16) == false);
-
-    // free returns blocks to the pool for reuse
-    m.free(0);
-    assert(m.allocate(1, 5 * 16)); // now fits
-    printf("test_paged_kv_manager: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
@@ -1,35 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-#include <vector>
-using namespace paged;
-
-int main() {
-    PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*enable_caching=*/true);
-
-    // shared prefix of 32 tokens (2 full blocks) + distinct suffix
-    std::vector<int> shared(32);
-    for (int i = 0; i < 32; ++i) shared[i] = 100 + i;
-
-    // chained hashing is deterministic and prefix-sensitive
-    auto h = m.compute_block_hashes(shared);
-    assert(h.size() == 2);
-    auto h2 = m.compute_block_hashes(shared);
-    assert(h == h2);                          // deterministic
-    std::vector<int> other = shared; other[0] = 999;
-    assert(m.compute_block_hashes(other)[0] != h[0]); // sensitive to content
-
-    // seq 0: cold, no cache hit yet
-    assert(m.get_computed_blocks(h) == 0);
-    assert(m.allocate(0, 32));
-    m.cache_blocks(0, h, 32);
-
-    // seq 1: warm — the 2 shared blocks are a cache hit (32 tokens)
-    assert(m.get_computed_blocks(h) == 32);
-
-    // first-miss stop: a chain that diverges after block 1 hits only 1 block
-    auto hmix = h; hmix[1] = 0xDEADBEEF;
-    assert(m.get_computed_blocks(hmix) == 16);
-    printf("test_prefix_cache: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
@@ -1,106 +0,0 @@
-# Paged-attention / parity benchmarks (GB10 / DGX Spark)
-
-Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
-
-**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
-(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
-prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
-
-## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
-
-| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
-|---|-----------|-----------|--------|-----------|-----------|--------|
-| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
-| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
-| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
-| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
-
-## Verdict — two distinct gaps, only one is the engine's
-
-1. **Prefill (S_PP): 6–48× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
-   GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this** —
-   it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
-2. **Decode at concurrency (S_TG): 2.5–3.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
-   engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
-   **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel).
-
-## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
-
-The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
-
-| kernel | % GPU time |
-|---|---|
-| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
-| `flash_attn_ext` (attention) | 19.8 |
-| `mul_mat_q<Q8>` (dense) | 10.9 |
-| KV writes / quant / norms / rest | ~15 |
-
-**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
-Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
-fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are
-MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
-
-### What the paged series IS still good for (just not throughput parity)
-
- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
-
-These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
-batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
-
-## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
-
-The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
-so we benchmarked a dense 32B head-to-head.
-
-**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
-DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
-(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
-the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
-root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
-
-| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
-|---|---|---|---|---|---|---|
-| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
-| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
-| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
-| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
-
-**Findings:**
-1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
-   B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
-   tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
-   than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
-2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
-   gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
-3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
-   block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
-   (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
-   - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
-     dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
-     nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
-     model classes. Parity is strictly the FP4 tensor-core kernel.
-4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
-   limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
-   kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
-   working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
-   CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
-   120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
-   `enable_flashinfer_autotune=True` at 0–3% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
-   with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
-   everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
-   isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
-   **Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
-   Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
-   `VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
-
-## So, honestly, where parity stands
-
- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
-  0005 continuous batching). Target: close the 2.5–3.7× at B≥8.
- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
-
-**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
-turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
-B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.
--- a/backend/cpp/llama-cpp/patches/README.md
+++ b/backend/cpp/llama-cpp/patches/README.md
@@ -1,82 +0,0 @@
-# llama.cpp patch series — paged attention (vLLM-parity engine)
-
-A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
-in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
-(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
-`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
-**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
-patch needs fixing, and the failure points at exactly which step the upstream change touched.
-
-## Base
-
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
-  the pin = re-run the regen workflow below and fix only the patches that no longer apply.
-
-## The series (phases → patches)
-
-| # | Patch | What | Verifies |
-|---|-------|------|----------|
-| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
-| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
-| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
-| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
-| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
-| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
-
-Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
-(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
-
-## Regen workflow (the anti-drift recipe)
-
-```sh
-# 1. check out the exact pin into a dev tree
-git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
-git checkout <LLAMA_VERSION from ../Makefile>
-git checkout -b paged
-
-# 2. apply the current series (each becomes a commit), or develop the next patch
-git am /path/to/backend/cpp/llama-cpp/patches/00*.patch     # or `git apply` + commit per patch
-
-# 3. iterate a phase as ONE commit, then export the whole series 1:1
-git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
-
-# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
-```
-
-## Build integration
-
-`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
-```
-for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
-```
-All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
-
-## Status
-
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
-  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
-  (see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
-  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
-  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
-  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
-  **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
-  compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
-  - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
-    flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
-    scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
-    the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
-    identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
-    makes paged placement token-identical under flash-attn.**
- 0004–0006 follow.
-
-### Honest parity note (important)
-
-This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
-by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
-(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
-touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
-gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
-`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
-kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
+++ b/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
@@ -1,91 +0,0 @@
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-new file mode 100644
-index 0000000..5f5a782
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-@@ -0,0 +1,46 @@
-+#include "fp4-grouped-moe.cuh"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+
-+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
-+//
-+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
-+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
-+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
-+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
-+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
-+//
-+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
-+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
-+
-+static bool fp4_grouped_enabled() {
-+    static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
-+    return en;
-+}
-+
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,
-+        const ggml_tensor * src1,
-+        const ggml_tensor * ids,
-+        ggml_tensor       * dst) {
-+    GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
-+
-+    if (!fp4_grouped_enabled()) {
-+        return false; // default: existing MMQ path
-+    }
-+    if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
-+        return false;
-+    }
-+
-+    // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
-+    // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
-+    // vs mul_mat_q<MXFP4> before enabling by default.
-+    static bool warned = false;
-+    if (!warned) {
-+        warned = true;
-+        fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
-+    }
-+    return false; // scaffold: fall back until the kernel lands
-+}
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-new file mode 100644
-index 0000000..29e1b5a
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-@@ -0,0 +1,13 @@
-+#pragma once
-+
-+#include "common.cuh"
-+
-+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
-+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
-+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,   // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
-+        const ggml_tensor * src1,   // activations, F32 [n_embd, n_tokens, ...]
-+        const ggml_tensor * ids,    // expert routing, I32
-+        ggml_tensor       * dst);   // F32 output
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 8ea462a..104d131 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -30,6 +30,7 @@
- #include "ggml-cuda/im2col.cuh"
- #include "ggml-cuda/mmf.cuh"
- #include "ggml-cuda/mmq.cuh"
-+#include "ggml-cuda/fp4-grouped-moe.cuh"
- #include "ggml-cuda/mmvf.cuh"
- #include "ggml-cuda/mmvq.cuh"
- #include "ggml-cuda/norm.cuh"
-@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
-         }
- 
-         if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
-+            if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
-             ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
-             return;
-         }
--- a/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
@@ -1,447 +0,0 @@
-From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 19 Jun 2026 22:54:49 +0000
-Subject: [PATCH] vendor paged kv manager
-
-vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
-PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
-nothing uses it yet; wired in by later patches in the series.
---
- src/CMakeLists.txt       |   1 +
- src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
- src/paged-kv-manager.h   | 108 ++++++++++++++
- 3 files changed, 405 insertions(+)
- create mode 100644 src/paged-kv-manager.cpp
- create mode 100644 src/paged-kv-manager.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index d15ccfd99..a030940b8 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -24,6 +24,7 @@ add_library(llama
-             llama-io.cpp
-             llama-kv-cache.cpp
-             llama-kv-cache-iswa.cpp
-+            paged-kv-manager.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
-new file mode 100644
-index 000000000..ca0dcd83a
--- /dev/null
-+++ b/src/paged-kv-manager.cpp
-@@ -0,0 +1,296 @@
-+#include "paged-kv-manager.h"
-+#include <cassert>
-+#include <stdexcept>
-+
-+namespace paged {
-+
-+// ---------------------------------------------------------------------------
-+// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
-+// ---------------------------------------------------------------------------
-+
-+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
-+    num_free_blocks = blocks.size();
-+    for (size_t i = 0; i < blocks.size(); ++i) {
-+        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
-+        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
-+    }
-+    if (!blocks.empty()) {
-+        fake_head.next_free = blocks.front();
-+        blocks.front()->prev_free = &fake_head;
-+        fake_tail.prev_free = blocks.back();
-+        blocks.back()->next_free = &fake_tail;
-+    } else {
-+        fake_head.next_free = &fake_tail;
-+        fake_tail.prev_free = &fake_head;
-+    }
-+}
-+
-+KVCacheBlock* FreeBlockQueue::popleft() {
-+    KVCacheBlock* first = fake_head.next_free;
-+    if (first == &fake_tail || first == nullptr) {
-+        assert(num_free_blocks == 0);
-+        throw std::runtime_error("No free blocks available");
-+    }
-+    fake_head.next_free = first->next_free;
-+    first->next_free->prev_free = &fake_head;
-+    first->prev_free = first->next_free = nullptr;
-+    num_free_blocks--;
-+    return first;
-+}
-+
-+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
-+    std::vector<KVCacheBlock*> ret;
-+    if (n == 0) return ret;
-+    assert(num_free_blocks >= n);
-+    num_free_blocks -= n;
-+    KVCacheBlock* curr = fake_head.next_free;
-+    ret.reserve(n);
-+    for (size_t i = 0; i < n; ++i) {
-+        assert(curr != nullptr);
-+        ret.push_back(curr);
-+        KVCacheBlock* last = curr;
-+        curr = curr->next_free;
-+        last->prev_free = last->next_free = nullptr;
-+    }
-+    if (curr != nullptr) {
-+        fake_head.next_free = curr;
-+        curr->prev_free = &fake_head;
-+    }
-+    return ret;
-+}
-+
-+void FreeBlockQueue::remove(KVCacheBlock* block) {
-+    if (!block->prev_free || !block->next_free)
-+        throw std::runtime_error("remove() called on an invalid block");
-+    block->prev_free->next_free = block->next_free;
-+    block->next_free->prev_free = block->prev_free;
-+    block->prev_free = block->next_free = nullptr;
-+    num_free_blocks--;
-+}
-+
-+void FreeBlockQueue::append(KVCacheBlock* block) {
-+    KVCacheBlock* last = fake_tail.prev_free;
-+    last->next_free = block;
-+    block->prev_free = last;
-+    block->next_free = &fake_tail;
-+    fake_tail.prev_free = block;
-+    num_free_blocks++;
-+}
-+
-+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
-+    if (blocks.empty()) return;
-+    KVCacheBlock* last = fake_tail.prev_free;
-+    for (KVCacheBlock* b : blocks) {
-+        b->prev_free = last;
-+        last->next_free = b;
-+        last = b;
-+    }
-+    last->next_free = &fake_tail;
-+    fake_tail.prev_free = last;
-+    num_free_blocks += blocks.size();
-+}
-+
-+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
-+    if (blocks.empty()) return;
-+    KVCacheBlock* first = fake_head.next_free;
-+    KVCacheBlock* prev = &fake_head;
-+    for (KVCacheBlock* b : blocks) {
-+        b->prev_free = prev;
-+        prev->next_free = b;
-+        prev = b;
-+    }
-+    prev->next_free = first;
-+    first->prev_free = prev;
-+    num_free_blocks += blocks.size();
-+}
-+
-+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
-+    std::vector<KVCacheBlock*> ret;
-+    const KVCacheBlock* curr = fake_head.next_free;
-+    while (curr && curr->next_free != nullptr) {
-+        ret.push_back(const_cast<KVCacheBlock*>(curr));
-+        curr = curr->next_free;
-+    }
-+    return ret;
-+}
-+
-+// ---------------------------------------------------------------------------
-+// BlockPool  (port of block_pool.py)
-+// ---------------------------------------------------------------------------
-+
-+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
-+    std::vector<KVCacheBlock*> p;
-+    p.reserve(v.size());
-+    for (auto& b : v) p.push_back(&b);
-+    return p;
-+}
-+
-+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
-+    std::vector<KVCacheBlock> v;
-+    v.reserve(num_blocks);
-+    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
-+    return v;
-+}
-+
-+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
-+    : enable_caching_(enable_caching),
-+      blocks_(make_block_vec(num_blocks)),
-+      ptrs_(make_ptrs(blocks_)),
-+      free_queue_(ptrs_) {
-+    // vLLM reserves block_id 0 as the null block (never cached).
-+    null_block = free_queue_.popleft();
-+    null_block->is_null = true;
-+}
-+
-+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
-+    if (!block->has_hash) return false;
-+    auto it = cached_block_hash_to_block_.find(block->block_hash);
-+    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
-+    cached_block_hash_to_block_.erase(it);
-+    block->reset_hash();
-+    return true;
-+}
-+
-+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
-+    if (n > get_num_free_blocks())
-+        throw std::runtime_error("Cannot get free blocks from pool");
-+    auto ret = free_queue_.popleft_n(n);
-+    for (KVCacheBlock* b : ret) {
-+        if (enable_caching_) maybe_evict_cached_block(b);
-+        assert(b->ref_cnt == 0);
-+        b->ref_cnt += 1;
-+    }
-+    return ret;
-+}
-+
-+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
-+    auto it = cached_block_hash_to_block_.find(block_hash);
-+    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
-+}
-+
-+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
-+    for (KVCacheBlock* b : blocks) {
-+        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
-+        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
-+        b->ref_cnt += 1;
-+    }
-+}
-+
-+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
-+    std::vector<KVCacheBlock*> without_hash, with_hash;
-+    for (KVCacheBlock* b : ordered_blocks) {
-+        if (b->is_null) continue;
-+        b->ref_cnt -= 1;
-+        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
-+    }
-+    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
-+    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
-+}
-+
-+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-+                                  size_t num_cached_blocks, size_t num_full_blocks,
-+                                  const std::vector<uint64_t>& block_hashes) {
-+    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
-+        KVCacheBlock* blk = req_blocks[i];
-+        if (blk->has_hash) continue;
-+        blk->has_hash = true;
-+        blk->block_hash = block_hashes[i];
-+        cached_block_hash_to_block_[blk->block_hash] = blk;
-+    }
-+}
-+
-+// ---------------------------------------------------------------------------
-+// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
-+// ---------------------------------------------------------------------------
-+
-+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-+
-+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
-+    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
-+
-+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
-+    auto& req = req_to_blocks_[seq_id];
-+    size_t need = cdiv(total_tokens, block_size_);
-+    if (need <= req.size()) return true;
-+    size_t add = need - req.size();
-+    if (add > pool_.get_num_free_blocks()) return false; // OOM
-+    auto nb = pool_.get_new_blocks(add);
-+    req.insert(req.end(), nb.begin(), nb.end());
-+    return true;
-+}
-+
-+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
-+    std::vector<int32_t> bt;
-+    auto it = req_to_blocks_.find(seq_id);
-+    if (it == req_to_blocks_.end()) return bt;
-+    bt.reserve(it->second.size());
-+    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
-+    return bt;
-+}
-+
-+int64_t PagedKVManager::slot(int seq_id, int pos) const {
-+    const auto& req = req_to_blocks_.at(seq_id);
-+    int32_t phys = req[pos / block_size_]->block_id;
-+    return (int64_t)phys * block_size_ + (pos % block_size_);
-+}
-+
-+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
-+    std::vector<int64_t> sm;
-+    sm.reserve(positions.size());
-+    for (int p : positions) sm.push_back(slot(seq_id, p));
-+    return sm;
-+}
-+
-+void PagedKVManager::free(int seq_id) {
-+    auto it = req_to_blocks_.find(seq_id);
-+    if (it == req_to_blocks_.end()) return;
-+    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
-+    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
-+    pool_.free_blocks(ordered);
-+    req_to_blocks_.erase(it);
-+}
-+
-+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
-+// hash into the seed so each block hash transitively encodes its whole prefix
-+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
-+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
-+    uint64_t h = 1469598103934665603ull ^ parent_hash;
-+    for (int t : token_ids) {
-+        h ^= (uint64_t)(uint32_t)t;
-+        h *= 1099511628211ull;
-+    }
-+    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
-+    return h;
-+}
-+
-+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
-+    std::vector<uint64_t> hashes;
-+    uint64_t parent = 0; // NONE_HASH analogue
-+    size_t n_full = token_ids.size() / block_size_;
-+    for (size_t i = 0; i < n_full; ++i) {
-+        std::vector<int> blk(token_ids.begin() + i * block_size_,
-+                             token_ids.begin() + (i + 1) * block_size_);
-+        parent = hash_block(parent, blk);
-+        hashes.push_back(parent);
-+    }
-+    return hashes;
-+}
-+
-+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
-+    std::vector<KVCacheBlock*> hits;
-+    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
-+        KVCacheBlock* cb = pool_.get_cached_block(bh);
-+        if (!cb) break;
-+        hits.push_back(cb);
-+    }
-+    pool_.touch(hits);                        // ++ref_cnt, pull from free list
-+    return hits.size() * (size_t)block_size_;
-+}
-+
-+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
-+    auto& req = req_to_blocks_[seq_id];
-+    size_t n_full = num_tokens / block_size_;
-+    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
-+}
-+
-+} // namespace paged
-diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
-new file mode 100644
-index 000000000..740280a7f
--- /dev/null
-+++ b/src/paged-kv-manager.h
-@@ -0,0 +1,108 @@
-+#pragma once
-+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
-+//
-+// Host-side block management is a faithful port of vLLM V1:
-+//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
-+//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
-+//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
-+//
-+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
-+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
-+// dependency so it can be unit-tested in isolation.
-+
-+#include <cstdint>
-+#include <vector>
-+#include <unordered_map>
-+#include <map>
-+
-+namespace paged {
-+
-+// vLLM KVCacheBlock (kv_cache_utils.py).
-+struct KVCacheBlock {
-+    int32_t  block_id   = 0;
-+    int      ref_cnt    = 0;
-+    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
-+    uint64_t block_hash = 0;
-+    bool     is_null    = false;
-+    KVCacheBlock* prev_free = nullptr;
-+    KVCacheBlock* next_free = nullptr;
-+
-+    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
-+    void reset_hash() { has_hash = false; block_hash = 0; }
-+};
-+
-+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
-+// O(1) middle removal is required so touch() can pull a warm cached block out of the
-+// free list when a later request hits its prefix.
-+class FreeBlockQueue {
-+public:
-+    size_t num_free_blocks = 0;
-+
-+    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
-+    KVCacheBlock* popleft();
-+    std::vector<KVCacheBlock*> popleft_n(size_t n);
-+    void remove(KVCacheBlock* block);
-+    void append(KVCacheBlock* block);
-+    void append_n(const std::vector<KVCacheBlock*>& blocks);
-+    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
-+    std::vector<KVCacheBlock*> get_all_free_blocks() const;
-+
-+private:
-+    KVCacheBlock fake_head{-1};
-+    KVCacheBlock fake_tail{-1};
-+};
-+
-+// vLLM BlockPool (block_pool.py).
-+class BlockPool {
-+public:
-+    KVCacheBlock* null_block = nullptr;
-+
-+    BlockPool(int32_t num_blocks, bool enable_caching);
-+    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
-+    KVCacheBlock* get_cached_block(uint64_t block_hash);
-+    void touch(const std::vector<KVCacheBlock*>& blocks);
-+    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
-+    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-+                           size_t num_cached_blocks, size_t num_full_blocks,
-+                           const std::vector<uint64_t>& block_hashes);
-+    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
-+
-+private:
-+    bool maybe_evict_cached_block(KVCacheBlock* block);
-+
-+    bool enable_caching_;
-+    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
-+    std::vector<KVCacheBlock*> ptrs_;
-+    FreeBlockQueue free_queue_;
-+    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
-+    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
-+    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
-+};
-+
-+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
-+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
-+class PagedKVManager {
-+public:
-+    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
-+
-+    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
-+    bool allocate(int seq_id, size_t total_tokens);
-+    std::vector<int32_t> block_table(int seq_id) const;
-+    int64_t slot(int seq_id, int pos) const;
-+    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
-+    void free(int seq_id);
-+    int block_size() const { return block_size_; }
-+
-+    // Prefix caching (win 3).
-+    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
-+    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
-+    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-+    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
-+
-+protected:
-+    int block_size_;
-+    BlockPool pool_;
-+    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
-+};
-+
-+} // namespace paged
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
@@ -1,75 +0,0 @@
-From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 19 Jun 2026 23:04:17 +0000
-Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
-
-Place each sequence's tokens at permuted, non-contiguous fixed-size block
-positions in find_slot, proving attention is invariant to physical KV placement
-(token-identical greedy generation). Default off; single-sequence scope; falls
-back to the normal allocator. The paged-placement substrate for the gather-read.
---
- src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
- 1 file changed, 41 insertions(+)
-
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 2802103bd..999e2ae61 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -11,6 +11,8 @@
- #include <cstring>
- #include <limits>
- #include <map>
-+#include <numeric>
-+#include <cstdlib>
- #include <stdexcept>
- 
- static bool ggml_is_power_of_2(int n) {
-@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             return { };
-         }
- 
-+        // [paged, experimental] Place this sequence's tokens at permuted,
-+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
-+        // This validates that attention is invariant to physical KV placement -
-+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-+        // Single-sequence scope (uses get_used() as the logical base); falls back
-+        // to the normal allocator if the permuted cells aren't available.
-+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+        if (paged_mode) {
-+            const uint32_t bs   = 16;                 // block size (tokens/block)
-+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            if (nblk >= 2) {
-+                // stride coprime to nblk => block-index permutation is a bijection
-+                uint32_t k = 1;
-+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-+                }
-+                const uint32_t base = cells.get_used();
-+                bool ok = true;
-+                for (uint32_t i = 0; i < n_tokens; ++i) {
-+                    const uint32_t L    = base + i;
-+                    const uint32_t b    = L / bs;
-+                    const uint32_t off  = L % bs;
-+                    if (b >= nblk) { ok = false; break; }
-+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-+                    res.idxs[s].push_back(phys);
-+                }
-+                if (ok && res.idxs[s].size() == n_tokens) {
-+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                    }
-+                    continue; // paged placement succeeded for this sequence
-+                }
-+                res.idxs[s].clear(); // fall back to the normal allocator
-+            }
-+        }
-+
-         uint32_t n_tested = 0;
- 
-         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
@@ -1,369 +0,0 @@
-From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 10:24:22 +0200
-Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
-
-Gather K, V and the kq_mask down to each sequence stream's non-empty cells
-before build_attn_mha. Position-sorted per stream so the flash-attn online
-softmax reduction order matches stock byte-for-byte. Multi-stream: one index
-column per stream over k->ne[3], padded to the max non-empty count with a
-masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
---
- src/CMakeLists.txt     |   1 +
- src/llama-graph.cpp    |   9 ++-
- src/llama-kv-cache.cpp |  74 ++++++++++++++++++++++++
- src/llama-kv-cache.h   |  11 ++++
- src/paged-attn.cpp     | 128 +++++++++++++++++++++++++++++++++++++++++
- src/paged-attn.h       |  40 +++++++++++++
- 6 files changed, 262 insertions(+), 1 deletion(-)
- create mode 100644 src/paged-attn.cpp
- create mode 100644 src/paged-attn.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index a030940..58083b3 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -25,6 +25,7 @@ add_library(llama
-             llama-kv-cache.cpp
-             llama-kv-cache-iswa.cpp
-             paged-kv-manager.cpp
-+            paged-attn.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
-index 68c9e60..b59d2a5 100644
--- a/src/llama-graph.cpp
-+++ b/src/llama-graph.cpp
-@@ -6,6 +6,8 @@
- #include "llama-cparams.h"
- 
- #include "llama-kv-cache.h"
-+
-+#include "paged-attn.h"
- #include "llama-kv-cache-iswa.h"
- #include "llama-kv-cache-dsa.h"
- #include "llama-memory-hybrid.h"
-@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
-     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
-     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- 
-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
-+    // [paged 0003] gather K, V and the mask to the sequence's used cells only
-+    //   (no-op unless env LLAMA_KV_PAGED is set).
-+    ggml_tensor * kq_mask_g = kq_mask;
-+    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+
-+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
-     cb(cur, "kqv_out", il);
- 
-     if (inp->self_v_rot) {
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 999e2ae..30d02d7 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1,4 +1,6 @@
- #include "llama-kv-cache.h"
-+#include <vector>
-+#include <utility>
- 
- #include "llama-impl.h"
- #include "llama-io.h"
-@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
-             ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
- }
- 
-+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
-+// single stream addressed by sinfo. With paged placement (patch 0002) these are
-+// the sequence's scattered block cells; gathering K/V/mask by this index list
-+// compacts the attention read while preserving every unmasked (token,cell) pair.
-+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
-+    // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
-+    // n_stream], so n_gather is the MAX non-empty count across the batch streams.
-+    // Streams with fewer cells are padded (see get_gather_idxs) with a masked
-+    // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
-+    // K is laid out over physical streams [s0, s1]; index v_cells the same way.
-+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
-+    uint32_t mx = 0;
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        uint32_t cnt = 0;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                ++cnt;
-+            }
-+        }
-+        mx = std::max(mx, cnt);
-+    }
-+    return mx;
-+}
-+
-+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
-+    const uint32_t ns       = sinfo.s1 - sinfo.s0 + 1;
-+    const uint32_t n_gather = get_n_gather(n_kv, sinfo);
-+    // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        // Collect the non-empty cells, then order them by token POSITION (not by
-+        // physical cell index). The attention reduction (flash-attn online
-+        // softmax, and the non-flash soft_max) runs over cells in array order and
-+        // is order-sensitive in floating point. Stock (contiguous) placement
-+        // happens to store cells in position order, so emitting the gathered
-+        // indices in position order reproduces stock's exact reduction order -
-+        // making the paged read bit-identical, not merely math-equivalent.
-+        std::vector<std::pair<llama_pos, int32_t>> pc;
-+        pc.reserve(n);
-+        int32_t pad = -1;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
-+            } else if (pad < 0) {
-+                pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
-+            }
-+        }
-+        std::sort(pc.begin(), pc.end());
-+        int32_t * col = dst + (size_t) j * n_gather;
-+        for (size_t k = 0; k < pc.size(); ++k) {
-+            col[k] = pc[k].second;
-+        }
-+        // Pad the tail to n_gather with a masked (empty) cell so the rectangular
-+        // gather drops to zero contribution for streams shorter than the max.
-+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
-+        for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
-+            col[k] = padv;
-+        }
-+    }
-+}
-+
- ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
-     GGML_UNUSED(sinfo);
- 
-@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
-     return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
- }
- 
-+uint32_t llama_kv_cache_context::get_n_gather() const {
-+    return kv->get_n_gather(n_kv, sinfos[i_cur]);
-+}
-+
-+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
-+    kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
-+}
-+
- ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
-     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index 3d68f98..494c0fb 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -171,6 +171,12 @@ public:
-     ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
-     ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
- 
-+    // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
-+    //   sinfo (position-sorted, padded across streams). Used by paged-attn
-+    //   gather-read. get_n_gather returns the max count across streams.
-+    uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
-+    void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
-+
-     // store k_cur and v_cur in the cache based on the provided head location
-     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
-     ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
-@@ -368,6 +374,11 @@ public:
-     ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
-     ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
- 
-+    // [paged 0003] gather-read helpers (delegate to the kv cache for the
-+    //   current ubatch's stream).
-+    uint32_t get_n_gather() const;
-+    void     get_gather_idxs(int32_t * dst) const;
-+
-     // store k_cur and v_cur in the cache based on the provided head location
-     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
-     //   - k_cur  [n_embd_head_k, n_head_k, n_tokens]
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-new file mode 100644
-index 0000000..ade75e8
--- /dev/null
-+++ b/src/paged-attn.cpp
-@@ -0,0 +1,128 @@
-+#include "paged-attn.h"
-+
-+#include "llama-graph.h"
-+#include "llama-kv-cache.h"
-+
-+#include "ggml.h"
-+#include "ggml-backend.h"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+
-+namespace paged_attn {
-+
-+bool active() {
-+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+    return a;
-+}
-+
-+static bool debug() {
-+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
-+    return d;
-+}
-+
-+namespace {
-+
-+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
-+// with each stream's non-empty cell indices (position-sorted, padded with a
-+// masked/empty cell) by delegating to the kv-cache context. Private to this
-+// unit; default can_reuse()==false keeps the graph from being reused across
-+// decodes (n_gather grows every step).
-+class input_gather_idxs : public llm_graph_input_i {
-+public:
-+    input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
-+        : mctx(mctx), idxs(idxs) {}
-+
-+    void set_input(const llama_ubatch * ubatch) override {
-+        GGML_UNUSED(ubatch);
-+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
-+        mctx->get_gather_idxs((int32_t *) idxs->data);
-+    }
-+
-+    const llama_kv_cache_context * mctx;
-+    ggml_tensor * idxs;
-+};
-+
-+} // namespace
-+
-+void gather(ggml_context * ctx0,
-+            llm_graph_result * res,
-+            const llama_kv_cache_context * mctx,
-+            ggml_tensor ** k,
-+            ggml_tensor ** v,
-+            ggml_tensor ** kq_mask) {
-+    if (!active()) {
-+        return;
-+    }
-+
-+    ggml_tensor * K = *k;
-+    ggml_tensor * V = *v;
-+    ggml_tensor * M = *kq_mask;
-+
-+    // Number of streams (sequences) in the unified batch. K is laid out
-+    // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
-+    // gather is per-stream (one index column per stream), so a single
-+    // ggml_get_rows over the stream axis handles 1..N streams uniformly.
-+    const int64_t n_stream = K->ne[3];
-+    GGML_ASSERT(M->ne[3] == n_stream);
-+
-+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
-+    if (n_gather <= 0) {
-+        // Worst-case graph reserve (empty cache) or nothing placed yet: leave
-+        // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
-+        return;
-+    }
-+
-+    if (debug()) {
-+        static int64_t once = 0;
-+        if (once++ < 2) {
-+            fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
-+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
-+        }
-+    }
-+
-+    // Per-stream index tensor [n_gather, n_stream], filled at set_input from
-+    // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
-+    // n_stream, so column s gathers from stream s of the source.
-+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
-+    ggml_set_input(idx);
-+    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
-+
-+    // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
-+    {
-+        ggml_tensor * t = ggml_cont(ctx0, K);                                          // [d, h, n_kv, ns]
-+        t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream);           // [d*h, n_kv, ns]
-+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
-+        *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream);         // [d, h, n_gather, ns]
-+    }
-+
-+    // --- gather V ---
-+    // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
-+    // result is contiguous and build_attn_mha sees a consistent v_trans==false.
-+    {
-+        const bool v_trans = V->nb[1] > V->nb[2];
-+        ggml_tensor * vsrc = v_trans
-+            ? ggml_permute(ctx0, V, 2, 1, 0, 3)   // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
-+            : V;                                  // already [d, h, n_kv, ns]
-+        ggml_tensor * t = ggml_cont(ctx0, vsrc);                                       // [d, h, n_kv, ns]
-+        t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream);  // [d*h, n_kv, ns]
-+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
-+        *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream);   // [d, h, n_gather, ns]
-+    }
-+
-+    // --- gather mask (cells are ne0): transpose so cells become the row axis,
-+    //     gather per stream, transpose back ---
-+    {
-+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);      // [n_kv, n_tps, ns]
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_tps, n_kv, ns]
-+        m = ggml_get_rows(ctx0, m, idx);                                               // [n_tps, n_gather, ns] (F32)
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_gather, n_tps, ns]
-+        m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
-+        if (M->type != m->type) {
-+            m = ggml_cast(ctx0, m, M->type);   // flash-attn requires an F16 mask
-+        }
-+        *kq_mask = m;
-+    }
-+}
-+
-+} // namespace paged_attn
-diff --git a/src/paged-attn.h b/src/paged-attn.h
-new file mode 100644
-index 0000000..c5b7bd7
--- /dev/null
-+++ b/src/paged-attn.h
-@@ -0,0 +1,40 @@
-+#pragma once
-+// Paged attention gather-read (patch 0003, experimental).
-+//
-+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
-+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
-+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
-+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
-+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
-+// build_attn_mha.
-+//
-+// Correctness: attention is permutation-invariant over the KV set, and dropping
-+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
-+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
-+//
-+// All logic lives here to keep the core files additive: build_attn gets one
-+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
-+
-+#include <cstdint>
-+
-+struct ggml_context;
-+struct ggml_tensor;
-+class  llm_graph_result;
-+class  llama_kv_cache_context;
-+
-+namespace paged_attn {
-+
-+// true iff env LLAMA_KV_PAGED is set (evaluated once).
-+bool active();
-+
-+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
-+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
-+// point at the compacted tensors; pass them straight to build_attn_mha.
-+void gather(ggml_context * ctx0,
-+            llm_graph_result * res,
-+            const llama_kv_cache_context * mctx,
-+            ggml_tensor ** k,
-+            ggml_tensor ** v,
-+            ggml_tensor ** kq_mask);
-+
-+} // namespace paged_attn
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
@@ -1,298 +0,0 @@
-From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 10:50:35 +0200
-Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
- 0004
-
-Drive the paged placement in find_slot through the vendored PagedKVManager
-(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
-free pool on demand as the sequence crosses block boundaries (peak << full
-reservation) and returned on sequence end (seq_rm full removal / clear). One
-manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
-so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
-gated call. Default off; stock path byte-identical.
---
- src/CMakeLists.txt     |   1 +
- src/llama-kv-cache.cpp |  69 +++++++++++++++++----------
- src/paged-alloc.cpp    | 106 +++++++++++++++++++++++++++++++++++++++++
- src/paged-alloc.h      |  39 +++++++++++++++
- 4 files changed, 190 insertions(+), 25 deletions(-)
- create mode 100644 src/paged-alloc.cpp
- create mode 100644 src/paged-alloc.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index 58083b3..4d9d7d1 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -26,6 +26,7 @@ add_library(llama
-             llama-kv-cache-iswa.cpp
-             paged-kv-manager.cpp
-             paged-attn.cpp
-+            paged-alloc.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 30d02d7..1125d9a 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1,4 +1,5 @@
- #include "llama-kv-cache.h"
-+#include "paged-alloc.h"
- #include <vector>
- #include <utility>
- 
-@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
- }
- 
- void llama_kv_cache::clear(bool data) {
-+    // [paged 0004] return all on-demand blocks to the pool on cache clear.
-+    if (paged_alloc::active()) {
-+        paged_alloc::release_all(this);
-+    }
-+
-     for (uint32_t s = 0; s < n_stream; ++s) {
-         v_cells[s].reset();
-         v_heads[s] = 0;
-@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
-         p1 = std::numeric_limits<llama_pos>::max();
-     }
- 
-+    // [paged 0004] free a stream's on-demand blocks when its whole sequence is
-+    // removed (sequence end), so they return to the pool for reuse.
-+    if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
-+        if (seq_id >= 0) {
-+            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
-+        } else {
-+            paged_alloc::release_all(this);
-+        }
-+    }
-+
-     if (seq_id >= 0) {
-         auto & cells = v_cells[seq_to_stream[seq_id]];
-         auto & head  = v_heads[seq_to_stream[seq_id]];
-@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-         // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-         // Single-sequence scope (uses get_used() as the logical base); falls back
-         // to the normal allocator if the permuted cells aren't available.
-        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-        if (paged_mode) {
-+        // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
-+        // invariant to physical KV placement; here that placement is driven by
-+        // the vendored PagedKVManager (patch 0001): blocks are popped from a free
-+        // pool only as the sequence crosses block boundaries (peak << full
-+        // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
-+        // falls back to the normal allocator on pool exhaustion or any conflict.
-+        if (paged_alloc::active()) {
-             const uint32_t bs   = 16;                 // block size (tokens/block)
-            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            const uint32_t nblk = cells.size() / bs;  // this stream's block budget
-             if (nblk >= 2) {
-                // stride coprime to nblk => block-index permutation is a bijection
-                uint32_t k = 1;
-                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-                }
-                 const uint32_t base = cells.get_used();
-                bool ok = true;
-                for (uint32_t i = 0; i < n_tokens; ++i) {
-                    const uint32_t L    = base + i;
-                    const uint32_t b    = L / bs;
-                    const uint32_t off  = L % bs;
-                    if (b >= nblk) { ok = false; break; }
-                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-                    res.idxs[s].push_back(phys);
-                }
-                if (ok && res.idxs[s].size() == n_tokens) {
-                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                const int      strm = (int) seq_to_stream[seq_id];
-+                std::vector<uint32_t> placed;
-+                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
-+                    bool ok = (placed.size() == n_tokens);
-+                    for (uint32_t i = 0; ok && i < n_tokens; ++i) {
-+                        if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
-+                            ok = false;
-+                        }
-+                    }
-+                    if (ok) {
-+                        for (uint32_t phys : placed) {
-+                            res.idxs[s].push_back(phys);
-+                        }
-+                        if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                            fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
-+                            for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                            fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
-+                        }
-+                        continue; // on-demand paged placement succeeded
-                     }
-                    continue; // paged placement succeeded for this sequence
-+                    res.idxs[s].clear(); // fall back to the normal allocator
-                 }
-                res.idxs[s].clear(); // fall back to the normal allocator
-             }
-         }
- 
-diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
-new file mode 100644
-index 0000000..1d13f9c
--- /dev/null
-+++ b/src/paged-alloc.cpp
-@@ -0,0 +1,106 @@
-+#include "paged-alloc.h"
-+#include "paged-kv-manager.h"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+#include <map>
-+#include <memory>
-+#include <utility>
-+
-+namespace paged_alloc {
-+
-+bool active() {
-+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+    return a;
-+}
-+
-+static bool debug() {
-+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
-+    return d;
-+}
-+
-+namespace {
-+
-+using key_t = std::pair<const void *, int>;
-+
-+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-+// physical pool of cells.size() cells, so a manager's block ids map directly to
-+// cell ranges within that stream's pool. The internal request id is always 0.
-+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
-+
-+paged::PagedKVManager * get_mgr(const void * cache, int stream,
-+                                uint32_t pool_blocks, uint32_t block_size) {
-+    const key_t k{cache, stream};
-+    auto it = g_managers.find(k);
-+    if (it == g_managers.end()) {
-+        // enable_caching=false: prefix caching is a later patch; 0004 exercises
-+        // only on-demand allocate / free.
-+        auto mgr = std::make_unique<paged::PagedKVManager>(
-+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
-+        it = g_managers.emplace(k, std::move(mgr)).first;
-+    }
-+    return it->second.get();
-+}
-+
-+} // namespace
-+
-+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+           uint32_t block_size, uint32_t pool_blocks,
-+           std::vector<uint32_t> & out) {
-+    if (n_tokens == 0) {
-+        return true;
-+    }
-+
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+
-+    const size_t before = mgr->block_table(0).size();
-+
-+    // Grow the request to cover the highest logical position. The manager pops
-+    // free blocks only for the boundaries actually crossed - that is the on-
-+    // demand behavior; an already-covered range adds nothing.
-+    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
-+        return false; // pool exhausted -> caller falls back to the stock path
-+    }
-+
-+    out.reserve(out.size() + n_tokens);
-+    for (uint32_t i = 0; i < n_tokens; ++i) {
-+        const int64_t s = mgr->slot(0, (int) (base + i));
-+        out.push_back((uint32_t) s);
-+    }
-+
-+    if (debug()) {
-+        const size_t after = mgr->block_table(0).size();
-+        if (after != before) {
-+            fprintf(stderr,
-+                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
-+                    "(budget=%u; base=%u +%u tok)\n",
-+                    cache, stream, before, after, pool_blocks, base, n_tokens);
-+        }
-+    }
-+
-+    return true;
-+}
-+
-+void release(const void * cache, int stream) {
-+    auto it = g_managers.find({cache, stream});
-+    if (it == g_managers.end()) {
-+        return;
-+    }
-+    it->second->free(0);
-+    g_managers.erase(it);
-+    if (debug()) {
-+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
-+    }
-+}
-+
-+void release_all(const void * cache) {
-+    for (auto it = g_managers.begin(); it != g_managers.end(); ) {
-+        if (it->first.first == cache) {
-+            it = g_managers.erase(it);
-+        } else {
-+            ++it;
-+        }
-+    }
-+}
-+
-+} // namespace paged_alloc
-diff --git a/src/paged-alloc.h b/src/paged-alloc.h
-new file mode 100644
-index 0000000..bf66665
--- /dev/null
-+++ b/src/paged-alloc.h
-@@ -0,0 +1,39 @@
-+#pragma once
-+// On-demand paged KV block allocation (patch 0004, experimental).
-+//
-+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-+// sequence's logical positions onto a fixed full-pool permutation, blocks are
-+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-+// and returned to the pool on sequence end. This is where the paged memory-
-+// capacity benefit begins: a short sequence holds only a few blocks, not the
-+// whole reserved window.
-+//
-+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-+// struct stays untouched - find_slot only gains a gated call.
-+
-+#include <cstdint>
-+#include <vector>
-+
-+namespace paged_alloc {
-+
-+// true iff env LLAMA_KV_PAGED is set (evaluated once).
-+bool active();
-+
-+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-+// demand, appending their physical cell indices to `out`. pool_blocks =
-+// cells.size()/block_size is this stream's block budget. Returns false (leaving
-+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
-+// allocator. The caller still validates each returned cell is empty.
-+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+           uint32_t block_size, uint32_t pool_blocks,
-+           std::vector<uint32_t> & out);
-+
-+// Return a stream's blocks to the pool (sequence end).
-+void release(const void * cache, int stream);
-+
-+// Return every stream's blocks for a kv-cache (clear() / teardown).
-+void release_all(const void * cache);
-+
-+} // namespace paged_alloc
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
@@ -1,143 +0,0 @@
-From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 12:13:44 +0200
-Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
- patch 0006
-
-Add host-side cross-request prefix sharing to the vendored PagedKVManager
-(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
-matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
-fresh blocks only for the divergent suffix. A shared block is freed only at
-ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
-write so co-owners stay byte-correct. All logic lives in the vendored
-src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
-core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
-
-Wiring the physical-cell reuse into find_slot so the engine itself skips
-recompute needs core seq-membership changes and is left to a later patch.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
- src/paged-kv-manager.h   | 23 ++++++++++++++
- 2 files changed, 88 insertions(+)
-
-diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
-index ca0dcd8..4c6ee4c 100644
--- a/src/paged-kv-manager.cpp
-+++ b/src/paged-kv-manager.cpp
-@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
-     pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
- }
- 
-+// ---------------------------------------------------------------------------
-+// Cross-request prefix caching + copy-on-write  (patch 0006)
-+// ---------------------------------------------------------------------------
-+
-+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
-+    auto& req = req_to_blocks_[seq_id];
-+
-+    // Longest cached prefix: hash the full blocks and stop at the first miss.
-+    // A block hash transitively encodes its whole prefix (FNV chaining), so the
-+    // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
-+    const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
-+    std::vector<KVCacheBlock*> hits;
-+    for (uint64_t bh : hashes) {
-+        KVCacheBlock* cb = pool_.get_cached_block(bh);
-+        if (!cb) break;
-+        hits.push_back(cb);
-+    }
-+
-+    // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
-+    // splice the shared physical blocks into this sequence's block table.
-+    pool_.touch(hits);
-+    req.insert(req.end(), hits.begin(), hits.end());
-+
-+    // Allocate fresh blocks only for the divergent suffix.
-+    const size_t need = cdiv(token_ids.size(), block_size_);
-+    if (need > req.size()) {
-+        const size_t add = need - req.size();
-+        if (add > pool_.get_num_free_blocks()) {
-+            // OOM: roll the sequence back (un-touch the shared prefix so no ref
-+            // leaks) and report no placement; the caller falls back to stock.
-+            std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
-+            pool_.free_blocks(ordered);
-+            req.clear();
-+            return 0;
-+        }
-+        auto nb = pool_.get_new_blocks(add);
-+        req.insert(req.end(), nb.begin(), nb.end());
-+    }
-+    return hits.size();
-+}
-+
-+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
-+    auto& req = req_to_blocks_.at(seq_id);
-+    KVCacheBlock* old = req.at(bi);
-+    if (old->ref_cnt <= 1) {
-+        return { old->block_id, old->block_id }; // already private - no copy
-+    }
-+    // Private copy for this sequence. get_new_blocks sets the fresh block's
-+    // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
-+    // it is NOT returned to the pool and the other owners are left untouched.
-+    KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
-+    pool_.free_blocks({ old });
-+    req[bi] = fresh;
-+    return { old->block_id, fresh->block_id };
-+}
-+
-+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
-+    return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
-+}
-+
-+size_t PagedKVManager::num_blocks(int seq_id) const {
-+    auto it = req_to_blocks_.find(seq_id);
-+    return it == req_to_blocks_.end() ? 0 : it->second.size();
-+}
-+
- } // namespace paged
-diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
-index 740280a..34decbc 100644
--- a/src/paged-kv-manager.h
-+++ b/src/paged-kv-manager.h
-@@ -14,6 +14,7 @@
- #include <vector>
- #include <unordered_map>
- #include <map>
-+#include <utility>
- 
- namespace paged {
- 
-@@ -99,6 +100,28 @@ public:
-     size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-     void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
- 
-+    // Cross-request prefix caching + copy-on-write (patch 0006).
-+    //
-+    // Splice the longest cached prefix of token_ids into seq_id (reuse the
-+    // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
-+    // allocate fresh blocks only for the divergent suffix. Returns the number of
-+    // shared (reused) blocks; the caller skips recomputing those tokens. On pool
-+    // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
-+    size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
-+
-+    // Copy-on-write the block at logical index bi of seq_id. If that block is
-+    // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
-+    // the shared one (other owners keep it, content untouched) and install the
-+    // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
-+    // block was already private (ref_cnt<=1) and no copy is needed. The caller
-+    // copies the physical cell contents old_block_id -> new_block_id.
-+    std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
-+
-+    // Introspection for the prefix-share gate (debug/tests).
-+    int    block_ref_cnt_at(int seq_id, size_t bi) const;
-+    size_t num_blocks(int seq_id) const;
-+    size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
-+
- protected:
-     int block_size_;
-     BlockPool pool_;
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
@@ -1,531 +0,0 @@
-From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 12:46:28 +0200
-Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
- patch 0007
-
-Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
-new sequence physically SHARES the cached prefix blocks and skips recomputing the
-shared prefix - the actual compute win that 0006 (which only proved the host-side
-machinery + realised reuse via the stock seq_cp) did not yet deliver from the
-paged path itself.
-
-Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
-
-  * paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
-    into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
-    requests are keyed by the real llama_seq_id. free(seq) now releases exactly
-    one sequence, so ref-counted shared blocks survive while another sharer holds
-    them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
-    slot, commit (publish a sequence into the content cache), ref-counted release,
-    plus ref/num-free introspection.
-
-  * Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
-    paged_prefix_share() reuses the longest cached content prefix for a sequence
-    and marks the shared physical cells as belonging to it (cells.seq_add) so the
-    engine's attention mask includes the already-computed prefix KV; the caller
-    then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
-    sequence's full blocks for later reuse.
-
-  * find_slot's paged branch anchors placement on each sequence's own logical base
-    (ubatch.pos) and keys the manager request by seq_id, so an independently-freed
-    sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
-    per-sequence (ref-counted) instead of nuking the whole stream.
-
-  * paged-prefix-api: a thin gated shim so a caller holding only the public
-    llama.h can reach the seam and the introspection without the internal headers.
-
-Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
-additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
-sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
-scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
-a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
-hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
-use-after-free) and returns to the pool only when all sharers are freed. The
-0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/CMakeLists.txt       |   1 +
- src/llama-kv-cache.cpp   |  66 +++++++++++++++++++++++--
- src/llama-kv-cache.h     |   8 +++
- src/paged-alloc.cpp      | 104 ++++++++++++++++++++++++++++++---------
- src/paged-alloc.h        |  69 +++++++++++++++++++-------
- src/paged-prefix-api.cpp |  48 ++++++++++++++++++
- src/paged-prefix-api.h   |  27 ++++++++++
- 7 files changed, 280 insertions(+), 43 deletions(-)
- create mode 100644 src/paged-prefix-api.cpp
- create mode 100644 src/paged-prefix-api.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index 4d9d7d1..432f42d 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -27,6 +27,7 @@ add_library(llama
-             paged-kv-manager.cpp
-             paged-attn.cpp
-             paged-alloc.cpp
-+            paged-prefix-api.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 1125d9a..7510ff9 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
-     // removed (sequence end), so they return to the pool for reuse.
-     if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
-         if (seq_id >= 0) {
-            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
-+            paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
-         } else {
-             paged_alloc::release_all(this);
-         }
-@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             const uint32_t bs   = 16;                 // block size (tokens/block)
-             const uint32_t nblk = cells.size() / bs;  // this stream's block budget
-             if (nblk >= 2) {
-                const uint32_t base = cells.get_used();
-+                // [paged 0007] Anchor placement on this sequence's own logical
-+                // base position (ubatch.pos), not the shared used-count, and key
-+                // the manager request by the real seq_id. slot(seq,pos) is then
-+                // stable per sequence, so an independently-freed (ref-counted)
-+                // sequence and a shared prefix can coexist in one unified pool.
-+                const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
-                 const int      strm = (int) seq_to_stream[seq_id];
-                 std::vector<uint32_t> placed;
-                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
-+                if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
-                     bool ok = (placed.size() == n_tokens);
-                     for (uint32_t i = 0; ok && i < n_tokens; ++i) {
-                         if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
-@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-     return res;
- }
- 
-+// [paged 0007] Cross-request prefix recompute-skip.
-+//
-+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
-+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
-+// blocks for the divergent suffix. We then mark the shared physical cells as
-+// belonging to seq_id - those cells already hold the owner's computed KV at the
-+// matching logical positions, so the caller decodes ONLY the suffix and the
-+// prefix is never recomputed. Returns the number of shared prefix tokens.
-+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
-+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
-+    if (!paged_alloc::active() || tokens.empty()) {
-+        return 0;
-+    }
-+    const uint32_t bs   = 16;
-+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
-+    auto & cells = v_cells[strm];
-+    const uint32_t nblk = cells.size() / bs;
-+    if (nblk < 2) {
-+        return 0;
-+    }
-+
-+    std::vector<int> toks(tokens.begin(), tokens.end());
-+    const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
-+
-+    for (size_t p = 0; p < kshare; ++p) {
-+        const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
-+        if (cell < 0 || (uint32_t) cell >= cells.size() ||
-+            cells.is_empty((uint32_t) cell) ||
-+            cells.pos_get((uint32_t) cell) != (llama_pos) p) {
-+            // Owner cell missing / repurposed: cannot safely share. Roll the
-+            // sequence back so the caller recomputes the whole prompt.
-+            paged_alloc::release(this, (int) strm, (int) seq_id);
-+            return 0;
-+        }
-+        if (!cells.seq_has((uint32_t) cell, seq_id)) {
-+            cells.seq_add((uint32_t) cell, seq_id);
-+        }
-+    }
-+    return (int32_t) kshare;
-+}
-+
-+// [paged 0007] Publish a sequence's full blocks into the content cache so a
-+// later paged_prefix_share() can reuse them. Call after the sequence KV is
-+// computed (its prefill decode has run).
-+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
-+    if (!paged_alloc::active() || tokens.empty()) {
-+        return;
-+    }
-+    const uint32_t bs   = 16;
-+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
-+    const uint32_t nblk = v_cells[strm].size() / bs;
-+    std::vector<int> toks(tokens.begin(), tokens.end());
-+    paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
-+}
-+
- void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
-     // TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
-     if (other) {
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index 494c0fb..f374ac6 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -199,6 +199,14 @@ public:
-     // emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
-     void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
- 
-+    // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
-+    // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
-+    // for seq_id and returns the number of shared prefix tokens (the caller
-+    // decodes only the suffix); paged_prefix_commit() publishes a sequence into
-+    // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
-+    int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
-+    void    paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
-+
-     //
-     // input API
-     //
-diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
-index 1d13f9c..c1027fb 100644
--- a/src/paged-alloc.cpp
-+++ b/src/paged-alloc.cpp
-@@ -23,9 +23,13 @@ namespace {
- 
- using key_t = std::pair<const void *, int>;
- 
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-// physical pool of cells.size() cells, so a manager's block ids map directly to
-// cell ranges within that stream's pool. The internal request id is always 0.
-+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
-+// separate physical pool of cells.size() cells, so a manager's block ids map
-+// directly to cell ranges within that stream's pool. Requests inside a manager
-+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
-+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
-+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
-+// can publish blocks and share_prefix() can hit them.
- std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
- 
- paged::PagedKVManager * get_mgr(const void * cache, int stream,
-@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
-     const key_t k{cache, stream};
-     auto it = g_managers.find(k);
-     if (it == g_managers.end()) {
-        // enable_caching=false: prefix caching is a later patch; 0004 exercises
-        // only on-demand allocate / free.
-         auto mgr = std::make_unique<paged::PagedKVManager>(
-            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
-+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
-         it = g_managers.emplace(k, std::move(mgr)).first;
-     }
-     return it->second.get();
- }
- 
-+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
-+    auto it = g_managers.find({cache, stream});
-+    return it == g_managers.end() ? nullptr : it->second.get();
-+}
-+
- } // namespace
- 
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
-            uint32_t block_size, uint32_t pool_blocks,
-            std::vector<uint32_t> & out) {
-     if (n_tokens == 0) {
-@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
- 
-     paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
- 
-    const size_t before = mgr->block_table(0).size();
-+    const size_t before = mgr->block_table(seq).size();
- 
-    // Grow the request to cover the highest logical position. The manager pops
-    // free blocks only for the boundaries actually crossed - that is the on-
-    // demand behavior; an already-covered range adds nothing.
-    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
-+    // Grow this sequence's request to cover its highest logical position. The
-+    // manager pops free blocks only for boundaries actually crossed; if
-+    // share_prefix() already reserved these blocks, this is a no-op.
-+    if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
-         return false; // pool exhausted -> caller falls back to the stock path
-     }
- 
-     out.reserve(out.size() + n_tokens);
-     for (uint32_t i = 0; i < n_tokens; ++i) {
-        const int64_t s = mgr->slot(0, (int) (base + i));
-+        const int64_t s = mgr->slot(seq, (int) (base + i));
-         out.push_back((uint32_t) s);
-     }
- 
-     if (debug()) {
-        const size_t after = mgr->block_table(0).size();
-+        const size_t after = mgr->block_table(seq).size();
-         if (after != before) {
-             fprintf(stderr,
-                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
-+                    "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
-                     "(budget=%u; base=%u +%u tok)\n",
-                    cache, stream, before, after, pool_blocks, base, n_tokens);
-+                    cache, stream, seq, before, after, pool_blocks, base, n_tokens);
-         }
-     }
- 
-     return true;
- }
- 
-void release(const void * cache, int stream) {
-    auto it = g_managers.find({cache, stream});
-    if (it == g_managers.end()) {
-+size_t share_prefix(const void * cache, int stream, int seq,
-+                    const std::vector<int> & tokens,
-+                    uint32_t block_size, uint32_t pool_blocks) {
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+    const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
-+    const size_t shared_tokens = shared_blocks * (size_t) block_size;
-+    if (debug() && shared_blocks > 0) {
-+        fprintf(stderr,
-+                "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
-+                "(%zu tokens) - prefix NOT recomputed\n",
-+                cache, stream, seq, shared_blocks, shared_tokens);
-+    }
-+    return shared_tokens;
-+}
-+
-+int64_t slot(const void * cache, int stream, int seq, int pos) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-+        return -1;
-+    }
-+    if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
-+        return -1;
-+    }
-+    return mgr->slot(seq, pos);
-+}
-+
-+void commit(const void * cache, int stream, int seq,
-+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+    mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
-+    if (debug()) {
-+        fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
-+                cache, stream, seq, tokens.size());
-+    }
-+}
-+
-+void release(const void * cache, int stream, int seq) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-         return;
-     }
-    it->second->free(0);
-    g_managers.erase(it);
-+    mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
-     if (debug()) {
-        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
-+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
-+                cache, stream, seq, mgr->num_free_blocks());
-     }
- }
- 
-@@ -103,4 +146,21 @@ void release_all(const void * cache) {
-     }
- }
- 
-+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-+        return -1;
-+    }
-+    const size_t bi = (size_t) pos / block_size;
-+    if (bi >= mgr->num_blocks(seq)) {
-+        return -1;
-+    }
-+    return mgr->block_ref_cnt_at(seq, bi);
-+}
-+
-+size_t num_free(const void * cache, int stream) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    return mgr ? mgr->num_free_blocks() : 0;
-+}
-+
- } // namespace paged_alloc
-diff --git a/src/paged-alloc.h b/src/paged-alloc.h
-index bf66665..88dedef 100644
--- a/src/paged-alloc.h
-+++ b/src/paged-alloc.h
-@@ -1,17 +1,27 @@
- #pragma once
-// On-demand paged KV block allocation (patch 0004, experimental).
-+// On-demand paged KV block allocation + cross-request prefix reuse
-+// (patches 0004 + 0007, experimental).
- //
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-// and returned to the pool on sequence end. This is where the paged memory-
-// capacity benefit begins: a short sequence holds only a few blocks, not the
-// whole reserved window.
-+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
-+// host-side PagedKVManager (patch 0001). Two responsibilities:
- //
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-// struct stays untouched - find_slot only gains a gated call.
-+//   * On-demand allocation (0004): a sequence's logical positions are mapped to
-+//     physical cells block-by-block, popped from a free pool only as the
-+//     sequence grows and returned on sequence end.
-+//
-+//   * Cross-request prefix reuse (0007): before a new sequence's suffix is
-+//     decoded, share_prefix() reuses the cached physical blocks of a matching
-+//     content prefix (ref_cnt++), so the engine shares the already-computed KV
-+//     cells and the caller decodes ONLY the divergent suffix - the prefix is not
-+//     recomputed. commit() publishes a sequence's full blocks into the content
-+//     cache so later sequences can hit them. Freeing is ref-counted: a shared
-+//     block returns to the pool only when every sharer has been released.
-+//
-+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
-+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
-+// shared blocks survive at ref>0. All state lives in this unit (a static
-+// registry), so the core kv-cache struct stays untouched - find_slot gains only
-+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
- 
- #include <cstdint>
- #include <vector>
-@@ -21,19 +31,42 @@ namespace paged_alloc {
- // true iff env LLAMA_KV_PAGED is set (evaluated once).
- bool active();
- 
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-// demand, appending their physical cell indices to `out`. pool_blocks =
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
-+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
-+// on demand, appending their physical cell indices to `out`. pool_blocks =
-+// cells.size()/block_size is the stream's block budget. Returns false (leaving
- // `out` unchanged) on pool exhaustion, so the caller falls back to the stock
- // allocator. The caller still validates each returned cell is empty.
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
-            uint32_t block_size, uint32_t pool_blocks,
-            std::vector<uint32_t> & out);
- 
-// Return a stream's blocks to the pool (sequence end).
-void release(const void * cache, int stream);
-+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
-+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
-+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
-+// (block-aligned); the caller marks those cells for seq and decodes only the
-+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
-+size_t share_prefix(const void * cache, int stream, int seq,
-+                    const std::vector<int> & tokens,
-+                    uint32_t block_size, uint32_t pool_blocks);
-+
-+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
-+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
-+int64_t slot(const void * cache, int stream, int seq, int pos);
- 
-// Return every stream's blocks for a kv-cache (clear() / teardown).
-+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
-+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
-+void commit(const void * cache, int stream, int seq,
-+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
-+
-+// Return one sequence's blocks to the pool (ref-counted; sequence end).
-+void release(const void * cache, int stream, int seq);
-+
-+// Drop every manager for a kv-cache (clear() / teardown).
- void release_all(const void * cache);
- 
-+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
-+// ref count of the block backing logical position `pos`, or -1 if unknown.
-+int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
-+size_t num_free(const void * cache, int stream);
-+
- } // namespace paged_alloc
-diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
-new file mode 100644
-index 0000000..8573cd2
--- /dev/null
-+++ b/src/paged-prefix-api.cpp
-@@ -0,0 +1,48 @@
-+#include "paged-prefix-api.h"
-+#include "paged-alloc.h"
-+#include "llama-kv-cache.h"
-+
-+#include <vector>
-+
-+namespace paged_prefix_api {
-+
-+static llama_kv_cache * kv_of(llama_context * ctx) {
-+    // The driver targets a plain unified KV-cache model; dynamic_cast yields null
-+    // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
-+    // not apply, so the shim degrades to a safe no-op.
-+    return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
-+}
-+
-+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv || n <= 0) {
-+        return 0;
-+    }
-+    return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
-+}
-+
-+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv || n <= 0) {
-+        return;
-+    }
-+    kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
-+}
-+
-+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv) {
-+        return -1;
-+    }
-+    return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
-+}
-+
-+long num_free(llama_context * ctx) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv) {
-+        return 0;
-+    }
-+    return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
-+}
-+
-+} // namespace paged_prefix_api
-diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
-new file mode 100644
-index 0000000..78a3864
--- /dev/null
-+++ b/src/paged-prefix-api.h
-@@ -0,0 +1,27 @@
-+#pragma once
-+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
-+// (patch 0007). Lets a driver that only includes the public llama.h reach the
-+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
-+// without pulling in the internal kv-cache headers. All entry points are no-ops
-+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
-+
-+#include "llama.h"
-+
-+namespace paged_prefix_api {
-+
-+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
-+// return the number of shared prefix tokens (the caller decodes only the
-+// suffix). 0 if nothing was shared.
-+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+
-+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
-+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+
-+// Ref count of the paged block backing logical position `pos` of `seq` (unified
-+// stream 0), or -1 if unknown.
-+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
-+
-+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
-+long num_free(llama_context * ctx);
-+
-+} // namespace paged_prefix_api
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -1,130 +0,0 @@
-From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 17:02:22 +0200
-Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
- - patch 0008
-
-Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
-paged_prefix_api::share/commit) into the llama-server continuous-batching loop
-(update_slots) so CONCURRENT requests that share a long prefix physically reuse
-one committed copy of the prefix blocks and prefill only their divergent suffix.
-Patch 0007 proved the engine seam correct via a standalone driver, but the server
-never called it: two concurrent shared-prefix requests each recomputed the full
-prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
-(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
-concurrent slots. 0008 adds that cross-slot share.
-
-Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
-
-  * In update_slots prompt-processing, after the native n_past is computed and
-    only for a FRESH slot (n_past < one block, i.e. the native cache did not
-    already cover the prefix), call paged_prefix_api::share() to splice the
-    longest committed cross-request prefix into this sequence (ref_cnt++ on the
-    shared physical blocks) and advance n_past past it, so the batch fill computes
-    ONLY the suffix. The slot's own divergent tail cells are removed first so the
-    shared cells own [n_past, kshare) without colliding (the native path removes
-    these later anyway). The n_past < block gate guarantees any block-aligned
-    share the engine returns is strictly larger than n_past and therefore always
-    adopted, so the engine's reservation always matches the suffix-only batch and
-    never leaves stale blocks (which otherwise fragment the paged pool).
-
-  * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
-    KV just computed), call paged_prefix_api::commit() to publish its prefix so
-    concurrent/later sharers can reuse it.
-
-The share() / commit() entry points are forward-declared (defined in libllama,
-src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
-server translation unit.
-
-Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
-holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
-~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
-K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
-blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
-documented CUDA batch-shape non-determinism band (stock native prompt-caching
-shows the same magnitude). Cross-request sharing requires the unified KV cache.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
- 1 file changed, 50 insertions(+)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 39b7eb2..b5f9d37 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -16,6 +16,16 @@
- #include "mtmd.h"
- #include "mtmd-helper.h"
- 
-+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
-+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
-+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
-+// cache wires into update_slots() without pulling in internal kv-cache headers.
-+// Fully gated; stock (paged off) is byte-identical.
-+namespace paged_prefix_api {
-+    int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+    void    commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+}
-+
- #include <algorithm>
- #include <cstddef>
- #include <cinttypes>
-@@ -3335,6 +3345,37 @@ private:
-                             }
-                         }
- 
-+                        // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
-+                        // above only reuses THIS slot's own prior prompt; when the paged KV
-+                        // engine is active, also reuse a committed CROSS-slot prefix so
-+                        // concurrent requests sharing a long prefix skip recompute. Gated on
-+                        // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
-+                        static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
-+                        // Only attempt the cross-request share on a FRESH slot (the native
-+                        // cache above did not already cover the prefix). With n_past < a
-+                        // block, any block-aligned share the engine returns is strictly
-+                        // larger than n_past and is therefore always adopted below - so the
-+                        // engine's full-prompt reservation always matches the suffix-only
-+                        // submission and never leaves stale blocks (which fragmented the
-+                        // paged pool and crashed the server under high fan-out otherwise).
-+                        if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
-+                            const llama_tokens ptoks = input_tokens.get_text_tokens();
-+                            // Drop this slot's own cells beyond the natively-cached prefix before
-+                            // splicing the shared physical prefix in, so the shared cells can own
-+                            // [n_past, kshare) without colliding (the native path removes exactly
-+                            // these later; a no-op for a fresh slot).
-+                            common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
-+                            const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
-+                            if (kshare > n_past) {
-+                                slot.prompt.tokens.keep_first(n_past);
-+                                for (int i = n_past; i < kshare; ++i) {
-+                                    slot.prompt.tokens.push_back(ptoks[i]);
-+                                }
-+                                n_past = kshare;
-+                                SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
-+                            }
-+                        }
-+
-                         // [TAG_PROMPT_LOGITS]
-                         if (n_past == slot.task->n_tokens() && n_past > 0) {
-                             SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
-@@ -3741,6 +3782,15 @@ private:
-                 // prompt evaluated for next-token prediction
-                 slot.state = SLOT_STATE_GENERATING;
- 
-+                // [paged 0008] Publish this slot's computed prefix so concurrent/later
-+                // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
-+                // for [0, n_tokens) has just run, so the prefix KV is computed.
-+                static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
-+                if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
-+                    const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
-+                    paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
-+                }
-+
-                 if (slot.can_speculate()) {
-                     common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
-                 }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
@@ -1,609 +0,0 @@
-From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 20:03:17 +0200
-Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
-
-Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
-contiguous buffer) with an in-kernel paged read on the decode step. build_attn
-passes the UNMODIFIED physical K/V views plus a block table (src[5] of
-ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
-index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
-reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
-read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
-(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
-causal mask in the same position order; KV_max / parallel_blocks / stream_k
-split-K are unchanged. The decode shape is forced onto the vec kernel (the only
-one wired for the block table); a nullptr block table => the stock contiguous
-read, byte-identical.
-
-Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
-so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
-tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
-the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
-ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
-recovering the gather regression to stock parity (647 ms). Gated behind
-LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h                  |   6 ++
- ggml/src/ggml-cpu/ops.cpp            |  10 ++-
- ggml/src/ggml-cuda/fattn-common.cuh  |   8 +-
- ggml/src/ggml-cuda/fattn-mma-f16.cuh |   4 +-
- ggml/src/ggml-cuda/fattn-tile.cuh    |   4 +-
- ggml/src/ggml-cuda/fattn-vec.cuh     |  25 +++++--
- ggml/src/ggml-cuda/fattn-wmma-f16.cu |   4 +-
- ggml/src/ggml-cuda/fattn.cu          |   9 +++
- ggml/src/ggml.c                      |  14 ++++
- src/llama-graph.cpp                  |  23 ++++--
- src/llama-graph.h                    |   3 +-
- src/llama-kv-cache.cpp               |  31 ++++++++
- src/llama-kv-cache.h                 |   4 +
- src/paged-attn.cpp                   | 107 +++++++++++++++++++++++++++
- src/paged-attn.h                     |  18 +++++
- 15 files changed, 248 insertions(+), 22 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index d6807b6..823f5a9 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2427,6 +2427,12 @@ extern "C" {
-             struct ggml_tensor * a,
-             struct ggml_tensor * sinks);
- 
-+    // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
-+    // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
-+    GGML_API void ggml_flash_attn_ext_set_block_table(
-+            struct ggml_tensor * a,
-+            struct ggml_tensor * block_table);
-+
-     // TODO: needs to be adapted to ggml_flash_attn_ext
-     GGML_API struct ggml_tensor * ggml_flash_attn_back(
-            struct ggml_context * ctx,
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 74611dc..63c07a2 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
-     const ggml_tensor * v     = dst->src[2];
-     const ggml_tensor * mask  = dst->src[3];
-     const ggml_tensor * sinks = dst->src[4];
-+    const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
-+    const int32_t     * bt    = block_table ? (const int32_t *) block_table->data : nullptr;
- 
-     GGML_TENSOR_LOCALS(int64_t, neq, q,   ne)
-     GGML_TENSOR_LOCALS(size_t,  nbq, q,   nb)
-@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
- 
-             float s; // KQ value
- 
-            const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
-+            // [paged] map the logical KV index ic to its physical cell via the block table.
-+            const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
-+            const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
-             kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
- 
-             s = s*scale; // scale KQ value
-@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
-             float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
-             float vs = 1.0f; // post-softmax KQ value, expf(s - M)
- 
-            const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
-+            const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
- 
-             if (v->type == GGML_TYPE_F16) {
-                 if (s > M) {
-@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
-         const int64_t dr = (nr + nchunk - 1) / nchunk;
- 
-         static constexpr int64_t Q_TILE_SZ  = ggml_fa_tile_config::Q;
-        bool use_tiled = !use_ref &&
-+        bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
-                                (q->type == GGML_TYPE_F32 &&
-                                 kv_is_f32_or_f16 &&
-                                 k->type == v->type &&
-diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
-index 8dfa51a..3c6ddd5 100644
--- a/ggml/src/ggml-cuda/fattn-common.cuh
-+++ b/ggml/src/ggml-cuda/fattn-common.cuh
-@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33);
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table);
- 
- typedef float (*vec_dot_KQ_t)(
-     const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
-@@ -981,6 +982,8 @@ void launch_fattn(
- 
-     const ggml_tensor * mask  = dst->src[3];
-     const ggml_tensor * sinks = dst->src[4];
-+    const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
-+    const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
- 
-     ggml_tensor * KQV = dst;
- 
-@@ -1217,7 +1220,8 @@ void launch_fattn(
-         K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
-         nb21, nb22, nb23,
-         mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
-        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
-+        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
-+        bt_ptr
-     );
-     CUDA_CHECK(cudaGetLastError());
- 
-diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-index 83478a0..0a92cd6 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
-     ggml_cuda_pdl_sync(); // TODO optimize placement
- #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
-index 0a09981..0ff14e6 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
-+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
-@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
-index 69dd936..a09e2fb 100644
--- a/ggml/src/ggml-cuda/fattn-vec.cuh
-+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
-@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-     ggml_cuda_pdl_lc();
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
-                   nb11, nb12, nb13,
-                   nb21, nb22, nb23,
-                   ne31, ne32, ne33,
-                  nb31, nb32, nb33);
-+                  nb31, nb32, nb33, block_table);
-         NO_DEVICE_CODE;
-         return;
-     }
-@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
-     K += nb13*sequence + nb12*(head / gqa_ratio);
-     V += nb23*sequence + nb22*(head / gqa_ratio);
- 
-+    // [paged] in-kernel block-table read: logical KV index j -> physical cell
-+    // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
-+    // mask/KV_max stay logical (the table is in token-position order). nullptr =>
-+    // the stock contiguous read below.
-+    const char * GGML_CUDA_RESTRICT K0 = K;
-+    const char * GGML_CUDA_RESTRICT V0 = V;
-+    const int  * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
-+
-     const half * maskh  = (const half  *) (mask + nb33*(sequence % ne33) + nb31*ic0);
- 
-     const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
-@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-         for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
-             const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
-+            const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
- 
- #pragma unroll
-             for (int j = 0; j < ncols; ++j) {
-                float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
-+                float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
-                 sum = warp_reduce_sum<nthreads_KQ>(sum);
- 
-                 if (use_logit_softcap) {
-@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-         for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
-             const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
-+            const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
- 
- #ifdef V_DOT2_F32_F16_AVAILABLE
-             half2 KQ_k[ncols];
-@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
-                 half2 tmp[V_rows_per_thread/2];
-                 if constexpr (type_V == GGML_TYPE_BF16) {
-                     float2 tmp_f[V_rows_per_thread/2];
-                    dequantize_V(V + k*nb21, tmp_f,
-+                    dequantize_V(V_blk, tmp_f,
-                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
- #pragma unroll
-                     for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
-                         tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
-                     }
-                 } else {
-                    dequantize_V(V + k*nb21, tmp,
-+                    dequantize_V(V_blk, tmp,
-                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
-                 }
- #pragma unroll
-@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-             for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
-                 float2 tmp[V_rows_per_thread/2];
-                dequantize_V(V + k*nb21, tmp,
-+                dequantize_V(V_blk, tmp,
-                     2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
- #pragma unroll
-                 for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
-@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
-               nb11, nb12, nb13,
-               nb21, nb22, nb23,
-               ne31, ne32, ne33,
-              nb31, nb32, nb33);
-+              nb31, nb32, nb33, block_table);
-     NO_DEVICE_CODE;
- #endif // FLASH_ATTN_AVAILABLE
- }
-diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-index 6850716..5357849 100644
--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index d6c501b..e3771ee 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
- 
- void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-     ggml_cuda_set_device(ctx.device);
-+
-+    // [paged] the block table (src[5]) is only honored by the vec kernel's
-+    // in-kernel read; force it. build_attn only sets it for a vec-supported
-+    // 1-token-per-stream decode shape.
-+    if (dst->src[5] != nullptr) {
-+        ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        return;
-+    }
-+
-     switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
-         case BEST_FATTN_KERNEL_NONE:
-             GGML_ABORT("fatal error");
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index b43016c..adbe52b 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
-     a->src[4] = sinks;
- }
- 
-+void ggml_flash_attn_ext_set_block_table(
-+        struct ggml_tensor * a,
-+        struct ggml_tensor * block_table) {
-+    if (!block_table) {
-+        a->src[5] = NULL;
-+        return;
-+    }
-+
-+    GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
-+    GGML_ASSERT(block_table->type == GGML_TYPE_I32);
-+
-+    a->src[5] = block_table;
-+}
-+
- // ggml_flash_attn_back
- 
- struct ggml_tensor * ggml_flash_attn_back(
-diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
-index b59d2a5..abdb48d 100644
--- a/src/llama-graph.cpp
-+++ b/src/llama-graph.cpp
-@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
-          ggml_tensor * sinks,
-          ggml_tensor * v_mla,
-                float   kq_scale,
-                 int   il) const {
-+                 int   il,
-+         ggml_tensor * block_table) const {
-     const bool v_trans = v->nb[1] > v->nb[2];
- 
-     // split the batch into streams if needed
-@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
-                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
-         cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
- 
-+        if (block_table) {
-+            ggml_flash_attn_ext_set_block_table(cur, block_table);
-+        }
-         ggml_flash_attn_ext_add_sinks(cur, sinks);
-         ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
- 
-@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
-     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
-     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- 
-    // [paged 0003] gather K, V and the mask to the sequence's used cells only
-    //   (no-op unless env LLAMA_KV_PAGED is set).
-    ggml_tensor * kq_mask_g = kq_mask;
-    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+    // [paged] decode read: when paging is active and this is a 1-token-per-stream
-+    //   decode step, present K/V as n_gather views + a block table so the fattn
-+    //   kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
-+    //   fall back to the gather-read (prefill, transposed V, or env off). All a
-+    //   no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
-+    ggml_tensor * kq_mask_g   = kq_mask;
-+    ggml_tensor * block_table = nullptr;
-+    const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
-+    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
-+        paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+    }
- 
-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
-+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
-     cb(cur, "kqv_out", il);
- 
-     if (inp->self_v_rot) {
-diff --git a/src/llama-graph.h b/src/llama-graph.h
-index 5e8a658..c95ae49 100644
--- a/src/llama-graph.h
-+++ b/src/llama-graph.h
-@@ -969,7 +969,8 @@ struct llm_graph_context {
-             ggml_tensor * sinks,   // [n_head_q]
-             ggml_tensor * v_mla,   // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
-                   float   kq_scale,
-                    int   il) const;
-+                    int   il,
-+            ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
- 
-     llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
- 
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 7510ff9..0351f86 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
-     }
- }
- 
-+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
-+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        std::vector<std::pair<llama_pos, int32_t>> pc;
-+        pc.reserve(n);
-+        int32_t pad = -1;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
-+            } else if (pad < 0) {
-+                pad = (int32_t) i;
-+            }
-+        }
-+        std::sort(pc.begin(), pc.end());
-+        int32_t * col = dst + (size_t) j * n_blk;
-+        for (size_t k = 0; k < pc.size(); ++k) {
-+            col[k] = pc[k].second;
-+        }
-+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
-+        for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
-+            col[k] = padv;
-+        }
-+    }
-+}
-+
- ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
-     GGML_UNUSED(sinfo);
- 
-@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
-     kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
- }
- 
-+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
-+    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
-+}
-+
- ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
-     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index f374ac6..e9980b6 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -176,6 +176,9 @@ public:
-     //   gather-read. get_n_gather returns the max count across streams.
-     uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
-     void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
-+    // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
-+    //   per column with a masked empty cell) for the in-kernel paged read.
-+    void     get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
- 
-     // store k_cur and v_cur in the cache based on the provided head location
-     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
-@@ -386,6 +389,7 @@ public:
-     //   current ubatch's stream).
-     uint32_t get_n_gather() const;
-     void     get_gather_idxs(int32_t * dst) const;
-+    void     get_block_table(int32_t * dst, uint32_t n_blk) const;
- 
-     // store k_cur and v_cur in the cache based on the provided head location
-     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-index ade75e8..8eebeaa 100644
--- a/src/paged-attn.cpp
-+++ b/src/paged-attn.cpp
-@@ -43,6 +43,25 @@ public:
-     ggml_tensor * idxs;
- };
- 
-+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
-+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
-+// with a masked empty cell, by delegating to the kv-cache context.
-+class input_block_table : public llm_graph_input_i {
-+public:
-+    input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
-+        : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
-+
-+    void set_input(const llama_ubatch * ubatch) override {
-+        GGML_UNUSED(ubatch);
-+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
-+        mctx->get_block_table((int32_t *) idxs->data, n_blk);
-+    }
-+
-+    const llama_kv_cache_context * mctx;
-+    ggml_tensor * idxs;
-+    uint32_t n_blk;
-+};
-+
- } // namespace
- 
- void gather(ggml_context * ctx0,
-@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
-     }
- }
- 
-+bool in_kernel_decode(ggml_context * ctx0,
-+                      llm_graph_result * res,
-+                      const llama_kv_cache_context * mctx,
-+                      ggml_tensor ** k,
-+                      ggml_tensor ** v,
-+                      ggml_tensor ** kq_mask,
-+                      ggml_tensor ** block_table) {
-+    if (!active()) {
-+        return false;
-+    }
-+    // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
-+    // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
-+    static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
-+    if (force_gather) {
-+        return false;
-+    }
-+
-+    ggml_tensor * K = *k;
-+    ggml_tensor * V = *v;
-+    ggml_tensor * M = *kq_mask;
-+
-+    const int64_t n_stream = K->ne[3];
-+    GGML_ASSERT(M->ne[3] == n_stream);
-+
-+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
-+    if (n_gather <= 0) {
-+        // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
-+        return false;
-+    }
-+
-+    // The in-kernel read addresses V along its d-major (non-transposed) axis. If
-+    // the cache stores V transposed, fall back to gather() (which normalizes it).
-+    if (V->nb[1] > V->nb[2]) {
-+        return false;
-+    }
-+
-+    if (debug()) {
-+        static int64_t once = 0;
-+        if (once++ < 2) {
-+            fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
-+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
-+        }
-+    }
-+
-+    // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
-+    // in token-POSITION order (identical to the gather index, so the reduction
-+    // order matches stock bit-for-bit), padded with a masked empty cell. Filled
-+    // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
-+    // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
-+    // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
-+    // engages; padded entries point at a masked empty cell (0 contribution). Stays
-+    // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
-+    int64_t n_view = GGML_PAD(n_gather, 256);
-+    if (n_view > K->ne[2]) {
-+        n_view = K->ne[2];
-+    }
-+
-+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
-+    ggml_set_input(idx);
-+    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
-+
-+    // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
-+    // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
-+    // dim shrinks to n_view. NOT materialized - the kernel reads in place.
-+    *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
-+                      K->nb[1], K->nb[2], K->nb[3], 0);
-+    *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
-+                      V->nb[1], V->nb[2], V->nb[3], 0);
-+
-+    // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
-+    // the kernel's logical mask index aligns with the block table. Cheap: the
-+    // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
-+    {
-+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
-+        m = ggml_get_rows(ctx0, m, idx);
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
-+        m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
-+        if (M->type != m->type) {
-+            m = ggml_cast(ctx0, m, M->type);
-+        }
-+        *kq_mask = m;
-+    }
-+
-+    *block_table = idx;
-+    return true;
-+}
-+
- } // namespace paged_attn
-diff --git a/src/paged-attn.h b/src/paged-attn.h
-index c5b7bd7..23e2184 100644
--- a/src/paged-attn.h
-+++ b/src/paged-attn.h
-@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
-             ggml_tensor ** v,
-             ggml_tensor ** kq_mask);
- 
-+// [paged inc1] In-kernel paged decode read. Instead of materializing the
-+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
-+// full physical window and return the position-ordered physical-cell index list
-+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
-+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
-+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
-+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
-+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
-+// not apply - env off, nothing placed, or a transposed V cache - so the caller
-+// keeps the dense gather()/contiguous read.
-+bool in_kernel_decode(ggml_context * ctx0,
-+                      llm_graph_result * res,
-+                      const llama_kv_cache_context * mctx,
-+                      ggml_tensor ** k,
-+                      ggml_tensor ** v,
-+                      ggml_tensor ** kq_mask,
-+                      ggml_tensor ** block_table);
-+
- } // namespace paged_attn
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
@@ -1,269 +0,0 @@
-From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 22:36:09 +0200
-Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
- LLAMA_KV_PAGED) - patch 0010
-
-Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
-safe against silent mis-routing, and plumb the same read into the tile kernel
-for the increment-3 GQA head-group work.
-
-fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
-fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
-(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
-i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
-The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
-so every existing non-paged caller is byte-identical to stock. The mask / KV_max
-stay logical (token-position order), as in vec.
-
-fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
-the vec or tile kernel and never fall through to the best-kernel switch. The
-mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
-(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
-GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
-(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
-kernel once.
-
-Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
-vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
-Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
-vec, matching the nsys profile (flash_attn_ext_vec).
-
-The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
-and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
-tile-paged diverging from tile-stock at the first cross-tile KV depth: the
-GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
-oob_check=false while the compacted paged mask is not padded to cover the tile, so
-past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
-the tile path is increment-3 work; the default vec route and all stock paths are
-untouched.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
- ggml/src/ggml-cuda/fattn.cu       | 38 +++++++++++++++++++++++---
- 2 files changed, 64 insertions(+), 19 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
-index 0ff14e6..bb84d61 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
-+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
-@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
- // TODO: deduplicate with mma-f16
- template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
- static __device__ __forceinline__ void flash_attn_tile_load_tile(
-        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
-+        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
-+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
-                     const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
- 
-                     const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
-+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
-+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
-                     ggml_cuda_memcpy_1<cpy_nb>(
-                         tile_KV + i*(J/2 + J_padding) + j,
-                        !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
-+                        !oob_check || i < i_sup ? KV_row + j : zero);
-                 }
-             }
-         }
-@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
- 
- template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
- static __device__ __forceinline__ void flash_attn_tile_load_tile(
-        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
-+        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
-+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
- 
-                     const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
-                     __align__(16) half2 tmp_h2[cpy_ne/2];
-+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
-+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
-                     ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
-                        tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
-+                        tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
- 
-                     __align__(16) float2 tmp_f2[cpy_ne/2];
- #pragma unroll
-@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
-         const int k_VKQ_0,
-         const int k_VKQ_sup,
-         const int k_KQ_0,
-+        const int * const __restrict__ block_table,
-         float * KQ_acc) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
-@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
-     constexpr int cpw   = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
-     constexpr int np    = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
- 
-+    // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
-+    const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
-     flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
-        (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
-+        (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
-     __syncthreads();
- 
- #ifdef FAST_FP16_AVAILABLE
-@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
-         T_acc * const VKQ,
-         const int k_VKQ_0,
-         const int k_VKQ_max,
-        const int col_Q_0) {
-+        const int col_Q_0,
-+        const int * const __restrict__ block_table) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
- #pragma unroll
-     for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
-         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
-+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
-     }
-     if (nbatch_K_last > 0) {
-         constexpr int k_KQ_0 = DKQ - nbatch_K_last;
-         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
-+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
-     }
- 
-     // Apply logit softcap + mask, update KQ_max:
-@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
-     static_assert(nbatch_V % np == 0, "bad nbatch_V");
- #pragma unroll
-     for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
-+        // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
-+        const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
-         flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
-            (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
-+            (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
-         __syncthreads();
- 
- #ifdef FAST_FP16_AVAILABLE
-@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                             const int32_t nb31, const int32_t nb32, const int64_t nb33,
-         const int  * __restrict__ block_table) {
-    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
-                   nb11, nb12, nb13,
-                   nb21, nb22, nb23,
-                   ne31, ne32, ne33,
-                  nb31, nb32, nb33);
-+                  nb31, nb32, nb33, block_table);
-         NO_DEVICE_CODE;
-         return;
-     }
-@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
-     const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
-     const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
- 
-+    // [paged] per-sequence logical->physical block table in token-position order
-+    // (mask/KV_max stay logical); nullptr => the stock contiguous read.
-+    const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
-+
-     const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
- 
-     const int stride_K2   = nb11 / sizeof(half2);
-@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
-             constexpr bool oob_check = false;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-             k_VKQ_0 += gridDim.y*nbatch_fa;
-         }
-         if (k_VKQ_0 < k_VKQ_max) {
-             constexpr bool oob_check = true;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-         }
-     } else {
-         // Branch without out-of-bounds checks.
-@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
-             constexpr bool oob_check = false;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-         }
-     }
- 
-@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
-               nb11, nb12, nb13,
-               nb21, nb22, nb23,
-               ne31, ne32, ne33,
-              nb31, nb32, nb33);
-+              nb31, nb32, nb33, block_table);
-     NO_DEVICE_CODE;
- #endif // FLASH_ATTN_AVAILABLE
- }
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index e3771ee..afcafa2 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
- void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-     ggml_cuda_set_device(ctx.device);
- 
-    // [paged] the block table (src[5]) is only honored by the vec kernel's
-    // in-kernel read; force it. build_attn only sets it for a vec-supported
-    // 1-token-per-stream decode shape.
-+    // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
-+    // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
-+    // silently read the wrong (contiguous physical) cells. So when a block table
-+    // is present we route here and NEVER fall through to the best-kernel switch
-+    // below - no decode shape can silently reach an mma/wmma misread. build_attn
-+    // only sets src[5] for the 1-token-per-stream decode shape; the vec
-+    // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
-+    // and any shape that should not be paged must take the host-side gather path
-+    // (LLAMA_KV_PAGED_GATHER=1) instead.
-+    //
-+    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
-+    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
-+    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
-+    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
-+    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
-+    // with oob_check=false while the compacted paged mask is not padded to cover
-+    // it, so it diverges from stock. Not for production paged decode until
-+    // increment-3 bounds that path; the default vec route is unaffected.
-     if (dst->src[5] != nullptr) {
-        ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
-+        if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
-+            static bool logged = false;
-+            if (!logged) {
-+                logged = true;
-+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
-+                    paged_tile ? "TILE(experimental)" : "VEC",
-+                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
-+                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
-+            }
-+        }
-+        if (paged_tile) {
-+            ggml_cuda_flash_attn_ext_tile(ctx, dst);
-+        } else {
-+            ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        }
-         return;
-     }
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
@@ -1,147 +0,0 @@
-From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 00:18:35 +0200
-Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
- gqa>=2) - patch 0011
-
-Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
-in-kernel decode to the tile kernel for the common grouped-query F16 case, and
-keep the inc-1 vec kernel for everything else.
-
-The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
-q-heads that share one kv-head, so each K/V row is loaded once for the whole
-group instead of once per q-head. vec re-streams each kv-head's K/V once per
-q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
-3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
-The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
-this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
-
-Routing guard (why conditional): the tile kernel has no K/V type template - it
-loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
-launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
-read (the table indexes the original paged layout, not the copy). So tile is
-correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
-fall back to the inc-1 vec path, exactly as before this change. The head-group
-reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
-Note: paged decode is currently exercised with an F16 cache only; quantized +
-paged is a separate pre-existing limitation, independent of this change
-(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
-after this patch, since both route the non-F16 cache to vec).
-
-Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
-1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
-same build, env-toggled:
-  STOCK (mma)            174.8 ms/step  183.1 t/s
-  PAGED-VEC  (inc-1)     186.3 ms/step  171.8 t/s   (+6.6% vs stock)
-  PAGED-TILE (inc-3)     177.9 ms/step  179.8 t/s   (+1.8% vs stock)
-GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
-paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
-vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
-takes a larger share of the step.
-
-Why not the split-K tune: the vec decode grid is already block-saturated
-(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
-SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
-intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
-directly; more split-K does not.
-
-Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
-  - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
-  - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
-    in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
-    band where vec also drifts from stock. Stock uses the mma kernel for this
-    multi-stream GQA shape, so a different kernel = different rounding =
-    autoregressive token drift; vec and tile agree with each other while both
-    differ from stock (both pick 15678 where stock picks 38835), confirming the
-    drift is kernel choice, not a paging error.
-  - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
-    (seq3: tile == stock == 624 at the token where vec picked 13).
-
-Stock is byte-identical: the dispatch guard only diverts when the block table
-(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
-path reads the last nbatch_fa tile with oob_check=false and relies on the mask
-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
-mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
-
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
-Assisted-by: Claude:opus-4.8 [Claude Code]
---
- ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
- 1 file changed, 36 insertions(+), 15 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index afcafa2..6b15810 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
-     // silently read the wrong (contiguous physical) cells. So when a block table
-     // is present we route here and NEVER fall through to the best-kernel switch
-     // below - no decode shape can silently reach an mma/wmma misread. build_attn
-    // only sets src[5] for the 1-token-per-stream decode shape; the vec
-+    // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
-     // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
-     // and any shape that should not be paged must take the host-side gather path
-     // (LLAMA_KV_PAGED_GATHER=1) instead.
-     //
-    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
-    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
-    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
-    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
-    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
-    // with oob_check=false while the compacted paged mask is not padded to cover
-    // it, so it diverges from stock. Not for production paged decode until
-    // increment-3 bounds that path; the default vec route is unaffected.
-+    // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
-+    // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
-+    // kv-head (ncols2), loading each K/V row once for the whole group instead of
-+    // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
-+    // Two constraints make this conditional: (1) the tile kernel has no K/V type
-+    // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
-+    // converted by launch_fattn to a contiguous F16 copy, which breaks the
-+    // in-kernel block-table read (the table indexes the original paged layout, not
-+    // the copy); vec instead reads the original cache with in-kernel dequant, so it
-+    // is the only correct paged path for non-F16 caches. (2) the head-group reuse
-+    // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
-+    // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
-+    // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
-+    // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
-+    // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
-+    // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
-+    // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
-+    // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
-+    // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
-+    // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
-+    // uses for ncols2>1); the compacted paged mask is gathered to the n_view
-+    // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
-+    // the inc-1 vec path for A/B.
-     if (dst->src[5] != nullptr) {
-        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
-+        const ggml_tensor * Qp = dst->src[0];
-+        const ggml_tensor * Kp = dst->src[1];
-+        const ggml_tensor * Vp = dst->src[2];
-+        const bool kv_f16    = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
-+        const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
-+        const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
-+        const bool use_tile  = !force_vec && kv_f16 && gqa_ratio >= 2;
-         if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
-             static bool logged = false;
-             if (!logged) {
-                 logged = true;
-                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
-                    paged_tile ? "TILE(experimental)" : "VEC",
-                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
-                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
-+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
-+                    use_tile ? "TILE(gqa)" : "VEC",
-+                    (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
-+                    (long) gqa_ratio, (int) kv_f16);
-             }
-         }
-        if (paged_tile) {
-+        if (use_tile) {
-             ggml_cuda_flash_attn_ext_tile(ctx, dst);
-         } else {
-             ggml_cuda_flash_attn_ext_vec(ctx, dst);
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
@@ -1,50 +0,0 @@
-From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 11:12:05 +0200
-Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
- route (patch 0012)
-
-The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
-leak past-end KV rows only because the compacted mask/block-table length is
-padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
-256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
-tile sits entirely inside the -inf pad window. That invariant was implicit.
-
-Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
-future change to the pad (e.g. < 256) or the tile (> 256) that broke the
-whole-tile property cannot silently reintroduce the leak. Additive only, no
-behaviour change.
-
-Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
-vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
-stays silent (n_view remains a whole number of tiles across all decode steps).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/paged-attn.cpp | 9 +++++++++
- 1 file changed, 9 insertions(+)
-
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-index 8eebeaa..fed8ca9 100644
--- a/src/paged-attn.cpp
-+++ b/src/paged-attn.cpp
-@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
-         n_view = K->ne[2];
-     }
- 
-+    // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
-+    // a whole number of such tiles so the in-kernel decode never reads past the gathered
-+    // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
-+    // the boundary still contributes zero. This holds today only because the pad (256) is a
-+    // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
-+    // silently reintroduce a past-end KV leak, so assert it rather than trust it.
-+    // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
-+    GGML_ASSERT(n_view % 64 == 0);
-+
-     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
-     ggml_set_input(idx);
-     res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -1,136 +0,0 @@
-From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 11:52:45 +0200
-Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
- 0013)
-
-llama-server already co-batches decode with chunked prefill: update_slots()
-appends every generating slot's sampled token first, then fills the rest of the
-n_batch budget with prompt tokens, deferring the overflow to the next step. But
-the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
-~2048-token prefill chunk lands in a single compute-heavy step, and every decode
-co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
-Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
-prefill throughput, because they are coupled.
-
-Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
-(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
-The prompt-fill loop and the outer slot loop now also stop once this many prompt
-tokens have been added in the current update_slots() step, so a long prefill is
-split across more steps that each still advance in-flight decode. Default (env
-unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
-LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
-
-Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
-streams with one 6000-token prefill injected mid-stream; same binary, only
-LLAMA_PREFILL_BUDGET differs:
-
-  metric                        stock(off)  budget=256   budget=512
-  worst decode freeze (ms)         3380      482 (7.0x)   778 (4.3x)
-  median decode ITL in window      2264      411 (5.5x)   689
-  decode_stall (ms)                3285      387 (8.5x)   684 (4.8x)
-  decode steps during prefill        38      201 (5.3x)   108
-  injected-req TTFT (ms)           8493     10172 (+20%)  8432 (~0%)
-  steady-state baseline ITL          94        95          94
-
-This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
-the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
-worst freeze and 5.3x more decode progress during the prefill at budget=256), in
-exchange for a modest TTFT rise on the long request (the classic chunked-prefill
-trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
-unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
-which the scheduler cannot lift.
-
-Correctness (same model, greedy temp 0, fa on):
- budget unset or >= n_batch: byte-identical to stock (the added break never
-  fires before the existing n_batch break; the off-path is a no-op by
-  construction).
- short prompt (<= budget): byte-identical to stock.
- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
-  stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
-  n_batch=2048 for decode width.
- on a prompt larger than the budget the chunked greedy output diverges from the
-  single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
-  stock -b256 diverges from stock -b2048 the same way with the patch inactive,
-  and the output stays coherent and answers correctly.
-
-Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
-mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
-Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
-stays disjoint from the paged allocation hunks.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
- 1 file changed, 33 insertions(+), 1 deletion(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index b5f9d37..afcdebe 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -3043,6 +3043,29 @@ private:
-         int32_t n_batch  = llama_n_batch(ctx_tgt);
-         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- 
-+        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
-+        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
-+        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
-+        // sampled decode tokens of every generating slot are appended FIRST, then prompt
-+        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
-+        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
-+        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
-+        // tokens added per step independently of n_batch, splitting a long prefill across
-+        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
-+        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
-+        // (this is a pure scheduler knob; works with paged off).
-+        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
-+        {
-+            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
-+            if (env_pb) {
-+                const int v = atoi(env_pb);
-+                if (v > 0) {
-+                    n_prefill_budget = std::min(n_batch, std::max(1, v));
-+                }
-+            }
-+        }
-+        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
-+
-         auto & alora_scale       = batch.alora_scale;
-         auto & alora_disabled_id = batch.alora_disabled_id;
- 
-@@ -3487,7 +3510,10 @@ private:
-                     const auto last_user_pos = spans.last_user_message_pos();
- 
-                     // add prompt tokens for processing in the current batch
-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
-+                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
-+                    // prompt is split across more steps and leaves batch room for co-batched decode
-+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
-+                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
-                         // get next token to process
-                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
-                         if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3512,6 +3538,7 @@ private:
-                         slot.prompt.tokens.push_back(cur_tok);
- 
-                         slot.n_prompt_tokens_processed++;
-+                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
- 
-                         // stop the prompt batch exactly before a user message
-                         if (spans.is_user_start(slot.prompt.n_tokens())) {
-@@ -3597,6 +3624,11 @@ private:
-                 if (!slot_batched) {
-                     slot_batched = &slot;
-                 }
-+                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
-+                // leaving the remaining batch capacity for co-batched decode of other slots
-+                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
-+                    add_ok = false;
-+                }
-             });
-         }
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -1,140 +0,0 @@
-From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 15:47:06 +0200
-Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
-
-On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
-sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
-mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
-originally reported npl128 throughput cliff does NOT reproduce on this build.
-llama-batched-bench decode (S_TG t/s) is monotonic across batch:
-
-  npl        1     8    32    64   128   256
-  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
-
-There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
-at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
-
-What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
-token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
-column upper bound = token count, up to 128) in one column-tile. At MoE decode
-the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
-ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
-col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
-time and burns throughput on the padding columns while the larger y-tile lowers
-occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
-covers the density would raise fill + occupancy at no extra weight read (at
-tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
-emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
-kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
-
-Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
-(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
-selection, and therefore every kernel launched, is byte-identical to stock. The
-cap only ever lowers the loop's upper bound and still selects from the same
-granularity- and shared-memory-validated mmq_x set stock already uses for
-smaller batches, so no new kernel configuration is exercised.
-
-Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
-only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
-
-  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
-   64        936          938      +0.1       2924         2883
-  128       1295         1357      +4.8       3075         3038
-  256       1784         1825      +2.3       3085         3046
-
-  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
-
-cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
-npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
-cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
-tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
-re-reads), so 64 is the recommended value and the only one that helps net.
-
-Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
-throughput unlock (llama-server continuous batching already scales). It is a
-modest high-effective-batch DECODE micro-optimization that matches vLLM's
-smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
-durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
-ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
-patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
-
-Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
-stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
-prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
-npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
- 1 file changed, 36 insertions(+), 1 deletion(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index edf546d..cff608e 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -6,6 +6,7 @@
- 
- #include <climits>
- #include <cstdint>
-+#include <cstdlib>
- 
- using namespace ggml_cuda_mma;
- 
-@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     }
- }
- 
-+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
-+static inline int ggml_cuda_moe_mmq_x_cap() {
-+    static const int cap = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_MMQ_X");
-+        return s ? atoi(s) : 0;
-+    }();
-+    return cap;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-     const int mmq_y = get_mmq_y_host(cc);
- 
-+    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
-+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
-+    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
-+    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
-+    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
-+    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
-+    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
-+    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
-+    // per-expert density raises tile fill + occupancy with no extra weight reads (at
-+    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
-+    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
-+    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
-+    // off the ids path the cap never applies.
-+    int mmq_x_lim = mmq_x_max;
-+    if (args.expert_bounds != nullptr) {
-+        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-+        if (moe_cap > 0) {
-+            const int cap = moe_cap < 8 ? 8 : moe_cap;
-+            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
-+        }
-+    }
-+
-     int mmq_x_best  = 0;
-     int ntiles_x_best = INT_MAX;
- 
-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
-+    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
-         const int granularity = mmq_get_granularity_host(mmq_x, cc);
- 
-         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -1,238 +0,0 @@
-From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 21:03:00 +0200
-Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
- (patch 0015)
-
-The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
-0014 doc itself scoped): replace the manual env cap with a host-side, default-on
-auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
-MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
-(decode), and keeps the large 128-wide tile when density is high (prefill). No new
-kernel: the selection only lowers the loop's upper bound to an already-compiled,
-granularity- and shared-memory-validated mmq_x.
-
-Density is estimated host-side from the args the ids path already passes:
-  ne_get_rows = ncols_dst   = ne12 * n_expert_used   (token-expert assignments)
-  n_experts   = nchannels_x = ne02
-  density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
-Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
-global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
-regress by construction.
-
-density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
-a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
-standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
-16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
-sits strictly between for every n_experts in [128,511], so it caps decode and leaves
-prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
-cratered its S_PP by ~2%, the regression this threshold exists to avoid.
-
-Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
-attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
-(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
-
-  npl   S_TG stock  S_TG 0015   dTG%    S_PP stock  S_PP 0015   dPP%
-    8      183.59     183.18  -0.22%       1489.2     1500.1  +0.73%
-   32      264.02     263.44  -0.22%       2034.5     2033.5  -0.05%
-   64      311.76     310.41  -0.43%       2028.3     2027.6  -0.03%
-  128      336.10     337.32  +0.36%       2025.0     2027.7  +0.13%
-
-Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
-and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
-256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
-lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
-cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
-useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
-smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
-
-Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
-(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
-decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
-the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
-neutral on the SSM model, harmless where it does not help. Conservative by design:
-at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
-(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
-+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
-work.
-
-LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
-old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
-select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
-LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
-
-Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
-NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
-{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
-All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
-LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
-nothing changes (non-MoE mul_mat byte-identical to stock).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
- tests/test-backend-ops.cpp |  16 ++++++
- 2 files changed, 99 insertions(+), 17 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index cff608e..9718b12 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     }
- }
- 
-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
-+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
-+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
-+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
-+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
-+// as an explicit override / A-B knob; the default path is now the auto-select.
- static inline int ggml_cuda_moe_mmq_x_cap() {
-     static const int cap = []() -> int {
-         const char * s = getenv("LLAMA_MOE_MMQ_X");
-@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
-     return cap;
- }
- 
-+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
-+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
-+static inline bool ggml_cuda_moe_auto_tile_enabled() {
-+    static const bool en = []() -> bool {
-+        const char * s = getenv("LLAMA_MOE_AUTO_TILE");
-+        return !(s && atoi(s) == 0);
-+    }();
-+    return en;
-+}
-+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
-+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
-+static inline int ggml_cuda_moe_decode_tile() {
-+    static const int t = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_DECODE_TILE");
-+        const int v = s ? atoi(s) : 0;
-+        return v >= 8 ? v : 64;
-+    }();
-+    return t;
-+}
-+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
-+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
-+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
-+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
-+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
-+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
-+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
-+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
-+// segment never splits into an extra col-tile.
-+static inline int ggml_cuda_moe_density_max() {
-+    static const int d = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
-+        const int v = s ? atoi(s) : 0;
-+        return v > 0 ? v : 8;
-+    }();
-+    return d;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-     const int mmq_y = get_mmq_y_host(cc);
- 
-    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
-    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
-    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
-    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
-    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
-    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
-    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
-    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
-    // per-expert density raises tile fill + occupancy with no extra weight reads (at
-    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
-    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
-    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
-    // off the ids path the cap never applies.
-+    // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
-+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
-+    // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
-+    // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
-+    // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
-+    // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
-+    // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
-+    // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
-+    // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
-+    // SMALLER mmq_x when - and only when - the per-expert density is low:
-+    //
-+    //   ne_get_rows  = args.ncols_dst    = ne12 * n_expert_used  (total token-expert assignments)
-+    //   n_experts    = args.nchannels_x  = ne02
-+    //   n_active_est = min(n_experts, ne_get_rows)               (upper bound on active experts)
-+    //   density      = ceil(ne_get_rows / n_active_est)          (avg tokens per active expert)
-+    //
-+    // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
-+    // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
-+    // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
-+    // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
-+    // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
-+    // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
-+    // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
-+    // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
-+    // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
-+    // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
-+    // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
-+    //   - LLAMA_MOE_MMQ_X=<n>   : manual blunt global cap, overrides the auto-select (patch 0014).
-+    //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
-+    //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
-     int mmq_x_lim = mmq_x_max;
-     if (args.expert_bounds != nullptr) {
-         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-         if (moe_cap > 0) {
-             const int cap = moe_cap < 8 ? 8 : moe_cap;
-             mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
-+        } else if (ggml_cuda_moe_auto_tile_enabled()) {
-+            const int64_t ne_get_rows = args.ncols_dst;
-+            const int64_t n_experts   = args.nchannels_x;
-+            if (ne_get_rows > 0 && n_experts > 0) {
-+                const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
-+                const int64_t density  = (ne_get_rows + n_active - 1) / n_active;
-+                const int     tile     = ggml_cuda_moe_decode_tile();
-+                if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
-+                    mmq_x_lim = tile;
-+                }
-+            }
-         }
-     }
- 
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index c83e91f..62a0989 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
-     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
- 
-+    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
-+    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
-+    // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
-+    // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
-+    // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
-+    // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
-+    // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
-+    // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
-+    // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
-+    // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
-+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
-+        for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
-+            test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
-+        }
-+    }
-+
-     for (ggml_type type_a : all_types) {
-         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -1,191 +0,0 @@
-From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Wed, 24 Jun 2026 10:11:48 +0200
-Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
- 0016, continuous-batch P1)
-
-Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
-decode-first token budget: the P1 of the token-granular continuous-batch
-scheduler. POLICY change only inside update_slots(): no new slot states, no
-batch-formation rewrite, zero libllama changes. llama-server already emits one
-unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
-token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
-changes the COUNT of prefill tokens admitted per step.
-
-The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
-== D (the live decode load) is known there. Instead of 0013's constant
-LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
-long prompt monopolise the step), compute a dynamic budget:
-
-  T  = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
-  prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
-       auto-shrinks as decode load rises so the step never inflates past T)
-  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
-       pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
-
-Phase 2's inner prompt-fill loop and outer admission break are bounded by
-prefill_budget_step (across slots) and a new per-slot slot_prompt_added
-counter; the n_batch hard ceiling stays as the compute bound. Decode is
-structurally claimed first and never capped (Phase 1), so the decode-first
-guarantee is free.
-
-DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
-to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
-determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
-(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
-subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
-decisions paged on or off.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
- 1 file changed, 85 insertions(+), 22 deletions(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index afcdebe..b8b8f00 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -3043,24 +3043,78 @@ private:
-         int32_t n_batch  = llama_n_batch(ctx_tgt);
-         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- 
-        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
-        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
-        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
-        // sampled decode tokens of every generating slot are appended FIRST, then prompt
-        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
-        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
-        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
-        // tokens added per step independently of n_batch, splitting a long prefill across
-        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
-        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
-        // (this is a pure scheduler knob; works with paged off).
-        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
-+        // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
-+        // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
-+        // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
-+        // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
-+        // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
-+        // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
-+        // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
-+        // lets one long prompt monopolise the step.
-+        //
-+        // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
-+        // a single total per-step token budget T, decode claims its D tokens first
-+        // (already in the batch), and prefill gets the leftover T - D distributed across
-+        // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
-+        // leftover auto-shrinks, so the step never inflates past T at any concurrency:
-+        // the budget self-tunes across the npl range and across dense vs MoE without a
-+        // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
-+        // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
-+        // never capped (Phase 1), so the decode-first guarantee is free here.
-+        //
-+        //   LLAMA_MAX_BATCH_TOKENS (T)  total per-step token budget (decode + prefill),
-+        //                               default n_batch, clamped to [n_ubatch, n_batch] so
-+        //                               the compute loop stays a single llama_decode and
-+        //                               prefill keeps an n_ubatch floor of progress.
-+        //   LLAMA_PREFILL_CAP           per-slot max prompt tokens per step (the
-+        //                               long_prefill_token_threshold analogue), default
-+        //                               min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
-+        //                               one long prompt cannot eat the whole leftover.
-+        //   LLAMA_PREFILL_BUDGET        legacy static cap (patch 0013); honoured ONLY when
-+        //                               LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
-+        //
-+        // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
-+        // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
-+        // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
-+        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
-+        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
-+        // scheduler policy, identical decisions with paged on or off.
-+        const int32_t n_decode_in_batch = batch.size();    // D: Phase 1 appended D decode tokens above
-+        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
-+        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
-         {
-            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
-            if (env_pb) {
-+            int32_t mbt = 0;
-+            if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
-+                mbt = atoi(env_mbt);
-+            }
-+            if (mbt > 0) {
-+                // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
-+                int32_t T = std::min(n_batch, mbt);
-+                T = std::max(T, n_ubatch);
-+                // leftover after decode, floored at n_ubatch so prefill never fully starves
-+                prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
-+                // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
-+                int32_t cap = 0;
-+                if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
-+                    cap = atoi(env_cap);
-+                }
-+                if (cap <= 0) {
-+                    const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
-+                    cap = std::min(T, std::max(n_ubatch, pct4));
-+                }
-+                cap = std::min(n_batch, std::max(n_ubatch, cap));
-+                // at T == n_batch the leftover and cap both reach the n_batch ceiling
-+                // together; pin the cap to n_batch so this case stays byte-identical
-+                if (T >= n_batch) {
-+                    cap = n_batch;
-+                }
-+                prefill_cap_per_slot = cap;
-+            } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
-+                // legacy static budget (patch 0013), kept for back-compat when the
-+                // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
-                 const int v = atoi(env_pb);
-                 if (v > 0) {
-                    n_prefill_budget = std::min(n_batch, std::max(1, v));
-+                    prefill_budget_step = std::min(n_batch, std::max(1, v));
-                 }
-             }
-         }
-@@ -3509,11 +3563,18 @@ private:
-                     const auto & spans = slot.task->params.message_spans;
-                     const auto last_user_pos = spans.last_user_message_pos();
- 
-+                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
-+                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
-+                    int32_t slot_prompt_added = 0;
-+
-                     // add prompt tokens for processing in the current batch
-                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
-                    // prompt is split across more steps and leaves batch room for co-batched decode
-+                    // (patch 0016) also stop once (a) the dynamic per-step prefill budget
-+                    // (the T - D leftover) is spent across all slots, or (b) this slot's
-+                    // per-slot chunk cap is hit, so a long prompt is split across more steps
-+                    // and leaves batch room for co-batched decode of the other slots
-                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
-                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
-+                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
-+                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
-                         // get next token to process
-                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
-                         if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3538,7 +3599,8 @@ private:
-                         slot.prompt.tokens.push_back(cur_tok);
- 
-                         slot.n_prompt_tokens_processed++;
-                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
-+                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
-+                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
- 
-                         // stop the prompt batch exactly before a user message
-                         if (spans.is_user_start(slot.prompt.n_tokens())) {
-@@ -3624,9 +3686,10 @@ private:
-                 if (!slot_batched) {
-                     slot_batched = &slot;
-                 }
-                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
-                // leaving the remaining batch capacity for co-batched decode of other slots
-                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
-+                // (patch 0016) stop admitting prompts once the dynamic per-step prefill
-+                // budget (the T - D leftover) is spent, leaving the remaining batch
-+                // capacity for co-batched decode of the other slots
-+                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
-                     add_ok = false;
-                 }
-             });
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
@@ -1,245 +0,0 @@
-From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Wed, 24 Jun 2026 19:56:05 +0200
-Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
- occupancy instrumentation (patch 0017)
-
-Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
-bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
-honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
-kill-gate tripped - so nothing is enabled by default.
-
-P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
-row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
-M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
-default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
-
-P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
-  - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
-    weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
-    assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
-    left as the P2 kernel change; the host/device plumbing is in place and inert.
-  - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
-    (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
-  - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
-
-Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
-  DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
-    (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
-  MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
-    (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
-
-nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
-(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
-not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
-mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
-mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
- tests/test-backend-ops.cpp | 16 +++++++
- 2 files changed, 92 insertions(+), 9 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index 9718b12..b53e38a 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
- #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
- }
- 
-static int get_mmq_y_host(const int cc) {
-+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
-+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
-+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
-+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
-+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
-+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
-+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
-+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
-+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
-+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
-+#ifndef GGML_CUDA_FP4_MMQ_Y
-+#define GGML_CUDA_FP4_MMQ_Y 128
-+#endif
-+
-+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
-+    if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
-+        return GGML_CUDA_FP4_MMQ_Y;
-+    }
-     return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
-         ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
- }
-@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
-     return MMQ_ITER_K;
- }
- 
-+template <ggml_type type = GGML_TYPE_COUNT>
- static constexpr __device__ int get_mmq_y_device() {
-+#if defined(BLACKWELL_MMA_AVAILABLE)
-+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
-+        return GGML_CUDA_FP4_MMQ_Y;
-+    }
-+#endif // defined(BLACKWELL_MMA_AVAILABLE)
- #if defined(GGML_USE_HIP)
- #if defined(RDNA1)
-     return 64;
-@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
- #endif // defined(GGML_USE_HIP)
- }
- 
-+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
-+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
-+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
-+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
-+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
-+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
-+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
-+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
-+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
-+#ifndef GGML_CUDA_FP4_MINBLOCKS
-+#define GGML_CUDA_FP4_MINBLOCKS 1
-+#endif
-+template <ggml_type type = GGML_TYPE_COUNT>
-+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
-+#if defined(BLACKWELL_MMA_AVAILABLE)
-+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
-+        return GGML_CUDA_FP4_MINBLOCKS;
-+    }
-+#endif // defined(BLACKWELL_MMA_AVAILABLE)
-+    return stock;
-+}
-+
- // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
- // The K dimension of the tiles has either,
- // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
-@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
-     constexpr int              warp_size  = ggml_cuda_get_physical_warp_size();
-     constexpr int              nwarps     = mmq_get_nwarps_device();
-     constexpr int              qk         = ggml_cuda_type_traits<type>::qk;
-    constexpr int              mmq_y      = get_mmq_y_device();
-+    constexpr int              mmq_y      = get_mmq_y_device<type>();
-     constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
- 
-     extern __shared__ int data_mul_mat_q[];
-@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
- template <ggml_type type, int mmq_x, bool need_check>
- #if defined(GGML_USE_HIP)
- #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
- #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
- #else
- #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
- #else
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
- #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
- #endif // defined(GGML_USE_HIP)
- static __global__ void mul_mat_q(
-@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size();
- 
-     constexpr int qk    = ggml_cuda_type_traits<type>::qk;
-    constexpr int mmq_y = get_mmq_y_device();
-+    constexpr int mmq_y = get_mmq_y_device<type>();
- 
-     const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
- 
-@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
-         float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
-         const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
-         const int stride_sample_dst, const uint3 ntx) {
-    constexpr int mmq_y           = get_mmq_y_device();
-+    constexpr int mmq_y           = get_mmq_y_device<type>();
-     constexpr int qk              = ggml_cuda_type_traits<type>::qk;
-     constexpr int ITER_K          = get_iter_k(type);
-     constexpr int blocks_per_iter = ITER_K / qk;
-@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     const int nsm = ggml_cuda_info().devices[id].nsm;
-     const int warp_size = ggml_cuda_info().devices[id].warp_size;
-     const int nwarps = mmq_get_nwarps_host(cc, warp_size);
-    const int mmq_y = get_mmq_y_host(cc);
-+    const int mmq_y = get_mmq_y_host(cc, type);
- 
-     const dim3 block_dims(warp_size, nwarps, 1);
- 
-@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
-     return d;
- }
- 
-+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
-+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
-+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
-+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
-+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
-+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
-+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
-+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
-+    static const int c = []() -> int {
-+        const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
-+        return s ? atoi(s) : 0;
-+    }();
-+    return c;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int nwarps    = mmq_get_nwarps_host(cc, warp_size);
- 
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-    const int mmq_y = get_mmq_y_host(cc);
-+    const int mmq_y = get_mmq_y_host(cc, type);
- 
-     // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
-     // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
-@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
-     //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
-     int mmq_x_lim = mmq_x_max;
-+    if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
-+        // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
-+        const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
-+        if (cap > 0 && cap < mmq_x_max) {
-+            mmq_x_lim = cap < 8 ? 8 : cap;
-+        }
-+    }
-     if (args.expert_bounds != nullptr) {
-         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-         if (moe_cap > 0) {
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index f219309..291c275 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-         }
-     }
- 
-+    // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
-+    // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
-+    // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
-+    // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
-+    // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
-+    // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
-+    // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
-+    // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
-+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
-+        for (int64_t m : {2048, 1600, 2050}) {
-+            for (int64_t n : {32, 128}) {
-+                test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
-+            }
-+        }
-+    }
-+
-     for (ggml_type type_a : all_types) {
-         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
@@ -1,349 +0,0 @@
-From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 00:44:13 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
- write-back (patch 0018)
-
-Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
-16 full-attention layers) was dominated by recurrent-state plumbing, not the
-FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
-recurrent state into graph scratch, then a separate ggml_cpy persisted it into
-the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
-~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
-
-This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
-ggml_gated_delta_net_inplace writes the final recurrent state directly into the
-active sequences contiguous cache slot (at kv_head), removing the copy-back. The
-op output then carries only the attention scores; the SSM arithmetic is
-unchanged (bit-identical greedy output vs the copy-back baseline).
-
- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
-  dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
-
-Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
-  npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
-  npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
-MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
-nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
-(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
-next lever.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h                   | 14 ++++++
- ggml/src/ggml-cpu/ops.cpp             | 13 ++++-
- ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
- ggml/src/ggml.c                       | 68 +++++++++++++++++++++++++++
- src/models/delta-net-base.cpp         | 30 ++++++++++++
- 5 files changed, 152 insertions(+), 12 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 823f5a9..4e7ab32 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2579,6 +2579,20 @@ extern "C" {
-             struct ggml_tensor  * state,
-             int64_t               K);
- 
-+    // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
-+    // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
-+    // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
-+    // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
-+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * q,
-+            struct ggml_tensor  * k,
-+            struct ggml_tensor  * v,
-+            struct ggml_tensor  * g,
-+            struct ggml_tensor  * beta,
-+            struct ggml_tensor  * state,
-+            struct ggml_tensor  * state_dst);
-+
-     // custom operators
- 
-     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 63c07a2..9457add 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-     ggml_tensor * src_g     = dst->src[3];
-     ggml_tensor * src_beta  = dst->src[4];
-     ggml_tensor * src_state = dst->src[5];
-+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
- 
-     const int64_t S_v      = src_v->ne[0];
-     const int64_t H        = src_v->ne[1];
-@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
- 
-     const float scale = 1.0f / sqrtf((float) S_v);
- 
-+    // when src_state_dst is provided (in-place decode write-back) the final state is written
-+    // directly into the persistent cache view, removing the separate state copy-back node.
-+    float * inplace_state_base = nullptr;
-+    if (src_state_dst != nullptr) {
-+        GGML_ASSERT(K == 1);
-+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
-+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
-+        inplace_state_base = (float *) src_state_dst->data;
-+    }
-+
-     for (int64_t ir = ir0; ir < ir1; ++ir) {
-         const int64_t iv1 = ir % H; // head_index
-         const int64_t iv3 = ir / H; // sequence
-@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-         // For K>1, work in scratch and copy out per-token when the slot is in range.
-         float * s_out = (K > 1)
-             ? state_work
-            : state_out_base + (iv3 * H + iv1) * S_v * S_v;
-+            : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
- 
-         // copy input state into the working buffer and operate in-place
-         // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index a547360..61a2b91 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
-                                      const uint3   neqk1_magic,
-                                      const uint3   rq3_magic,
-                                      float         scale,
-                                     int           K) {
-+                                     int           K,
-+                                     float *       state_dst) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-     // each warp owns one column, using warp-level primitives to reduce across rows
-@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
- 
-     const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
-     float *       attn_data        = dst;
-    float *       state            = dst + attn_score_elems;
-+    // when state_dst is provided (in-place decode write-back) the final recurrent state is written
-+    // directly into the persistent cache view instead of being appended to the op output; this
-+    // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
-+    float *       state            = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
- 
-     // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
-     // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
-@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
- static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-         const float * g_d, const float * b_d, const float * s_d,
-        float * dst_d,
-+        float * dst_d, float * state_dst_d,
-         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
-         int64_t sq1,   int64_t sq2, int64_t sq3,
-         int64_t sv1,   int64_t sv2, int64_t sv3,
-@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         case 32:
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         case 64: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         }
-         case 128: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         }
-         default:
-@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     ggml_tensor * src_g     = dst->src[3];
-     ggml_tensor * src_beta  = dst->src[4];
-     ggml_tensor * src_state = dst->src[5];
-+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
- 
-     GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
-     GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
-@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const float * s_d   = (const float *) src_state->data;
-     float *       dst_d = (float *) dst->data;
- 
-+    float * state_dst_d = nullptr;
-+    if (src_state_dst != nullptr) {
-+        // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
-+        GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
-+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
-+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
-+        state_dst_d = (float *) src_state_dst->data;
-+    }
-+
-     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
-@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const int K = ggml_get_op_params_i32(dst, 0);
-     const bool keep_rs = K > 1;
- 
-+    // in-place write-back is only valid for the single-snapshot (final-state) case
-+    GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
-+
-     if (kda) {
-         if (keep_rs) {
-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-     } else {
-         if (keep_rs) {
-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index adbe52b..b8d34bf 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
-     return result;
- }
- 
-+// ggml_gated_delta_net_inplace
-+//
-+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
-+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
-+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
-+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
-+// the attention-output view layout is identical to ggml_gated_delta_net.
-+struct ggml_tensor * ggml_gated_delta_net_inplace(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * q,
-+        struct ggml_tensor  * k,
-+        struct ggml_tensor  * v,
-+        struct ggml_tensor  * g,
-+        struct ggml_tensor  * beta,
-+        struct ggml_tensor  * state,
-+        struct ggml_tensor  * state_dst) {
-+    GGML_ASSERT(ggml_is_contiguous_rows(q));
-+    GGML_ASSERT(ggml_is_contiguous_rows(k));
-+    GGML_ASSERT(ggml_is_contiguous_rows(v));
-+    GGML_ASSERT(ggml_is_contiguous(g));
-+    GGML_ASSERT(ggml_is_contiguous(beta));
-+    GGML_ASSERT(ggml_is_contiguous(state));
-+
-+    GGML_ASSERT(q->type == GGML_TYPE_F32);
-+    GGML_ASSERT(k->type == GGML_TYPE_F32);
-+    GGML_ASSERT(v->type == GGML_TYPE_F32);
-+    GGML_ASSERT(g->type == GGML_TYPE_F32);
-+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state_dst != NULL);
-+    GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
-+
-+    const int64_t S_v      = v->ne[0];
-+    const int64_t H        = v->ne[1];
-+    const int64_t n_tokens = v->ne[2];
-+    const int64_t n_seqs   = v->ne[3];
-+
-+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
-+    GGML_ASSERT(beta->ne[0] == 1);
-+
-+    GGML_ASSERT(state->ne[0] == S_v);
-+    GGML_ASSERT(state->ne[1] == S_v);
-+    GGML_ASSERT(state->ne[2] == H);
-+    GGML_ASSERT(state->ne[3] == n_seqs);
-+
-+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
-+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
-+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
-+
-+    const int64_t state_rows = S_v * n_seqs; // K == 1
-+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
-+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
-+
-+    ggml_set_op_params_i32(result, 0, 1); // K == 1
-+
-+    result->op     = GGML_OP_GATED_DELTA_NET;
-+    result->src[0] = q;
-+    result->src[1] = k;
-+    result->src[2] = v;
-+    result->src[3] = g;
-+    result->src[4] = beta;
-+    result->src[5] = state;
-+    result->src[6] = state_dst;
-+
-+    return result;
-+}
-+
- ////////////////////////////////////////////////////////////////////////////////
- 
- struct ggml_hash_set ggml_hash_set_new(size_t size) {
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index ad9ce77..26a718b 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-     const bool keep = cparams.n_rs_seq > 0;
- 
-     if (!keep) {
-+        const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
-+
-+        if (fused) {
-+            // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
-+            // directly into the persistent cache slot for the active sequences (a contiguous block
-+            // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
-+            // dominated decode. The op output then carries only the attention scores.
-+            ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
-+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
-+
-+            ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
-+            if (n_seq_tokens == 1) {
-+                cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
-+            } else {
-+                cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
-+            }
-+
-+            ggml_tensor * output = ggml_view_4d(ctx0, result,
-+                    S_v, H_v, n_seq_tokens, n_seqs,
-+                    ggml_row_size(result->type, S_v),
-+                    ggml_row_size(result->type, S_v * H_v),
-+                    ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
-+            cb(output, "attn_output", il);
-+
-+            // the state write is a side effect of the op; pull the op into the graph via the output
-+            ggml_build_forward_expand(gf, output);
-+
-+            return output;
-+        }
-+
-         auto attn_out = build_delta_net(q, k, v, g, b, s, il);
-         ggml_tensor * output    = attn_out.first;
-         ggml_tensor * new_state = attn_out.second;
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
@@ -1,583 +0,0 @@
-From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 01:45:02 +0200
-Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
- (patch 0019)
-
-Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
-write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
-state get_rows gather (18.8% of decode GPU time): build_rs materialized each
-sequence's prior state into a contiguous scratch via ggml_get_rows before the
-gated-DeltaNet op read it.
-
-This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
-ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
-s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
-each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
-in-place write the op now reads AND writes the cache directly: no recurrent-state
-materialization at all. build_recurrent_attn feeds the full cache + ids through
-the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
-clear and the extra-states copy around the op.
-
-Race-free by construction on CUDA. In-place write plus an ids read of the same
-cache is only safe when read slot == write slot; s_copy is identity
-(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
-remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
-The recurrence kernel handles both per (seq, head) block on device: identity
-sequences read s0 in place from the destination slot (the kernel loads all of s0
-into registers before writing, so reading and writing the same slot is safe),
-and non-identity sequences read from a disjoint scratch that a small gather
-kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
-another block writes. The CPU op mirrors this (host identity check + a serial
-gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
-device-resident at op-execute time). Bit-identical to the get_rows path in every
-case.
-
- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
-  (gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
-  CPU identity guard + serial gather fallback in the dispatcher
- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
-  qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
-  the rollback (n_rs_seq > 0) path are unchanged.
-
-Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
-  dense q36-27b-nvfp4 : npl 32  137.64 -> 170.68 (+24.0 percent)
-                        npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
-  MoE   q36-35b-a3b-nvfp4: npl 32  299.68 -> 366.69 (+22.4 percent)
-                           npl 128 409.30 -> 553.63 (+35.3 percent)
-Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
-(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
-k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
-is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
-is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h                   | 17 ++++++
- ggml/src/ggml-cpu/ops.cpp             | 49 ++++++++++++++-
- ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
- ggml/src/ggml.c                       | 76 +++++++++++++++++++++++
- src/models/delta-net-base.cpp         | 63 ++++++++++++++++++++
- src/models/models.h                   | 13 ++++
- src/models/qwen35.cpp                 |  6 +-
- src/models/qwen35moe.cpp              |  6 +-
- 8 files changed, 292 insertions(+), 23 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 4e7ab32..951dd21 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2593,6 +2593,23 @@ extern "C" {
-             struct ggml_tensor  * state,
-             struct ggml_tensor  * state_dst);
- 
-+    // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
-+    // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
-+    // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
-+    // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
-+    // base slot. Eliminates the recurrent-state gather on the decode path.
-+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * q,
-+            struct ggml_tensor  * k,
-+            struct ggml_tensor  * v,
-+            struct ggml_tensor  * g,
-+            struct ggml_tensor  * beta,
-+            struct ggml_tensor  * state,
-+            struct ggml_tensor  * state_dst,
-+            struct ggml_tensor  * ids,
-+            int                   rs_head);
-+
-     // custom operators
- 
-     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 9457add..b6a1976 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-     const int64_t K = ggml_get_op_params_i32(dst, 0);
-     GGML_ASSERT(K >= 1);
-     // per-seq stride in floats (seq s starts at state + s * seq_stride)
-    const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
-+    int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
- 
-     const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
-     const int ith = params->ith;
-@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
- 
-     const float * state_in_base = (const float *)src_state->data;
- 
-+    // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
-+    // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
-+    // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
-+    // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
-+    // region. Bit-identical to the get_rows path.
-+    ggml_tensor * src_ids = dst->src[7];
-+    if (src_ids != nullptr) {
-+        const int64_t   D       = S_v * S_v * H;
-+        const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
-+        const int32_t * ids     = (const int32_t *) src_ids->data;
-+        bool identity = true;
-+        for (int64_t s = 0; s < n_seqs; ++s) {
-+            if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
-+        }
-+        state_seq_stride = D;
-+        state_in_base = identity
-+            ? (const float *) src_state->data + (int64_t) rs_head * D
-+            : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
-+    }
-+
-   //const int64_t rq1 = nev1 / neq1;
-   //const int64_t rk1 = nev1 / nek1;
-     const int64_t rq3 = nev3 / neq3;
-@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
- 
-     if (ith == 0) {
-       ggml_threadpool_chunk_set(params->threadpool, nth);
-+
-+      // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
-+      // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
-+      // recurrence, so the in-place write never aliases another sequence's read.
-+      ggml_tensor * src_ids = dst->src[7];
-+      if (src_ids != nullptr) {
-+          const ggml_tensor * src_state = dst->src[5];
-+          const int64_t S_v      = V->ne[0];
-+          const int64_t H        = V->ne[1];
-+          const int64_t n_tokens = V->ne[2];
-+          const int64_t n_seqs   = V->ne[3];
-+          const int64_t D        = S_v * S_v * H;
-+          const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
-+          const int32_t * ids     = (const int32_t *) src_ids->data;
-+          bool identity = true;
-+          for (int64_t s = 0; s < n_seqs; ++s) {
-+              if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
-+          }
-+          if (!identity) {
-+              const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
-+              const float * cache   = (const float *) src_state->data;
-+              float *       scratch = (float *) dst->data + attn_score_elems;
-+              for (int64_t s = 0; s < n_seqs; ++s) {
-+                  memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
-+              }
-+          }
-+      }
-     }
- 
-     ggml_barrier(params->threadpool);
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index 61a2b91..86d5e2a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -1,6 +1,34 @@
- #include "gated_delta_net.cuh"
- #include "ggml-cuda/common.cuh"
- 
-+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
-+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
-+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
-+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
-+                                           float * scratch, int64_t D, int n_seqs) {
-+    const int s = blockIdx.x;
-+    if (s >= n_seqs) {
-+        return;
-+    }
-+    const int r = ids[s];
-+    if (r == rs_head + s) {
-+        return; // identity: prior state already lives in the in-place destination slot
-+    }
-+    const float * src = cache   + (int64_t) r * D;
-+    float *       dst = scratch + (int64_t) s * D;
-+    for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
-+        dst[i] = src[i];
-+    }
-+}
-+
-+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
-+                                          float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
-+    if (n_seqs <= 0) {
-+        return;
-+    }
-+    gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
-+}
-+
- template <int S_v, bool KDA, bool keep_rs_t>
- __global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
- gated_delta_net_cuda(const float * q,
-@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
-                                      const uint3   rq3_magic,
-                                      float         scale,
-                                      int           K,
-                                     float *       state_dst) {
-+                                     float *       state_dst,
-+                                     const int32_t * ids,
-+                                     int           rs_head) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-     // each warp owns one column, using warp-level primitives to reduce across rows
-@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
-     const int64_t state_in_offset      = sequence * H * S_v * S_v + h_idx * S_v * S_v;
-     const int64_t state_out_offset     = (sequence * H + h_idx) * S_v * S_v;
-     state += state_out_offset;
-    curr_state += state_in_offset + col * S_v;
-+    // Step 2: select the prior-state read base per sequence. For the ids variant, identity
-+    // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
-+    // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
-+    // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
-+    // offset. The whole s0 is loaded into registers before the new state is written, so reading and
-+    // writing the same slot per block (identity) is race-free.
-+    const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
-+        ? state_dst : curr_state;
-+    read_state += state_in_offset + col * S_v;
-     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
- 
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
-@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
- #pragma unroll
-     for (int r = 0; r < rows_per_lane; r++) {
-         const int i = r * warp_size + lane;
-        s_shard[r]  = curr_state[i];
-+        s_shard[r]  = read_state[i];
-     }
- 
-     for (int t = 0; t < n_tokens; t++) {
-@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-         const float * g_d, const float * b_d, const float * s_d,
-         float * dst_d, float * state_dst_d,
-+        const int32_t * ids_d, int rs_head,
-         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
-         int64_t sq1,   int64_t sq2, int64_t sq3,
-         int64_t sv1,   int64_t sv2, int64_t sv3,
-@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         case 32:
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         case 64: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         }
-         case 128: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         }
-         default:
-@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const float * g_d = (const float *) src_g->data;
-     const float * b_d = (const float *) src_beta->data;
- 
-    const float * s_d   = (const float *) src_state->data;
-     float *       dst_d = (float *) dst->data;
- 
-     float * state_dst_d = nullptr;
-@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-         state_dst_d = (float *) src_state_dst->data;
-     }
- 
-+    // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
-+    // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
-+    // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
-+    // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
-+    // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
-+    // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
-+    // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
-+    ggml_tensor * src_ids = dst->src[7];
-+    const float *   s_d     = (const float *) src_state->data;
-+    const int32_t * ids_d   = nullptr;
-+    int             rs_head = 0;
-+    ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
-+    if (src_ids != nullptr) {
-+        GGML_ASSERT(state_dst_d != nullptr);
-+        GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
-+        rs_head = ggml_get_op_params_i32(dst, 1);
-+        ids_d   = (const int32_t *) src_ids->data;
-+        const int64_t D = S_v * S_v * H;
-+        float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
-+        ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
-+        s_d = scratch;
-+    }
-+
-     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
-@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
- 
-     if (kda) {
-         if (keep_rs) {
-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-     } else {
-         if (keep_rs) {
-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index b8d34bf..1762037 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
-     return result;
- }
- 
-+// ggml_gated_delta_net_inplace_ids
-+//
-+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
-+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
-+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
-+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
-+// + s), where the prior state already lives in the in-place destination slots.
-+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * q,
-+        struct ggml_tensor  * k,
-+        struct ggml_tensor  * v,
-+        struct ggml_tensor  * g,
-+        struct ggml_tensor  * beta,
-+        struct ggml_tensor  * state,
-+        struct ggml_tensor  * state_dst,
-+        struct ggml_tensor  * ids,
-+        int                   rs_head) {
-+    GGML_ASSERT(ggml_is_contiguous_rows(q));
-+    GGML_ASSERT(ggml_is_contiguous_rows(k));
-+    GGML_ASSERT(ggml_is_contiguous_rows(v));
-+    GGML_ASSERT(ggml_is_contiguous(g));
-+    GGML_ASSERT(ggml_is_contiguous(beta));
-+    GGML_ASSERT(ggml_is_contiguous(state));
-+
-+    GGML_ASSERT(q->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(k->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(v->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(g->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
-+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
-+
-+    const int64_t S_v      = v->ne[0];
-+    const int64_t H        = v->ne[1];
-+    const int64_t n_tokens = v->ne[2];
-+    const int64_t n_seqs   = v->ne[3];
-+
-+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
-+    GGML_ASSERT(beta->ne[0] == 1);
-+
-+    // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
-+    GGML_ASSERT(state->ne[0] == S_v);
-+    GGML_ASSERT(state->ne[1] == S_v);
-+    GGML_ASSERT(state->ne[2] == H);
-+    GGML_ASSERT(state->ne[3] >= n_seqs);
-+
-+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
-+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
-+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
-+
-+    // ids: per-seq source slot into the full cache (s_copy_main)
-+    GGML_ASSERT(ids->ne[0] >= n_seqs);
-+
-+    const int64_t state_rows = S_v * n_seqs; // K == 1
-+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
-+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
-+
-+    ggml_set_op_params_i32(result, 0, 1);       // K == 1
-+    ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
-+
-+    result->op     = GGML_OP_GATED_DELTA_NET;
-+    result->src[0] = q;
-+    result->src[1] = k;
-+    result->src[2] = v;
-+    result->src[3] = g;
-+    result->src[4] = beta;
-+    result->src[5] = state;     // FULL cache (read via ids)
-+    result->src[6] = state_dst; // in-place final-state write-back target
-+    result->src[7] = ids;       // per-seq source slots (s_copy)
-+
-+    return result;
-+}
-+
- ////////////////////////////////////////////////////////////////////////////////
- 
- struct ggml_hash_set ggml_hash_set_new(size_t size) {
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index 26a718b..194e611 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
-     return conv_input;
- }
- 
-+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
-+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
-+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
-+// and rollback paths fall back to materializing the prior state and delegating below.
-+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-+        llm_graph_input_rs * inp,
-+        ggml_tensor *        ssm_states_all,
-+        ggml_tensor *        q,
-+        ggml_tensor *        k,
-+        ggml_tensor *        v,
-+        ggml_tensor *        g,
-+        ggml_tensor *        b,
-+        int                  il) {
-+    const auto * mctx_cur = inp->mctx;
-+    const auto   kv_head  = mctx_cur->get_head();
-+
-+    const int64_t S_v          = v->ne[0];
-+    const int64_t H_v          = v->ne[1];
-+    const int64_t n_seqs       = v->ne[3];
-+    const int64_t n_seq_tokens = q->ne[2];
-+
-+    const bool keep  = cparams.n_rs_seq > 0;
-+    const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
-+
-+    if (!keep && fused) {
-+        // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
-+        // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
-+        // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
-+        // the final state in place at kv_head; no recurrent-state materialization at all.
-+        auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
-+            ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
-+            ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
-+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
-+            return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
-+        };
-+
-+        ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
-+        if (n_seq_tokens == 1) {
-+            cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
-+        } else {
-+            cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
-+        }
-+
-+        ggml_tensor * output = ggml_view_4d(ctx0, result,
-+                S_v, H_v, n_seq_tokens, n_seqs,
-+                ggml_row_size(result->type, S_v),
-+                ggml_row_size(result->type, S_v * H_v),
-+                ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
-+        cb(output, "attn_output", il);
-+
-+        // the state write is a side effect of the op; pull the op into the graph via the output
-+        ggml_build_forward_expand(gf, output);
-+
-+        return output;
-+    }
-+
-+    // non-fused / rollback: materialize the prior state via gather and delegate to the
-+    // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
-+    ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-+    s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
-+    return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
-+}
-+
- ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-         llm_graph_input_rs * inp,
-         ggml_tensor *        ssm_states_all,
-diff --git a/src/models/models.h b/src/models/models.h
-index 2ac8415..98b89e9 100644
--- a/src/models/models.h
-+++ b/src/models/models.h
-@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
-             ggml_tensor *        b,
-             ggml_tensor *        s,
-             int                  il);
-+
-+    // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
-+    // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
-+    // delegates to the state-taking overload for the non-fused and rollback paths.
-+    ggml_tensor * build_recurrent_attn(
-+            llm_graph_input_rs * inp,
-+            ggml_tensor *        ssm_states_all,
-+            ggml_tensor *        q,
-+            ggml_tensor *        k,
-+            ggml_tensor *        v,
-+            ggml_tensor *        g,
-+            ggml_tensor *        b,
-+            int                  il);
- };
- 
- struct llm_build_rwkv6_base : public llm_graph_context {
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 6783d98..0be3247 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
- 
-     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-
-     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-     cb(conv_output_proper, "conv_output_raw", il);
- 
-@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     cb(k_conv, "k_conv_predelta", il);
-     cb(v_conv, "v_conv_predelta", il);
- 
-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
-+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
- 
-     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
-     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index eb5e9a4..2995f04 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
- 
-     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-
-     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-     cb(conv_output_proper, "conv_output_raw", il);
- 
-@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     cb(k_conv, "k_conv_predelta", il);
-     cb(v_conv, "v_conv_predelta", il);
- 
-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
-+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
- 
-     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
-     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
@@ -1,140 +0,0 @@
-From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 12:40:49 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
- (patch 0020)
-
-Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
-models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
-(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
-both engines pinned the largest llama-specific overage to the gated-DeltaNet
-OUTPUT projection (ssm_out).
-
-The GDN op left its output in SSM layout and the graph reshaped it to 3D
-[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
-src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
-sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
-ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
-the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
-M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
-
-The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
-(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
-routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
-all 128 tokens). The result is then already 2D, so the redundant post-matmul
-reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
-Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
-untouched.
-
-Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
-q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
-test-backend-ops MUL_MAT and MUL_MAT_ID OK.
-
-decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
-  dense q36-27b:    170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
-  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
-Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
-
-nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
-to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
-per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
-vs 2.77 ms/call for the old GEMV.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/models/qwen35.cpp       | 13 ++++---
- src/models/qwen35moe.cpp    | 13 ++++---
- src/models/qwen3next.cpp    | 13 ++++---
- 3 files changed, 21 insertions(+), 18 deletions(-)
-
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index 2995f04..1f6f643 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
-index 97200a4..bfdf026 100644
--- a/src/models/qwen3next.cpp
-+++ b/src/models/qwen3next.cpp
-@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
@@ -1,655 +0,0 @@
-From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 18:55:54 +0200
-Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch
- 0021)
-
-The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
-design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
-recurrence is already single-pass at the f32 byte floor), the decode conv path
-was the only remaining bit-exact lever.
-
-New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
-by a non-null src[3]). On the single-token decode path it replaces the four-op
-conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
-+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
-(channel, sequence), assembles the width-K window in registers from the K-1 cached
-taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
-ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
-output, and writes the 1-token-shifted ring state back IN PLACE into the conv
-cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
-in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
-write target (the cache view) are disjoint buffers, so it is race-free by
-construction with no ids/identity logic.
-
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
-  src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
-  op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
-  ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
-  + branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
-  conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
-  (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
-  the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
-
-test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
-
-Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
-(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
-ac163882... both BYTE-IDENTICAL.
-
-decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
-  dense q36-27b-nvfp4 : npl 32  199.76 -> 202.99 (+1.6%)
-                        npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
-  MoE   q36-35b-a3b   : npl 32  421.72 -> 432.39 (+2.5%)
-                        npl 128 689.74 -> 713.54 (+3.5%)
-Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
-(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
-decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
-conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
-conv-cache plumbing.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h            |  16 +++++
- ggml/src/ggml-cpu/ops.cpp      |  73 ++++++++++++++++++++-
- ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
- ggml/src/ggml.c                |  54 ++++++++++++++++
- src/models/delta-net-base.cpp  |  51 +++++++++++++++
- src/models/models.h            |  14 +++++
- src/models/qwen35.cpp          |  23 +++++--
- src/models/qwen35moe.cpp       |  23 +++++--
- src/models/qwen3next.cpp       |  29 ++++++---
- tests/test-backend-ops.cpp     |  47 ++++++++++++++
- 10 files changed, 420 insertions(+), 22 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 951dd21..76fa401 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2447,6 +2447,22 @@ extern "C" {
-             struct ggml_tensor  * sx,
-             struct ggml_tensor  * c);
- 
-+    // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles
-+    // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels,
-+    // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise
-+    // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes
-+    // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs]
-+    // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back +
-+    // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV;
-+    // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode).
-+    GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * conv_states,
-+            struct ggml_tensor  * conv_kernel,
-+            struct ggml_tensor  * x_cur,
-+            struct ggml_tensor  * conv_state_dst,
-+            bool                  fuse_silu);
-+
-     GGML_API struct ggml_tensor * ggml_ssm_scan(
-             struct ggml_context * ctx,
-             struct ggml_tensor  * s,
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index b6a1976..f9cd850 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32(
-     }
- }
- 
-+// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the
-+// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same
-+// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv
-+// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split
-+// over channels.
-+static void ggml_compute_forward_ssm_conv_update_f32(
-+        const ggml_compute_params * params,
-+        ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    ggml_tensor       * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+
-+    const int ith = params->ith;
-+    const int nth = params->nth;
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+
-+    const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float);
-+    const int64_t states_ch_stride  = conv_states->nb[1] / sizeof(float);
-+    const int64_t w_stride          = conv_kernel->nb[1] / sizeof(float);
-+    const int64_t x_seq_stride      = x_cur->nb[2] / sizeof(float);
-+    const int64_t dst_seq_stride    = dst->nb[2] / sizeof(float);
-+    const int64_t cdst_seq_stride   = cdst->nb[1] / sizeof(float);
-+
-+    const float * states_base = (const float *) conv_states->data;
-+    const float * w_base      = (const float *) conv_kernel->data;
-+    const float * x_base      = (const float *) x_cur->data;
-+    float *       cdst_base   = (float *) cdst->data;
-+    float *       dst_base    = (float *) dst->data;
-+
-+    const int64_t dc = (channels + nth - 1) / nth;
-+    const int64_t c0 = dc * ith;
-+    const int64_t c1 = MIN(c0 + dc, channels);
-+
-+    for (int64_t s = 0; s < n_seqs; ++s) {
-+        for (int64_t c = c0; c < c1; ++c) {
-+            const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride;
-+            const float * w_c      = w_base + c * w_stride;
-+            const float   xc       = x_base[s * x_seq_stride + c];
-+
-+            // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
-+            float sumf = 0.0f;
-+            for (int64_t j = 0; j < d_conv - 1; ++j) {
-+                sumf += states_c[j] * w_c[j];
-+            }
-+            sumf += xc * w_c[d_conv - 1];
-+            sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
-+
-+            dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
-+
-+            // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
-+            float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
-+            for (int64_t j = 0; j < d_conv - 2; ++j) {
-+                out_state[j] = states_c[j + 1];
-+            }
-+            out_state[d_conv - 2] = xc;
-+        }
-+    }
-+}
-+
- void ggml_compute_forward_ssm_conv(
-         const ggml_compute_params * params,
-         ggml_tensor * dst) {
-     switch (dst->src[0]->type) {
-         case GGML_TYPE_F32:
-             {
-                ggml_compute_forward_ssm_conv_f32(params, dst);
-+                if (dst->src[3] != nullptr) {
-+                    ggml_compute_forward_ssm_conv_update_f32(params, dst);
-+                } else {
-+                    ggml_compute_forward_ssm_conv_f32(params, dst);
-+                }
-             } break;
-         default:
-             {
-diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
-index 1463169..e1af1cd 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
-+++ b/ggml/src/ggml-cuda/ssm-conv.cu
-@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
-     }
- }
- 
-+// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of
-+// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the
-+// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as
-+// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted
-+// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back.
-+template <bool apply_silu, int d_conv>
-+static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states,
-+                                           const float * __restrict__ conv_kernel,
-+                                           const float * __restrict__ x_cur,
-+                                           float       * __restrict__ conv_state_dst,
-+                                           float       * __restrict__ dst,
-+                                           const int channels,
-+                                           const int states_seq_stride,
-+                                           const int w_stride,
-+                                           const int x_seq_stride,
-+                                           const int dst_seq_stride,
-+                                           const int cdst_seq_stride) {
-+    const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
-+    const int s = blockIdx.y;                            // sequence
-+    if (c >= channels) {
-+        return;
-+    }
-+
-+    const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1);
-+    const float * w_c       = conv_kernel + (int64_t) c * w_stride;
-+    const float   xc        = x_cur[(int64_t) s * x_seq_stride + c];
-+
-+    // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window
-+    float window[d_conv];
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        window[j] = states_c[j];
-+    }
-+    window[d_conv - 1] = xc;
-+
-+    float sumf = 0.0f;
-+#pragma unroll
-+    for (int j = 0; j < d_conv; j++) {
-+        sumf += window[j] * w_c[j];
-+    }
-+    sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
-+    dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
-+
-+    // 1-token-shifted ring write-back: drop the oldest tap, append the current token
-+    float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        out_state[j] = window[j + 1];
-+    }
-+}
-+
-+static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    const ggml_tensor * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+
-+    GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+    GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
-+
-+    const float * states_d = (const float *) conv_states->data;
-+    const float * w_d      = (const float *) conv_kernel->data;
-+    const float * x_d      = (const float *) x_cur->data;
-+    float *       cdst_d   = (float *) cdst->data;
-+    float *       dst_d    = (float *) dst->data;
-+    cudaStream_t  stream   = ctx.stream();
-+
-+    const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float));
-+    const int w_stride          = (int) (conv_kernel->nb[1] / sizeof(float));
-+    const int x_seq_stride      = (int) (x_cur->nb[2] / sizeof(float));
-+    const int dst_seq_stride    = (int) (dst->nb[2] / sizeof(float));
-+    const int cdst_seq_stride   = (int) (cdst->nb[1] / sizeof(float));
-+
-+    const int threads = 128;
-+    const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
-+
-+    auto launch = [&](auto NC) {
-+        constexpr int kNC = decltype(NC)::value;
-+        if (apply_silu) {
-+            ssm_conv_update_f32<true, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
-+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        } else {
-+            ssm_conv_update_f32<false, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
-+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        }
-+    };
-+
-+    switch (d_conv) {
-+        case 3: launch(std::integral_constant<int, 3>{}); break;
-+        case 4: launch(std::integral_constant<int, 4>{}); break;
-+        default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4");
-+    }
-+}
-+
- template <bool apply_silu>
- static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
-                               const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
-@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa
- }
- 
- void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
-+    // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a
-+    // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/
-+    // silu of the decode conv path into a single kernel.
-+    if (dst->src[3] != nullptr) {
-+        GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
-+        ggml_cuda_op_ssm_conv_update(ctx, dst);
-+        return;
-+    }
-+
-     const struct ggml_tensor * src0 = dst->src[0];  // conv_x
-     const struct ggml_tensor * src1 = dst->src[1];  // conv1d.weight
-     const bool fuse_bias = bias_add_node != nullptr;
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index 1762037..b777748 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv(
-     return result;
- }
- 
-+// ggml_ssm_conv_update_inplace
-+//
-+// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a
-+// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new
-+// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv),
-+// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the
-+// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache
-+// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern.
-+struct ggml_tensor * ggml_ssm_conv_update_inplace(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * conv_states,
-+        struct ggml_tensor  * conv_kernel,
-+        struct ggml_tensor  * x_cur,
-+        struct ggml_tensor  * conv_state_dst,
-+        bool                  fuse_silu) {
-+    GGML_ASSERT(ggml_is_3d(conv_states));
-+    GGML_ASSERT(ggml_is_matrix(conv_kernel));
-+    GGML_ASSERT(ggml_is_3d(x_cur));
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+
-+    GGML_ASSERT(conv_states->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_kernel->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type          == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
-+
-+    // conv_states: [K-1, channels, n_seqs], contiguous taps per channel
-+    GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
-+    GGML_ASSERT(conv_states->ne[1] == channels);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    // x_cur: single decode token per sequence
-+    GGML_ASSERT(x_cur->ne[0] == channels);
-+    GGML_ASSERT(x_cur->ne[1] == 1);
-+    GGML_ASSERT(x_cur->ne[2] == n_seqs);
-+    // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
-+    GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
-+    GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
-+
-+    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+
-+    ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
-+
-+    result->op     = GGML_OP_SSM_CONV;
-+    result->src[0] = conv_states;
-+    result->src[1] = conv_kernel;
-+    result->src[2] = x_cur;
-+    result->src[3] = conv_state_dst;
-+
-+    return result;
-+}
-+
- // ggml_ssm_scan
- 
- struct ggml_tensor * ggml_ssm_scan(
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index 194e611..0eee804 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
-     return conv_input;
- }
- 
-+// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same
-+// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token-
-+// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat
-+// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate
-+// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the
-+// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy
-+// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu.
-+ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
-+        llm_graph_input_rs * inp,
-+        ggml_tensor *        conv_states_all,
-+        ggml_tensor *        qkv_mixed,
-+        ggml_tensor *        conv_kernel,
-+        int64_t              conv_kernel_size,
-+        int64_t              conv_channels,
-+        int                  il) {
-+    const auto * mctx_cur = inp->mctx;
-+    const auto   kv_head  = mctx_cur->get_head();
-+
-+    const int64_t n_seqs       = ubatch.n_seqs;
-+    const int64_t n_seq_tokens = ubatch.n_seq_tokens;
-+
-+    GGML_ASSERT(n_seq_tokens == 1);        // single-token decode only
-+    GGML_ASSERT(cparams.n_rs_seq == 0);    // no rollback splits on this path
-+
-+    // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
-+    // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
-+    ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
-+    conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
-+    cb(conv_states, "conv_states_reshaped", il);
-+
-+    // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
-+    ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
-+
-+    // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
-+    // destination the baseline ggml_cpy wrote to (s_slot == 0).
-+    const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
-+    const size_t  row_size  = ggml_row_size(conv_states_all->type, row_count);
-+    ggml_tensor * conv_state_dst =
-+        ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
-+    cb(conv_state_dst, "conv_state_update", il);
-+
-+    ggml_tensor * conv_output =
-+        ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
-+    cb(conv_output, "conv_output_silu", il);
-+
-+    // the ring write is a side effect of the op; pull the op into the graph via the output
-+    ggml_build_forward_expand(gf, conv_output);
-+
-+    return conv_output; // [conv_channels, 1, n_seqs], already silu'd
-+}
-+
- // Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
- // gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
- // ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
-diff --git a/src/models/models.h b/src/models/models.h
-index 98b89e9..da0dd86 100644
--- a/src/models/models.h
-+++ b/src/models/models.h
-@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context {
-             int64_t              conv_channels,
-             int                  il);
- 
-+    // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu +
-+    // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and
-+    // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring
-+    // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv
-+    // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain.
-+    ggml_tensor * build_conv_state_fused(
-+            llm_graph_input_rs * inp,
-+            ggml_tensor *        conv_states_all,
-+            ggml_tensor *        qkv_mixed,
-+            ggml_tensor *        conv_kernel,
-+            int64_t              conv_kernel_size,
-+            int64_t              conv_channels,
-+            int                  il);
-+
-     // run delta-net attention and write the new recurrent state(s) back to ssm_states_all
-     // s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs)
-     ggml_tensor * build_recurrent_attn(
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 0874c43..b6dcc5f 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index 1f6f643..c7c7c44 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
-index bfdf026..92749d1 100644
--- a/src/models/qwen3next.cpp
-+++ b/src/models/qwen3next.cpp
-@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-+    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-+    cb(state, "state_predelta", il);
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index 291c275..c7348d6 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case {
-     }
- };
- 
-+// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021).
-+// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token-
-+// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy
-+// md5 gate); here it just exercises the in-place write target as an op src.
-+struct test_ssm_conv_update : public test_case {
-+    const int64_t d_conv;
-+    const int64_t channels;
-+    const int64_t n_seqs;
-+
-+    std::string op_desc(ggml_tensor * t) override {
-+        GGML_UNUSED(t);
-+        return "SSM_CONV_UPDATE";
-+    }
-+
-+    std::string vars() override {
-+        return VARS_TO_STR3(d_conv, channels, n_seqs);
-+    }
-+
-+    test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
-+        : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
-+
-+    ggml_tensor * build_graph(ggml_context * ctx) override {
-+        ggml_tensor * conv_states    = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
-+        ggml_tensor * conv_kernel    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
-+        ggml_tensor * x_cur          = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+        ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
-+        ggml_set_name(conv_states, "conv_states");
-+        ggml_set_name(conv_kernel, "conv_kernel");
-+        ggml_set_name(x_cur, "x_cur");
-+        ggml_set_name(conv_state_dst, "conv_state_dst");
-+
-+        ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true);
-+        ggml_set_name(out, "out");
-+        return out;
-+    }
-+};
-+
- // GGML_OP_SSM_SCAN
- struct test_ssm_scan : public test_case {
-     const ggml_type type;
-@@ -8355,6 +8392,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-         }
-     }
- 
-+    // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be
-+    // a multiple of 128 for the CUDA SSM_CONV supports_op gate.
-+    for (int64_t d_conv : {3, 4}) {
-+        for (int64_t channels : {256, 3328}) {
-+            for (int64_t n_seqs : {1, 4, 32, 128}) {
-+                test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs));
-+            }
-+        }
-+    }
-+
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64,  8, 2, 32, 4)); // Falcon-H1
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
@@ -1,403 +0,0 @@
-From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 20:32:55 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
- occupancy/coalescing retune (patch 0022)
-
-Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
-kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
-float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
-efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
-the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
-order byte-identical (md5-gateable). The bf16-state plan stays shelved.
-
-Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
-(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
-state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
-... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
-grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
-across the lanes by the same strided i = r*warp_size + lane mapping, and every
-column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
-unchanged; only the (warp,block)->column assignment and visit order differ, which a
-column's value provably does not depend on (columns are fully independent). This
-raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
-state-load bursts before any reduction + interleaved butterfly reductions hiding
-each other's shfl latency), covering more DRAM latency on this bandwidth-bound
-kernel. Every global access stays identically coalesced, so it is a scheduling /
-latency-coverage win, not a coalescing change. The forbidden float4 state load
-(which would repartition a lane to 4 contiguous rows and change the reduction
-grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
-env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
-GB10 winner (16, 8).
-
-GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
-gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
-200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
-102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
-dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
-35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
-unchanged.
-
-Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
-both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
-test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
- 1 file changed, 157 insertions(+), 79 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index 86d5e2a..d071d5a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -1,6 +1,8 @@
- #include "gated_delta_net.cuh"
- #include "ggml-cuda/common.cuh"
- 
-+#include <cstdlib>
-+
- // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
- // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
- // destination slot by the recurrence kernel and are skipped here. One block per sequence.
-@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
-     gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
- }
- 
-template <int S_v, bool KDA, bool keep_rs_t>
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
-+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
-+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
-+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
-+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
-+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
-+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
-+// f32 value provably does not depend on either (columns are fully independent: column c reads only
-+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
-+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
-+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
-+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
-+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
-+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
-+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
-+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
-+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
- gated_delta_net_cuda(const float * q,
-                                      const float * k,
-                                      const float * v,
-@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
-                                      int           rs_head) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-    // each warp owns one column, using warp-level primitives to reduce across rows
-+    // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
-     const int      lane     = threadIdx.x;
-    const int      col      = blockIdx.z * blockDim.y + threadIdx.y;
-+    const int      col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
- 
-     const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
-     const uint32_t iq3 = fastdiv(sequence, rq3_magic);
-@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
-     // writing the same slot per block (identity) is race-free.
-     const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
-         ? state_dst : curr_state;
-    read_state += state_in_offset + col * S_v;
-+    read_state += state_in_offset;
-     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
- 
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
-     static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
-     constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
-    float         s_shard[rows_per_lane];
-    // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
-+    // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
-+    float         s_shard[COLS_PER_WARP][rows_per_lane];
- 
-     ggml_cuda_pdl_sync();
- #pragma unroll
-    for (int r = 0; r < rows_per_lane; r++) {
-        const int i = r * warp_size + lane;
-        s_shard[r]  = read_state[i];
-+    for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+        const int     col = col_base + cc * NUM_WARPS;
-+        const float * rs  = read_state + col * S_v;
-+#pragma unroll
-+        for (int r = 0; r < rows_per_lane; r++) {
-+            const int i   = r * warp_size + lane;
-+            s_shard[cc][r] = rs[i];
-+        }
-     }
- 
-     for (int t = 0; t < n_tokens; t++) {
-@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
- 
-         const float beta_val = *beta_t;
- 
-        // Cache k and q in registers
-+        // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
-         float k_reg[rows_per_lane];
-         float q_reg[rows_per_lane];
- #pragma unroll
-@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
-         if constexpr (!KDA) {
-             const float g_val = expf(*g_t);
- 
-            // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
-            float kv_shard = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                kv_shard += s_shard[r] * k_reg[r];
-            }
-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
-+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                const int col = col_base + cc * NUM_WARPS;
- 
-            // delta[col] = (v[col] - g * kv[col]) * beta
-            float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
-+                // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
-+                float kv_shard = 0.0f;
-+#pragma unroll
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    kv_shard += s_shard[cc][r] * k_reg[r];
-+                }
-+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- 
-            // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-            float attn_partial = 0.0f;
-+                // delta[col] = (v[col] - g * kv[col]) * beta
-+                float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
-+
-+                // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
-+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-+                float attn_partial = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                s_shard[r]  = g_val * s_shard[r] + k_reg[r] * delta_col;
-                attn_partial += s_shard[r] * q_reg[r];
-            }
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    s_shard[cc][r]  = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
-+                    attn_partial += s_shard[cc][r] * q_reg[r];
-+                }
- 
-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
-+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- 
-            if (lane == 0) {
-                attn_data[col] = attn_col * scale;
-+                if (lane == 0) {
-+                    attn_data[col] = attn_col * scale;
-+                }
-             }
-         } else {
-            // kv[col] = sum_i g[i] * S[i][col] * k[i]
-            float kv_shard = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                const int i = r * warp_size + lane;
-                kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
-            }
-+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                const int col = col_base + cc * NUM_WARPS;
-+
-+                // kv[col] = sum_i g[i] * S[i][col] * k[i]
-+                float kv_shard = 0.0f;
-+#pragma unroll
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    const int i = r * warp_size + lane;
-+                    kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
-+                }
- 
-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
-+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- 
-            // delta[col] = (v[col] - kv[col]) * beta
-            float delta_col = (v_t[col] - kv_col) * beta_val;
-+                // delta[col] = (v[col] - kv[col]) * beta
-+                float delta_col = (v_t[col] - kv_col) * beta_val;
- 
-            // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-            float attn_partial = 0.0f;
-+                // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
-+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-+                float attn_partial = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                const int i = r * warp_size + lane;
-                s_shard[r]  = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
-                attn_partial += s_shard[r] * q_reg[r];
-            }
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    const int i = r * warp_size + lane;
-+                    s_shard[cc][r]  = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
-+                    attn_partial += s_shard[cc][r] * q_reg[r];
-+                }
- 
-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
-+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- 
-            if (lane == 0) {
-                attn_data[col] = attn_col * scale;
-+                if (lane == 0) {
-+                    attn_data[col] = attn_col * scale;
-+                }
-             }
-         }
- 
-@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
-             const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
-             const int target_slot = (int) n_tokens - 1 - t;
-             if (target_slot >= 0 && target_slot < K) {
-                float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
- #pragma unroll
-                for (int r = 0; r < rows_per_lane; r++) {
-                    const int i = r * warp_size + lane;
-                    curr_state[col * S_v + i] = s_shard[r];
-+                for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                    const int col = col_base + cc * NUM_WARPS;
-+                    float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
-+#pragma unroll
-+                    for (int r = 0; r < rows_per_lane; r++) {
-+                        const int i = r * warp_size + lane;
-+                        curr_state[col * S_v + i] = s_shard[cc][r];
-+                    }
-                 }
-             }
-         }
-@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
- 
-     if constexpr (!keep_rs_t) {
- #pragma unroll
-        for (int r = 0; r < rows_per_lane; r++) {
-            const int i          = r * warp_size + lane;
-            state[col * S_v + i] = s_shard[r];
-+        for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+            const int col = col_base + cc * NUM_WARPS;
-+#pragma unroll
-+            for (int r = 0; r < rows_per_lane; r++) {
-+                const int i          = r * warp_size + lane;
-+                state[col * S_v + i] = s_shard[cc][r];
-+            }
-         }
-     }
- }
- 
-+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
-+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
-+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
-+#ifndef GDN_DEFAULT_NW
-+#define GDN_DEFAULT_NW 16
-+#endif
-+#ifndef GDN_DEFAULT_CPW
-+#define GDN_DEFAULT_CPW 8
-+#endif
-+
-+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
-+static void launch_gdn_variant(
-+        const float * q_d, const float * k_d, const float * v_d,
-+        const float * g_d, const float * b_d, const float * s_d,
-+        float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
-+        int64_t H, int64_t n_tokens, int64_t n_seqs,
-+        int64_t sq1, int64_t sq2, int64_t sq3,
-+        int64_t sv1, int64_t sv2, int64_t sv3,
-+        int64_t sb1, int64_t sb2, int64_t sb3,
-+        const uint3 neqk1_magic, const uint3 rq3_magic,
-+        float scale, int K, int warp_size, cudaStream_t stream) {
-+    static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
-+    dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
-+    dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
-+    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
-+    ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
-+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-+        n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-+        sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+}
-+
- template <bool KDA, bool keep_rs_t>
- static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
-         float scale, int K, cudaStream_t stream) {
-     //TODO: Add chunked kernel for even faster pre-fill
-     const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
-    const int num_warps = 4;
-    dim3      grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
-    dim3      block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
- 
-     const uint3 neqk1_magic = init_fastdiv_values(neqk1);
-     const uint3 rq3_magic   = init_fastdiv_values(rq3);
- 
-    int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-+#define GDN_LAUNCH_ARGS \
-+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
-+        H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
-+        neqk1_magic, rq3_magic, scale, K, warp_size, stream
- 
-    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
-     switch (S_v) {
-         case 16:
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-         case 32:
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-        case 64: {
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+        case 64:
-+            launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-        }
-         case 128: {
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
-+            // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
-+            // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
-+            // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
-+            static const int gdn_nw  = []{ const char * e = getenv("GDN_NW");  return e ? atoi(e) : GDN_DEFAULT_NW;  }();
-+            static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
-+            // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
-+            // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
-+            // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
-+            if      (gdn_nw == 4  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4,  1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 4  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4,  2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 4  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4,  4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8,  1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8,  2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8,  4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8,  8, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
-+            else                                   launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
-             break;
-         }
-         default:
-             GGML_ABORT("fatal error");
-             break;
-     }
-+
-+#undef GDN_LAUNCH_ARGS
- }
- 
- void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
@@ -1,144 +0,0 @@
-From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 23:47:25 +0200
-Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
- (patch 0023)
-
-Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
-mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
-ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
-expert of a token receives the SAME token activation, so the stock path
-re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
-block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
-reduction), so the gathered blocks are byte-identical across the experts.
-
-Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
-gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
-ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
-byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
-to re-quantizing each gathered row; the GEMM is untouched. down_proj
-(ne11 == n_expert_used, distinct per expert) keeps the stock path.
-
-Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
-  gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
-
-Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
-  q36-27b-nvfp4     5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
-  q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
-  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
-
-On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cu       | 21 +++++++++++++++++--
- ggml/src/ggml-cuda/quantize.cu  | 37 +++++++++++++++++++++++++++++++++
- ggml/src/ggml-cuda/quantize.cuh |  4 ++++
- 3 files changed, 60 insertions(+), 2 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
-index e1add5e..9933fa6 100644
--- a/ggml/src/ggml-cuda/mmq.cu
-+++ b/ggml/src/ggml-cuda/mmq.cu
-@@ -1,3 +1,4 @@
-+#include <cstdlib>
- #include "common.cuh"
- #include "mmq.cuh"
- #include "quantize.cuh"
-@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
-         const int64_t s13 = src1->nb[3] / ts_src1;
- 
-         if (use_native_fp4) {
-            quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-+            // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
-+            // across experts, so quantize the ne12 unique tokens once and gather the blocks.
-+            static const bool moe_quant_dedup = []{
-+                const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
-+                return e ? atoi(e) != 0 : true;  // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
-+            }();
-+            if (moe_quant_dedup && ne11 == 1) {
-+                const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
-+                    get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
-+                ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
-+                quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
-+                                        ne10_padded, ne12, 1, 1, stream);
-+                gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
-+                                    ne11_flat, ne12, ne10_padded, stream);
-+            } else {
-+                quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-+                                        ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-+            }
-         } else {
-             quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
-index 39a500a..a7fd86f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
-+++ b/ggml/src/ggml-cuda/quantize.cu
-@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
-     }
- }
- 
-+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
-+// gathered row references one of ne12 unique token activations, so the stock path
-+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
-+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
-+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
-+static __global__ void gather_mmq_fp4(
-+        const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
-+        uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
-+        const int64_t total_words) {
-+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
-+    const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
-+    if (t >= total_words) {
-+        return;
-+    }
-+    const int     w   = (int) (t % W);
-+    const int64_t ib  = t / W;                 // destination block index = kb*ne11_flat + j
-+    const int     j   = (int) (ib % ne11_flat);
-+    const int     kb  = (int) (ib / ne11_flat);
-+    const int     src = ids[j];
-+    const int64_t ib_u = (int64_t) kb * ne12_unique + src;
-+    gathered[t] = unique[ib_u * W + w];
-+}
-+
-+void gather_mmq_fp4_cuda(
-+        const void * unique, const int32_t * ids, void * gathered,
-+        int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
-+    const int     blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
-+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
-+    const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
-+    const int     bs = 256;
-+    const dim3    block_size(bs, 1, 1);
-+    const dim3    num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
-+    gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
-+        (const uint4 *) unique, ids, (uint4 *) gathered,
-+        (int) ne11_flat, (int) ne12_unique, total_words);
-+}
-+
- void quantize_mmq_fp4_cuda(
-         const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
-         const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
-diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
-index 768a3ae..7f64069 100644
--- a/ggml/src/ggml-cuda/quantize.cuh
-+++ b/ggml/src/ggml-cuda/quantize.cuh
-@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
-         ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
-         int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
- 
-+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
-+                         int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
-+                         cudaStream_t stream);
-+
- void quantize_mmq_fp4_cuda(const float *   x,
-                              const int32_t * ids,
-                              void *          vy,
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0024-paged-pool-burst-reclaim.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0024-paged-pool-burst-reclaim.patch
@@ -1,357 +0,0 @@
-From a8a9d129ae2226a08a12c30ece697865c0fc85c4 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 26 Jun 2026 12:41:49 +0200
-Subject: [PATCH] feat(paged): paged-pool burst-reclaim (truncate + defrag +
- slot release) (patch 0024)
-
-Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
-Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
-burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
-draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
-"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
-block accounting and placement, never KV values or compute, and is gated behind
-LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
-
-Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
-block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
-the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
-Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
-free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
-content-cache hashes.
-Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
-under the paged engine so a finished-idle slot returns its blocks promptly.
-
-Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
-  (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
-  unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
-  ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
-  64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
-  488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
-  532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
-  (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
-  arms (bit-exact serving).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/llama-kv-cache.cpp          | 13 ++++++++++
- src/paged-alloc.cpp             | 31 +++++++++++++++++++++++
- src/paged-alloc.h               | 18 +++++++++++++
- src/paged-kv-manager.cpp        | 45 +++++++++++++++++++++++++++++++++
- src/paged-kv-manager.h          | 24 ++++++++++++++++++
- src/paged-prefix-api.cpp        |  8 ++++++
- src/paged-prefix-api.h          |  6 +++++
- tools/server/server-context.cpp | 17 +++++++++++++
- 8 files changed, 162 insertions(+)
-
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 0351f86..21b8f1e 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -425,6 +425,19 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
-         }
-     }
- 
-+    // [paged 0024 Fix-1] Reclaim trailing blocks on a partial TAIL truncation
-+    // (p1 == MAX, p0 > 0). llama-server issues seq_rm(slot, n_past, -1) on every
-+    // reused slot and before a cross-request prefix splice; the kv-cache frees the
-+    // cells [p0, end) but, without this, the paged manager keeps owning those
-+    // blocks - the reclamation gap that leaks and fragments the pool across a
-+    // burst. truncate() frees the blocks beyond ceil(p0/bs) so the manager's
-+    // accounting tracks the kv-cache exactly. Gated so LLAMA_PAGED_NO_RECLAIM
-+    // restores the pre-fix behavior for A/B.
-+    if (paged_alloc::active() && paged_alloc::reclaim_active() && seq_id >= 0 &&
-+        p0 > 0 && p1 == std::numeric_limits<llama_pos>::max()) {
-+        paged_alloc::truncate(this, (int) seq_to_stream[seq_id], (int) seq_id, (uint32_t) p0);
-+    }
-+
-     if (seq_id >= 0) {
-         auto & cells = v_cells[seq_to_stream[seq_id]];
-         auto & head  = v_heads[seq_to_stream[seq_id]];
-diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
-index c1027fb..ba98dd5 100644
--- a/src/paged-alloc.cpp
-+++ b/src/paged-alloc.cpp
-@@ -14,6 +14,11 @@ bool active() {
-     return a;
- }
- 
-+bool reclaim_active() {
-+    static const bool off = (std::getenv("LLAMA_PAGED_NO_RECLAIM") != nullptr);
-+    return !off;
-+}
-+
- static bool debug() {
-     static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
-     return d;
-@@ -124,12 +129,28 @@ void commit(const void * cache, int stream, int seq,
-     }
- }
- 
-+void truncate(const void * cache, int stream, int seq, uint32_t n_keep) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-+        return;
-+    }
-+    mgr->truncate(seq, (size_t) n_keep);     // Fix-1: reclaim trailing blocks
-+    mgr->defrag_free_pool();                 // Fix-2: compact iff the pool emptied
-+    if (debug()) {
-+        fprintf(stderr, "[paged-alloc] truncate cache=%p stream=%d seq=%d keep<=%u (free=%zu)\n",
-+                cache, stream, seq, n_keep, mgr->num_free_blocks());
-+    }
-+}
-+
- void release(const void * cache, int stream, int seq) {
-     paged::PagedKVManager * mgr = find_mgr(cache, stream);
-     if (!mgr) {
-         return;
-     }
-     mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
-+    if (reclaim_active()) {
-+        mgr->defrag_free_pool();             // Fix-2: compact iff the pool emptied
-+    }
-     if (debug()) {
-         fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
-                 cache, stream, seq, mgr->num_free_blocks());
-@@ -163,4 +184,14 @@ size_t num_free(const void * cache, int stream) {
-     return mgr ? mgr->num_free_blocks() : 0;
- }
- 
-+size_t num_free_global() {
-+    size_t total = 0;
-+    for (auto & kv : g_managers) total += kv.second->num_free_blocks();
-+    return total;
-+}
-+
-+size_t num_managers() {
-+    return g_managers.size();
-+}
-+
- } // namespace paged_alloc
-diff --git a/src/paged-alloc.h b/src/paged-alloc.h
-index 88dedef..bfaf45b 100644
--- a/src/paged-alloc.h
-+++ b/src/paged-alloc.h
-@@ -31,6 +31,12 @@ namespace paged_alloc {
- // true iff env LLAMA_KV_PAGED is set (evaluated once).
- bool active();
- 
-+// [paged 0024] The burst-reclaim fix (truncate + defrag-on-empty + slot release)
-+// is on by default whenever the paged engine is active. LLAMA_PAGED_NO_RECLAIM=1
-+// restores the pre-fix behavior (no trailing-block reclaim, no compaction) for
-+// A/B measurement. Evaluated once.
-+bool reclaim_active();
-+
- // Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
- // on demand, appending their physical cell indices to `out`. pool_blocks =
- // cells.size()/block_size is the stream's block budget. Returns false (leaving
-@@ -58,6 +64,12 @@ int64_t slot(const void * cache, int stream, int seq, int pos);
- void commit(const void * cache, int stream, int seq,
-             const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
- 
-+// [paged 0024 Fix-1] Reclaim the trailing blocks of (cache,stream,seq) beyond
-+// logical position n_keep (ref-counted), mirroring a partial kv-cache seq_rm
-+// [n_keep, end). When the stream's pool empties as a result, its free queue is
-+// defragged to pristine contiguous order (Fix-2). No-op if no manager exists.
-+void truncate(const void * cache, int stream, int seq, uint32_t n_keep);
-+
- // Return one sequence's blocks to the pool (ref-counted; sequence end).
- void release(const void * cache, int stream, int seq);
- 
-@@ -69,4 +81,10 @@ void release_all(const void * cache);
- int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
- size_t num_free(const void * cache, int stream);
- 
-+// [paged 0024] Total free blocks summed across every live manager (all caches /
-+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
-+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
-+size_t num_free_global();
-+size_t num_managers();
-+
- } // namespace paged_alloc
-diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
-index 4c6ee4c..738b332 100644
--- a/src/paged-kv-manager.cpp
-+++ b/src/paged-kv-manager.cpp
-@@ -104,6 +104,22 @@ void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
-     num_free_blocks += blocks.size();
- }
- 
-+void FreeBlockQueue::rebuild(const std::vector<KVCacheBlock*>& blocks) {
-+    // Relink the intrusive list using THIS queue's stable fake head/tail nodes.
-+    num_free_blocks = blocks.size();
-+    for (size_t i = 0; i < blocks.size(); ++i) {
-+        blocks[i]->prev_free = (i == 0)                  ? &fake_head : blocks[i - 1];
-+        blocks[i]->next_free = (i + 1 < blocks.size())   ? blocks[i + 1] : &fake_tail;
-+    }
-+    if (!blocks.empty()) {
-+        fake_head.next_free = blocks.front();
-+        fake_tail.prev_free = blocks.back();
-+    } else {
-+        fake_head.next_free = &fake_tail;
-+        fake_tail.prev_free = &fake_head;
-+    }
-+}
-+
- std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
-     std::vector<KVCacheBlock*> ret;
-     const KVCacheBlock* curr = fake_head.next_free;
-@@ -199,6 +215,20 @@ void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-     }
- }
- 
-+void BlockPool::defrag_free_queue() {
-+    // Pool is fully idle: every non-null block is free (ref_cnt 0). Rebuild the
-+    // free list in ascending block_id order so popleft hands out physically
-+    // contiguous blocks again. Hashes / the content-cache map are left intact so
-+    // a warm committed prefix stays re-hittable.
-+    std::vector<KVCacheBlock*> ordered;
-+    ordered.reserve(ptrs_.size());
-+    for (KVCacheBlock* b : ptrs_) {
-+        if (b->is_null) continue;
-+        ordered.push_back(b);
-+    }
-+    free_queue_.rebuild(ordered);
-+}
-+
- // ---------------------------------------------------------------------------
- // PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
- // ---------------------------------------------------------------------------
-@@ -250,6 +280,21 @@ void PagedKVManager::free(int seq_id) {
-     req_to_blocks_.erase(it);
- }
- 
-+void PagedKVManager::truncate(int seq_id, size_t n_keep) {
-+    auto it = req_to_blocks_.find(seq_id);
-+    if (it == req_to_blocks_.end()) return;
-+    auto & blocks = it->second;
-+    const size_t keep = cdiv(n_keep, block_size_); // blocks covering [0, n_keep)
-+    if (keep >= blocks.size()) return;             // nothing trailing to reclaim
-+    // Free the trailing blocks [keep, end) tail-first (vLLM eviction order). Their
-+    // cells were just cleared by the partial seq_rm, so they are safe to reuse.
-+    std::vector<KVCacheBlock*> ordered(blocks.rbegin(),
-+                                       blocks.rbegin() + (blocks.size() - keep));
-+    pool_.free_blocks(ordered);
-+    blocks.resize(keep);
-+    if (blocks.empty()) req_to_blocks_.erase(it);
-+}
-+
- // FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
- // hash into the seed so each block hash transitively encodes its whole prefix
- // (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
-diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
-index 34decbc..e410d58 100644
--- a/src/paged-kv-manager.h
-+++ b/src/paged-kv-manager.h
-@@ -47,6 +47,11 @@ public:
-     void append_n(const std::vector<KVCacheBlock*>& blocks);
-     void prepend_n(const std::vector<KVCacheBlock*>& blocks);
-     std::vector<KVCacheBlock*> get_all_free_blocks() const;
-+    // [paged 0024 Fix-2] Relink the intrusive free list to the given order using
-+    // THIS queue's fake head/tail (the nodes' addresses are stable; a temporary
-+    // FreeBlockQueue would leave dangling fake-node pointers). Used to restore a
-+    // pristine, contiguous popleft order after a fragmenting burst drains.
-+    void rebuild(const std::vector<KVCacheBlock*>& blocks);
- 
- private:
-     KVCacheBlock fake_head{-1};
-@@ -67,6 +72,14 @@ public:
-                            size_t num_cached_blocks, size_t num_full_blocks,
-                            const std::vector<uint64_t>& block_hashes);
-     size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
-+    // [paged 0024 Fix-2] Total non-null blocks, and whether the pool is fully
-+    // idle (every non-null block back in the free queue). defrag_free_queue()
-+    // relinks the free queue into pristine ascending-block-id order; only valid
-+    // when all_free() so no live request's block table is disturbed. Block hashes
-+    // are preserved, so a warm committed prefix stays re-hittable.
-+    size_t total_blocks() const { return blocks_.size(); }
-+    bool   all_free()    const { return free_queue_.num_free_blocks + 1 == blocks_.size(); }
-+    void   defrag_free_queue();
- 
- private:
-     bool maybe_evict_cached_block(KVCacheBlock* block);
-@@ -94,6 +107,17 @@ public:
-     void free(int seq_id);
-     int block_size() const { return block_size_; }
- 
-+    // [paged 0024 Fix-1] Reclaim the trailing blocks of seq_id beyond logical
-+    // position n_keep: free every block at index >= ceil(n_keep/bs) (ref-counted,
-+    // mirroring vLLM's free of a truncated block suffix). Called on a partial tail
-+    // seq_rm [n_keep, end) so the manager's block accounting tracks the kv-cache
-+    // exactly instead of stranding the blocks whose cells were just cleared.
-+    void truncate(int seq_id, size_t n_keep);
-+
-+    // [paged 0024 Fix-2] When no live request holds a block, relink the free
-+    // queue into pristine contiguous order (undo a burst's scrambled free order).
-+    void defrag_free_pool() { if (pool_.all_free()) pool_.defrag_free_queue(); }
-+
-     // Prefix caching (win 3).
-     static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
-     std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
-diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
-index 8573cd2..209cee8 100644
--- a/src/paged-prefix-api.cpp
-+++ b/src/paged-prefix-api.cpp
-@@ -45,4 +45,12 @@ long num_free(llama_context * ctx) {
-     return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
- }
- 
-+long num_free_global() {
-+    return (long) paged_alloc::num_free_global();
-+}
-+
-+long num_managers() {
-+    return (long) paged_alloc::num_managers();
-+}
-+
- } // namespace paged_prefix_api
-diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
-index 78a3864..8dd817e 100644
--- a/src/paged-prefix-api.h
-+++ b/src/paged-prefix-api.h
-@@ -24,4 +24,10 @@ int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
- // Number of free blocks in the unified stream-0 pool, or 0 if no manager.
- long num_free(llama_context * ctx);
- 
-+// [paged 0024] Total free blocks across every live paged manager (all caches /
-+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
-+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
-+long num_free_global();
-+long num_managers();
-+
- } // namespace paged_prefix_api
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index f7a114c..8c19cfb 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -411,6 +411,23 @@ struct server_slot {
- 
-             reset();
- 
-+            // [paged 0024 Fix-3] Return this finished slot's paged blocks to the
-+            // pool promptly. Stock llama-server keeps an idle slot's KV for its own
-+            // next-prompt cache, but under the paged engine that strands blocks in
-+            // idle slots after a high-fan-out burst, so a later low-npl run sees a
-+            // depleted, fragmented pool and its prefill collapses. prompt_clear()
-+            // issues a full seq_rm (clearing the cells AND, via the paged hook,
-+            // releasing + defragging the blocks) and clears the slot-local prompt
-+            // cache so the next reuse recomputes from a pristine pool; cross-request
-+            // reuse still works through the committed paged content cache. Gated on
-+            // LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM opts out for A/B); stock
-+            // (paged off) is byte-identical.
-+            static const bool paged_release_on_idle =
-+                getenv("LLAMA_KV_PAGED") != nullptr && getenv("LLAMA_PAGED_NO_RECLAIM") == nullptr;
-+            if (paged_release_on_idle && prompt.n_tokens() > 0) {
-+                prompt_clear(false);
-+            }
-+
-             callback_on_release(id);
-         }
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch
@@ -1,56 +0,0 @@
-From 2f4f5ab7c9050f890ee1137ef9c8ee09dfcd9ae7 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 26 Jun 2026 16:52:21 +0200
-Subject: [PATCH] feat(paged): qwen35moe NVFP4 MoE-decode re-graph
- (should_use_mmq graph-safe id-path) (patch 0025)
-
-The MUL_MAT_ID CUDA-graph guard (ggml-cuda.cu [TAG_MUL_MAT_ID_CUDA_GRAPHS]) disables CUDA graphs for
-the whole decode step whenever a MUL_MAT_ID node has ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121),
-because the per-expert host-loop fallback synchronizes the stream. But on Blackwell NVFP4 the path
-actually taken is should_use_mmq()==true -> the grouped stream-k mul_mat_q id-branch, which launches
-on one stream with NO host sync (no cudaStreamSynchronize/Memcpy in mmq.cu/mmid.cu). The disable is
-therefore conservative; graphs are safe for the grouped path.
-
-Env-gated (LLAMA_MOE_FORCE_GRAPHS, default-off = byte-identical to stock): when set and the node
-takes the grouped MMQ path, keep CUDA graphs on for the MoE decode step.
-
-Measured (DGX GB10 sm_121, q36-35b-a3b-nvfp4, llama-batched-bench -fa on -npp128 -ntg128, decode_agg):
-  npl 8   226.0 -> 226.4  +0.2% (noise; ne2<=8 already on the MMVQ-graphed path)
-  npl 32  433.8 -> 452.7  +4.4%
-  npl 64  589.0 -> 605.9  +2.9%
-  npl 128 743.1 -> 757.1  +1.9%
-
-Bit-exact (graph replay re-issues identical kernels): test-backend-ops MUL_MAT_ID 806/806 CUDA0 OK;
-parallel-greedy np16 (ne2=16>8) generated content byte-identical ON==OFF.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/ggml-cuda.cu | 12 +++++++++++-
- 1 file changed, 11 insertions(+), 1 deletion(-)
-
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index cca7059..254d2e0 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -3275,7 +3275,17 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
-         if (node->op == GGML_OP_MUL_MAT_ID) {
-             const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-             const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc);
-            if (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max) {
-+            bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
-+            // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is
-+            // launched on-stream with no host sync (only the per-expert host-loop fallback syncs);
-+            // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe
-+            // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step.
-+            if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
-+                getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
-+                ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
-+                mmid_needs_sync = false;
-+            }
-+            if (mmid_needs_sync) {
-                 // under these conditions, the mul_mat_id operation will need to synchronize the stream, so we cannot use CUDA graphs
-                 // TODO: figure out a way to enable for larger batch sizes, without hurting performance
-                 // ref: https://github.com/ggml-org/llama.cpp/pull/18958
--
-2.43.0
--- a/backend/cpp/llama-cpp/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch
--- a/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch
@@ -1,578 +0,0 @@
-From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 26 Jun 2026 22:58:47 +0200
-Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
- 0028)
-
-The MoE-gap groundtruth found k_get_rows_float to be the single biggest decode
-kernel vLLM has no equivalent of (~5.2 ms/step MoE; also dense): vLLM updates its
-gated-DeltaNet recurrent state in place, while llama ran a separate ggml_get_rows
-gather. Patch 0019 fused the SSM-state gather; patch 0021 fused the conv compute
-but kept a build_rs gather for the conv taps. This closes that residual.
-
-nsys located the residual k_get_rows as the conv-state tap gather in
-build_conv_state_fused: a 24576-float (= n_embd_r = (d_conv-1)*(d_inner +
-2*n_group*d_state)) row x 128 sequences, once per GDN layer per decode step
-(~720 big ~115 us gathers / 24-step window). The SSM-state gather is already
-fused by 0019, so this conv gather is the last k_get_rows in the GDN decode path.
-
-New op ggml_ssm_conv_update_inplace_ids (reuses GGML_OP_SSM_CONV, discriminated
-by a non-null src[4] = ids) takes the FULL conv cache + the s_copy ids and reads
-each active sequence's prior taps directly from cache[ids[s]] in the kernel (no
-ggml_get_rows). Identity sequences (ids[s] == rs_head + s, the AR-decode path)
-read in place from the conv_state_dst write slot (the whole window is loaded into
-registers before the ring write-back, so read==write is race-free); non-identity
-sequences (reorder / rs_zero) are gathered into a disjoint scratch by a small
-ssm_conv_gather_nonident_kernel first. Mirrors the 0019 in-place + ids gather
-fusion. The read VALUES are unchanged; only the read path (gather -> indexed
-in-kernel read) changes, so it is bit-identical to the build_rs gather + 0021 op.
-
-build_conv_state_fused now feeds the full cache + ids through the build_rs
-get_state_rows lambda (rs_zero clear + extra-states copy still run around it).
-Helps BOTH dense and MoE (shared GDN conv path).
-
-GATE test-backend-ops (CUDA0 vs CPU, 2/2 backends): SSM_CONV_UPDATE_IDS OK (new),
-SSM_CONV_UPDATE OK, SSM_CONV OK, GATED_DELTA_NET OK, GET_ROWS OK.
-
-GATE greedy md5 (--temp 0 --seed 1 -n 48) BYTE-IDENTICAL both models:
-q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4
-07db32c2bcb78d17a43ed18bc22705cd (== baseline).
-
-nsys: k_get_rows_float float,float 10174 -> 9454 instances (720 fewer = 30 GDN
-layers x 24 steps), 186.3 -> 102.8 ms; the 720 ~115 us conv gathers replaced by a
-720 x ~1.1 us no-op ssm_conv_gather_nonident (all identity at steady decode).
-MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h            |  20 ++++
- ggml/src/ggml-cpu/ops.cpp      |  90 +++++++++++++++++-
- ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
- ggml/src/ggml.c                |  62 +++++++++++++
- src/models/delta-net-base.cpp  |  26 ++++--
- tests/test-backend-ops.cpp     |  69 ++++++++++++++
- 6 files changed, 411 insertions(+), 11 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 2a5cbce..5fa220a 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2463,6 +2463,26 @@ extern "C" {
-             struct ggml_tensor  * conv_state_dst,
-             bool                  fuse_silu);
- 
-+    // Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
-+    // per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
-+    // n_cells]) plus the per-sequence `ids` ([n_seqs], I32, = the recurrent-state s_copy) and reads
-+    // each active sequence's prior taps directly from cache[ids[s]] inside the kernel -- no
-+    // ggml_get_rows materialization (mirrors ggml_gated_delta_net_inplace_ids). Identity sequences
-+    // (ids[s] == rs_head + s) are read in place from `conv_state_dst` (the write slot); any
-+    // non-identity sequence (reorder / rs_zero remap) is gathered into a disjoint scratch by the
-+    // backend first, so the read never aliases another sequence's in-place ring write -> race-free
-+    // and bit-identical to the get_rows + ggml_ssm_conv_update_inplace path. op_params[0]=fuse_silu,
-+    // op_params[1]=rs_head. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
-+    GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * conv_states,
-+            struct ggml_tensor  * conv_kernel,
-+            struct ggml_tensor  * x_cur,
-+            struct ggml_tensor  * conv_state_dst,
-+            struct ggml_tensor  * ids,
-+            int                   rs_head,
-+            bool                  fuse_silu);
-+
-     GGML_API struct ggml_tensor * ggml_ssm_scan(
-             struct ggml_context * ctx,
-             struct ggml_tensor  * s,
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 07ab9e5..515aae4 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -9580,6 +9580,90 @@ static void ggml_compute_forward_ssm_conv_update_f32(
-     }
- }
- 
-+// Patch 0028: CPU reference for ggml_ssm_conv_update_inplace_ids (mirror of the CUDA
-+// ssm_conv_update_ids_f32). Reads each active sequence's prior K-1 taps directly from the FULL conv
-+// cache (src[0]) via ids (src[4]) -- identity sequences (ids[s] == rs_head + s) read in place from the
-+// destination slot src[3], non-identity from cache[ids[s]] -- computes the depthwise conv with the
-+// same ascending-tap FMA order, optionally folds silu, writes the conv output to dst, and writes the
-+// 1-token-shifted ring state back in place into src[3]. The window is copied to a local before the
-+// write so the identity (read == write slot) case is correct. Threads split over channels.
-+static void ggml_compute_forward_ssm_conv_update_ids_f32(
-+        const ggml_compute_params * params,
-+        ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    ggml_tensor       * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+    const ggml_tensor * ids         = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
-+
-+    const int ith = params->ith;
-+    const int nth = params->nth;
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = x_cur->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+    const int32_t rs_head    = ggml_get_op_params_i32(dst, 1);
-+
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+    GGML_ASSERT(ids->type == GGML_TYPE_I32);
-+    GGML_ASSERT(d_conv <= 8);
-+
-+    const int64_t cache_row_stride = conv_states->nb[2] / sizeof(float); // (K-1)*channels
-+    const int64_t w_stride         = conv_kernel->nb[1] / sizeof(float);
-+    const int64_t x_seq_stride     = x_cur->nb[2] / sizeof(float);
-+    const int64_t dst_seq_stride   = dst->nb[2] / sizeof(float);
-+    const int64_t cdst_seq_stride  = cdst->nb[1] / sizeof(float);
-+
-+    const float * cache_base = (const float *) conv_states->data;
-+    const float * w_base     = (const float *) conv_kernel->data;
-+    const float * x_base     = (const float *) x_cur->data;
-+    float *       cdst_base  = (float *) cdst->data;
-+    float *       dst_base   = (float *) dst->data;
-+    const int32_t * ids_base = (const int32_t *) ids->data;
-+
-+    const int64_t dc = (channels + nth - 1) / nth;
-+    const int64_t c0 = dc * ith;
-+    const int64_t c1 = MIN(c0 + dc, channels);
-+
-+    for (int64_t s = 0; s < n_seqs; ++s) {
-+        const int32_t r     = ids_base[s];
-+        const bool    ident = (r == rs_head + (int32_t) s);
-+        // identity reads the K-1 taps in place from the destination slot; non-identity from cache[r].
-+        const float * states_seq = ident
-+            ? (cdst_base  + s * cdst_seq_stride)
-+            : (cache_base + (int64_t) r * cache_row_stride);
-+        for (int64_t c = c0; c < c1; ++c) {
-+            const float * states_c = states_seq + c * (d_conv - 1);
-+            const float * w_c      = w_base + c * w_stride;
-+            const float   xc       = x_base[s * x_seq_stride + c];
-+
-+            // window = [tap0 .. tap_{K-2}, xc], copied to a local before the (possibly aliasing) write
-+            float window[8];
-+            for (int64_t j = 0; j < d_conv - 1; ++j) {
-+                window[j] = states_c[j];
-+            }
-+            window[d_conv - 1] = xc;
-+
-+            // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
-+            float sumf = 0.0f;
-+            for (int64_t j = 0; j < d_conv; ++j) {
-+                sumf += window[j] * w_c[j];
-+            }
-+            sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
-+
-+            dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
-+
-+            // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
-+            float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
-+            for (int64_t j = 0; j < d_conv - 1; ++j) {
-+                out_state[j] = window[j + 1];
-+            }
-+        }
-+    }
-+}
-+
- void ggml_compute_forward_ssm_conv(
-         const ggml_compute_params * params,
-         ggml_tensor * dst) {
-@@ -9587,7 +9671,11 @@ void ggml_compute_forward_ssm_conv(
-         case GGML_TYPE_F32:
-             {
-                 if (dst->src[3] != nullptr) {
-                    ggml_compute_forward_ssm_conv_update_f32(params, dst);
-+                    if (dst->src[4] != nullptr) {
-+                        ggml_compute_forward_ssm_conv_update_ids_f32(params, dst);
-+                    } else {
-+                        ggml_compute_forward_ssm_conv_update_f32(params, dst);
-+                    }
-                 } else {
-                     ggml_compute_forward_ssm_conv_f32(params, dst);
-                 }
-diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
-index e1af1cd..28b3cce 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
-+++ b/ggml/src/ggml-cuda/ssm-conv.cu
-@@ -226,6 +226,153 @@ static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_t
-     }
- }
- 
-+// Patch 0028: gather only the NON-identity sequences' prior conv taps from the FULL conv cache into a
-+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
-+// destination slot by the update kernel and are skipped here. One block per sequence. Mirrors
-+// gdn_gather_nonident_kernel (the 0019 recurrent-state gather fusion).
-+static __global__ void ssm_conv_gather_nonident_kernel(const float * __restrict__ cache,
-+                                                       const int32_t * __restrict__ ids, int rs_head,
-+                                                       float * __restrict__ scratch, int row_stride, int n_seqs) {
-+    const int s = blockIdx.x;
-+    if (s >= n_seqs) {
-+        return;
-+    }
-+    const int r = ids[s];
-+    if (r == rs_head + s) {
-+        return; // identity: prior taps already live in the in-place destination slot
-+    }
-+    const float * src = cache   + (int64_t) r * row_stride;
-+    float *       dst = scratch + (int64_t) s * row_stride;
-+    for (int i = threadIdx.x; i < row_stride; i += blockDim.x) {
-+        dst[i] = src[i];
-+    }
-+}
-+
-+// Patch 0028: gather-free fused conv update. Per (channel, sequence), read the K-1 prior taps from the
-+// active sequence's cache slot via ids -- identity (ids[s] == rs_head + s) reads in place from
-+// conv_state_dst (the same slot it writes; the whole window is loaded into registers before any write,
-+// so it is race-free), non-identity reads the pre-gathered disjoint scratch -- then computes the
-+// depthwise conv with the SAME ascending-tap FMA order as ssm_conv_update_f32, folds silu, writes the
-+// conv output, and writes the 1-token-shifted ring state back in place. Bit-identical to the get_rows +
-+// ssm_conv_update_f32 path: the read VALUES are the same; only the read POINTER changes.
-+template <bool apply_silu, int d_conv>
-+static __global__ void ssm_conv_update_ids_f32(const float * __restrict__ nonident_scratch,
-+                                               const float * __restrict__ conv_kernel,
-+                                               const float * __restrict__ x_cur,
-+                                               float       * __restrict__ conv_state_dst,
-+                                               float       * __restrict__ dst,
-+                                               const int32_t * __restrict__ ids,
-+                                               const int   rs_head,
-+                                               const int   channels,
-+                                               const int   scratch_seq_stride,
-+                                               const int   w_stride,
-+                                               const int   x_seq_stride,
-+                                               const int   dst_seq_stride,
-+                                               const int   cdst_seq_stride) {
-+    const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
-+    const int s = blockIdx.y;                            // sequence
-+    if (c >= channels) {
-+        return;
-+    }
-+
-+    const bool ident = (ids[s] == rs_head + s);
-+    const float * states_c = ident
-+        ? conv_state_dst   + (int64_t) s * cdst_seq_stride    + (int64_t) c * (d_conv - 1)
-+        : nonident_scratch + (int64_t) s * scratch_seq_stride + (int64_t) c * (d_conv - 1);
-+    const float * w_c = conv_kernel + (int64_t) c * w_stride;
-+    const float   xc  = x_cur[(int64_t) s * x_seq_stride + c];
-+
-+    // window = [tap0 .. tap_{K-2}, current-token], same ordering as ssm_conv_update_f32
-+    float window[d_conv];
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        window[j] = states_c[j];
-+    }
-+    window[d_conv - 1] = xc;
-+
-+    float sumf = 0.0f;
-+#pragma unroll
-+    for (int j = 0; j < d_conv; j++) {
-+        sumf += window[j] * w_c[j];
-+    }
-+    sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
-+    dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
-+
-+    // 1-token-shifted ring write-back: drop the oldest tap, append the current token
-+    float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        out_state[j] = window[j + 1];
-+    }
-+}
-+
-+static void ggml_cuda_op_ssm_conv_update_ids(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    const ggml_tensor * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+    const ggml_tensor * ids         = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = x_cur->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+    const int     rs_head    = ggml_get_op_params_i32(dst, 1);
-+
-+    GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
-+    GGML_ASSERT(ids->type == GGML_TYPE_I32);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+    GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
-+
-+    const float *   cache_d = (const float *) conv_states->data;
-+    const float *   w_d     = (const float *) conv_kernel->data;
-+    const float *   x_d     = (const float *) x_cur->data;
-+    float *         cdst_d  = (float *) cdst->data;
-+    float *         dst_d   = (float *) dst->data;
-+    const int32_t * ids_d   = (const int32_t *) ids->data;
-+    cudaStream_t    stream  = ctx.stream();
-+
-+    // n_embd_r = (K-1)*channels: the per-cell row stride of the full conv cache.
-+    const int cache_row_stride = (int) (conv_states->nb[2] / sizeof(float));
-+    const int w_stride         = (int) (conv_kernel->nb[1] / sizeof(float));
-+    const int x_seq_stride     = (int) (x_cur->nb[2] / sizeof(float));
-+    const int dst_seq_stride   = (int) (dst->nb[2] / sizeof(float));
-+    const int cdst_seq_stride  = (int) (cdst->nb[1] / sizeof(float));
-+
-+    // Gather only the non-identity sequences' prior taps into a disjoint scratch (identity sequences
-+    // read in place from cdst). The scratch is written here and read-only by the update kernel, so the
-+    // update kernel never reads a slot another block writes -> race-free. No-op at steady AR decode.
-+    ggml_cuda_pool_alloc<float> nonident_scratch(ctx.pool());
-+    float * scratch = nonident_scratch.alloc((size_t) cache_row_stride * n_seqs);
-+    if (n_seqs > 0) {
-+        ssm_conv_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(
-+            cache_d, ids_d, rs_head, scratch, cache_row_stride, (int) n_seqs);
-+    }
-+
-+    const int threads = 128;
-+    const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
-+
-+    auto launch = [&](auto NC) {
-+        constexpr int kNC = decltype(NC)::value;
-+        if (apply_silu) {
-+            ssm_conv_update_ids_f32<true, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
-+                ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        } else {
-+            ssm_conv_update_ids_f32<false, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
-+                ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        }
-+    };
-+
-+    switch (d_conv) {
-+        case 3: launch(std::integral_constant<int, 3>{}); break;
-+        case 4: launch(std::integral_constant<int, 4>{}); break;
-+        default: GGML_ABORT("ssm_conv_update_ids only supports d_conv 3 or 4");
-+    }
-+}
-+
- template <bool apply_silu>
- static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
-                               const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
-@@ -266,7 +413,13 @@ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, g
-     // silu of the decode conv path into a single kernel.
-     if (dst->src[3] != nullptr) {
-         GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
-        ggml_cuda_op_ssm_conv_update(ctx, dst);
-+        // Patch 0028: a non-null src[4] (ids) selects the gather-free variant that reads each
-+        // sequence's prior taps directly from the full cache via ids (no get_rows materialization).
-+        if (dst->src[4] != nullptr) {
-+            ggml_cuda_op_ssm_conv_update_ids(ctx, dst);
-+        } else {
-+            ggml_cuda_op_ssm_conv_update(ctx, dst);
-+        }
-         return;
-     }
- 
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index 16b180f..dcc09bd 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -5606,6 +5606,68 @@ struct ggml_tensor * ggml_ssm_conv_update_inplace(
-     return result;
- }
- 
-+// ggml_ssm_conv_update_inplace_ids
-+//
-+// Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
-+// per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
-+// n_cells]) plus the per-sequence `ids` (the recurrent-state s_copy) and reads each active sequence's
-+// prior taps directly from cache[ids[s]] inside the kernel (no ggml_get_rows). Identity sequences
-+// (ids[s] == rs_head + s) read in place from the `conv_state_dst` write slot; non-identity sequences
-+// are gathered into a disjoint scratch by the backend first. Bit-identical to the get_rows +
-+// ggml_ssm_conv_update_inplace path. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
-+// op_params[1] carries rs_head. Mirrors the 0019 ggml_gated_delta_net_inplace_ids gather fusion.
-+struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * conv_states,
-+        struct ggml_tensor  * conv_kernel,
-+        struct ggml_tensor  * x_cur,
-+        struct ggml_tensor  * conv_state_dst,
-+        struct ggml_tensor  * ids,
-+        int                   rs_head,
-+        bool                  fuse_silu) {
-+    GGML_ASSERT(ggml_is_3d(conv_states));
-+    GGML_ASSERT(ggml_is_matrix(conv_kernel));
-+    GGML_ASSERT(ggml_is_3d(x_cur));
-+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = x_cur->ne[2];
-+
-+    GGML_ASSERT(conv_states->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_kernel->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type          == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
-+
-+    // conv_states: FULL cache [K-1, channels, n_cells], contiguous taps per channel
-+    GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
-+    GGML_ASSERT(conv_states->ne[1] == channels);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    // x_cur: single decode token per sequence
-+    GGML_ASSERT(x_cur->ne[0] == channels);
-+    GGML_ASSERT(x_cur->ne[1] == 1);
-+    // ids: one slot index per active sequence
-+    GGML_ASSERT(ids->ne[0] == n_seqs);
-+    // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
-+    GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
-+    GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
-+
-+    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+
-+    ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
-+    ggml_set_op_params_i32(result, 1, rs_head);
-+
-+    result->op     = GGML_OP_SSM_CONV;
-+    result->src[0] = conv_states;
-+    result->src[1] = conv_kernel;
-+    result->src[2] = x_cur;
-+    result->src[3] = conv_state_dst;
-+    result->src[4] = ids;
-+
-+    return result;
-+}
-+
- // ggml_ssm_scan
- 
- struct ggml_tensor * ggml_ssm_scan(
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index 58f3d0c..962f5eb 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -548,25 +548,33 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
-     GGML_ASSERT(n_seq_tokens == 1);        // single-token decode only
-     GGML_ASSERT(cparams.n_rs_seq == 0);    // no rollback splits on this path
- 
-    // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
-    // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
-    ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
-    conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
-    cb(conv_states, "conv_states_reshaped", il);
-
-     // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
-     ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
- 
-     // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
-     // destination the baseline ggml_cpy wrote to (s_slot == 0).
-    const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
-+    const int64_t row_count = (conv_kernel_size - 1) * conv_channels; // = n_embd_r
-     const size_t  row_size  = ggml_row_size(conv_states_all->type, row_count);
-     ggml_tensor * conv_state_dst =
-         ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
-     cb(conv_state_dst, "conv_state_update", il);
- 
-    ggml_tensor * conv_output =
-        ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
-+    // Patch 0028: fuse the residual conv-state tap gather (the k_get_rows that build_conv_state's
-+    // build_rs left firing -- ~the biggest single residual decode kernel, see MOE_GAP_VS_VLLM.md).
-+    // Exactly like the 0019 SSM-state gather fusion, build_rs feeds the FULL conv cache + the s_copy
-+    // ids into the op (via the get_state_rows lambda) and still performs the rs_zero clear and the
-+    // extra-states copy around it; the op reads each active sequence's prior taps directly from
-+    // cache[ids[s]] (identity sequences read in place from conv_state_dst), so the separate
-+    // ggml_get_rows materialization is eliminated. The read VALUES are unchanged, only the read path
-+    // (gather -> indexed in-kernel read) changes, so it is bit-identical to the build_rs gather.
-+    auto get_conv_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
-+        // states = full conv-state cache reshaped 2d [n_embd_r, n_cells]
-+        ggml_tensor * cache3d = ggml_reshape_3d(ctx, states, conv_kernel_size - 1, conv_channels, states->ne[1]);
-+        return ggml_ssm_conv_update_inplace_ids(ctx, cache3d, conv_kernel, x_cur, conv_state_dst,
-+                ids, (int) kv_head, /*fuse_silu=*/true);
-+    };
-+
-+    ggml_tensor * conv_output = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs, get_conv_op);
-     cb(conv_output, "conv_output_silu", il);
- 
-     // the ring write is a side effect of the op; pull the op into the graph via the output
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index b5e3048..302975f 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -3793,6 +3793,65 @@ struct test_ssm_conv_update : public test_case {
-     }
- };
- 
-+// GGML_OP_SSM_CONV gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids,
-+// patch 0028). conv_states is the FULL cache; ids (a shuffled permutation of [0,n_seqs), rs_head=0)
-+// selects each sequence's slot, exercising BOTH the identity in-place read (ids[s]==s) and the
-+// non-identity cache read. Validates the conv + silu output (dst) against the CPU reference.
-+struct test_ssm_conv_update_ids : public test_case {
-+    const int64_t d_conv;
-+    const int64_t channels;
-+    const int64_t n_seqs;
-+
-+    std::string op_desc(ggml_tensor * t) override {
-+        GGML_UNUSED(t);
-+        return "SSM_CONV_UPDATE_IDS";
-+    }
-+
-+    std::string vars() override {
-+        return VARS_TO_STR3(d_conv, channels, n_seqs);
-+    }
-+
-+    test_ssm_conv_update_ids(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
-+        : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
-+
-+    ggml_tensor * build_graph(ggml_context * ctx) override {
-+        ggml_tensor * conv_states    = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
-+        ggml_tensor * conv_kernel    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
-+        ggml_tensor * x_cur          = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+        ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
-+        ggml_tensor * ids            = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, n_seqs);
-+        ggml_set_name(conv_states, "conv_states");
-+        ggml_set_name(conv_kernel, "conv_kernel");
-+        ggml_set_name(x_cur, "x_cur");
-+        ggml_set_name(conv_state_dst, "conv_state_dst");
-+        ggml_set_name(ids, "ids");
-+
-+        ggml_tensor * out = ggml_ssm_conv_update_inplace_ids(ctx, conv_states, conv_kernel, x_cur,
-+                conv_state_dst, ids, /*rs_head=*/0, /*fuse_silu=*/true);
-+        ggml_set_name(out, "out");
-+        return out;
-+    }
-+
-+    void initialize_tensors(ggml_context * ctx) override {
-+        std::random_device rd;
-+        std::default_random_engine rng(rd());
-+        for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
-+            if (t->type == GGML_TYPE_I32) {
-+                // ids: shuffled permutation of [0, n_seqs) into the full cache (rs_head == 0), so some
-+                // sequences are identity (ids[s] == s, in-place read) and some are not (scratch read).
-+                std::vector<int32_t> data(t->ne[0]);
-+                for (int i = 0; i < t->ne[0]; i++) {
-+                    data[i] = i;
-+                }
-+                std::shuffle(data.begin(), data.end(), rng);
-+                ggml_backend_tensor_set(t, data.data(), 0, t->ne[0] * sizeof(int32_t));
-+            } else {
-+                init_tensor_uniform(t);
-+            }
-+        }
-+    }
-+};
-+
- // GGML_OP_SSM_SCAN
- struct test_ssm_scan : public test_case {
-     const ggml_type type;
-@@ -8504,6 +8563,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-         }
-     }
- 
-+    // gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids, patch 0028).
-+    // channels must be a multiple of 128 for the CUDA SSM_CONV supports_op gate.
-+    for (int64_t d_conv : {3, 4}) {
-+        for (int64_t channels : {256, 3328}) {
-+            for (int64_t n_seqs : {1, 4, 32, 128}) {
-+                test_cases.emplace_back(new test_ssm_conv_update_ids(d_conv, channels, n_seqs));
-+            }
-+        }
-+    }
-+
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64,  8, 2, 32, 4)); // Falcon-H1
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0029-qwen35-blocktable-within-step-cache.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0029-qwen35-blocktable-within-step-cache.patch
@@ -1,176 +0,0 @@
-From e2acb3bca4d12ecef4964a214d397fc91ecfcebc Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Sat, 27 Jun 2026 03:45:19 +0200
-Subject: [PATCH] feat(paged): block-table within-step host cache (patch 0029)
-
-Lever 5 (host pipeline). get_block_table() is called once per full-attention
-layer per decode step, but the KV cell layout (and therefore the block table)
-is fixed for the whole step: it only changes in apply() when the ubatch's slots
-are committed. The old path recomputed the full table on every layer.
-
-This caches the table the first time it is built in a step and reuses the bytes
-(memcpy) for every subsequent full-attention layer, invalidating the cache in
-apply(). The reused bytes are identical to a fresh compute, so the change is
-bit-exact. Toggle off with LLAMA_PAGED_NO_BT_CACHE=1.
-
-Measured host-side get_block_table time (llama-batched-bench, npp128 ntg128
-npl128, cache OFF -> ON):
- MoE  q36-35b-a3b-nvfp4: 112.94 -> 14.82 ms  (-87%)
- dense q36-27b-nvfp4   : 193.78 -> 16.90 ms  (-91%)
-
-Throughput: dense is partly host-bound and gains (TG 364.8 -> 374.7 t/s,
-+2.7%, ~95.8% of the vLLM 391 t/s reference @npl128). MoE decode is compute-
-bound (FP4 GEMM dominates) so the saved host time is off the critical path and
-TG is flat (752.2 -> 757.0 t/s). The cache is therefore a pure pipeline cleanup,
-not a numeric change.
-
-Bit-exact, per path (llama-completion --temp 0 --seed 1, 48 tok):
- non-paged MoE   = 07db32c2bcb78d17a43ed18bc22705cd  (unchanged baseline)
- paged MoE       = 8cb0ce23777bf55f92f63d0292c756b0  (paged baseline)
- paged MoE cache OFF == cache ON (both 8cb0ce23)
- dense non-paged == dense paged = 5951a5b4d624ce891e22ab5fca9bc439
-
-The paged-MoE md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
-benign FP-accumulation-order difference of the paged attention reduction, not a
-bug: KL-divergence vs the f16 reference (16 chunks, c512) gives KLD(paged||f16)
-= 0.13600 <= KLD(nonpaged||f16) = 0.13660 and PPL(paged) = 7.4009 ~
-PPL(nonpaged) = 7.3896 (within +/- 0.29). See PAGED_BITEXACT_NOTE.md and
-LEVER5_HOSTPIPE_RESULTS.md.
-
-Includes the [L5INSTR] host-timing instrumentation used to measure the lever.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/llama-context.cpp  |  7 +++++++
- src/llama-kv-cache.cpp | 28 +++++++++++++++++++++++++++-
- src/llama-kv-cache.h   |  9 +++++++++
- src/paged-attn.cpp     |  9 +++++++++
- 4 files changed, 52 insertions(+), 1 deletion(-)
-
-diff --git a/src/llama-context.cpp b/src/llama-context.cpp
-index 5c90c48..ad7939e 100644
--- a/src/llama-context.cpp
-+++ b/src/llama-context.cpp
-@@ -1306,7 +1306,11 @@ bool llama_context::set_adapter_cvec(
-     return res;
- }
- 
-+extern "C" void l5_add_setinp(double ns);
-+extern "C" void l5_add_hostproc(double ns);
-+static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; }
- llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {
-+    double _l5_t0=l5c_now_ns();
-     if (mctx && !mctx->apply()) {
-         LLAMA_LOG_ERROR("%s: failed to apply memory context\n", __func__);
-         ret = GGML_STATUS_FAILED;
-@@ -1361,11 +1365,14 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
-         //const auto t_start_us = ggml_time_us();
- 
-         // FIXME this call causes a crash if any model inputs were not used in the graph and were therefore not allocated
-+        double _l5_si=l5c_now_ns();
-         res->set_inputs(&ubatch);
-+        l5_add_setinp(l5c_now_ns()-_l5_si);
- 
-         //LLAMA_LOG_INFO("graph set inputs time: %.3f ms\n", (ggml_time_us() - t_start_us)/1000.0);
-     }
- 
-+    l5_add_hostproc(l5c_now_ns()-_l5_t0);
-     const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
-     if (status != GGML_STATUS_SUCCESS) {
-         LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 21b8f1e..17aaf40 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -2772,6 +2772,9 @@ bool llama_kv_cache_context::apply() {
-     kv->apply_ubatch(sinfos[i_cur], ubatches[i_cur]);
-     n_kv = kv->get_n_kv(sinfos[i_cur]);
- 
-+    // the cells for this ubatch just changed -> drop the cached block table
-+    bt_cache_valid = false;
-+
-     return true;
- }
- 
-@@ -2814,7 +2817,30 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
- }
- 
- void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
-    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
-+    const auto & sinfo = sinfos[i_cur];
-+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
-+    const size_t total = (size_t) ns * n_blk;
-+
-+    // within-step reuse: all full-attention layers of a step request the same
-+    // table (same i_cur/n_blk, cells fixed since apply()). The bytes are
-+    // identical to a fresh compute, so this is bit-exact.
-+    static const bool nocache = (getenv("LLAMA_PAGED_NO_BT_CACHE") != nullptr);
-+    if (nocache) {
-+        kv->get_block_table(dst, n_blk, n_kv, sinfo);
-+        return;
-+    }
-+
-+    if (bt_cache_valid && bt_cache_n_blk == n_blk && bt_cache.size() == total) {
-+        memcpy(dst, bt_cache.data(), total * sizeof(int32_t));
-+        return;
-+    }
-+
-+    kv->get_block_table(dst, n_blk, n_kv, sinfo);
-+
-+    bt_cache.resize(total);
-+    memcpy(bt_cache.data(), dst, total * sizeof(int32_t));
-+    bt_cache_n_blk = n_blk;
-+    bt_cache_valid = true;
- }
- 
- ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index e9980b6..b03de78 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -451,4 +451,13 @@ private:
-     // a heuristic, to avoid attending the full cache if it is not yet utilized
-     // as the cache gets filled, the benefit from this heuristic disappears
-     int32_t n_kv;
-+
-+    // [paged L5] within-step block-table cache. get_block_table() is called once
-+    // per full-attention layer per decode step, but the cell layout (and hence
-+    // the table) is identical across all layers of a step. Compute it on the
-+    // first call and reuse the bytes for the rest; invalidated in apply() when
-+    // the ubatch's slots are committed (the only host-side mutation per step).
-+    mutable std::vector<int32_t> bt_cache;
-+    mutable uint32_t bt_cache_n_blk = 0;
-+    mutable bool     bt_cache_valid = false;
- };
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-index fed8ca9..ebd92be 100644
--- a/src/paged-attn.cpp
-+++ b/src/paged-attn.cpp
-@@ -8,6 +8,13 @@
- 
- #include <cstdlib>
- #include <cstdio>
-+#include <ctime>
-+namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } }
-+double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0;
-+extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; }
-+extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; }
-+namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; }
-+
- 
- namespace paged_attn {
- 
-@@ -54,7 +61,9 @@ public:
-     void set_input(const llama_ubatch * ubatch) override {
-         GGML_UNUSED(ubatch);
-         GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
-+        double _t=l5_now_ns();
-         mctx->get_block_table((int32_t *) idxs->data, n_blk);
-+        g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++;
-     }
- 
-     const llama_kv_cache_context * mctx;
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0030-fused-op-backend-gate.patch
@@ -1,106 +0,0 @@
-From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Sat, 27 Jun 2026 07:30:43 +0000
-Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
- emission (patch 0030)
-
-Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
-Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
-and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
-[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
-slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
-(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
-CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
-reference ONLY.
-
-The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
-the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
-the node and the scheduler assigns the discriminated conv to it; it then runs the
-wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
-device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
-discriminated-SSM_CONV safety was only incidentally covered (it happened to share
-backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
-build of a gated-DeltaNet model exists.
-
-FIX: gate the fused-op emission on the active compute backend type. Before the
-auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
-backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
-fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
-these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
-so disabling them routes the graph to the upstream non-fused path: a PLAIN
-ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
-correctly. This makes the discriminated-op safety explicit and decoupled from the
-GDN-op device-mismatch heuristic.
-
-INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
-fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
-graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
-non-CUDA/non-CPU backends.
-
-GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
-0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
-edited llama-context.cpp compiles clean (uses only already-included <cstring> +
-backend-reg API already used in this TU). test-backend-ops correctness for
-SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
-CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
-registered and exercised on the CUDA DGX run.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
- 1 file changed, 39 insertions(+)
-
-diff --git a/src/llama-context.cpp b/src/llama-context.cpp
-index ad7939e..c408eef 100644
--- a/src/llama-context.cpp
-+++ b/src/llama-context.cpp
-@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
-         cparams.auto_fa = false;
-     }
- 
-+    // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
-+    // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
-+    // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
-+    // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
-+    // built from the hipified ggml-cuda TU) and the CPU reference. Any other
-+    // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
-+    // ignores the discriminator src would silently run the WRONG conv. The
-+    // upstream auto_fgdn device-mismatch check below only inspects
-+    // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
-+    // explicitly to the backend type here: keep the fused path enabled only when
-+    // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
-+    // untouched, so the emitted decode graph is byte-identical.
-+    if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
-+        bool fgdn_backend_ok = true;
-+        for (auto & backend : backends) {
-+            ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
-+            if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
-+                // CPU reference handles the fused/discriminated ops
-+                continue;
-+            }
-+            ggml_backend_reg_t reg  = ggml_backend_dev_backend_reg(dev);
-+            const char *       name = reg ? ggml_backend_reg_name(reg) : "";
-+            // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
-+            // same ggml-cuda TU that carries the discriminated-op handling.
-+            if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
-+                fgdn_backend_ok = false;
-+                break;
-+            }
-+        }
-+
-+        if (!fgdn_backend_ok) {
-+            cparams.fused_gdn_ar = false;
-+            cparams.fused_gdn_ch = false;
-+            cparams.auto_fgdn    = false;
-+            LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
-+                    "(compute backend is not CUDA/HIP/CPU)\n", __func__);
-+        }
-+    }
-+
-     if (cparams.auto_fgdn) {
-         LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
+++ b/backend/cpp/llama-cpp/patches/paged/LOCALAI_LLAMACPP_BACKEND_PLAN.md
@@ -1,507 +0,0 @@
-# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
-
-Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
-actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
-
-================================================================================
-0. GROUND TRUTH (what the repo actually does today)
-================================================================================
-
-The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
-worktree. Two mechanisms, both already present:
-
-  (a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
-      target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
-      patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
-      prepare.sh has a fallback `patch`-based apply guarded by a sentinel
-      (llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
-      already ships the paged engine compiled in.
-
-  (b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
-      hooks (lines ~752-842). They only call setenv() before context init:
-        - option `kv_paged` / `paged_kv` / `paged_attention`  -> setenv LLAMA_KV_PAGED=1
-        - option `kv_paged_debug` / `paged_kv_debug`          -> setenv LLAMA_KV_PAGED_DEBUG=1
-        - option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
-        - option `max_batch_tokens` / `mbt`                   -> setenv LLAMA_MAX_BATCH_TOKENS
-        - option `prefill_cap`                                -> setenv LLAMA_PREFILL_CAP
-      Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
-      so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
-      The paged engine itself lives entirely inside the patched llama.cpp lib
-      (paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
-
-Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
-ONE backend. The task is to SPLIT that into two backends:
-  - llama-cpp  = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
-                 paged hook), grpc-server.cpp keeps the dormant hooks.
-  - <newname>  = stock grpc-server.cpp + paged patch series applied + paged on.
-
-The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
-backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
-+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
-
-CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
-(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
-SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
-gets the same single-build CPU target for free by reusing the same Makefile machinery.
-
--------------------------------------------------------------------------------
-RECOMMENDED BACKEND NAME: `llama-cpp-paged`  (see section 4 for the full rationale)
--------------------------------------------------------------------------------
-Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
-SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
-DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
-in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
-turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
-
-================================================================================
-1. NEW BACKEND - file by file
-================================================================================
-
--------------------------------------------------------------------------------
-1.1 backend/cpp/llama-cpp/Makefile  (the ONE necessary touch to stock)
--------------------------------------------------------------------------------
-Change exactly one default so the STOCK image ships clean against upstream:
-
-    -LLAMA_PAGED?=on
-    +LLAMA_PAGED?=off
-
-Why: this is the entire point of the split - stock llama-cpp must build clean so an
-upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
-grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
-it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
-currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
-the PR. If the team prefers stock literally untouched, the alternative is to leave
-`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
-not achieved. Recommendation: flip to off.
-
-(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
-patches/, patches/paged/ are all reused as-is by the new backend.)
-
--------------------------------------------------------------------------------
-1.2 backend/cpp/llama-cpp-paged/Makefile  (NEW - thin wrapper, model on turboquant)
--------------------------------------------------------------------------------
-Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
-do NOT):
-  - turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
-    as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
-    NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
-    turboquant). We only force LLAMA_PAGED=on.
-  - turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
-    apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
-    paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
-    own `llama.cpp:` target when LLAMA_PAGED=on.
-
-Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
-cpu-all/grpc/rpc-server):
-
-    LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
-
-    define paged-build   # $(1)=flavor $(2)=cmake flags $(3)=target
-      rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
-      cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
-      $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
-      # clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
-      LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
-      CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
-        $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
-      cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
-    endef
-
-    llama-cpp-paged-cpu-all:
-      # identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
-      # + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
-      ... LLAMA_PAGED=on SHARED_LIBS=ON \
-          EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
-          TARGET="--target grpc-server --target ggml" ...
-
-    package: ; bash package.sh
-    purge:   ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
-    clean: purge
-
-Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
-package.sh glob them.
-
--------------------------------------------------------------------------------
-1.3 backend/cpp/llama-cpp-paged/run.sh  (NEW - copy turboquant/run.sh, rename binaries)
--------------------------------------------------------------------------------
-s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
-llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
-DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
-
--------------------------------------------------------------------------------
-1.4 backend/cpp/llama-cpp-paged/package.sh  (NEW - copy turboquant/package.sh, rename)
--------------------------------------------------------------------------------
-s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
-ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
-and the per-arch libc/ld.so set (unchanged).
-
--------------------------------------------------------------------------------
-1.5 backend/Dockerfile.llama-cpp-paged  (NEW - copy Dockerfile.turboquant, swap paths)
--------------------------------------------------------------------------------
-Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
-  - bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
-  - ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
-    (OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
-     stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
-     stock cache would give the paged build near-free object reuse. Trade-off: a regression
-     in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
-  - both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
-  - final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
-
--------------------------------------------------------------------------------
-1.6 .docker/llama-cpp-paged-compile.sh  (NEW - copy llama-cpp-compile.sh, swap make targets)
--------------------------------------------------------------------------------
-Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
-`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
-(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
-arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
-
--------------------------------------------------------------------------------
-1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
--------------------------------------------------------------------------------
-  a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
-  b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
-       # llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
-       # series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
-       BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
-     (lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
-      llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
-  c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
-       $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
-  d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
-  e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
-     line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
-  f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
-     backends/llama-cpp-darwin at line 1124; see 1.11).
-
--------------------------------------------------------------------------------
-1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
--------------------------------------------------------------------------------
-For EACH variant you choose to ship (see phased recommendation in section 4), add a row
-copied from the corresponding llama-cpp row with:
-  - backend: "llama-cpp-paged"
-  - dockerfile: "./backend/Dockerfile.llama-cpp-paged"
-  - tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
-    (e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
-           -gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
-  - builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
-    (this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
-     is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
-  - CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
-    tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
-    backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
-    pattern as -cpu-llama-cpp.
-  - Darwin (if shipping): add to includeDarwin:
-      - backend: "llama-cpp-paged"
-        tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
-        lang: "go"
-    (omit build-type, exactly like the llama-cpp darwin row at line 4908.)
-
-  REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
-  changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
-  triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
-  (the shared source) retrigger this backend too.
-
--------------------------------------------------------------------------------
-1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
--------------------------------------------------------------------------------
-  a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
-     next to the turboquant branch (line 45):
-       if (item.dockerfile.endsWith("llama-cpp-paged")) {
-         // reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
-         return `backend/cpp/llama-cpp-paged/`;
-       }
-     ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
-     keep the specific branch first regardless (defensive, and returns the right path).
-  b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
-       if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
-  c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
-       if (backend === "llama-cpp-paged" && !changed) {
-         changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
-       }
-  Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
-
--------------------------------------------------------------------------------
-1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
--------------------------------------------------------------------------------
-GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
-meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
-per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
-
-  Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
-    - &llamacpppaged
-      name: "llama-cpp-paged"
-      alias: "llama-cpp-paged"
-      license: mit
-      icon: <same as llama-cpp>
-      description: |
-        LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
-        prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
-        Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
-      urls: [ https://github.com/ggerganov/llama.cpp ]
-      tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
-      capabilities:
-        default: "cpu-llama-cpp-paged"
-        nvidia: "cuda12-llama-cpp-paged"
-        nvidia-cuda-12: "cuda12-llama-cpp-paged"
-        nvidia-cuda-13: "cuda13-llama-cpp-paged"
-        nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
-        nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
-        nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
-        metal: "metal-llama-cpp-paged"
-        # add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
-
-  Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
-    capabilities map pointing at the `*-development` image names.
-
-  Step c - concrete image entries at end of file (mirror the llama-cpp block lines
-    2106-2200), one latest + one development per variant, each as:
-      - !!merge <<: *llamacpppaged
-        name: "cpu-llama-cpp-paged"
-        uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
-        mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
-      - !!merge <<: *llamacpppaged
-        name: "cpu-llama-cpp-paged-development"
-        uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
-        mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
-      ...repeat for cuda12 / cuda13 / l4t / metal etc.
-  The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
-
--------------------------------------------------------------------------------
-1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
--------------------------------------------------------------------------------
-If metal is shipped, also:
-  - scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
-    drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
-  - Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
-  - backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
-    step that calls `make backends/llama-cpp-darwin`).
-  - index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
-  - C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
-    linking protobuf/grpc++), so no Homebrew-include failure.
-
--------------------------------------------------------------------------------
-1.12 Importer / /backends/known dropdown  (drop-in, NOT a new importer)
--------------------------------------------------------------------------------
-This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
-add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
-  - AdditionalBackends() (line 37): append
-      {Name: "llama-cpp-paged", Modality: "text",
-       Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
-  - Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
-      preferences.backend == "llama-cpp-paged" is honored:
-        case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
-  - core/gallery/importers/importers_test.go: add a table case asserting the preference
-    override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
-    HF fixture). Run `go test ./core/gallery/importers/...`.
-
--------------------------------------------------------------------------------
-1.13 Docs
--------------------------------------------------------------------------------
-  - docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
-    one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
-    -> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
-
--------------------------------------------------------------------------------
-1.14 Does grpc-server.cpp need the paged hooks?  YES - already present, reused unchanged.
--------------------------------------------------------------------------------
-The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
-SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
-(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
-hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
-LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
-
-================================================================================
-2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
-================================================================================
-
-Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
-the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
-either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
-`options:` string list (key:value, parsed by grpc-server.cpp set_option).
-
--------------------------------------------------------------------------------
-2.1 Benchmark llama-server flags -> LocalAI model-config mapping
--------------------------------------------------------------------------------
-  -c 131072                  -> context_size: 131072            (LLMConfig.ContextSize, yaml context_size)
-  -fa on                     -> flash_attention: "on"           (LLMConfig.FlashAttention, yaml flash_attention; string)
-  -ngl 99                    -> gpu_layers: 99                  (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
-  -b 2048                    -> batch: 2048                     (schema.PredictionOptions.Batch, yaml batch)  [see caveat]
-  --parallel 128             -> options: ["parallel:128"]       (grpc-server.cpp:629; alias n_parallel)
-  LLAMA_KV_PAGED=1           -> options: ["paged_kv:true"]      (grpc-server.cpp:778)
-  LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
-  f16 KV                     -> f16: true                       (LLMConfig.F16, yaml f16)
-  (recommended for paged)    -> options: ["kv_unified:false"]   (grpc-server.cpp:746 - the per-slot paged
-                                  capacity/memory benefit only materializes with a per-sequence cache;
-                                  the patch comment explicitly recommends pairing paged with kv_unified:false)
-
-  CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
-  (grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
-  benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
-    (i)  set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
-         max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
-         benchmark states decode throughput is budget-independent), OR
-    (ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
-  RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
-  lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
-  be added later to honor -b/-ub independently (not required to ship).
-
--------------------------------------------------------------------------------
-2.2 gallery/index.yaml entry - DENSE  q36-27b-nvfp4
--------------------------------------------------------------------------------
- name: "qwen3.6-27b-nvfp4-paged"
-  url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
-  urls:
-    - https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF      # placeholder, section 3
-  description: |
-    Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
-    paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
-    Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
-  license: "apache-2.0"                                         # confirm vs Qwen license
-  tags: [ llm, gguf, nvfp4, reasoning ]
-  icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
-  overrides:
-    backend: llama-cpp-paged
-    f16: true
-    flash_attention: "on"
-    context_size: 131072
-    gpu_layers: 99
-    batch: 512                       # see -ub caveat 2.1; matches the 512 ubatch floor
-    known_usecases: [ chat ]
-    options:
-      - use_jinja:true
-      - paged_kv:true                # LLAMA_KV_PAGED=1
-      - max_batch_tokens:512         # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
-      - kv_unified:false             # enables the per-slot paged capacity/memory benefit
-      - parallel:128                 # --parallel 128 serving slots
-    parameters:
-      model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
-    template:
-      use_tokenizer_template: true
-  files:
-    - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
-      sha256: <FILL after publish>
-      uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
-
--------------------------------------------------------------------------------
-2.3 gallery/index.yaml entry - MoE  q36-35b-a3b-nvfp4
--------------------------------------------------------------------------------
-Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
-MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
-purely for saturated MoE decode use max_batch_tokens:256.
- name: "qwen3.6-35b-a3b-nvfp4-paged"
-  urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
-  ...
-  overrides:
-    backend: llama-cpp-paged
-    f16: true
-    flash_attention: "on"
-    context_size: 131072
-    batch: 512
-    options:
-      - use_jinja:true
-      - paged_kv:true
-      - max_batch_tokens:512          # or 256 for max saturated MoE decode (sweep winner)
-      - kv_unified:false
-      - parallel:128
-    parameters:
-      model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
-  files:
-    - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
-      sha256: <FILL after publish>
-      uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
-
-Note: these are the BENCHMARK serving configs. For an interactive single-user default you
-may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
- optional, not required to ship the benchmark reproduction.
-
-================================================================================
-3. GGUF PUBLISHING (so the gallery uri: resolves)
-================================================================================
-
-The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
-q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
-document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
-tensors from nvidia modelopt weights). To publish:
-
-  1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
-       <ORG>/Qwen3.6-27B-NVFP4-GGUF      (single q36-27b-nvfp4.gguf)
-       <ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF  (single q36-35b-a3b-nvfp4.gguf)
-     ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
-  2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
-     (LocalAI verifies it on download). Without sha256 the entry still works but loses the
-     integrity check - fill it.
-  3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
-     pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
-     Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
-     paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
-  4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
-     LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
-     tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
-     backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
-     IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
-  5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
-     modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
-     attribute upstream in the card.
-
-================================================================================
-4. OPEN DECISIONS / BLOCKERS / BUILD COST
-================================================================================
-
-BACKEND NAME - RECOMMEND `llama-cpp-paged`.
-  - llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
-    every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
-    changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
-    importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
-  - localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
-    dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
-    Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
-  - localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
-    llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
-    ".cpp" looks like a file extension to the suffix matcher. Avoid.
-
-BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
-  1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
-     If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
-  2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
-     until upload. Blocks gallery validation only, not the backend build.
-  3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
-     option; shipped config uses batch:512. Minor, not a blocker.
-  4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
-     - get explicit sign-off since it alters a heavily-used backend's build.
-
-PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
-base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
-  Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
-    - cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64  (NVFP4/paged win)
-    - cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
-  Phase 2 (parity with stock llama-cpp coverage, only if demand):
-    - metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
-  Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
-  and these add CI cost without a clear consumer.
-
-BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
-~byte-identical to stock so a SHARED ccache id makes most objects free):
-  - CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
-    apt step. Two arches + a merge job.
-  - CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
-    across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
-  - Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
-  - No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
-    so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
-    write; first master build per row pays the cold cost once, then warm.
-
-VERIFICATION (post-implementation, needs a GPU box - out of scope here):
-  - `make backends/llama-cpp-paged` builds + installs locally (from-source path).
-  - Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
-    checkout) - proves the split.
-  - Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
-    server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
-    max_batch_tokens/parallel took effect.
-  - go test ./core/gallery/importers/... green (importer drop-in case).
-  - node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
-    llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
-
-================================================================================
-END OF PLAN
-================================================================================
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_BITEXACT_NOTE.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_BITEXACT_NOTE.md
@@ -1,75 +0,0 @@
-# Paged bit-exactness gate - per path (canonical references)
-
-## TL;DR
-
-The greedy decode of the **paged** path does not byte-match the **non-paged**
-path for the MoE model. This is a **benign FP-accumulation-order difference of
-the paged attention reduction**, KL-validated against the f16 reference. It is
-**not a bug**. The bit-exactness gate is therefore **per path**:
-
-| path | model | canonical md5 |
-|------|-------|---------------|
-| non-paged | MoE q36-35b-a3b-nvfp4   | `07db32c2bcb78d17a43ed18bc22705cd` |
-| paged     | MoE q36-35b-a3b-nvfp4   | `8cb0ce23777bf55f92f63d0292c756b0` |
-| non-paged | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` |
-| paged     | dense q36-27b-nvfp4     | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
-
-Gate command (chat-template / conversation path):
-```
-llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-                 -n 48 --temp 0 --seed 1
-# paged: prefix with  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
-```
-Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
-completion lands in a different md5 namespace).
-
-**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
-the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
-single reference `5951a5b4`.
-
-## Why dense is bit-exact but MoE is not
-
-Dense paged decode reproduces the non-paged reduction order exactly, so dense
-greedy md5 is identical across paths. The MoE path runs additional kernels (the
-NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
-between the paged and non-paged attention layouts. Over a long greedy decode this
-flips a small number of near-tied argmaxes, changing the byte stream. The same
-divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
-off, and with the patch-0029 block-table cache on or off - it is a property of
-the paged attention path, not of any one lever.
-
-## KL evidence that the paged path is sound (the load-bearing check)
-
-`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
-`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
-(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
-
-| comparison | PPL(Q) | KL divergence | Same top p | Cor |
-|------------|-------:|--------------:|-----------:|----:|
-| f16 reference | 7.3734 | - | - | - |
-| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
-| **paged** vs f16     | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
-| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
-
-Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
-
-### Verdict: BENIGN
-
- **Paged does not diverge from the f16 ground truth more than non-paged does.**
-  KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
-  7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
-  error bars). A real paged-MoE correctness bug would push paged measurably
-  *further* from f16; it does not (it is marginally closer).
- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
-  89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
-  with essentially zero probability bias. That is the signature of two equivalent
-  FP-reorderings of the same quantized model, both equally approximating the f16
-  ground truth - not a quality regression.
- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
-  heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
-  logit near-ties are abundant, so a different-but-equivalent reduction order
-  flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
-  zero Delta-p bias).
-
-Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
-reference for the MoE deployment path.
--- a/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
+++ b/backend/cpp/llama-cpp/patches/paged/PIN_SYNC_c299a92c.md
@@ -1,100 +0,0 @@
-# Pin-sync: paged patch-stack -> llama.cpp c299a92c
-
-Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
-28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
-("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
-GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
-path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
-upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
-
-## Upstream jump
-
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
-  ("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
-  ("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
-
-## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
-
-Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
-**zero patch changes**. The already-shipped source-only series (the result of the
-`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
-`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
-`git apply`** (the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
-`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
-28 patches reported "Applied patch ... cleanly", the sentinel
-`src/paged-kv-manager.cpp` was created, and there are **zero** stray
-`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
-intact). git apply tolerates `@@` line-number offsets, which absorbed the
-upstream drift; no hunk context broke.
-
-Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
-patch tarball used for the verification has
-`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
-
-## Clean build
-
-Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
-28 patches applied as working-tree changes, then:
-
-```
-cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-  -DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-  -DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
-cmake --build build-cuda --target llama-completion test-backend-ops -j20
-```
-
-Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
-`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
-
-## GATE: ALL GREEN
-
-Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
-`9d5d882d` build too):
-```
-llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-                 -n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
-# paged dense: prefix  LLAMA_KV_PAGED=1
-# paged MoE:   prefix  LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
-```
-
-(a) greedy md5 - all four paths PASS:
-| path | model | md5 @ c299a92c | baseline | verdict |
-|------|-------|----------------|----------|---------|
-| non-paged | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
-| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
-| paged     | dense `q36-27b-nvfp4`   | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
-| paged     | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
-
-(b) `test-backend-ops` (Backend CUDA0) - all PASS:
-| op | result |
-|----|--------|
-| SSM_CONV            | 45/45 OK |
-| SSM_CONV_UPDATE     | 16/16 OK |
-| SSM_CONV_UPDATE_IDS | 16/16 OK |
-| GATED_DELTA_NET     | 84/84 OK |
-| MUL_MAT             | 1146/1146 OK |
-| MUL_MAT_ID          | 806/806 OK |
-
-(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
-series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
-pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
-
-Bit-exactness preserved across the 23-commit upstream jump.
-
-## Canary
-
-`.github/workflows/llama-cpp-paged-canary.yml` and
-`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
-series is source-only and applies strict-clean with no `--exclude`, the canary's
-`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
-the shipped series) and may be removed on a future canary touch; left in place
-here to keep the pin-bump diff minimal.
-
-## Source of truth
-
-The shipped `.patch` files under `backend/cpp/llama-cpp/patches/paged/` are the
-source of truth and are unchanged by this bump. The DGX dev tree
-(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
-the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
--- a/backend/cpp/llama-cpp/patches/paged/README.md
+++ b/backend/cpp/llama-cpp/patches/paged/README.md
@@ -1,317 +0,0 @@
-# LocalAI paged-attention llama.cpp patch series
-
-This directory holds the vendored patch series that turns stock llama.cpp into
-LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
-applied on top of a pinned upstream llama.cpp at build time; nothing here is a
-fork - it is a source-only `*.patch` stack plus this single canonical doc.
-
-> One-file rule: this README is the canonical reference for the patch series. The
-> only other docs kept in this directory are operational and linked below:
-> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
-> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
-> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
-
---
-
-## 1. What it is
-
-`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
-vendored patch series over upstream llama.cpp that adds
-
- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
-  free pool, ref-counted blocks) with a **block-table flash-attention** read so
-  the attention kernels index physical cells instead of a contiguous buffer;
- **cross-request prefix sharing** - concurrent requests that share a long
-  prefix physically reuse one committed copy of the prefix blocks and prefill
-  only their divergent suffix;
- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
-  decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
-  gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
-  GEMM - dominates the decode step.
-
-It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
-export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
-[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
-(see section 7).
-
-The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
-enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
-`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
-runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
-clean and the paged build.
-
---
-
-## 2. Architecture
-
-The decode step on these models breaks into three cost centers; the patch series
-attacks each one.
-
-**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
-(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
-fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
-copy-on-write for shared prefixes). The attention path reads through a **block
-table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
-as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
-CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
-and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
-reduction order identical to stock. A null block table is the stock contiguous
-read, byte-identical.
-
-**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
-gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
-GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
-cost. The series fuses that plumbing to mirror vLLM's
-`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
-to its cache slot in place (no copy-back, no `get_rows` materialization), the
-conv state is updated in place, the output projection is reshaped to route to the
-tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
-bit-exact (md5-gateable) against the f32 baseline.
-
-**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
-Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
-in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
-keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
-NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
-to dequant.
-
-**The prefill/decode scheduler.** `update_slots()` already emits one unified
-mixed prefill+decode batch per step. The scheduler patches change only the *count*
-of prefill tokens admitted per step: decode tokens are claimed first
-(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
-live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
-shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
-orthogonal to the paged allocator.
-
---
-
-## 3. Patch series (0001-0030)
-
-28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
-`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
-is in section 5.
-
-### Paged-KV core (0001-0012)
-
-| # | What it does | Bit-exact |
-|---|---|---|
-| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
-| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
-| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
-| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
-| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
-| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
-| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
-| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
-| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
-| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
-| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
-
-### Decode-first scheduler (0013, 0016)
-
-| # | What it does | Bit-exact |
-|---|---|---|
-| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
-| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
-
-(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
-high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
-default-on, density-aware auto-select that is prefill-safe by construction. Both
-bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
-but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
-ships as the parity gate + default-off instrumentation only.)
-
-### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
-
-These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
-
-| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
-|---|---|---|
-| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
-| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
-| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
-| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
-| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
-| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
-
-### MoE NVFP4 quant (0023, 0025)
-
-| # | What it does | Bit-exact |
-|---|---|---|
-| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
-| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
-
-### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
-
-| # | What it does | Bit-exact |
-|---|---|---|
-| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
-| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
-| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
-| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
-
---
-
-## 4. Benchmarks
-
-Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
-**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
-S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
-serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
-[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
-[`final_benchmark.csv`](final_benchmark.csv).
-
-### (a) + (b) Patched vs stock vs vLLM
-
-The **stock** and **patched** columns are the same binary, env-toggled, on the
-**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
-apples-to-apples measure of the patch series' contribution. The **vLLM** column
-is a **different harness** (vLLM server + client continuous batching), so the
-cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
-
-**Dense Qwen3.6-27B-NVFP4** (t/s):
-
-| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
-|----:|------:|--------:|-----:|------------------:|---------------------:|
-| 8   |  65.7 |   84.0 |  71.1 | 118% | 1.28x |
-| 32  | 113.7 |  204.0 | 207.6 |  98% | 1.79x |
-| 64  | 134.3 |  294.9 | 309.7 |  95% | 2.20x |
-| 128 | 143.5 |  371.2 | 422.4 |  88% | 2.59x |
-
-**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
-
-| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
-|----:|------:|--------:|------:|-----------------:|---------------------:|
-| 8   | 181.4 |  227.4 |  315.1 | 72% | 1.25x |
-| 32  | 260.8 |  455.7 |  681.9 | 67% | 1.75x |
-| 64  | 306.8 |  612.3 |  765.5 | 80% | 2.00x |
-| 128 | 331.3 |  772.6 | 1011.7 | 76% | 2.33x |
-
-**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
-@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
-config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
-groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
-
-**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
-stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
-remaining gap is structural (see section 5).
-
-### (c) Apple M4 (16GB) - for curiosity only
-
-No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
-RAM (= SSD paging). Architectural findings:
-
- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
-  `GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
-  FP4-MMA.
- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
-
-Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
-a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
-
---
-
-## 5. Dev notes - what we learned
-
-**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
-md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
-is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
-`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
-chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
-touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
-
-**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
-Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
-(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
-is a benign FP-accumulation-order difference of the paged attention reduction,
-**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
-KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
-equivalent FP-reorderings of the same quantized model, not a regression. Future
-paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
-
-**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
-decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
-**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
-the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
-bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
-llama is losing. The MoE GEMM kernel is *not* where the gap lives.
-
-**Rejected / flat levers** (recorded so they are not re-tried):
-
- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
-  exhausted by 0025; more graph/stream overlap is a no-op or small regression on
-  this model.
- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
-  by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
-  further bit-exact lever clears it. 0023 already banks the de-dup.
- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
-  Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
-  same bf16 projections. No-ship.
- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
-  needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
-  (The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
-  carries over to MoE.)
-
-**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
-that bf16 KL error concentrates in long-memory heads and can be removed by
-keeping them f32 - is **empirically refuted**: the error scales with the bf16
-head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
-byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
-byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
-ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
-in a recommended/gallery config.
-
---
-
-## 6. Architecture and quant generality
-
-(From the arch-generality and quant-generality audits.)
-
- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
-  activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
-  help **any quant** of these models (the GDN recurrence, conv, gather and
-  o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
-  not on the weight dtype).
- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
-  to dequant on other hardware; the GB10-tuned occupancy params (0022) are
-  perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
-  correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
-  CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
-
---
-
-## 7. Pin + maintenance policy
-
- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
-  [`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
-  onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
-  (dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
-  c299a92c` jump (23 upstream commits) needed zero patch changes and did not
-  change decode output.
- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
-  `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
-  silently shift the tree out from under the patches.
- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
-  (via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
-  tries the patch series against the latest upstream tip with the build's own
-  strict `git apply`. **Red = upstream drifted past the series -> run a
-  PIN_SYNC** (do not bump the pin blindly). The canary references
-  [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
-
---
-
-## 8. Models
-
-The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
-
-| Gallery entry | Weights (HuggingFace) | Notes |
-|---|---|---|
-| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
-| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
-
-Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
-(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
-`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
-`ssm_bf16_tau`). The full backend-split + gallery plan is in
-[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).
--- a/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv
+++ b/backend/cpp/llama-cpp/patches/paged/final_benchmark.csv
@@ -1,17 +0,0 @@
-model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
-q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
-q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
-q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
-q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
-q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
-q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
-q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
-q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
-q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
-q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
-q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
-q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
-q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
-q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
-q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
-q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
--- a/backend/cpp/llama-cpp/patches/paged/paged-burst-bench.cpp
+++ b/backend/cpp/llama-cpp/patches/paged/paged-burst-bench.cpp
@@ -1,217 +0,0 @@
-// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
-//
-// Reproduces, at the libllama level, the two host-side defects behind the
-// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
-// signature:
-//
-//   * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
-//     what llama-server issues on every reused slot - frees the kv-cache CELLS
-//     but the paged manager keeps owning the trailing BLOCKS. The manager's
-//     free pool silently shrinks. Test A measures the reclaimed-block delta.
-//
-//   * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
-//     many sequences and frees them in a scrambled order leaves the free queue a
-//     scrambled permutation of physical block ids. A later low-npl prefill then
-//     pops physically scattered blocks, so its KV scatter-write + in-kernel
-//     paged-attention gather lose locality and prefill throughput collapses;
-//     decode (single-token append) barely notices. Test B times an npl8 prefill
-//     on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
-//
-// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
-// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
-// of the fresh npl8 and num_free returns to the pristine value after the drain.
-//
-// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
-// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
-
-#include "llama.h"
-#include "paged-prefix-api.h"
-
-#include <chrono>
-#include <clocale>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <vector>
-
-static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
-
-using clk = std::chrono::steady_clock;
-static double secs(clk::time_point a, clk::time_point b) {
-    return std::chrono::duration<double>(b - a).count();
-}
-
-struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
-
-// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
-// blocks unique so no cross-request prefix sharing masks the accounting.
-static llama_token tok_of(int seq, int pos, int n_vocab) {
-    return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
-}
-
-// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
-// Returns wall seconds (sync'd).
-static double prefill(Ctx & C, int seq, int pos0, int n) {
-    clk::time_point t0 = clk::now();
-    C.batch.n_tokens = 0;
-    for (int j = 0; j < n; ++j) {
-        int i = C.batch.n_tokens;
-        C.batch.token[i]    = tok_of(seq, pos0 + j, C.n_vocab);
-        C.batch.pos[i]      = pos0 + j;
-        C.batch.n_seq_id[i] = 1;
-        C.batch.seq_id[i][0]= seq;
-        C.batch.logits[i]   = (j + 1 == n) ? 1 : 0;
-        C.batch.n_tokens++;
-    }
-    if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
-    llama_synchronize(C.ctx);
-    return secs(t0, clk::now());
-}
-
-// One decode step (single token) for seq at pos.
-static void decode1(Ctx & C, int seq, int pos) {
-    C.batch.n_tokens = 1;
-    C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
-    C.batch.pos[0]   = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
-    if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
-}
-
-int main(int argc, char ** argv) {
-    std::setlocale(LC_NUMERIC, "C");
-    const char * model_path = nullptr;
-    for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
-    if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
-
-    const int NSLOT = env_i("BURST_NSLOT", 64);
-    const int NPL   = env_i("NPL", 8);
-    const int PP    = env_i("PP", 512);
-    const int KEEP  = env_i("KEEP", 256);
-    const int GEN   = env_i("GEN", 4);
-    const int ngl   = env_i("PAGED_NGL", 99);
-    const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
-
-    ggml_backend_load_all();
-    llama_model_params mp = llama_model_default_params();
-    mp.n_gpu_layers = ngl;
-    llama_model * model = llama_model_load_from_file(model_path, mp);
-    if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
-    const llama_vocab * vocab = llama_model_get_vocab(model);
-    const int n_vocab = llama_vocab_n_tokens(vocab);
-
-    // Pool sized for the burst plus headroom so the burst fits but a later npl
-    // run draws from whatever the burst's churn left behind.
-    const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
-    llama_context_params cp = llama_context_default_params();
-    cp.n_ctx     = (uint32_t) cells;
-    cp.n_batch   = (uint32_t) (PP + 16);
-    cp.n_ubatch  = (uint32_t) (PP + 16);
-    cp.n_seq_max = NSLOT + NPL + 2;
-    cp.kv_unified = true;     // one unified stream-0 pool -> num_free(ctx) is the whole pool
-    cp.no_perf   = true;
-    llama_context * ctx = llama_init_from_model(model, cp);
-    if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
-
-    Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
-    C.batch = llama_batch_init(cp.n_batch, 0, 1);
-
-    printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
-           paged, NSLOT, NPL, PP, KEEP, GEN, cells);
-
-    llama_memory_clear(C.mem, true);
-    const long F_start = paged_prefix_api::num_free_global();
-
-    // ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
-    {
-        prefill(C, 0, 0, PP);
-        const long f_after_prefill = paged_prefix_api::num_free_global();
-        llama_memory_seq_rm(C.mem, 0, KEEP, -1);          // partial tail removal
-        const long f_after_rm = paged_prefix_api::num_free_global();
-        llama_memory_seq_rm(C.mem, 0, -1, -1);            // full free -> pristine
-        const long f_after_full = paged_prefix_api::num_free_global();
-        const long bs = 16;
-        const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
-        printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
-               "(expect %ld post-fix, 0 pre-fix)  afterFullFree=%ld\n",
-               F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
-    }
-
-    // ---- Test B: fragmentation -> npl prefill collapse -----------------------
-    // Fresh npl prefill baseline on a pristine pool.
-    llama_memory_clear(C.mem, true);
-    double tps_fresh;
-    {
-        clk::time_point t0 = clk::now();
-        long ntok = 0;
-        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
-        tps_fresh = ntok / secs(t0, clk::now());
-        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
-    }
-    const long F_pristine = paged_prefix_api::num_free_global();
-
-    // High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
-    // steps (mixed alloc), then drain them in a scrambled order (odd ids first,
-    // then even, each truncated before the full free) so the free queue becomes a
-    // scrambled permutation - the fragmentation the bug never compacts.
-    for (int s = 0; s < NSLOT; ++s) {
-        if (prefill(C, NPL + s, 0, PP) < 0) return 1;
-        for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
-    }
-    const long F_during_burst = paged_prefix_api::num_free_global();
-    // Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
-    // scrambled slot order to scramble the physical free order.
-    for (int parity = 1; parity >= 0; --parity)
-        for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
-            llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1);   // partial (Fix-1 path)
-            llama_memory_seq_rm(C.mem, NPL + s, -1, -1);     // full free
-        }
-    const long F_after_drain = paged_prefix_api::num_free_global();
-
-    // Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
-    // defragged) free queue.
-    double tps_post;
-    {
-        clk::time_point t0 = clk::now();
-        long ntok = 0;
-        for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
-        tps_post = ntok / secs(t0, clk::now());
-        for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
-    }
-
-    const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
-    printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
-           "(afterDrain==pristine? %s)\n",
-           F_start, F_pristine, F_during_burst, F_after_drain,
-           F_after_drain == F_pristine ? "YES" : "NO");
-    printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f  ratio=%.3f "
-           "(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
-
-    // ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
-    // Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
-    // slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
-    // low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
-    // prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
-    llama_memory_clear(C.mem, true);
-    // Touch the pool once so the manager exists, then read the full-pool size
-    // (num_free is 0 while no manager is registered).
-    if (prefill(C, 0, 0, 16) < 0) return 1;
-    llama_memory_seq_rm(C.mem, 0, -1, -1);
-    const long F_pre_c = paged_prefix_api::num_free_global();
-    for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
-    const long F_idle = paged_prefix_api::num_free_global();
-    for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
-    const long F_reclaimed = paged_prefix_api::num_free_global();
-    printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
-           "(returns_to_fresh? %s)\n",
-           F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
-           F_reclaimed == F_pre_c ? "YES" : "NO");
-
-    printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
-           paged, ratio,
-           F_after_drain == F_pristine ? "YES" : "NO",
-           F_reclaimed == F_pre_c ? "YES" : "NO");
-
-    llama_batch_free(C.batch);
-    llama_free(ctx);
-    llama_model_free(model);
-    return 0;
-}
--- a/backend/cpp/llama-cpp/patches/paged/paged-reclaim-unit.cpp
+++ b/backend/cpp/llama-cpp/patches/paged/paged-reclaim-unit.cpp
@@ -1,59 +0,0 @@
-// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
-// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
-//
-//   Fix-1  PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
-//          beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
-//          longer strands blocks whose cells were cleared.
-//   Fix-2  defrag_free_pool() relinks the free queue into ascending block-id
-//          order once the pool is fully idle, undoing a burst's scrambled frees
-//          so a later prefill pops physically contiguous blocks again.
-
-#include "paged-kv-manager.h"
-#include <cstdio>
-
-using paged::PagedKVManager;
-
-int main() {
-    int rc = 0;
-
-    // ---- Fix-1: truncate reclaims the trailing block suffix -----------------
-    {
-        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
-        const size_t f0 = m.num_free_blocks();   // 63 (block 0 reserved as null)
-        m.allocate(0, 512);                       // ceil(512/16)=32 blocks
-        const size_t f1 = m.num_free_blocks();    // 31
-        m.truncate(0, 256);                       // keep ceil(256/16)=16, free 16
-        const size_t f2 = m.num_free_blocks();    // 47
-        printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
-               f0, f1, f2, f2 - f1);
-        if (f2 - f1 != 16) rc = 1;
-        m.truncate(0, 16);                        // keep 1 block, free 15 more
-        const size_t f3 = m.num_free_blocks();    // 62
-        printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
-        if (f3 != f0 - 1) rc = 1;
-        m.free(0);
-        if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
-    }
-
-    // ---- Fix-2: defrag restores ascending popleft order ---------------------
-    {
-        PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
-        for (int s = 0; s < 8; ++s) m.allocate(s, 16);          // pop blocks 1..8
-        const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4};      // free out of order
-        for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
-        m.defrag_free_pool();                                    // all idle -> compact
-        m.allocate(100, 16 * 3);                                 // pop 3 blocks
-        const auto bt = m.block_table(100);
-        bool asc = true;
-        printf("[unit Fix-2] post-defrag block_table:");
-        for (size_t i = 0; i < bt.size(); ++i) {
-            printf(" %d", bt[i]);
-            if (i && bt[i] < bt[i - 1]) asc = false;
-        }
-        printf("  ascending=%s (expect YES)\n", asc ? "YES" : "NO");
-        if (!asc) rc = 1;
-    }
-
-    printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
-    return rc;
-}
--- a/backend/cpp/llama-cpp/patches/paged/qwen36_dense_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp/patches/paged/qwen36_dense_decode_vs_npl.png
--- a/backend/cpp/llama-cpp/patches/paged/qwen36_moe_decode_vs_npl.png
+++ b/backend/cpp/llama-cpp/patches/paged/qwen36_moe_decode_vs_npl.png
--- a/backend/cpp/llama-cpp/prepare.sh
+++ b/backend/cpp/llama-cpp/prepare.sh
@@ -2,30 +2,18 @@

 ## Patches

-## Apply patches: the base `patches/` series, then the gated `patches/paged/`
-## series (default on; LLAMA_PAGED=off skips it). Only *.patch files are applied
-## (docs/dirs like patches/paged/ and *.md are skipped). The Makefile `llama.cpp`
-## target already `git apply`s these at checkout, so each apply is guarded by a
-## sentinel and skipped when already present - re-applying git-format patches with
-## `patch` fuzzily duplicates hunks (redefinition errors). This block only does
-## real work if prepare.sh is run against an unpatched checkout.
+## Apply the base `patches/` series (top-level *.patch only; *.md/dirs skipped).
+## The stock llama-cpp backend is patch-free by default, so this normally does
+## nothing. The Makefile `llama.cpp` target already `git apply`s any base patch
+## at checkout, so each apply here is `-N` (skip already-applied): re-applying a
+## git-format patch with `patch` would fuzzily duplicate hunks. This block only
+## does real work if prepare.sh is run against an unpatched checkout.
 if [ -d "patches" ]; then
    for patch in patches/*.patch; do
        [ -e "$patch" ] || continue
        echo "Applying patch $patch"
        patch -d llama.cpp/ -p1 -N -r - < "$patch" || true
    done
-    if [ "${LLAMA_PAGED:-on}" != "off" ] && [ -d "patches/paged" ]; then
-        if [ -f llama.cpp/src/paged-kv-manager.cpp ]; then
-            echo "paged-attention patch series already applied (sentinel present) - skipping re-apply"
-        else
-            for patch in patches/paged/*.patch; do
-                [ -e "$patch" ] || continue
-                echo "Applying paged patch $patch"
-                patch -d llama.cpp/ -p1 -N -r - < "$patch" || true
-            done
-        fi
-    fi
 fi

 set -e