feat(backends): make PreferDevelopmentBackends install the development image as primary

When LOCALAI_PREFER_DEV_BACKENDS is set, install the -development image as the primary backend URI (keeping the released image reachable as the first fallback), instead of only reaching development as a download fallback when the released image is missing. This lets an operator force backends built from the development branch — e.g. to pick up a fix already on master before a release. Threads PreferDevelopmentBackends through SystemState so InstallBackend can see it, and reuses the same development-URI convention as the existing failure-path fallback (released tag -> branch tag + dev suffix). The unexported developmentURI helper is covered by a Ginkgo spec. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(backends): darwin/Metal build for the privacy-filter backend (#10513 )
2026-06-26 01:16:58 -04:00 · 2026-06-25 23:37:19 +00:00 · 2026-06-26 01:18:41 +02:00 · 2026-06-26 01:15:40 +02:00 · 2026-06-26 01:08:09 +02:00 · 2026-06-26 01:02:48 +02:00
150 changed files with 412 additions and 19972 deletions
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
@@ -4963,6 +4963,12 @@ includeDarwin:
    tag-suffix: "-metal-darwin-arm64-sam3-cpp"
    build-type: "metal"
    lang: "go"
+  # privacy-filter (PII/NER) is a C++/ggml backend built by a bespoke darwin
+  # script (make backends/privacy-filter-darwin); ggml defaults Metal ON on Apple
+  # so the build is Metal-enabled. lang=go drives runner/toolchain selection only.
+  - backend: "privacy-filter"
+    tag-suffix: "-metal-darwin-arm64-privacy-filter"
+    lang: "go"
  # LocalVQE has no Metal path; on Apple Silicon it builds CPU-only (GGML_METAL
  # OFF) but is still a native arm64 image. Uses the darwin/metal build profile.
  - backend: "localvqe"
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -228,8 +228,17 @@ jobs:
        run: |
          make backends/ds4-darwin

+      # privacy-filter is a C++/ggml backend like ds4 - a single grpc-server with
+      # otool dylib bundling - so it gets its own bespoke darwin script rather than
+      # the generic build-darwin-go-backend path.
+      - name: Build privacy-filter backend (Darwin Metal)
+        if: inputs.backend == 'privacy-filter'
+        run: |
+          make protogen-go
+          make backends/privacy-filter-darwin
+
      - name: Build ${{ inputs.backend }}-darwin
-        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4'
+        if: inputs.backend != 'llama-cpp' && inputs.backend != 'ds4' && inputs.backend != 'privacy-filter'
        run: |
          make protogen-go
          BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend
--- a/6
+++ b/6
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -1129,6 +1129,10 @@ backends/ds4-darwin: build
 	bash ./scripts/build/ds4-darwin.sh
 	./local-ai backends install "ocifile://$(abspath ./backend-images/ds4.tar)"

+backends/privacy-filter-darwin: build
+	bash ./scripts/build/privacy-filter-darwin.sh
+	./local-ai backends install "ocifile://$(abspath ./backend-images/privacy-filter.tar)"
+
 build-darwin-python-backend: build
 	bash ./scripts/build/python-darwin.sh

--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=d5507e33ae7ee2b7b41475f08044d3bde3b839ee
+IK_LLAMA_VERSION?=b84902d2ad27c34f989f23947200c4b91b1568fd
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/ik-llama-cpp/run.sh
+++ b/backend/cpp/ik-llama-cpp/run.sh
@@ -2,7 +2,7 @@
 set -ex

 # Get the absolute current dir where the script is located
-CURDIR=$(dirname "$(realpath $0)")
+CURDIR=$(dirname "$(realpath "$0")")

 cd /

@@ -13,28 +13,28 @@ grep -e "flags" /proc/cpuinfo | head -1
 # ik_llama.cpp requires AVX2 — default to avx2 binary
 BINARY=ik-llama-cpp-avx2

-if [ -e $CURDIR/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
+if [ -e "$CURDIR"/ik-llama-cpp-fallback ] && ! grep -q -e "\savx2\s" /proc/cpuinfo ; then
 	echo "CPU:    AVX2   NOT found, using fallback"
 	BINARY=ik-llama-cpp-fallback
 fi

 # Extend ld library path with the dir where this script is located/lib
 if [ "$(uname)" == "Darwin" ]; then
-	export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH
-	#export DYLD_FALLBACK_LIBRARY_PATH=$CURDIR/lib:$DYLD_FALLBACK_LIBRARY_PATH
+	export DYLD_LIBRARY_PATH="$CURDIR"/lib:$DYLD_LIBRARY_PATH
+	#export DYLD_FALLBACK_LIBRARY_PATH="$CURDIR"/lib:$DYLD_FALLBACK_LIBRARY_PATH
 else
-	export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH
+	export LD_LIBRARY_PATH="$CURDIR"/lib:$LD_LIBRARY_PATH
 fi

 # If there is a lib/ld.so, use it
-if [ -f $CURDIR/lib/ld.so ]; then
+if [ -f "$CURDIR"/lib/ld.so ]; then
 	echo "Using lib/ld.so"
 	echo "Using binary: $BINARY"
-	exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@"
+	exec "$CURDIR"/lib/ld.so "$CURDIR"/$BINARY "$@"
 fi

 echo "Using binary: $BINARY"
-exec $CURDIR/$BINARY "$@"
+exec "$CURDIR"/$BINARY "$@"

 # We should never reach this point, however just in case we do, run fallback
-exec $CURDIR/ik-llama-cpp-fallback "$@"
+exec "$CURDIR"/ik-llama-cpp-fallback "$@"
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,14 +1,6 @@

-LLAMA_VERSION?=8be759e6f70d629638a7eb70db3824cbdcea370b
+LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
-# LLAMA_PAGED controls whether the vendored paged-attention patch series
-# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
-# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
-# dep-bump if an upstream change breaks a paged hook - the paged carry is then
-# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
-# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
-# env is set.
-LLAMA_PAGED?=on

 CMAKE_ARGS?=
 BUILD_TYPE?=
@@ -177,28 +169,14 @@ llama.cpp:
 	git remote add origin $(LLAMA_REPO)  && \
 	git fetch --all --tags && \
 	git checkout -b build $(LLAMA_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch && \
-	for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \
-		[ -e "$$p" ] || continue; \
-		echo "applying llama.cpp patch: $$p"; \
-		git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
-	done && \
-	if [ "$(LLAMA_PAGED)" = "off" ]; then \
-		echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
-	else \
-		for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
-			[ -e "$$p" ] || continue; \
-			echo "applying llama.cpp PAGED patch: $$p"; \
-			git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
-		done; \
-	fi
+	git submodule update --init --recursive --depth 1 --single-branch

 llama.cpp/tools/grpc-server: llama.cpp
 	mkdir -p llama.cpp/tools/grpc-server
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh

 rebuild:
-	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
+	bash prepare.sh
 	rm -rf grpc-server
 	$(MAKE) grpc-server

--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -749,97 +749,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.kv_unified = false;
            }
-        // --- paged KV cache (experimental, off by default) ---
-        // Enables the on-demand paged KV-cache engine (vendored PagedKVManager
-        // + paged placement/gather/alloc seams). The engine is gated inside
-        // llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use;
-        // here we expose it as a per-server model option instead of forcing the
-        // operator to export a process-wide env. When enabled we set the env
-        // BEFORE the model/context is created (later in this handler), so the
-        // engine latches on. When the option is absent we touch nothing, so an
-        // externally exported LLAMA_KV_PAGED still works as an escape hatch.
-        // Note: the engine's env check is process-wide and latches on first
-        // use, so enabling it for one model enables it for the worker process;
-        // LocalAI runs one model per llama.cpp worker, so this maps cleanly to
-        // per-server configuration. `kv_paged_debug` turns on the per-slot
-        // [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG).
-        //
-        // The continuous-batching serving loop (update_slots) drives paged KV
-        // transparently through the existing kv-cache seams: each slot's
-        // sequence allocates paged blocks on arrival (find_slot placement) and
-        // returns them on slot release (the seq_rm free seam). This is
-        // token-identical to stock under both the unified and per-sequence
-        // caches. The per-slot allocate/free capacity benefit, however, only
-        // materialises with a per-sequence cache, since paged block ownership
-        // is keyed by stream and the unified cache collapses every slot onto a
-        // single stream. Operators who want that benefit should pair this with
-        // `kv_unified:false`; we do NOT flip kv_unified here, to keep the
-        // default serving behaviour (and the idle-slot prompt cache) unchanged.
-        } else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) {
-            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
-                setenv("LLAMA_KV_PAGED", "1", 1);
-            }
-        } else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) {
-            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
-                setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
-            }
-        // --- chunked-prefill QoS budget (experimental, off by default) ---
-        // Caps the number of prompt tokens any single slot may prefill per
-        // update_slots iteration, so a large prompt cannot monopolise the batch
-        // and freeze the in-flight decoders. The serving loop reads this budget
-        // from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
-        // kv_paged above) and splits oversized prompts across iterations,
-        // interleaving decode steps for the other slots. A 6k-token prefill that
-        // stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
-        // cut) with zero TTFT cost and no steady-state regression. Unset or a
-        // non-positive value leaves the env untouched, so the stock unbounded
-        // prefill behaviour is preserved (an externally exported
-        // LLAMA_PREFILL_BUDGET still works as an escape hatch).
-        } else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
-            if (optval != NULL) {
-                try {
-                    int budget = std::stoi(optval_str);
-                    if (budget > 0) {
-                        setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
-                    }
-                } catch (const std::exception& e) {
-                    // If conversion fails, leave the budget unset (stock behaviour)
-                }
-            }
-        // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
-        // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
-        // T - D budget read by update_slots(): a single total per-step token budget T
-        // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
-        // decode claims its live load D first and prefill gets the leftover, plus an
-        // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
-        // threshold analogue). Both are set BEFORE context init, like kv_paged /
-        // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
-        // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
-        // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
-        // it takes precedence over max_prefill_tokens: the engine honours the legacy
-        // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
-        } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
-            if (optval != NULL) {
-                try {
-                    int mbt = std::stoi(optval_str);
-                    if (mbt > 0) {
-                        setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
-                    }
-                } catch (const std::exception& e) {
-                    // If conversion fails, leave the budget unset (stock behaviour)
-                }
-            }
-        } else if (!strcmp(optname, "prefill_cap")) {
-            if (optval != NULL) {
-                try {
-                    int cap = std::stoi(optval_str);
-                    if (cap > 0) {
-                        setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
-                    }
-                } catch (const std::exception& e) {
-                    // If conversion fails, leave the per-slot cap unset (engine default)
-                }
-            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {
--- a/backend/cpp/llama-cpp/paged/.gitignore
+++ b/backend/cpp/llama-cpp/paged/.gitignore
@@ -1,7 +0,0 @@
-tests/test_free_block_queue
-tests/test_block_pool
-tests/test_paged_kv_manager
-tests/test_prefix_cache
-tests/test_ggml_paged_rw
-tests/test_ggml_paged_attn
-paged-bench
--- a/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
+++ b/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
@@ -1,105 +0,0 @@
-# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
-
-Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
-profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
-Blackwell**. And the parity target is far lower than the headline vLLM number implied.
-
-## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
-
-vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
-compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
-(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
-(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
-the BF16 ceiling) or ~6,600 (beats it, via FP4).
-
-## 2. GB10 per-precision DENSE peaks (measured, not spec)
-
-| precision | dense peak | vs BF16 |
-|---|---|---|
-| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
-| INT8 | ~215 TOPS | **1.0×** |
-| FP4 (MXFP4/NVFP4) | ~427–500 TFLOP/s | **2.0×** |
-
-Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
-**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
-int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
-direct GB10 measurement; on-the-record discrepancy.)
-
-## 3. Measured gaps (nsys, GB10)
-
-| path | kernel | % of prefill | achieved | % of ceiling |
-|---|---|---|---|---|
-| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
-| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~4–5% of 500** (or ~10% of BF16) |
-
-Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
-RTX 3000/4000" (Ampere/Ada). The headroom (4–5×) is recoverable; it's not an architectural ceiling.
-
-## 4. ggml's current quantized-matmul paths (what exists)
-
- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
-  for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
-  by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
-  Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
-
-## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
-
-**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
-a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
-is also broken on sm_121.
-
-Ranked, by effort:
-1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
-   recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
-   the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
-   activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
-2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
-   Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
-   reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
-   genuine ggml gap. **This is the recommended kernel to MATCH.**
-
-**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
-3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
-   (RTX 5090: +43–68% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
-   miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
-   (it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
-
-**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
-ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
-problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
-should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
-
-## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
-
-Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
-speed-only test) and benched prefill:
-
-| quant | kernel | pp512 | pp2048 | vs Q4_K |
-|---|---|---|---|---|
-| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
-| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
-
-**Findings:**
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
-  dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
-  **FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
-  from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~4–6× on the table.
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
-  serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
-  tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
-  to *match* on the BF16 ceiling if FP4 tuning stalls.
-
-Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
-it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
-
-## Sources
-GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
-arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
-DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
-vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
-
-> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.
--- a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
@@ -1,334 +0,0 @@
-# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
-
-Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
-`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
-plan for what the brief called "chunked prefill".
-
-Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
-  `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
-  build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
-  lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
-  `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
-  cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
-  `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
-  a few rows at the pin — match on the quoted comment strings, not the integers.
-
---
-
-## TL;DR — the headline finding
-
-**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
-llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
-this version. `update_slots()` in `server-context.cpp`:
-
-1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
-   ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
-   one sampled token into the shared `llama_batch` before any prefill is added.
-2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
-   "next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
-   gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
-   default, `grpc-server.cpp:547`). The per-slot prefill fill loop
-   (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
-   batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
-   the **remaining** budget and defers the rest to the next iteration.
-3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
-   and prefill-chunk tokens go through the **same `llama_decode`**, which then
-   splits internally into `n_ubatch` physical sub-batches.
-
-This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
-("server : chunked prefill support") asked for — "the first task is no longer
-blocked by the second long prompt processing task." That PR is still marked OPEN
-but its goal was absorbed into the natural evolution of `update_slots()`; we do
-**not** need to port it. A long prefill no longer stalls the decode batch: decode
-slots are serviced first every iteration, prefill consumes only the leftover
-budget.
-
-**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
-narrow and is the rest of this plan:
-
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
-  the scheduler token budget (`n_batch`) to the physical forward width
-  (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
-  `n_batch == n_ubatch`, so the logical scheduling window can never be wider than
-  one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
-  spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
-  into a larger logical window. There is no first-class `batch:`/`ubatch:` split
-  on the Go side, and there is only a one-directional `ubatch` override on the C++
-  side (you can shrink ubatch below the coupled value, never grow n_batch above
-  it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
-  caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
-  one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
-  to the decoders sharing that forward. vLLM exposes
-  `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
-  LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
-  bounds that jitter. This is genuinely not in upstream and is the only place a
-  scheduler-policy change is warranted.
-
---
-
-## 1. Current behavior — precise citations
-
-### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
-  `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
-  service + `params_parse` + `parse_options`. `update_slots()`, the slot state
-  machine, and the batch builder are **upstream `server-context.cpp`**, untouched
-  by LocalAI today.
- Slot states: `server-context.cpp:36-42` —
-  `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
-  GENERATING`.
-
-### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
-  `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
-  token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
-  `n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
-  → with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
-  `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
-  — adds prompt tokens until the slot is done **or** the shared budget is hit.
-  Whatever does not fit stays for the next iteration (the slot remains
-  `SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
-  it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
-  the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
-  calls `llama_decode`; the physical `n_ubatch` split happens inside
-  `llama_decode`.
-
-### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
-  embeddings with non-LAST pooling. So **completion/generation tasks always
-  chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
-  ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
-  size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
-
-### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515` — `params.n_batch  = request->nbatch();`
- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
-  that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
-  `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
-  There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
-  above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
-  in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
-  (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
-  `c.Options` (`core/backend/options.go:221`).
-
-### 1.5 Go side sends a single batch number
- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
-  is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
-  else context size for single-pass (score/embed/rerank), else
-  `hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
-  backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
-  on Blackwell an unset batch defaults to 2048, so today
-  `n_batch == n_ubatch == 2048` there.
-
---
-
-## 2. Why the decouple matters for serving (not just rerank)
-
-Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
-width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
-**scheduler token budget** — the logical window shared by decode + prefill chunks,
-analogous to vLLM's `max_num_batched_tokens`.
-
-With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
-physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
-  is capped at the physical ubatch, so aggregate prefill cannot grow past one
-  ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
-  degrading prefill GEMM efficiency — and vice versa.
-
-Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
-`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
-logical window, lifting aggregate prefill under mixed load — `llama_decode` still
-tiles the physical work at 2048.
-
---
-
-## 3. Phased implementation
-
-### Phase 0 — Verification harness (do first; TDD red)
-Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
-  `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
-  decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
-  PR #10718's body works). Capture each stream's full token id sequence. Re-run
-  with the prefill request absent. **Assert the short streams' token ids are
-  byte-identical** in both runs — proves interleaving does not perturb decode
-  numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
-  spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
-  the same tree) or a small driver hitting `/v1/chat/completions`: measure
-  aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
-  under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
-  config. This is the before of Phase A/B.
-
-Expected result of Phase 0: 0.1 already passes (interleave is correct today);
-0.2 gives the baseline the decouple must beat.
-
-### Phase A — Decouple n_batch from n_ubatch
-Goal: let model config set the physical ubatch independently of the logical batch,
-defaulting to today's behavior (no regression).
-
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
-  In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
-  sibling branch:
-  ```cpp
-  } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
-      if (optval != NULL) {
-          try { params.n_batch = std::stoi(optval_str); } catch (...) {}
-      }
-  ```
-  This is the missing direction (raise `n_batch` above the coupled value). Order
-  matters: both `:515/:519` run first (coupling as default), then option parsing
-  overrides either independently. Add a clamp note: if a user sets
-  `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
-  `:519` aliasing for backward compat (rerank still works with no options).
-
- **A.2 Proto: add an explicit physical ubatch field.**
-  `backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
-  4). Regenerate with `make protogen-go` + the C++ proto build.
-
- **A.3 C++: honor `NUBatch` when present.**
-  In `grpc-server.cpp` `params_parse`, after `:519`, add:
-  ```cpp
-  if (request->nubatch() > 0) {
-      params.n_ubatch = request->nubatch();
-  }
-  ```
-  so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
-  string-option as a third path for users who only edit `options:`.
-
- **A.4 Go: config surface + plumbing.**
-  - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
-    (search `core/config` for the `Batch` field; mirror it).
-  - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
-    `EffectiveBatchSize` (return `c.UBatch` if set, else
-    `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
-    stays at the hardware sweet spot while `n_batch` may be larger). Set
-    `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
-  - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
-    byte-identical to today.
-
- **A.5 Serving default (the lever).**
-  In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
-  measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
-  configs (when `n_parallel > 1` and the model is a completion model), while
-  `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
-  Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
-  paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
-  Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
-
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
-  `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
-  `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
-  0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
-  neutral ITL) at `n_batch=4096, n_ubatch=2048`.
-
-### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
-Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
-one change that touches the inherited scheduler, so it lives as a patch in
-`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
-`:141-145`), never as an edit to a checked-in upstream file.
-
-Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
-`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
-
-```
-# token budget for THIS iteration, decode already seated:
-n_decode_in_batch = batch.n_tokens            # set after the decode phase
-prefill_budget    = n_batch                    # default == today
-
-if serving_mode and n_decode_in_batch > 0:
-    # leave room so decoders are not starved/jittered by one giant prefill chunk
-    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
-    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
-
-# fill loop guard becomes:
-while slot.prompt.n_tokens() < slot.task->n_tokens()
-      and batch.n_tokens < prefill_budget:
-      ...
-```
-
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
-  `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
-  exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
-  ongoing decodes keep a steady cadence; the remaining prompt rides the next
-  iteration (already supported by the state machine — slot stays
-  `PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
-  `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
-  from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
-  *how many* tokens are added this iteration, not *which* positions, so 0.1 must
-  remain token-identical.
-
-### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
-  `docs/content/` model-config reference, with the serving recipe
-  (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
-  `PHASED_VLLM_PARITY_PLAN.md` Phase 3.
-
---
-
-## 4. Risk / correctness
-
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
-  token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
-  (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
-  boundaries are transparent to the KV cache because positions are absolute, not
-  per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
-  per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
-  unaffected — co-batching prefill+decode across slots is what the unified cache is
-  for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
-  EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
-  `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
-  context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
-  — do not let the serving `BlackwellLogicalBatch` default leak into single-pass
-  configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
-  `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
-  `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
-  guard the new field behind a `#ifndef` like the checkpoint block does.
-
-## 5. Orthogonality to paged KV (Phase 2)
-
-Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
-and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
-prefill / this decouple changes **how many tokens per iteration** the scheduler
-batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
-KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
-scheduling window to feed those slots; neither touches the other's data structures.
-The only contact point is `update_slots()` — if both ship a vendored patch to it,
-land them as separate, ordered patches in `patches/` and keep the hunks disjoint
-(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
-budget).
-
---
-
-## 6. Bottom line
-
- Chunked prefill + decode interleave: **already present and correct** on the
-  pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
-  default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
-  if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
-  + proto + `options.go`; B as a vendored `patches/` hunk.
--- a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
+++ b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
@@ -1,215 +0,0 @@
-# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
-
-Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
-vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
-sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
-`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
-
-## TL;DR (the result overturns the brief's premise)
-
-On **this** build the prime suspect is wrong and the host-overhead premise does not
-hold:
-
-1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
-   decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
-   key, no property churn post-warmup). The keyed-warmup gate works.
-2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
-   decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
-   **GPU-compute-bound**, not host-bound.
-3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
-   different/earlier build (mainline without these graph fixes). It is not
-   reproducible on `llama.cpp-pr24423`.
-4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
-   the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
-   npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
-5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
-   (Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
-   memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
-   GEMM kernels, not host-side work. This is a kernel project (the direction the
-   prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
-   started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
-
-## 1. Why CUDA graphs are (not) disabled - exact code + measurement
-
-### The gate (code)
-
-PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
-`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
-
- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
-  `cgraph->nodes[0]` (first-node pointer).
- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
-  - **split buffers** (`ggml_backend_buft_is_cuda_split`), and
-  - **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
-    `ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
-  Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
-  after **2 consecutive calls with no property change** (`warmup_complete`); any
-  property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
-  detects change by `memcmp` of the full `ggml_tensor` struct + per-src
-  data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
-
-### Why it stays enabled across decode steps
-
-The graph stays stable because llama.cpp's host-side graph reuse holds during
-decode, so node pointers/props (and `cgraph->uid`) do not churn:
-
- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
-  a multiple of 256** ("so that the graph remains constant across batches and can
-  be reused"). For ntg<=256 within the first KV block, n_kv is constant.
- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
-  `ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
-  at npl=128.
- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
-  scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
-  `ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
-  which is skipped on the reuse path -> stable uid -> CUDA graph replays.
-
-### Measurement (instrumented build, npl=128, ntg=96)
-
-Env-gated counters added to `ggml_backend_cuda_graph_compute` /
-`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
-Release / NDEBUG). End-of-run summary:
-
-```
-[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
-```
-
-94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
-warmup resets; a **single** distinct graph key for the whole decode; no node
-property churn after warmup. Graphs are fully engaged at npl=128.
-
-(The instrumentation was reverted afterwards; the checkout is back to its
-pre-task state and the `.so` rebuilt clean.)
-
-## 2. The per-step CPU "hotspot" - there isn't one on this build
-
-GPU utilization during npl=128 decode (ntg=256):
-
- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
-  steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
-  total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
-  window = ~96% GPU-active**. Even with every kernel launched individually from
-  the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
-
-Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
-graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
-step ~= wall time per step. The host work between steps is in the low single-digit
-ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
-
-This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
-come from a pre-graphs build.
-
-### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
-
-| Kernel | % GPU time | ~ms/step |
-|--------|-----------:|---------:|
-| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
-| `flash_attn_ext_f16` | 19.3 | ~44 |
-| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
-| `unary_gated` silu | 4.1 | ~9 |
-| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
-| rms_norm / rope / set_rows / add | ~4 | ~10 |
-
-Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
-
-`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
-the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
-
-## 3. Fix attempt + measured result
-
-### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
-
-Graphs are already enabled and the batch is already stable (n_kv padded to 256,
-kq_mask dims constant). The clean cold A/B (cooldowns between every run):
-
-| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
-|----:|----------------:|-----------------:|------:|
-| 32  | 242.60 | 235.75 | +2.9% |
-| 64  | 398.59 | 389.06 | +2.5% |
-| 128 | 543.95 | 535.71 | +1.5% |
-
-Baseline (separate cold runs, original non-instrumented build):
-npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
-
-Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
-the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
-remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
-toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
-
-### Where the number actually is, and the real lever
-
- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
-  ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
-  per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
-  llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
-  floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
-  efficiency difference is the ~547 -> 667 gap.
- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
-  (median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
-  n=128 decode shape.
-
-**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
-int4 kernel for the decode shapes. This is exactly the direction of the prior
-session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
-`fp4-grouped-moe.cu` (already wired via
-`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
-Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
-so they are inert for this model - a Marlin path for K-quants (or shipping the
-model in a Marlin-friendly int4 format) would be required. That is a multi-day
-kernel effort, out of scope for this session, but it is the only lever that can
-move the number.
-
-### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
-
-`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
-(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
-MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
-drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
-graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
-confirms the host term is real but tiny at 256 - not a path to 667.
-
-## How this would land in LocalAI
-
- **No host/graph patch is warranted** for this build: graphs already engage and
-  the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
-  would change nothing measurable at high concurrency.
- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
-  (extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
-  model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
-  concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
-
-## Reproduction
-
-```
-# baseline / A/B (cold, 30s cooldowns)
-llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
-  -ngl 99 -b 2048 -ub 2048 -fa on            # graphs on
-GGML_CUDA_DISABLE_GRAPHS=1 ...same...        # graphs off
-
-# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
-# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
-#   nsys stats --report cuda_gpu_kern_sum  -> sum/0.516 ~= 7.72s of 8s = ~96%
-```
-
-## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
-
-The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
-and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
-that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
-
-| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
-|---|---|---|---|
-| Q4_K_M | 547 (548/546) | - | 82% |
-| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
-
-NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
-decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
-as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
-vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
-decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
-from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
-both the prefill and the decode gap.
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -1,253 +0,0 @@
-# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
-
-Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
-Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
-Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
-
-> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
-> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
-> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
-> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
-
---
-
-## Baseline results (established)
-
-### Single-stream (B=1), matched ~8-bit
-| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
-|---|---|---|
-| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
-| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
-| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
-
-\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
-
-### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
-| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
-|---|---|---|---|---|
-| 1 | 1080 | 9644 | 60.1 | 48.0 |
-| 8 | 2189 | 33373 | 160.8 | 312.4 |
-| 32 | 2198 | 99398 | 357.1 | 1171 |
-| 64 | 2194 | 151990 | 519.2 | 2064 |
-
-llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
-
-### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
-Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
-| | B=1 | B=8 | B=32 |
-|---|---|---|---|
-| stock decode | 61.2 | 171.7 | 377.0 |
-| paged decode | 62.7 | 170.8 | 376.8 |
-
-Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
-
---
-
-## Root-cause diagnosis (nsys + code audit)
-
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
- **The 45–69× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~25–50% long-context**, not 45–69×.
-
-Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
-
---
-
-## The levers (cheap → expensive) — execution log
-
-### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
-Status: **IN PROGRESS** — single-stream done, concurrency next.
-Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
-
-Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
-| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
-|---|---|---|---|
-| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
-| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
-| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
-| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
-
-Findings:
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
-Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
-| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
-|---|---|---|---|---|---|---|
-| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
-| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
-| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
-| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
-
-**Lever-1 verdict:** MXFP4 is a large, free win — decode +50–66% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 2–3. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
-
-### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
-Status: **DONE + SHIPPED (auto-default implemented)**
-MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
-**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~3300–3650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 3–5).
-**Implemented:** `core/backend/hardware_defaults.go` — `EffectiveBatchSize` now defaults the physical batch
-(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
-config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
-common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
-
-### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
-**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
-| metric | Q4_K_M | MXFP4 | Q8 |
-|---|---|---|---|
-| decode tg128 | **93.5** | 86.4 | 62.2 |
-| prefill pp512 | 2164 | **3061** | 2215 |
-| prefill pp2048 | 2953 | **3441** | ~2200 |
-**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
-MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
-**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
-
-### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
-Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
-
-**Decisive measurements:**
- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
-  pp16384=2051. So the plateau is not a batch-size fix.
- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
-  at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
-  `mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
-  inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
-
-**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
-the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
-fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
-Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
-(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
-(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
-streaming and flat vs batch.
-Approach:
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
-  experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
-  m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
-  separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
-  `ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
- Effort: high (2–4 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
-  prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
-  not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
-
-### Lever 4 — FP8 (e4m3) GEMM for dense layers
-Status: **DESIGNED, not built** (blocked on a core ggml API change).
-Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
-tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
-Approach (two options):
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
-  compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
-  scale-tensor plumbing below.
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042 —
-  per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
-  scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
-  prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
-
-### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
-Status: **DESIGNED, not built** (very high effort; last increment).
-Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
-tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
-Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
-3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
-`mma.sync`.
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
-  worth it after Levers 3–4 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
-
---
-
-## Paged attention — complete implementation (after kernels are fair)
-The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
-four gaps. CPU foundation already built & verified (`PagedKVManager` P0–P3, `README.md`); the in-model parts
-are unbuilt. **Build order and concrete design:**
-
-1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
-   - Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
-     KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
-     (already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
-   - Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
-2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165` → `ggml_get_rows`
-   gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
-   7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
-3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
-   the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
-   current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
-4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
-
-Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0–P3 done; in-model #1–#4 unbuilt.
-**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
-**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
-test that (correctly) showed no benefit.
-
-> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
-> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
-> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
-> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 1–3), separate
-> from paging.
-
---
-
-## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
-
-Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
-`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
-Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
-tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
-(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
-
-Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build --target llama-bench` →
-`llama-bench` MXFP4 pp/concurrency → record):
-1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
-   FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
-   - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
-     (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
-     (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
-     it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
-2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
-   a global round-trip. Tractable, ~1.1x.
-   - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
-     `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
-     *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
-     post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
-
-**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
-not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
-dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
-(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
-is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
-3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
-   CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
-   one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
-   Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
-4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
-   the whole prefill runs on FP4 tensor cores, not int8-MMQ.
-Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
-Matching vLLM realistically requires phase 3; phases 1–2 are the warm-up + de-risking.
-
-## Implementation plan B — Complete paged attention (the pivot)
-
-CPU foundation done (P0–P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
-numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
-Phases:
-1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
-   `PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
-   fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
-2. **Gather-read** (`get_k/get_v` `:1145/1165` → `ggml_get_rows` into scratch) + `build_attn_paged` branch
-   in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
-3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
-   real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
-4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
-   already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
-Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
-the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
-(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
-
-## Honest scope note
-Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
--- a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
+++ b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
@@ -1,59 +0,0 @@
-# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
-
-The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
-bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
-**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
-it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
-tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
-
-## Scaffold (DONE — builds clean, default byte-identical)
-
-Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
-upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
-
- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
-  (true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
-  returns false → **default build unchanged**.
- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
-  `if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
-  after adding the file — GLOB is configure-time).
-
-This is the integration seam. The kernel fills the stub.
-
-## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
-
-1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
-   dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
-   on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
-2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
-   GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
-3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
-   Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
-   3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
-4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
-5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
-
-## Dependencies / decisions
-
- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
-  Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
-  early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
-  (exhaustive search confirmed). Net-new value upstream.
-
-## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
-
-Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.6–32×
-behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
-So the kernel track is **two kernels, not one**:
-
- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
-  land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
-  before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
-
-Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
-prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
-GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
-W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)
--- a/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
+++ b/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
@@ -1,140 +0,0 @@
-# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
-
-## Question
-
-MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
-Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
-routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
-default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
-stronger 4-bit format).
-
-Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
-with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
-MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
-
-## Verdict
-
-**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
-Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
-shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
-
- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
-  than Q4_K**).
-
-The double-quant suspicion was correct but it was **not** the main culprit: even a clean
-MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
-speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
-gibberish), so it is usable where raw throughput dominates and a quality hit is
-acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
-
-## Evidence
-
-### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
-
-`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
-created it:
-
-```
-SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf      # <-- source is Q4_K_M, not F16/BF16
-OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
-$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
-    "$SRC" "$OUT" MXFP4_MOE
-```
-
-Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
-overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
-fair answer.
-
-### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
-
-`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
-
-| 32B model | PPL | vs Q4_K |
-|---|---|---|
-| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
-| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
-
-MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
-unfair -> escalated to a clean small-model comparison.
-
-### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
-
-The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
-small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
-GiB) and quantized it **from that same BF16 source** to both formats with the identical
-recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
-
-```
-llama-quantize  q3-4b-bf16.gguf  q3-4b-q4km.gguf   Q4_K_M
-llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
-               q3-4b-bf16.gguf  q3-4b-mxfp4.gguf  MXFP4_MOE
-```
-
-Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
-
-| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
-|---|---|---|---|---|
-| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
-| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
-| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
-
-This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
-than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
-Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
-smaller models being more quantization-sensitive - the double-quant did not invent the
-problem, it is intrinsic to the format as quantized by `llama-quantize`.
-
-### 4. Coherence spot-check (32B, llama-simple, n=60)
-
-MXFP4-dense 32B is fully coherent, not degraded gibberish:
-
- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
-  (correct); Q4_K similar.
- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
-  (sound); Q4_K answers 391 directly (correct).
- "def fibonacci(n):" -> both emit valid Python.
-
-So the quality cost shows up as measurably higher perplexity (and would surface on harder
-/ longer tasks), not as obviously broken text at short generation lengths.
-
-## Why
-
-`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
-round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
-per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
-better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
-kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
-but the weights it accepts (MXFP4 here) are lossy for dense.
-
-## Caveat, stated precisely
-
-This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
-against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
-format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
-through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
-
- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
-  (**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
-  may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
- The result is for Qwen3 dense; other families may differ in magnitude but the
-  format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
-
-## Recommendation
-
- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
-  quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
-  inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
-  option with the perplexity caveat surfaced, not a default.
- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
-  Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
-  format before making any FP4-dense recommendation.
-
-## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
-
- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
-  correctness metric, so thermal state does not affect these numbers.
--- a/backend/cpp/llama-cpp/paged/Makefile
+++ b/backend/cpp/llama-cpp/paged/Makefile
@@ -1,41 +0,0 @@
-CXX ?= g++
-CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
-
-TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
-BINS  = $(addprefix tests/,$(TESTS))
-
-all: $(BINS)
-
-tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
-
-check: all
-	@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
-
-bench: paged-bench
-	./paged-bench
-
-# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
-# Requires a built ggml. Override these to point at your checkout / build:
-#   make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
-GGML_SRC   ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
-GGML_BUILD ?= /tmp/ggml-build
-GGML_LIBDIR = $(GGML_BUILD)/src
-
-GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
-GGML_BINS  = $(addprefix tests/,$(GGML_TESTS))
-
-tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
-	$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-		-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
-
-ggml-check: $(GGML_BINS)
-	@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
-
-clean:
-	rm -f $(BINS) $(GGML_BINS) paged-bench
-
-.PHONY: all check ggml-check clean
--- a/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
+++ b/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
@@ -1,114 +0,0 @@
-# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
-
-Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
-kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
-established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
-BF16, no imatrix.
-
-## Verdict (short)
-
-YES on all the load-bearing questions, with one honest caveat:
-
-1. llama.cpp CAN produce an NVFP4 GGUF.
-2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
-   slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
-3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
-   4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
-4. Output is coherent.
-
-Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
-essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
-tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
-workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
-NVFP4 quant would likely close most of that remaining gap.
-
-## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
-
- The type exists with a full quantize path, not just a kernel:
-  - `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
-  - `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
-  - type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
-  no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
-  `--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
-  `ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
-  MXFP4-dense.
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
-  norms F32, all 2D attn+ffn weights to FP4):
-
-  ```
-  llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
-                 q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
-  ```
-
-  Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
-  Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
-
-The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
-do not feed llama.cpp - confirmed and irrelevant.
-
-## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
-
-`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
-
-| Quant   | PPL    | vs BF16  | vs Q4_K  |
-|---------|--------|----------|----------|
-| BF16    | 13.32  | -        | -        |
-| Q4_K_M  | 13.66  | +2.6%    | -        |
-| NVFP4   | 14.31  | +7.4%    | +4.8%    |
-| MXFP4   | 17.42  | +30.8%   | +27.6%   |
-
-(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
-
-NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
-sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
-all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
-firmly in the "acceptable 4-bit" regime, not the lossy one.
-
-## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
-
-No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
-so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
-cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
-
-Prefill S_PP (t/s):
-
-| B   | Q4_K   | NVFP4  | MXFP4  | NVFP4 / Q4_K | NVFP4 / MXFP4 |
-|-----|--------|--------|--------|--------------|---------------|
-| 8   | 4862   | 6313   | 6602   | 1.30x        | 0.96x         |
-| 32  | 5020   | 6497   | 6836   | 1.29x        | 0.95x         |
-| 64  | 5031   | 6490   | 6831   | 1.29x        | 0.95x         |
-
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
-  kernel. NVFP4 does NOT fall back to a slow path.
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
-  Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
-  32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
-  smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
-
-## 4. Coherence
-
-`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
-  ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
- "Q: What is 17 plus 25? A:" -> "42." (correct)
-
-Coherent and factually accurate.
-
-## Recommendation for LocalAI on Blackwell
-
-Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
-via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
-norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
-expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
-MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
-
-Caveats / follow-ups:
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
-  does not matter, Q4_K_M remains the better pick.
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
-  next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
-  blanket recommendation.
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
-  confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
@@ -1,115 +0,0 @@
-# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
-
-Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
-"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
-`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
-(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
-**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
-where decode aggregate is free to keep climbing with concurrency.
-
-Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
-`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
-sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
-`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
-ngl 99`. Cold runs, 12 s cooldowns.
-
-## TL;DR for the decision
-
-**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
-aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
-already reaches it.** Measured across two model regimes and concurrency up to 2048
-sequences:
-
- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
-  - 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
-  - 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
-  throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
-  max**) while aggregate stays flat.
-
-**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
-regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
-does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
-prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
-single-model benchmark deliberately does not create the memory-pressure regime where
-paging pays off, which is exactly why no win appears.
-
-## The numbers
-
-### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
-
-| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
-|----:|------------------------:|----------------------------:|------------------:|---------------------:|
-| 128 | 2,643 | 2,887 | 23-25 | - |
-| 256 | 2,925 | - | - | - |
-| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
-| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
-| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
-
-Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
-**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
-`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
-contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
-paged 12-13% *behind*.)
-
-### Why it plateaus (the hardware ceiling, not the KV layout)
-
-Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
-read across the whole batch. Once concurrency is high enough that the shared weight-read
-is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
-which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
-is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
-the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
-plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
-bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
-
-## What paged KV is actually for (the honest, deliverable value)
-
-Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
-to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
-which require the regime this hardware+benchmark does not exercise, are:
-
-1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
-   in-flight sequences (variable, dynamically arriving/leaving contexts) without the
-   contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
-   compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
-   with very long/variable contexts.
-2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
-   prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
-   real token-budget win for shared-prefix workloads; PR #22569 defers this to a
-   non-existent Phase 2 (our from-scratch P0 has the machinery).
-
-These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
-aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
-
-## Recommendation
-
- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
-  the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
-  benchmark.
- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
-  ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
-  npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
-  MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
-  Blackwell number.
- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
-  lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
-  framed and measured as concurrent-tenant capacity and KV memory saved, on a
-  variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
-  aggregate regardless of KV layout; that is a fleet-level result.
-
-## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
-
-```sh
-M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
-# contiguous
-for NPL in 128 256 512 1024; do
-  ./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
-    -b 2048 -ub 2048 -fa on -c $((NPL*160)); done
-# paged
-for NPL in 512 1024 2048; do
-  ./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
-    -np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
-```
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
@@ -1,170 +0,0 @@
-# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
-
-Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
-the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
-artifacts (low bandwidth caps throughput before KV memory ever binds). This document
-delivers the three things needed to push paged KV toward the real target:
-
-1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
-2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
-3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
-
---
-
-## 1. Correctness: PASS (after fixing the auto-fit OOM)
-
-`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
-(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
-it aborted at context creation. Root cause found:
-
- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
-  `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
-  GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
-  `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
-  explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
-
-**Fix (item-1 patch, applied on the box):**
-
-```diff
--- a/tests/test-paged-kv-e2e.cpp
-+++ b/tests/test-paged-kv-e2e.cpp
-@@ run_paged()
-     params.kv_paged      = true;
-+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
-     params.n_gpu_blocks  = 64;
-```
-
-**Result (Qwen3-0.6B-Q8_0, GB10):**
-
-```
-test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
-test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
-test-paged-kv-e2e: PASSED
-```
-
-The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
-bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
-
-**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
-brittle and must be hardened before it runs on a real serving box - even though
-`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
-(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
-`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
-(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
-
---
-
-## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
-
-**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
-`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
-load. That has no over-reservation and no fragmentation, so contiguous KV is already
-memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
-paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
-the real serving regime. No tool in the tree creates it.
-
-`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
-API:
-
- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
-  cross-request prefix sharing,
- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
-  `LG_GENSHORT`) - the over-reservation driver,
- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
-  one finishes.
-
-It reports the load-bearing number for the buy decision - the **capacity ratio**:
-
-```
-paged peak KV      = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
-contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token   (worst-case per slot)
-CAPACITY RATIO     = contiguous_reserve / paged_peak   (+ prefix sharing on top)
-```
-
-`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
-`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
-
-**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
-CMakeLists next to `llama-paged`, build, then e.g.
-`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
-Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
-It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
-the ratio is uninteresting because throughput plateaus before memory binds (see below).
-
---
-
-## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
-
-### Measured on GB10 (this work)
-
-| model | decode plateau (aggregate) | plateau concurrency | bound by |
-|---|---|---|---|
-| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
-| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
-
-### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
-
-| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
-|---|---|---|---|---|
-| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
-| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
-| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
-
-Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
-is reached scale with bandwidth (~30x on 2x H200)**:
-
- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
-  ~128 x 30 ~= **3,800 concurrent sequences**.
-
-### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
-
-To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
-
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
-  sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
-
-So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
-and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
-is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
-caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
-inverted on the real target.
-
-### Magnitude of the paged win
-
-Paging recovers concurrency two ways, both multiplicative on achievable throughput:
-
-1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
-   `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
-   long, prompts ~512) the average held context is several-fold below `max_ctx` ->
-   `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
-   your workload's length distribution).
-2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
-   workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
-
-Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
-concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
-decode ceiling. **That is the datacenter payoff, and it is real on the target even though
-GB10 cannot exhibit it.**
-
-### Honest caveats for the buy case
-
- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
-  workload's context-length distribution (more variable -> bigger paged win) and TP
-  efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
-  *slower* than the mature contiguous flash-attention path at equal concurrency
-  (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
-  the fit-robustness bug above. Adopting paged KV for the target means either hardening
-  #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
-  competitive* op, which is the remaining engineering.
- Prefill on either KV layout is compute-capped, not a paged concern.
-
-**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
-the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
-**correctness-verified**, the **benchmark to size the win exists**, and the projection
-says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
-decode** on the target. The remaining work is hardening/finishing the paged op, not
-proving the thesis.
--- a/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
@@ -1,55 +0,0 @@
-# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
-
-Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
-quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
-`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
-
-## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
-
- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
-  viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
-
-## Phases
-
-### Phase 1 — Hardware-tuned config (PR #10411) — DONE
-Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
- Blackwell physical batch (n_ubatch) = 2048.
- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
-  which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
-  KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
-  flash-attn=auto, context=4096.
-
-### Phase 2 — Paged / block KV cache  ← biggest structural multi-user lever
-vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
-contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
-  rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
-  design informs the review/port; the upstream momentum is the place to land it.
- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
-
-### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
-llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
-capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
-  a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
-
-### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
-Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
-Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
-BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
-
-### Phase 5 — Backend GPU sampling
-CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
-
-### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
-Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
-config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
-
-## Sequencing rationale
-Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
-Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
-(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.
--- a/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
@@ -1,90 +0,0 @@
-# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
-
-Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
-Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
-
-## TL;DR (clean negative)
-
-1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
-2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
-3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
-4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
-5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
-
-## 1. What PR #17004 does + state
-
- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
-
-Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
-
-## 2/3. Apply + build
-
-No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
-
-```
-cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-  -DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-  -DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
-cmake --build build --target llama-batched llama-batched-bench -j20
-```
-
-**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
-
-## 4. Decode aggregate: fix vs baseline vs vLLM
-
-### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
-`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
-
-| npl | S_TG t/s |
-|-----|----------|
-| 32  | 241.8 |
-| 64  | 395.1 |
-| 128 | 542.6 |
-| 256 | 567.2 |
-
-Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
-
-### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
-`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
-
-| np  | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
-|-----|------------------|------------------------|-------|
-| 32  | 174.1 | 217.5 | +25% |
-| 64  | 390.5 | 403.4 | +3.3% |
-| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
-
-(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
-
-**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
-
-## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
-
-`nsys profile -t cuda ... -n 96 -np 64`
-
-| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
-|------|-----------|------------------------------|----------------------|
-| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
-| GPU `-bs`    | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
-
-GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
-
-(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
-
-## LocalAI adoption path
-
-**The code arrives transparently with a version bump; enabling it is not transparent.**
-
- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
-
-### Recommendation
-Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.
--- a/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
@@ -1,136 +0,0 @@
-# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
-
-Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
-high-concurrency GPU throughput, or should we finish our own from-scratch P4
-(`backend/cpp/llama-cpp/paged/`)?
-
-Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
-memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
-n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
-the correctness gate.
-
-## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
-
-On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
-win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
-256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
-hold for this PR or this hardware/model. On top of that it is broken out of the box,
-wired to the wrong integration surface, and a contested draft.
-
-## 1. Builds? Correct?
-
- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
-  `0b0f7bd...`, base = current master). Clean CUDA build for sm_121
-  (`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
-  `llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
-  It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
-  depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
-
- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
-  at context creation:
-  `build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
-  `GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
-  `--fit off` (so it is the real graph, not just the memory probe).
-  **Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
-  wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
-  32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
-  "qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
-  (use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
-
- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
-  during the device-memory probe (before the fix). After the reshape fix, paged
-  auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
-
- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
-  output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
-  256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
-  correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
-  top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
-  **not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
-  (96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
-  issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
-  **unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
-  output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
-  does not change the verdict, which is decided by throughput below.
-
-## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
-
-Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
-Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
-`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
-
-| npl  | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
-|------|----------------------|------------------------|---------|
-| 128  | **537** (S 553)      | **477**                | both run; paged ~12% slower |
-| 256  | **541** (S 550)      | **471**                | both run; paged ~13% slower; neither gains over 128 |
-| 512  | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
-| 1024 | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
-
-### The decisive facts
-
-1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
-   identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
-   `LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
-   is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
-   "paged unlocks thousands of seqs" premise is false for this PR.
-
-2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
-   the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
-   concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
-   24 GiB card; that lever does not exist on GB10's 119 GiB.
-
-3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
-   flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
-   nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
-   recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
-   ~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
-   layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
-   problem, not a KV-fragmentation problem.
-
-## 3. Verdict and reasoning: finish our own P4
-
-**Do not adopt #22569 as the base.** Reasons:
-
- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
-  what we already have and capped at the same 256 seqs. There is no throughput or
-  concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
-  (`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
-  bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
-  only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
-  LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
-  around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
-  decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
-  creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
-  (gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
-  `seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
-  API is even the right design; maintainers are skeptical of paged for single-node use.
-
-**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
-gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
-The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
-implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
-  over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
-  prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
-
-Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
-prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
-saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
-`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
-paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
-worth upstreaming to #22569 regardless, but the PR is not our base.
-
-### Reproduction (DGX, `~/llama.cpp-pr22569`)
-```sh
-export PATH=/usr/local/cuda/bin:$PATH
-# contiguous
-./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
-  -npl 128 -c 20480 -b 2048 -ub 2048        # 256/512/1024 -> n_seq_max must be <= 256
-# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
-./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
-  -np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99   # 512/1024 -> n_seq_max must be <= 256
-```
--- a/backend/cpp/llama-cpp/paged/README.md
+++ b/backend/cpp/llama-cpp/paged/README.md
@@ -1,95 +0,0 @@
-# Paged Attention for llama.cpp (vLLM-parity), CPU-first
-
-A from-scratch port of vLLM V1's paged KV-cache model into the llama.cpp / ggml
-world, built CPU-first and verified incrementally. The host-side block manager is
-a faithful port of vLLM; the compute stays in ggml (no new op — the read path
-gathers blocks with `ggml_get_rows` and feeds the existing attention ops).
-
-Design: `docs/superpowers/specs/2026-06-19-paged-attention-llamacpp-design.md`
-Plan:   `docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md`
-
-## Status
-
-| Phase | What | State |
-|------|------|-------|
-| P0 | vLLM-parity host block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache) | ✅ verified — `make check`, 4/4 suites |
-| P1 | ggml paged write/gather mechanism (`set_rows` by slot_mapping → `get_rows` gather) | ✅ verified — `make ggml-check`, non-contiguous blocks `[2,1,5]` round-trip + isolation |
-| P2 (core) | attention over gathered paged KV matches independent host reference | ✅ verified — max abs err **7.5e-08** |
-| P3 (partial) | capacity & prefix-sharing wins | ✅ measured — `make bench`: **9.2×** more concurrent seqs, **11.3×** less KV memory |
-| **P3 (in-model placement)** | **paged, non-contiguous block KV placement in the real model** | ✅ **Gate 0 PASSED** — Qwen3-0.6B token-identical (`patches/0001-paged-kv-block-placement.patch`) |
-| P4 (in-model compute) | gather-read (`build_attn_paged`, read only a seq's blocks) + win-2 throughput + multi-seq | ⛔ remaining |
-
-The design's central risk — *does paged (non-contiguous) KV produce correct attention?* —
-is **retired at two levels**: (1) at the ggml-op level (P2, 7.5e-08 vs reference) and
-(2) **in a real model** (P3): with KV physically scattered across permuted, non-contiguous
-blocks (cells `0-15, 144-159, 32-47, …`), Qwen3-0.6B greedy generation is **token-for-token
-identical** to the contiguous cache. Reproduce:
-
-```sh
-# from backend/cpp/llama-cpp-fallback-build/llama.cpp (patch applied, CPU build)
-B=build-cpu/bin/llama-simple; M=<Qwen3-0.6B.Q4_K_M.gguf>; P="...long prompt..."
-"$B" -m "$M" -n 40 "$P"                         > base.txt
-LLAMA_KV_PAGED=1 "$B" -m "$M" -n 40 "$P"        > paged.txt
-diff base.txt paged.txt && echo TOKEN-IDENTICAL
-# LLAMA_KV_PAGED_DEBUG=1 prints the permuted physical cells per step
-```
-
-This proves the **storage/placement** layer of paged attention in-model. What remains (P4)
-is the **compute** optimization that yields the throughput win: a gather-read that attends
-only a sequence's own blocks (instead of scanning `[0,n_kv)` with a mask), plus the
-multi-sequence driver to measure tok/s vs concurrency. The patch is single-sequence scope.
-
-## Build & test
-
-```sh
-make check                     # P0 host-manager unit suites (pure C++, no deps)
-make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>   # P1/P2 ggml tests
-make bench                     # P3 capacity + prefix-sharing numbers
-```
-
-`ggml-check` needs a built ggml. To build one CPU-only from a llama.cpp checkout:
-`cmake -S <llama.cpp>/ggml -B /tmp/ggml-build -DGGML_CUDA=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build /tmp/ggml-build -j`
-(if it complains about a missing `ggml.pc.in`, add a minimal pkg-config stub).
-
-## Files
-
- `paged_kv_manager.{h,cpp}` — the vLLM-parity block manager (no ggml/llama dep).
- `tests/test_free_block_queue.cpp` — intrusive LRU free list.
- `tests/test_block_pool.cpp` — alloc/touch/free/evict/cache.
- `tests/test_paged_kv_manager.cpp` — allocate/block_table/slot_mapping/free.
- `tests/test_prefix_cache.cpp` — chained block hashing + first-miss cache hit.
- `tests/test_ggml_paged_rw.cpp` — paged write/gather through real ggml ops.
- `tests/test_ggml_paged_attn.cpp` — attention over paged KV vs host reference.
- `paged-bench.cpp` — capacity (win 1) + prefix-sharing (win 3) measurements.
-
-## Remaining work — integration map (for the next session)
-
-Target: a paged read path active behind a flag, producing **token-identical** greedy
-output vs the contiguous cache on a real model (Gate 0), then `paged-bench` win 2.
-
-Exact seams in the vendored llama.cpp (`backend/cpp/llama-cpp-fallback-build/llama.cpp`,
-the pinned build fetches `LLAMA_VERSION=f3e182816421…`):
-
-1. **Memory type** — `src/llama-model.cpp:2070` `create_memory()` constructs `llama_kv_cache`.
-   Add a paged variant (or a flag on the existing cache) implementing `llama_memory_i`
-   (`src/llama-memory.h`), backed by `PagedKVManager`.
-2. **Allocation** — `src/llama-kv-cache.cpp:818` `find_slot()` produces `slot_info.idxs`.
-   Replace the ring-buffer scan with block-aligned allocation from `PagedKVManager`.
-3. **Read path** — `src/llama-kv-cache.cpp:1145/1165` `get_k`/`get_v` return a contiguous
-   `[0,n_kv)` view. For paged, gather the sequence's blocks (`ggml_get_rows`) into scratch.
-   The new branch lives alongside `build_attn` in `src/llama-graph.cpp` (`build_attn_mha`).
-4. **Mask** — `src/llama-graph.cpp` `build_attn_inp_kq_mask` sizes the mask to the gathered
-   length per sequence.
-5. **Gate 0 driver** — `build-cpu/bin/llama-simple` (greedy argmax) on
-   `Qwen3-0.6B.Q4_K_M.gguf`; assert paged output == contiguous output token-for-token.
-
-### Honest caveats (from the maintainer discussion + reading `find_slot`)
-
- llama.cpp's **unified cache already shares one KV pool** across sequences and already
-  tolerates non-contiguous slots. So win-1 vs *unified* is smaller than vs per-seq
-  reservation (stream mode). The durable LocalAI wins are **on-demand sizing** and
-  **automatic cross-tenant prefix sharing** (P0 implements the block-hash machinery).
- vLLM's classic `paged_attention_v1/v2` CUDA kernel is **deprecated**; the live path is
-  FlashAttention/FlashInfer over a block table. The port targets that pattern, not the
-  old kernel. Upstream draft PRs #22569 (new `ggml_paged_attn` op) and #17579 (CUDA) are
-  unmerged; maintainers are skeptical for single-user use.
--- a/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
+++ b/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
@@ -1,78 +0,0 @@
-# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
-
-**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
-
-## Summary
-
-On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
-`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
-fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
-on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
-limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
-
-## Hardware / build
-
- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
-
-## Measurements
-
-Single-stream (`llama-bench`, ub2048):
-
-| metric | Q8_0 | MXFP4 | vLLM FP8 |
-|---|---|---|---|
-| prefill pp2048 | ~2200 | 3441 | — |
-| decode tg128 | 62 | 86 | 52 |
-
-Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
-
-| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
-|---|---|---|---|---|
-| 1 | 1625 | 9644 | 83 | 48 |
-| 8 | 3634 | 33373 | 267 | 312 |
-| 32 | 3651 | 99398 | 551 | 1171 |
-| 64 | 3648 | 151990 | 770 | 2064 |
-
-Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
-
-## Profiling (nsys, MXFP4 pp2048 kernel time)
-
-| kernel | % |
-|---|---|
-| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
-| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
-| `flash_attn_ext_f16` | 8.8 |
-| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
-
-Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
-
-## What we ruled out (so it's the kernel, not config)
-
- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
-  helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
-  1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
-
-## Proposal
-
-A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
-  vs today's per-expert MMQ scheduler.
- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
-  warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
-  tile sizes).
- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
-
-This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
-
-## Repro
-
-```sh
-llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
-llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
-llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
-```
--- a/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
@@ -1,83 +0,0 @@
-# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured)
-
-Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag
-`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the
-scheduler track.
-
-## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts)
-
-| | vLLM W4A16 Marlin | llama.cpp | verdict |
-|---|---|---|---|
-| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** |
-| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps |
-| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** |
-
-vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k
-headline is entirely the aggregate decode multiplier.
-
-## The kernel vLLM actually runs on sm_121 (W4A16, forced)
-
-Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) →
-`compressed_tensors_w4a4_nvfp4.py:28` → `kernels/linear/__init__.py:894` (`if use_a16: force_kernel =
-MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py` → `marlin_utils_fp4.py:182`
-`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu`
-+ `marlin_template.h` + `marlin.cuh`.
-
-Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722
-^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait<stages-2>`), ldmatrix+mma,
-FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent
-on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.**
-
-## The scheduler (the real multiplier) — what llama.cpp lacks
-
- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very
-  high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency).
- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill
-  chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch).
- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`,
-  rudimentary — we enabled VRAM-scaled slots in #10411).
-
-## Sizing the scheduler gap — MEASURED (llama.cpp aggregate, the surprise)
-
-`llama-batched-bench` Qwen3-32B-Q4_K_M, npp=128 ntg=128, npl scaling (DGX):
-
-| npl | S_PP (agg prefill) | **S_TG (agg decode)** | vLLM decode | llama % of vLLM |
-|---|---|---|---|---|
-| 1 | 628 | 10.2 | 11.8 | 86% |
-| 8 | 773 | 59.8 | - | - |
-| 32 | 763 | **235** | **328** | **72%** |
-| 64 | 761 | **391** | **569** | **69%** |
-| 128 | 762 | **540** | **667** | **81%** |
-
-**The "30x gap" headline is wrong for realistic concurrency.** llama.cpp's continuous batching already
-captures **~70-81% of vLLM's aggregate decode** at npl<=128, with a near-identical multiplier (10.2 -> 540 =
-**53x**, vs vLLM's 56x). And it is still climbing linearly at 128 (not plateaued). Combined with llama.cpp being
-*ahead* single-stream (MXFP4 1153 > vLLM 800), **llama.cpp is already broadly competitive with vLLM on GB10 at
-self-hosted concurrency.**
-
-Two real findings remain:
-1. **Aggregate prefill is flat ~760** regardless of npl - but that is the **GB10 compute roofline** (vLLM single-
-   stream is ~800; neither can prefill faster aggregate, it is compute-bound). So prefill is **not a throughput
-   gap**; chunked prefill is a **latency/TTFT** win (stop a long prefill stalling the decode batch), not a
-   throughput one.
-2. **vLLM's ~24k headline lives at thousands-of-sequences concurrency**, which **paged KV** unlocks (block KV,
-   no fragmentation). llama.cpp's contiguous KV caps how far npl can scale before memory/fragmentation bite. So
-   paged KV is the **high-concurrency (datacenter) lever**, not a moderate-concurrency one.
-
-## Recommendation
-
-**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.**
-Now that the gap is measured, ROI-ordered:
-1. **Ship the MXFP4-dense win** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant
-   recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. Cheapest.
-2. **Chunked prefill** — the tractable scheduler win: interleave prefill chunks with decode so a long prompt
-   doesn't stall the decode batch. Payoff is **latency/TTFT under mixed load** (and steadier decode batches),
-   not aggregate prefill throughput (that's GB10-compute-capped at ~760-800 for both engines). A grpc-server
-   scheduler change; no KV-layout rewrite.
-3. **Paged KV** — the **high-concurrency (thousands-of-seqs) lever** that unlocks vLLM's 24k regime. Heavy
-   (block KV manager; contested upstream PR #22569 / vendored `patches/`). Worth it only if datacenter-scale
-   concurrency is a target; at self-hosted concurrency (npl<=128) llama.cpp is already ~75-80% of vLLM.
-
-**Reframed expectation:** llama.cpp on GB10 is NOT 30x behind vLLM. It is ahead single-stream (MXFP4) and
-~70-81% of vLLM aggregate at npl<=128. The genuine differentiator vLLM still has is **scaling to very high
-concurrency via paged KV**. Kernel tracks (W4A16 178 t/s; FP4-MMA) stay **banked** - not the lever.
--- a/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
@@ -1,59 +0,0 @@
-# Where vLLM beats llama.cpp on a DGX Spark (GB10), and how to close it — keeping quality
-
-The question: "vLLM is faster at the end — what do we improve, while keeping good quality?" Answer: the
-gap is **three independent things**, and the biggest *per-user, quality-preserving* one is **speculative
-decoding**, which llama.cpp already supports.
-
-## Decomposition (measured + researched)
-
-| vLLM advantage | helps single user? | llama.cpp answer | quality cost | status |
-|---|---|---|---|---|
-| **Per-user decode speed** | **yes** | **speculative decoding** (Qwen3 draft / EAGLE3) | **none** (target-verified, lossless) | mature in llama.cpp; **the main lever** |
-| Prefill / TTFT | no (it's first-token latency) | tune FP4-MMA / Marlin W4A16 kernel | none | hard; `BLACKWELL_KERNEL_GAPS.md` |
-| Aggregate throughput @ concurrency | no (per-user = 0) | continuous batching (paged engine) | none | also kernel-bound |
-
-Key measured fact: **single-user decode is already at parity** (Qwen3-32B: llama 10.2 vs vLLM 11.7 t/s) —
-both hit GB10's ~273 GB/s bandwidth wall (~15 t/s ceiling) **without** spec-dec. So vLLM's real per-user
-speed edge is spec-dec, not architecture.
-
-## Why spec-dec is THE lever here (and quality-safe)
-
- **Lossless:** the 32B target verifies every drafted token (accept/reject) — output distribution is
-  identical to no-drafting. So you keep **Q4_K_M quality** (no lossy MXFP4 needed) *and* get speed.
- **GB10 is best-case for it:** decode is bandwidth-bound (one ~17 GB weight-read per token) with huge idle
-  compute. Spec-dec verifies K drafted tokens in **one** weight-read → converts the loop to compute-bound,
-  where GB10 has headroom. Realized speedup ≈ mean accepted length.
- **Measured (others, same model class):** llama.cpp Qwen2.5-32B dense + 0.5B draft = **2.9×** (13→38 t/s);
-  vLLM EAGLE3 on Qwen3-32B = ~1.8–2.5× general, up to ~3× code/structured. **Competitive.**
- **Regime caveat:** spec-dec gives **~nothing for MoE-A3B** models (only ~3B active → not bandwidth-bound,
-  nothing to amortize). It shines for **dense** 27–32B — the opposite regime. So this lever is *dense-model*
-  specific.
-
-## Qwen3-32B specifics
-
- **No native MTP head** (MTP is a Qwen3-*Next*/MoE feature). Options: a **same-family draft**
-  (Qwen3-0.6B or **1.7B** — same tokenizer, llama.cpp vocab check passes) or an external **EAGLE3 head**
-  (RedHatAI/AngelSlim Qwen3-32B-eagle3, accept length 2.15–2.49).
- Draft pick: **lean Qwen3-1.7B** (0.6B had ~60% lower acceptance in AWS's test; on a bandwidth-bound box the
-  32B weight-read dwarfs the draft cost, so maximize acceptance). `--spec-draft-n-max 5–8`.
-
-## Recommended LocalAI actions (quality-preserving, ranked)
-
-1. **Make speculative decoding easy/recommended for dense ≥14B models on Blackwell** — a draft-model field in
-   the model config (`-md` / `--spec-draft-*`), with a suggested Qwen3-1.7B draft for the Qwen3 family. This
-   is the biggest per-user speed win, lossless, available **now** (no kernel). Gallery: ship target+draft pairs.
-2. Kernel work (FP4-MMA tuning / Marlin W4A16) — improves **prefill/TTFT**, separate metric.
-3. Continuous batching (paged engine) — **aggregate** concurrency only; per-user = 0.
-
-## Honesty / status
-
-The research conclusion is solid (sources below). **Our own empirical spec-dec run on the DGX is pending** —
-the box rebooted mid-session and `llama-cli` now hangs at 0% GPU (while `llama-bench` works), plus the network
-is dropping ssh mid-command. Drafts (Qwen3-0.6B/1.7B Q8) are downloaded and the spec-dec flags are confirmed;
-re-run `llama-cli -m Qwen3-32B-Q4_K_M -md Qwen3-1.7B-Q8_0 -ngl 99 -ngld 99 --spec-draft-n-max 8` when the box
-is stable to confirm the ~2× locally. The conclusion does not depend on it (it's measured-reproducible by
-others on this exact model class), but we should bank our own number.
-
-Sources: llama.cpp Discussion #10466 (Qwen2.5-32B+0.5B = 2.9×), #16578 (DGX Spark), DandinPower/llama.cpp_bench
-(32B = 10.7 t/s, bandwidth-bound); vLLM MTP docs + Red Hat EAGLE3 article (lossless, up to 2.5×); AWS spec-dec
-blog (Qwen3-32B+1.7B up to 3×, 0.6B ~60% lower accept); RedHatAI/AngelSlim Qwen3-32B-eagle3 heads.
--- a/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
@@ -1,176 +0,0 @@
-# W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan
-
-> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.**
-> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED
-> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved
-> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy
-> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS
-> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so
-> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×)
-> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV +
-> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is
-> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail:
-> `VLLM_DECOMPOSITION.md`.
-
-The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213
-TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's
-own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).
-
-## Why a custom kernel (validated, not assumed)
-
-On GB10 (sm_121), measured: **both** llama-MMQ (int8, Ampere-tuned) **and** cuBLAS-FP16 sit at ~46 TFLOP/s
-(~21% of peak). cuBLAS falls back to an Ampere `cutlass_80_tensorop` kernel (CUDA-13 has no sm_121 GEMM for
-these shapes); rebuilt with `-DGGML_CUDA_FORCE_CUBLAS=ON` it's *slower* than MMQ (690 vs 750). **No library
-path reaches the ceiling on consumer Blackwell** — a hand-tuned sm_120a kernel is required. `mmapeak` measures
-the 213 BF16 peak as reachable, and vLLM's Marlin hits it, so the ceiling is real; the work is reaching it.
-
-## What Marlin does (the design we mirror)
-
-Weights stored 4-bit, **dequantized in-register/shared-mem** in-flight; GEMM math on **FP16/BF16 tensor
-cores** (`mma.sync m16n8k16`). Speed comes from: `cp.async` global→shared with a **multi-stage double-buffered
-pipeline**, **offline weight reshuffle** into the MMA-friendly layout, activations kept resident in registers,
-and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLLM machete (Hopper successor).
-
-## Phases (each ends with: numerical parity vs MMQ + a prefill benchmark)
-
-### P0 — Harness + baseline — DONE
- **Correctness gate (GREEN):** `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103 passed** (CUDA vs CPU
-  reference, covers Q4_0/Q4_K at the real FFN shapes m=4096,k=14336,n=1..512). This is *the* parity check the
-  W4A16 kernel must keep green at every phase — it tests the CUDA MUL_MAT path the kernel will hook. The
-  `not supported` lines are `type_b=f16` combos (irrelevant; prefill uses f32 activations).
- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
-  of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
-  per-shape GFLOPS too; build it once with the harness.)
- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
-  | n (tokens) | q4_0 | q4_K | regime |
-  |---|---|---|---|
-  | 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
-  | 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
-  | **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
-
-  So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
-  is cleaner than end-to-end for kernel iteration.
- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
-  phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
-
-### P1 — Dispatch seam (no behavior change) — DONE
- `marlin-w4a16.{cuh,cu}` + a gated hook in `ggml_cuda_mul_mat` (dense, non-ids path), behind
-  `GGML_CUDA_W4A16` + sm_120/121 (`cc >= GGML_CUDA_CC_BLACKWELL`) + type∈{Q4_0,Q4_K} + f32 activations.
-  Returns false → falls back to MMQ. Source + apply instructions: `kernel/w4a16/` (`HOOK.md`).
- **Verified on GB10:** clean build; `test-backend-ops MUL_MAT` = **1103/1103** (byte-identical default);
-  `llama-bench` dense Q4 pp512 unchanged (717.77 default / 718.26 with flag); `GGML_CUDA_W4A16=1` reaches the
-  seam (stderr `[w4a16] ... P1 seam - using MMQ`) and falls back. The empty frame P2/P3 fills.
-
-### P2 — Correctness-first kernel (slow OK) — DONE
- **Kernel:** `marlin-w4a16.cu` replaces the P1 TODO with a real W4A16 GEMM. In-kernel dequant Q4→BF16 into
-  shared mem, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` via ggml's `mma.cuh` tile abstractions
-  (`tile<16,8,nv_bfloat162>` A, `tile<8,8,nv_bfloat162>` B, `tile<16,8,float>` C), F32 accumulate, F32 write.
-  One warp per 16(M)x8(N) output tile, K looped in steps of 16. Both src0 (weights, row m) and src1 (acts,
-  row n) are row-major `[row][k]`, so A and B load symmetrically via `load_generic`; the mma does the dot over k.
- **Types handled:** Q4_0 and Q4_K. Q4_0 dequant `w=d*(q-8)` inline; Q4_K via the superblock decode mirrored
-  from `convert.cu` (`get_scale_min_k4`, 8x32 sub-blocks, `d*q-m`).
- **Shape classes handled:** contiguous 2D GEMM (the prefill path), `ne2==ne3==1`, f32 activations, K%16==0
-  (always true: Q4_0 K%32, Q4_K K%256). **Falls back to MMQ (returns false)** for batched (bs!=[1,1]),
-  broadcast (nr!=[1,1]), permuted / non-contiguous (per!=[0,1,2,3]), and any non-f32 activation (e.g. f16) -
-  keeps the gate green. M / N boundaries are zero-padded in-kernel (handles M not %16, N not %8).
- **Parity (the gate):** `GGML_CUDA_W4A16=1 test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103 passed**
-  (the Q4_0/Q4_K f32 contiguous shapes run the kernel and match the CPU reference; batched/permuted/f16 fall
-  back). Default (flag-unset) build still **1103/1103** (byte-identical, seam returns false).
- **Model sanity / P2 perf:** `GGML_CUDA_W4A16=1 llama-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -p 512 -n 16
-  -ub 2048` runs clean: **pp512 = 31.75 t/s**, tg16 = 6.28 t/s. Slow as expected (naive 1-warp/tile, weights
-  re-dequantized per n-tile, no pipeline) - this is the correctness checkpoint; P3 brings the speedup. The real
-  Q4_K model matmul path engages the kernel without error.
-
-### P3 — The Marlin pipeline (the speedup) — STEP 1 + SKEW-PAD/TILING LANDED; PREPACK + PIPELINE + STREAM-K DEFERRED
-Goal: `cp.async` double/triple-buffered global->shared; offline weight reshuffle (a one-time repack of the Q4
-tensor into the mma+pipeline layout); register-resident activation tiles; Stream-K split for the prefill M.
-Target: >=150 TFLOP/s (>=~2,300 t/s), then ~213. **MMQ baseline to beat: 47.1 TFLOPS (q4_K n=512) / pp512 718.**
-
-**Kernel structure now (committed P3b):** block-tiled multi-warp GEMM with a CONFLICT-FREE shared feed via skew
-padding. `blockDim=(32, WM*WN)` so `threadIdx.x` is the warp lane (required by `mma.cuh` get_i/get_j) and
-`threadIdx.y` is the warp index; the original 1-warp P2 launch put 128 threads on `threadIdx.x` and exploded
-`get_j` into an out-of-bounds shared read (found via compute-sanitizer). `WM*WN` warps compute a
-`BM(=WM*FM*16) x BN(=WN*FN*8)` output tile; each warp owns an `FM x FN` grid of m16n8k16 mma fragments
-accumulated in F32. Per k-step (16-deep): all warps cooperatively dequant the `BM x 16` Q4 weight strip + load
-the `BN x 16` f32->bf16 activation strip into shared, one `__syncthreads`, then `ldmatrix.x4` (A) / `ldmatrix.x2`
-(B) fragments + `FM*FN` mmas. The shared rows hold 8 bf162 of data but are stored at a PADDED stride of 12 bf162
-(`W4A16_SPAD`): ldmatrix's per-lane address is `row*stride`, and the natural stride 8 (a divisor of the
-32-bank / 128-byte cycle) collides rows 0,4,8,12 into a 2-way bank conflict; skewing to 12 (4-byte aligned, so
-ldmatrix's 16-byte alignment holds) makes `{r*12 mod 32}` hit 8 distinct bank-quads for r in 0..7, so both
-halves of ldmatrix are conflict-free at only +50% on the small staged tile (~12 KB at the shipping tile).
-Shipping config `WM=4,WN=4,FM=2,FN=4` -> `BM=128, BN=128`, 16 warps, 8 m16n8 C-tiles per warp (keeping
-register pressure low is what lets BN grow without an occupancy cliff). M/N tails zero-padded in-kernel; still
-gated to contiguous 2D Q4_0/Q4_K f32 prefill, else falls back to MMQ.
-
-**Per-step results (q4_K n=512 via `test-backend-ops perf`; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):**
-
-| step | q4_K n=512 | q4_0 n=512 | pp512 | pp2048 | vs MMQ 47 / 718 | notes |
-|---|---|---|---|---|---|---|
-| P2 (1 warp/tile) | ~2 TFLOPS | - | 31.75 | - | 0.04x | correctness checkpoint |
-| Step 1: block tiling (load_generic, BM64/4w) | 6.63 (cold) | 7.53 | 119 | 123 | 0.14x | original committed kernel |
-| P3b-1: skew-pad ldmatrix + BM128/8w | 8.50 (cold) | 10.56 | 148.5 | 153.9 | 0.18x | +28% q4_K, +40% q4_0 over step 1 |
-| **P3b-2: + BN128/16w (current)** | **9.92 (cold)** | **11.68** | **177.6** | **185.0** | **0.21x** | +17% q4_K, +20% pp512 over P3b-1 (+49% pp512 over step 1) |
-
-Parity gate **1103/1103** at every step, flag set and unset (byte-identical when unset). All P3b numbers above
-are from thermally-bracketed cold A/B sessions (committed measured immediately before AND after each candidate,
-identical both times -> the deltas are real, not thermal). P3b-1 cold A/B: 6.63/7.53 vs 8.52/10.49. P3b-2 cold
-A/B: BN64/8w 10.56/8.50 then 10.51/8.45 (bracket) vs BN128/16w 11.68/9.92.
-
-**What landed / what was tried (honest):**
- **P3b - LANDED (committed).** Two combined changes lift the prior committed kernel: (1) **skew-pad
-  conflict-free ldmatrix** (shared row stride 8->12 bf162; makes `ldmatrix.x4`/`.x2` bank-conflict-free at near
-  zero occupancy cost) and (2) **bigger tile / more warps** (`BM=128, BN=64`, 8 warps). Cold A/B: q4_K
-  6.63->8.52 (+28%), q4_0 7.53->10.49 (+40%), pp512 119->148.5 (+25%). **Still ~5.5x under MMQ (47) per-op and
-  ~4.8x under pp512 718 - does NOT beat MMQ.** This is forward progress, not the finish line.
- **The XOR-swizzle-FIRST plan was tested and is WRONG for this GPU - documented so it is not re-tried.** A
-  wide-row (BK=64, 128-byte rows) XOR swizzle `seg ^ (row&7)` IS conflict-free, but the 16 KB shared it needs
-  collapsed occupancy and dropped q4_K n=512 to **2.84 TFLOPS** (worse than the unswizzled 6.63) - the same
-  occupancy cliff P3 hit with a 32 KB pipeline. The conflict-free feed must be bought WITHOUT widening shared:
-  skew padding (above) does exactly that (6 KB), which is why it is the committed form. Lesson: on GB10 occupancy
-  dominates bank-conflict latency; never trade occupancy for a conflict-free layout.
- **Conflict-free feed alone did NOT beat the unswizzled kernel - the limiter moved.** At the SAME BM64/4w tile,
-  skew-pad ldmatrix (6.70) ~= load_generic (6.63): removing bank conflicts bought ~nothing. The win came only
-  when the tile grew (BM128/8w). A 5-config tile sweep then split the two quant types:
-  - **q4_0 SCALES with warps/tiles** (7.7 -> 10.5 -> **15.8 TFLOPS at BM128/16w**): feed/global-traffic bound,
-    helped by cutting redundant activation re-reads (more BM = fewer M-blocks each re-reading the act column).
-  - **q4_K is largely DEQUANT-COMPUTE bound** (the BM64/16w tile gives q4_0=15.8 but q4_K=6.8 - they diverge
-    hard). This **refines P3's "within 12%" finding**: that held only in the low-throughput memory-bound regime;
-    once the feed is unblocked, q4_K's per-element 6-bit superblock decode (`get_scale_min_k4` + superblock
-    indexing, redone every k-step AND re-done by every N-block) becomes the wall. BM256 regressed both (too few
-    blocks / register pressure).
- **Growing BN partly relieves the q4_K dequant wall (P3b-2).** Because every N-block re-decodes the same
-  weight strip, halving the N-block count (BN 64->128) halves that redundant q4_K decode - but only when BN is
-  spread across MORE WARPS (16w, 8 C-tiles/warp), not more fragments-per-warp: the FN=8 / FM=4 variants (16
-  C-tiles/warp) regressed to ~6.6 on register pressure, while WM=4,WN=4,FM=2,FN=4 (16w, 8 tiles/warp) lifted
-  q4_K 8.5->9.9 and q4_0 10.6->11.7 cold. BN=256 was no better and costs more shared. **BN128/16w is the
-  shipping tile.**
- **Next blocker (the remaining q4_K unlock) = offline prepack.** BN growth only divides the redundant decode by
-  the N-block count; it cannot remove the per-k-step decode itself. The full fix is the **one-time offline
-  repack** - decode the Q4 tensor ONCE into a cached device buffer keyed off the tensor data pointer, in a layout
-  with the scale/min pre-applied (store reshuffled 4-bit + per-subblock bf16 d,m, ~1.25x the q4 size, NOT a full
-  bf16 blow-up which would be ~4x), so the in-kernel path becomes a cheap `q*d - m` with coalesced loads. Then
-  `cp.async` multi-stage (sized to NOT widen shared past the occupancy cliff) and **Stream-K** over M. These
-  remain the multi-week core; **prepack is the highest-value next step for q4_K specifically** (it should let
-  q4_K join q4_0 on the feed-bound scaling curve instead of plateauing at ~10).
- **Methodology note (unchanged):** the box thermally throttles under sustained perf+bench runs (identical code
-  ~8.8 cold vs ~6.6 hot earlier), so only same-session A/Bs are trustworthy. The P3b deltas above were taken in
-  one bracketed cold session for exactly this reason.
-
-### P4 — Tune
- Tile (mmq_x/y analogues), warps, pipeline depth, occupancy. We have nsys (throughput) but **not ncu** on the
-  DGX — tuning is empirical (sweep configs, measure t/s). Note ncu would need sudo/driver perms we lack.
-
-### P5 — Enable
- Default on for sm_120/121 + Q4_0/Q4_K dense when parity holds + faster; keep the flag as an escape hatch.
-  Ship as a LocalAI llama.cpp patch (the patches/ series) and/or upstream (ggml has no Marlin-equivalent —
-  issue #1519 — so it's net-new upstream value; float it with maintainers first).
-
-## Risks / notes
- **Multi-week, expert-CUDA, DGX-only** (GB10 is the only sm_121). The session's network flakiness +
-  `llama-cli` hang make `llama-bench`/`test-backend-ops` the reliable verification tools (both work).
- Quantization correctness: Q4_K's superblock structure (256-elem, 6-bit scales) is more complex to dequant
-  in-kernel than Q4_0; consider landing Q4_0 first, then Q4_K.
- **Beat-path follow-on:** the FP4-MMA path (`mul_mat_q<MXFP4>`, ~5% of FP4 peak) tuned/fixed on sm_121 reaches
-  ~6,600 (2× BF16). Separate track; this W4A16 kernel is the match-path foundation.
- Reuse ggml's `mma.cuh` tile abstractions (MMQ already uses them) rather than raw PTX where possible.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
@@ -1,31 +0,0 @@
-# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout
-
-Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
-new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).
-
-## Files (copy into `ggml/src/ggml-cuda/`)
- `marlin-w4a16.cuh`
- `marlin-w4a16.cu`
-
-## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`
-
-1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
-   ```cpp
-   #include "ggml-cuda/marlin-w4a16.cuh"
-   ```
-
-2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
-   `if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
-   ```cpp
-   if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
-   ```
-
-## Verify (P1 acceptance — met)
- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
-  gating passes on sm_121, falls back).
-
-The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
-and returns `true` once parity holds.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
@@ -1,66 +0,0 @@
-# W4A16 kernel - subagent dispatch briefs (P3, P4, P5)
-
-**Dispatch strategy.** Each phase = one fresh **Opus-4.8** subagent handed a complete zero-context brief.
-Phases are **sequential** (P3 needs P2's correct kernel; P4 needs P3's pipeline; P5 needs P4's tuned kernel),
-so dispatch phase N+1 only after phase N's commit lands, and before dispatching, splice phase N's *actual*
-deliverable (final kernel shape, configs, fallback set) into the next brief. P2's brief (already dispatched)
-is the template; reuse the COMMON section below verbatim in every dispatch.
-
---
-
-## COMMON (paste into every phase brief)
-
- **Kernel dev is on the remote DGX** (GB10, sm_121): `ssh -o ConnectTimeout=25 -o ServerAliveInterval=10 -o ServerAliveCountMax=10 dgx.casa '<cmd>'`. Network is FLAKY (re-poll on drop; nohup jobs survive). `llama-cli` HANGS - never use it. Only `llama-bench` + `test-backend-ops` work.
- Checkout `~/llama.cpp-pr24423`, build `~/llama.cpp-pr24423/build` (sm_121, `-DLLAMA_BUILD_TESTS=ON`). Kernel file `ggml/src/ggml-cuda/marlin-w4a16.cu`. Build auto-GLOBs it; no CMakeLists edits. Hook already in `ggml-cuda.cu`, gated behind env `GGML_CUDA_W4A16`.
- Dense test model: `~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
- **Builds run detached + poll** (never blocking foreground): write a `~/pN.sh` that builds `--target test-backend-ops llama-bench`, echoes `RC=$?`, runs the gate, echoes `PN_DONE`; `nohup` it; poll `for i in $(seq 1 90); do grep -q PN_DONE ~/pN.out && break; sleep 20; done; tail ~/pN.out`.
- **GPU hygiene:** check `docker ps | grep local-ai` + `nvidia-smi`; `docker stop` a running localai worker if present (authorized); never pkill native procs; never start model servers.
- **Parity gate (must stay green every step):** `GGML_CUDA_W4A16=1 CUDA_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**; and flag-unset stays 1103/1103 (byte-identical). A wrong result is worse than a fallback - return false for any shape you can't do correctly.
- **Perf measurement:** `test-backend-ops perf -o MUL_MAT -b CUDA0` (per-shape GFLOPS; the canonical target is q4_K m=4096 k=14336 **n=512**, baseline **47.1 TFLOPS**, ceiling ~213) + `llama-bench -m <model> -ngl 99 -p 512,2048 -n 0 -ub 2048` (baseline pp512 ~718).
- **LocalAI repo (commit here; you do NOT inherit cwd - `cd` explicitly):** `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`. Plan: `backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md`. Source mirror: `backend/cpp/llama-cpp/paged/kernel/w4a16/`. After a phase passes: fetch the final `marlin-w4a16.cu` from the DGX (`ssh ... 'cat ...'`), overwrite the mirror, update the plan (mark the phase DONE with numbers), `git commit -s` (DCO sign-off; user is Ettore Di Giacinto <mudler@localai.io>). **No `Co-Authored-By`. No em-dashes anywhere. Trailer `Assisted-by: Claude:opus-4.8 [Claude Code]`. Do NOT push.**
- Final message = the result (gate ?/1103, the perf delta, blockers + resolutions, commit hash). A precise partial result beats a vague success claim.
-
---
-
-## P3 brief - the Marlin pipeline (the speedup)
-
-**Goal.** Take P2's correct-but-slow kernel from ~47 toward ~150+ TFLOPS (then ~213) on the q4_K n=512 prefill GEMM, **without ever breaking parity**. This is the Marlin design: the math is the same BF16 mma; the speed comes from feeding the tensor cores without stalling.
-
-**Implement, incrementally (re-run the parity gate after each):**
-1. **`cp.async` multi-stage pipeline** - double/triple-buffer global->shared loads of both the Q4 weight tiles and the activation tiles so dequant+mma on stage k overlaps the load of stage k+1. (Study `mma.cuh` + how `mmq.cu`/`mmf.cu` stage shared memory; ggml already uses `cp.async`/`__pipeline_*`.)
-2. **Offline weight reshuffle** - repack the Q4 weights once into the mma+pipeline-friendly layout (Marlin's interleave) so loads are coalesced and the mma fragment maps directly. Do this as a load-time transform of src0 (a new prepacked buffer keyed off the tensor) - NOT per-call. Document where the repack lives + its memory cost.
-3. **Register-resident activation tiles + Stream-K** split of the M dimension across blocks for the prefill (large-M) case so all SMs stay busy.
-
-**Acceptance.** Parity gate stays **1103/1103** at every commit; `test-backend-ops perf` q4_K n=512 climbs materially above 47 TFLOPS (target >=150) and `llama-bench` pp512 climbs above ~718. Report the TFLOPS + t/s after each of the 3 steps so the contribution of each is visible. If a step regresses parity, revert it and report why.
-
-**Reference.** IST-DASLab/marlin (github), arXiv 2408.11743, vLLM machete. Mirror `mmf.cu`'s BF16 GEMM structure; Marlin = that + Q4 dequant-on-load + the pipeline/reshuffle.
-
-**Splice before dispatch:** P2's final kernel structure (tile sizes, which types/shapes it handles vs falls back, helper functions it defined).
-
---
-
-## P4 brief - tune to the ceiling
-
-**Goal.** Drive the P3 kernel as close to the ~213 TFLOPS ceiling as empirical tuning allows. **No `ncu` on this box** (no driver perms) - tune by throughput: `test-backend-ops perf` + `llama-bench` + `nsys` (throughput only).
-
-**Do.** Parametrize the kernel (template params / constants) over: tile M/N/K, warps per block, pipeline depth (stages), and occupancy (regs, shared-mem budget). Sweep systematically (a script that rebuilds + benches each config, logs q4_K n=512 TFLOPS + pp512/pp2048 t/s), pick the best, hard-set it (with a short comment on the sweep). Check both prefill shapes (n=512 and n=2048) and confirm decode (n=1) didn't regress (it should still route to mat-vec, not this kernel - verify the gating).
-
-**Acceptance.** Best config maximizes q4_K n=512 TFLOPS (stretch ~150-213) with parity **1103/1103** intact; the sweep table (config -> TFLOPS/t-s) is recorded in the plan's P4 section. Report the chosen config + the final pp512/pp2048 t/s vs the 718/750 baseline and vs vLLM's ~3300 single-stream target.
-
-**Splice before dispatch:** P3's pipeline structure + the perf it reached + which knobs are already fixed vs free.
-
---
-
-## P5 brief - enable + package + (maybe) upstream
-
-**Goal.** Make W4A16 the default dense-Q4 path on Blackwell and ship it through LocalAI.
-
-**Do.**
-1. **Flip the gate:** default-ON for sm_120/121 + Q4_0/Q4_K dense when faster, keep an opt-out env (e.g. `GGML_CUDA_W4A16=0`) as an escape hatch. The existing return-false-on-unhandled-shape path is the correctness safety net; keep it. Verify the default (no env) build now runs W4A16 for dense Q4, gate green, faster than the old MMQ baseline.
-2. **Package as a LocalAI llama.cpp patch:** produce `backend/cpp/llama-cpp/paged/patches/kernel/0002-w4a16-marlin.patch` (the new files + the `ggml-cuda.cu` hook + the gate flip) that applies cleanly to the pinned llama.cpp, mirroring the existing `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`. Confirm LocalAI's `make backends/llama-cpp` build path can consume it (read `.agents/llama-cpp-backend.md` + the build memory: `make -C backend/cpp/llama-cpp clean` before rebuilds).
-3. **Docs:** update `BLACKWELL_KERNEL_GAPS.md` + the plan with the shipped result; add a short note to the LocalAI docs if there's a Blackwell/performance page.
-4. **Upstream decision (do NOT open without surfacing first):** ggml has no Marlin-equivalent (issue #1519) so this is net-new upstream value. Draft (do not submit) an upstream PR description + note the sm_121 build-flag caveats; report it for the user to decide.
-
-**Acceptance.** Default Blackwell build uses W4A16 for dense Q4, parity 1103/1103, measurably faster than MMQ; the patch applies + the LocalAI llama-cpp backend builds with it (verify or, if the full backend build is too heavy, document the exact build command + that the patch applies cleanly). Report the end-to-end LocalAI dense-Q4 prefill number vs the start-of-project 765 t/s.
-
-**Splice before dispatch:** P4's final kernel + config + the measured ceiling reached; the exact enable condition decided.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
@@ -1,258 +0,0 @@
-#include "marlin-w4a16.cuh"
-#include "mma.cuh"
-
-#include <cstdio>
-#include <cstdlib>
-#include <cuda_bf16.h>
-
-// W4A16 Marlin-style GEMM.
-//
-// In-kernel dequantize Q4 weights -> BF16, multiply against BF16-converted F32
-// activations using mma.sync m16n8k16 BF16 tensor-core ops, accumulate in F32,
-// write F32 output. Handles only the contiguous 2D GEMM (prefill) case for
-// Q4_0 / Q4_K; everything else returns false and falls back to MMQ.
-//
-// ggml MUL_MAT convention: dst[m,n] = sum_k src0[k,m] * src1[k,n].
-//   src0 (weights): ne0=K (contiguous), ne1=M  -> row m is K contiguous quants.
-//   src1 (acts,f32): ne0=K (contiguous), ne1=N -> row n is K contiguous floats.
-//   dst  (f32):      ne0=M (contiguous), ne1=N -> element (m,n) at m + n*M.
-// Both operands are row-major [row][k]; m16n8k16 computes C[m,n] += sum_k A[m,k]*B[n,k].
-//
-// Thread layout: blockDim = (32, WM*WN). threadIdx.x is the warp lane (0..31,
-// required by mma.cuh get_i/get_j), threadIdx.y is the warp index.
-//
-// P3b step 1 - conflict-free shared layout via SKEW PADDING:
-//  - WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile; each warp
-//    owns an FM x FN grid of m16n8k16 mma fragments accumulated in F32.
-//  - Per 16-deep k-step the warps cooperatively dequant the BM x 16 Q4 weight
-//    strip + load the BN x 16 f32->bf16 activation strip into shared, then feed
-//    the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B).
-//  - The shared rows are PADDED to SPAD(=12) bf162 instead of the natural 8.
-//    ldmatrix's per-lane address is row*stride; with the natural stride 8 (a
-//    divisor of the 32-bank / 128-byte cycle) rows 0,4,8,12 collide -> 2-way
-//    bank conflict on every fragment load (this is why P3 measured a plain
-//    ldmatrix swap as neutral). Skewing the stride to 12 (4-byte aligned, so
-//    ldmatrix's 16-byte alignment holds) makes {r*12 mod 32} hit 8 distinct
-//    bank-quads for r in 0..7, so both halves of ldmatrix.x4 and ldmatrix.x2 are
-//    conflict-free. The pad costs only +50% on the small (~4 KB) staged tile, so
-//    unlike a 128-byte-row XOR swizzle it does NOT collapse occupancy on GB10
-//    (a wide-row swizzle pushed shared to 16 KB and dropped this to ~2.8 TFLOPS).
-//
-// Dead-ends already proven (do not re-try): a double-buffered KSTAGE=64 cp.async
-// pipeline collapsed occupancy (32 KB shared -> 2.7 TFLOPS); a plain ldmatrix on
-// the UNpadded layout was neutral (bank conflicts); a wide-row (BK=64) XOR swizzle
-// was conflict-free but occupancy-starved (16 KB shared -> 2.8 TFLOPS). Skew
-// padding gets the conflict-free feed at near-zero occupancy cost.
-
-using namespace ggml_cuda_mma;
-
-typedef tile<16, 8, nv_bfloat162> tile_A; // 16(M) x 16(K)
-typedef tile< 8, 8, nv_bfloat162> tile_B; //  8(N) x 16(K)
-typedef tile<16, 8, float>        tile_C; // 16(M) x  8(N)
-
-// bf162 columns actually live per shared row (16 k-values = 8 bf162) ...
-#define W4A16_KP   8
-// ... padded to this stride to bank-skew the ldmatrix row addresses.
-#define W4A16_SPAD 12
-
-static bool w4a16_enabled() {
-    static const bool en = (std::getenv("GGML_CUDA_W4A16") != nullptr);
-    return en;
-}
-
-// 6-bit packed scale/min decode for Q4_K (mirrors convert.cu get_scale_min_k4).
-static __device__ __forceinline__ void w4a16_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
-    if (j < 4) {
-        d = q[j] & 63; m = q[j + 4] & 63;
-    } else {
-        d = (q[j+4] & 0xF) | ((q[j-4] >> 6) << 4);
-        m = (q[j+4] >>  4) | ((q[j-0] >> 6) << 4);
-    }
-}
-
-// Dequantize a single Q4_0 weight at column k of a row.
-static __device__ __forceinline__ float w4a16_dq_q4_0(const char * row, int k) {
-    const block_q4_0 * blk = (const block_q4_0 *) row + (k / QK4_0);
-    const int j = k % QK4_0;
-    const float d = __half2float(blk->d);
-    const int q = (j < QK4_0/2) ? (blk->qs[j] & 0xF) : (blk->qs[j - QK4_0/2] >> 4);
-    return (q - 8) * d;
-}
-
-// Dequantize a single Q4_K weight at column k of a row.
-static __device__ __forceinline__ float w4a16_dq_q4_K(const char * row, int k) {
-    const block_q4_K * blk = (const block_q4_K *) row + (k / QK_K);
-    const int e = k % QK_K;
-    const int il     = e / 64;        // 0..3
-    const int within = e % 64;
-    const int half   = within / 32;   // 0..1
-    const int pos    = within % 32;
-    const int ir     = pos / 4;       // 0..7
-    const int l      = pos % 4;       // 0..3
-    const int is     = 2*il + half;
-    const float dall = __low2half (blk->dm);
-    const float dmin = __high2half(blk->dm);
-    uint8_t sc, mn;
-    w4a16_scale_min_k4(is, blk->scales, sc, mn);
-    const float d = dall * sc;
-    const float m = dmin * mn;
-    const uint8_t qb = blk->qs[32*il + 4*ir + l];
-    const int q = (half == 0) ? (qb & 0xF) : (qb >> 4);
-    return d * q - m;
-}
-
-template <bool IS_Q4_K, int WM, int WN, int FM, int FN>
-static __global__ void __launch_bounds__(WM*WN*32, 1)
-w4a16_gemm_kernel(
-        const char * __restrict__ src0,
-        const char * __restrict__ src1,
-        float      * __restrict__ dst,
-        const int M, const int N, const int K,
-        const int64_t nb01, const int64_t nb11, const int64_t dst_ne0) {
-    constexpr int KP   = W4A16_KP;      // 8 bf162 = 16 k per row
-    constexpr int SPAD = W4A16_SPAD;    // padded row stride (bank skew)
-    constexpr int BM  = WM*FM*16;
-    constexpr int BN  = WN*FN*8;
-    constexpr int NTH = WM*WN*32;
-
-    const int m0 = blockIdx.x * BM;
-    const int n0 = blockIdx.y * BN;
-
-    const int warp_id = threadIdx.y;        // 0 .. WM*WN-1
-    const int warp_n  = warp_id % WN;
-    const int warp_m  = warp_id / WN;
-    const int tid     = threadIdx.y*32 + threadIdx.x;
-
-    __shared__ nv_bfloat162 sW[BM*SPAD]; // [m][kpair], padded row stride SPAD
-    __shared__ nv_bfloat162 sB[BN*SPAD]; // [n][kpair], padded row stride SPAD
-
-    tile_C C[FM][FN]; // zero-initialized accumulators
-
-    for (int k0 = 0; k0 < K; k0 += 16) {
-        // Dequantize the BM x 16 weight strip once; reused across the block's BN span.
-        #pragma unroll
-        for (int idx = tid; idx < BM*KP; idx += NTH) {
-            const int m  = idx / KP;
-            const int kk = idx % KP;
-            const int k  = k0 + 2*kk;
-            float w0 = 0.0f, w1 = 0.0f;
-            if (m0 + m < M) {
-                const char * row = src0 + (int64_t)(m0 + m) * nb01;
-                if (IS_Q4_K) { w0 = w4a16_dq_q4_K(row, k); w1 = w4a16_dq_q4_K(row, k + 1); }
-                else         { w0 = w4a16_dq_q4_0(row, k); w1 = w4a16_dq_q4_0(row, k + 1); }
-            }
-            sW[m*SPAD + kk] = __floats2bfloat162_rn(w0, w1);
-        }
-        // Load the BN x 16 activation strip (f32 -> bf16).
-        #pragma unroll
-        for (int idx = tid; idx < BN*KP; idx += NTH) {
-            const int n  = idx / KP;
-            const int kk = idx % KP;
-            const int k  = k0 + 2*kk;
-            float a0 = 0.0f, a1 = 0.0f;
-            if (n0 + n < N) {
-                const float * arow = (const float *)(src1 + (int64_t)(n0 + n) * nb11);
-                a0 = arow[k]; a1 = arow[k + 1];
-            }
-            sB[n*SPAD + kk] = __floats2bfloat162_rn(a0, a1);
-        }
-        __syncthreads();
-
-        tile_A Af[FM];
-        tile_B Bf[FN];
-        #pragma unroll
-        for (int fm = 0; fm < FM; ++fm) {
-            const int mrow = (warp_m*FM + fm) * 16;
-            load_ldmatrix(Af[fm], sW + mrow*SPAD, SPAD);
-        }
-        #pragma unroll
-        for (int fn = 0; fn < FN; ++fn) {
-            const int ncol = (warp_n*FN + fn) * 8;
-            load_ldmatrix(Bf[fn], sB + ncol*SPAD, SPAD);
-        }
-        #pragma unroll
-        for (int fm = 0; fm < FM; ++fm) {
-            #pragma unroll
-            for (int fn = 0; fn < FN; ++fn) {
-                mma(C[fm][fn], Af[fm], Bf[fn]);
-            }
-        }
-        __syncthreads();
-    }
-
-    #pragma unroll
-    for (int fm = 0; fm < FM; ++fm) {
-        #pragma unroll
-        for (int fn = 0; fn < FN; ++fn) {
-            const int mbase = m0 + (warp_m*FM + fm) * 16;
-            const int nbase = n0 + (warp_n*FN + fn) * 8;
-            #pragma unroll
-            for (int l = 0; l < tile_C::ne; ++l) {
-                const int m = mbase + tile_C::get_i(l);
-                const int n = nbase + tile_C::get_j(l);
-                if (m < M && n < N) {
-                    dst[(int64_t)n * dst_ne0 + m] = C[fm][fn].x[l];
-                }
-            }
-        }
-    }
-}
-
-bool ggml_cuda_w4a16_mul_mat(
-        ggml_backend_cuda_context & ctx,
-        const ggml_tensor * src0,
-        const ggml_tensor * src1,
-        ggml_tensor       * dst) {
-    if (!w4a16_enabled()) {
-        return false;
-    }
-    if (src0->type != GGML_TYPE_Q4_0 && src0->type != GGML_TYPE_Q4_K) {
-        return false;
-    }
-    if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
-        return false;
-    }
-    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-    if (!GGML_CUDA_CC_IS_NVIDIA(cc) || cc < GGML_CUDA_CC_BLACKWELL) {
-        return false; // consumer Blackwell (sm_120/121) only
-    }
-
-    if (src0->ne[2] != 1 || src0->ne[3] != 1 ||
-        src1->ne[2] != 1 || src1->ne[3] != 1 ||
-        dst->ne[2]  != 1 || dst->ne[3]  != 1) {
-        return false;
-    }
-    if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
-        return false;
-    }
-
-    const int64_t K = src0->ne[0];
-    const int64_t M = src0->ne[1];
-    const int64_t N = src1->ne[1];
-    if (src1->ne[0] != K || dst->ne[0] != M || dst->ne[1] != N) {
-        return false;
-    }
-    if (K % 16 != 0) {
-        return false;
-    }
-
-    cudaStream_t stream = ctx.stream();
-
-    // Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
-    constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
-    constexpr int BM = WM*FM*16;
-    constexpr int BN = WN*FN*8;
-    const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
-    const dim3 block(32, WM*WN, 1);
-
-    if (src0->type == GGML_TYPE_Q4_K) {
-        w4a16_gemm_kernel<true, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
-            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
-            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
-    } else {
-        w4a16_gemm_kernel<false, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
-            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
-            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
-    }
-    return true;
-}
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
@@ -1,14 +0,0 @@
-#pragma once
-
-#include "common.cuh"
-
-// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
-// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
-// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
-// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
-// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
-bool ggml_cuda_w4a16_mul_mat(
-        ggml_backend_cuda_context & ctx,
-        const ggml_tensor * src0,   // 4-bit weights (Q4_0/Q4_K)
-        const ggml_tensor * src1,   // F32 activations
-        ggml_tensor       * dst);   // F32 output
--- a/backend/cpp/llama-cpp/paged/paged-bench.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-bench.cpp
@@ -1,129 +0,0 @@
-// paged-bench: quantify the multi-tenant wins of paged KV allocation that are
-// properties of the host-side block model (vLLM-parity), independent of the
-// in-model compute path.
-//
-//   Win 1 (capacity):       on-demand block allocation vs contiguous per-seq
-//                           reservation, under a fixed KV block budget.
-//   Win 3 (prefix sharing): automatic cross-tenant prefix dedup via block
-//                           hashing.
-//
-// Win 2 (throughput) is intentionally NOT here: it requires the paged read
-// path wired into llama-graph.cpp (Gate 0). Measuring it at this layer would
-// be dishonest, so it is reported as pending.
-
-#include "paged_kv_manager.h"
-
-#include <cstdio>
-#include <vector>
-#include <numeric>
-
-using namespace paged;
-
-// A deterministic LCG so sequence lengths vary without Math.random-style nondeterminism.
-struct Lcg {
-    uint64_t s;
-    explicit Lcg(uint64_t seed) : s(seed) {}
-    uint32_t next() { s = s * 6364136223846793005ULL + 1442695040888963407ULL; return (uint32_t)(s >> 33); }
-    int range(int lo, int hi) { return lo + (int)(next() % (uint32_t)(hi - lo + 1)); }
-};
-
-static size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-
-int main() {
-    const int block_size = 16;
-    const int n_ctx      = 2048;   // max context a sequence could use
-    const int num_blocks = 512;    // fixed KV budget: 512 blocks * 16 = 8192 cells
-
-    printf("paged-bench  (block_size=%d, n_ctx=%d, budget=%d blocks = %d cells)\n\n",
-           block_size, n_ctx, num_blocks, num_blocks * block_size);
-
-    // ---------------------------------------------------------------------
-    // WIN 1: concurrency capacity. Sequences have realistic, VARYING lengths
-    // (most short, a few long) - the regime where reserving n_ctx per seq
-    // wastes the most. Count how many fit under the same block budget.
-    // ---------------------------------------------------------------------
-    {
-        Lcg rng(12345);
-        const int blocks_per_ctx = (int) cdiv(n_ctx, block_size); // contiguous reserves this per seq
-
-        // Contiguous (stream-style) reservation: every seq reserves n_ctx worth.
-        int contiguous_fit = num_blocks / blocks_per_ctx;
-
-        // Paged on-demand: draw real lengths until the pool is exhausted.
-        PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-        int paged_fit = 0;
-        long total_tokens = 0;
-        for (int seq = 0; ; ++seq) {
-            // 80% short (8-128 tok), 20% long (up to n_ctx)
-            int len = (rng.range(0, 99) < 80) ? rng.range(8, 128) : rng.range(128, n_ctx);
-            if (!m.allocate(seq, (size_t) len)) break;
-            paged_fit++;
-            total_tokens += len;
-        }
-
-        printf("WIN 1  concurrency capacity @ %d-block budget\n", num_blocks);
-        printf("  contiguous (reserve n_ctx/seq): %d sequences\n", contiguous_fit);
-        printf("  paged (on-demand blocks):       %d sequences  (avg %ld tok/seq)\n",
-               paged_fit, paged_fit ? total_tokens / paged_fit : 0);
-        printf("  --> paged fits %.1fx more concurrent sequences\n\n",
-               contiguous_fit ? (double) paged_fit / contiguous_fit : 0.0);
-    }
-
-    // ---------------------------------------------------------------------
-    // WIN 3: cross-tenant prefix sharing. N tenants share a long system
-    // prompt / RAG context, then diverge. Compare physical blocks consumed
-    // with prefix caching on vs off.
-    // ---------------------------------------------------------------------
-    {
-        const int n_tenants    = 32;
-        const int shared_len   = 1024;  // shared system prompt (64 blocks)
-        const int distinct_len = 64;    // per-tenant suffix (4 blocks)
-
-        // Shared prefix token ids (identical across tenants -> identical block hashes).
-        std::vector<int> shared(shared_len);
-        for (int i = 0; i < shared_len; ++i) shared[i] = 1000 + i;
-
-        // --- prefix caching OFF: every tenant pays for the whole prefix ---
-        long blocks_off = 0;
-        {
-            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/false);
-            for (int t = 0; t < n_tenants; ++t) {
-                m.allocate(t, (size_t) (shared_len + distinct_len));
-                blocks_off += m.block_table(t).size();
-            }
-        }
-
-        // --- prefix caching ON: shared blocks are deduped to one physical copy ---
-        long blocks_on = 0;
-        {
-            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/true);
-            // tenant 0 fills + caches the shared prefix
-            auto h = m.compute_block_hashes(shared);
-            m.allocate(0, (size_t) (shared_len + distinct_len));
-            m.cache_blocks(0, h, (size_t) shared_len);
-            long physical = m.block_table(0).size();
-            // tenants 1..N-1 hit the cached prefix; only their distinct suffix is new
-            for (int t = 1; t < n_tenants; ++t) {
-                size_t cached_tokens = m.get_computed_blocks(h); // shared blocks reused
-                size_t new_tokens = (shared_len - cached_tokens) + distinct_len;
-                m.allocate(t, (size_t) (shared_len + distinct_len));
-                // physically new blocks = only what wasn't already resident
-                physical += (long) cdiv(new_tokens, block_size);
-            }
-            blocks_on = physical;
-        }
-
-        printf("WIN 3  cross-tenant prefix sharing (%d tenants, %d-tok shared prefix)\n",
-               n_tenants, shared_len);
-        printf("  prefix-cache OFF: %ld physical blocks\n", blocks_off);
-        printf("  prefix-cache ON:  %ld physical blocks\n", blocks_on);
-        printf("  --> %.1fx less KV memory for the shared workload\n\n",
-               blocks_on ? (double) blocks_off / blocks_on : 0.0);
-    }
-
-    printf("WIN 2  aggregate throughput under load: PENDING\n");
-    printf("  Requires the paged gather-read path wired into llama-graph.cpp\n");
-    printf("  (Gate 0) to measure tok/s vs concurrency. Not measurable at the\n");
-    printf("  allocation layer; not reported here to avoid overclaiming.\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
@@ -1,169 +0,0 @@
-// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the
-// regime where paging wins - variable prompt lengths, variable generation lengths,
-// staggered (continuous) arrival, and a shared system prefix. The stock
-// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a
-// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and
-// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md).
-//
-// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to
-// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU
-// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV
-// memory becomes the binding constraint - that is where paged KV pays off and where
-// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput
-// plateaus long before memory binds, so the win is not observable there regardless.
-//
-// Metrics reported:
-//   - goodput (decode tokens/s aggregate) under the dynamic load
-//   - peak concurrent in-flight sequences actually sustained
-//   - paged peak KV bytes used  vs  the contiguous reservation a unified cache needs
-//     (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks
-//
-// The capacity ratio is the load-bearing number for the buy decision: it is how many
-// more concurrent tenants a fixed HBM budget serves with paging than without.
-
-#include "common.h"
-#include "llama.h"
-
-#include <cmath>
-#include <cstdio>
-#include <cstring>
-#include <random>
-#include <string>
-#include <vector>
-
-// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ----
-static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
-
-struct workload_cfg {
-    int    total_requests  = env_int("LG_TOTAL",    2000); // total requests to serve
-    int    target_inflight = env_int("LG_INFLIGHT",  256); // continuous-batching concurrency target
-    int    prefix_tokens   = env_int("LG_PREFIX",    512); // shared system-prompt prefix (prefix-cache target)
-    int    suffix_min      = env_int("LG_SUFMIN",     16); // per-request unique prompt suffix range
-    int    suffix_max      = env_int("LG_SUFMAX",    768);
-    int    gen_short       = env_int("LG_GENSHORT",   32); // bimodal generation: most short...
-    int    gen_long        = env_int("LG_GENLONG",  1024); // ...some long (the over-reservation driver)
-    int    gen_long_pct    = env_int("LG_LONGPCT",    15); // % of requests that are long
-    int    block_size      = env_int("LG_BLOCK",      16); // must match -kvbls
-    unsigned seed          = (unsigned) env_int("LG_SEED", 1234);
-};
-
-// Per-request plan drawn from the workload distribution.
-struct req_plan { int prompt_len; int gen_len; };
-
-int main(int argc, char ** argv) {
-    common_params params;
-    params.n_predict = -1; // per-request, controlled by the plan below
-    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) {
-        fprintf(stderr, "usage: %s -m <model> -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]);
-        return 1;
-    }
-    params.kv_paged = true;
-
-    common_init_result init = common_init_from_params(params);
-    llama_model *   model = init.model.get();
-    llama_context * ctx   = init.context.get();
-    if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; }
-    const llama_vocab * vocab = llama_model_get_vocab(model);
-
-    workload_cfg cfg;
-    std::mt19937 rng(cfg.seed);
-    std::uniform_int_distribution<int> suf(cfg.suffix_min, cfg.suffix_max);
-    std::uniform_int_distribution<int> pct(1, 100);
-
-    // KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed
-    // against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio.
-    const int n_layers   = llama_model_n_layer(model);
-    const int n_head_kv  = llama_model_n_head_kv(model);
-    const int head_dim   = llama_model_n_embd(model) / llama_model_n_head(model);
-    const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t);
-
-    // A long shared system prefix that every request reuses (the prefix-cache target).
-    std::vector<llama_token> prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true);
-
-    // Pre-draw all request plans so paged peak usage and the contiguous reservation are
-    // computed from the SAME workload.
-    std::vector<req_plan> plans(cfg.total_requests);
-    int max_ctx = 0;
-    for (auto & p : plans) {
-        p.prompt_len = cfg.prefix_tokens + suf(rng);
-        p.gen_len    = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short;
-        max_ctx      = std::max(max_ctx, p.prompt_len + p.gen_len);
-    }
-
-    llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx);
-    if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; }
-
-    // ---- continuous-arrival loop: keep ~target_inflight requests live at all times ----
-    int    next_req = 0, done = 0, inflight = 0, peak_inflight = 0;
-    long   total_decoded = 0;
-    size_t peak_kv_bytes_paged = 0;   // sum over live seqs of ceil(used/block)*block*kv_bytes
-    size_t live_used_tokens = 0;      // running sum of actual KV tokens held by live seqs
-
-    auto admit = [&](int rid) {
-        const req_plan & p = plans[rid];
-        std::vector<llama_token> toks = prefix; // shared prefix...
-        std::vector<llama_token> suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false);
-        toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix
-        if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) {
-            inflight++; peak_inflight = std::max(peak_inflight, inflight);
-            live_used_tokens += p.prompt_len;
-        }
-    };
-
-    const int64_t t0 = ggml_time_us();
-    for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++);
-
-    llama_batch batch = {};
-    std::vector<llama_token> sampled; std::vector<int8_t> stop_flags;
-
-    while (done < cfg.total_requests) {
-        if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break;
-        const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched);
-        sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0);
-
-        // (decode is done inside the scheduler/update path in PR #22569; greedy here)
-        for (int i = 0; i < info->n_seq; ++i) {
-            const int rid = info->seq_ids[i];
-            llama_paged_seq_state st{};
-            llama_paged_scheduler_get_seq_state(sched, rid, &st);
-            // greedy argmax from the i-th row of logits
-            const float * lg = llama_get_logits_ith(ctx, i);
-            int best = 0; float bv = lg[0];
-            for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; }
-            sampled[i] = best;
-            const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len;
-            stop_flags[i] = stop ? 1 : 0;
-            if (!stop) { total_decoded++; live_used_tokens++; }
-            if (stop) {
-                done++; inflight--;
-                live_used_tokens -= (plans[rid].prompt_len + st.n_decoded);
-                if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival
-            }
-        }
-        // paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate
-        // current paged footprint from live_used_tokens rounded up per the block size.
-        const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size)
-                                 * cfg.block_size * kv_bytes_per_token;
-        peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now);
-
-        llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data());
-    }
-    const double secs = (ggml_time_us() - t0) / 1e6;
-
-    // Contiguous unified-KV reservation needed to serve the SAME peak concurrency without
-    // mid-generation eviction: every live slot must be backed for the worst-case context.
-    const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token;
-
-    printf("\n==== paged-loadgen ====\n");
-    printf("requests served      : %d  (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight);
-    printf("goodput (decode)     : %.1f tok/s   (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs);
-    printf("kv bytes / token     : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim);
-    printf("paged peak KV        : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0);
-    printf("contiguous reserve   : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx);
-    printf("CAPACITY RATIO       : %.2fx  <- tenants-per-HBM paging unlocks\n",
-           peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0);
-    printf("  (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens);
-
-    llama_paged_scheduler_free(sched);
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
@@ -1,296 +0,0 @@
-#include "paged_kv_manager.h"
-#include <cassert>
-#include <stdexcept>
-
-namespace paged {
-
-// ---------------------------------------------------------------------------
-// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
-// ---------------------------------------------------------------------------
-
-FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
-    num_free_blocks = blocks.size();
-    for (size_t i = 0; i < blocks.size(); ++i) {
-        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
-        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
-    }
-    if (!blocks.empty()) {
-        fake_head.next_free = blocks.front();
-        blocks.front()->prev_free = &fake_head;
-        fake_tail.prev_free = blocks.back();
-        blocks.back()->next_free = &fake_tail;
-    } else {
-        fake_head.next_free = &fake_tail;
-        fake_tail.prev_free = &fake_head;
-    }
-}
-
-KVCacheBlock* FreeBlockQueue::popleft() {
-    KVCacheBlock* first = fake_head.next_free;
-    if (first == &fake_tail || first == nullptr) {
-        assert(num_free_blocks == 0);
-        throw std::runtime_error("No free blocks available");
-    }
-    fake_head.next_free = first->next_free;
-    first->next_free->prev_free = &fake_head;
-    first->prev_free = first->next_free = nullptr;
-    num_free_blocks--;
-    return first;
-}
-
-std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
-    std::vector<KVCacheBlock*> ret;
-    if (n == 0) return ret;
-    assert(num_free_blocks >= n);
-    num_free_blocks -= n;
-    KVCacheBlock* curr = fake_head.next_free;
-    ret.reserve(n);
-    for (size_t i = 0; i < n; ++i) {
-        assert(curr != nullptr);
-        ret.push_back(curr);
-        KVCacheBlock* last = curr;
-        curr = curr->next_free;
-        last->prev_free = last->next_free = nullptr;
-    }
-    if (curr != nullptr) {
-        fake_head.next_free = curr;
-        curr->prev_free = &fake_head;
-    }
-    return ret;
-}
-
-void FreeBlockQueue::remove(KVCacheBlock* block) {
-    if (!block->prev_free || !block->next_free)
-        throw std::runtime_error("remove() called on an invalid block");
-    block->prev_free->next_free = block->next_free;
-    block->next_free->prev_free = block->prev_free;
-    block->prev_free = block->next_free = nullptr;
-    num_free_blocks--;
-}
-
-void FreeBlockQueue::append(KVCacheBlock* block) {
-    KVCacheBlock* last = fake_tail.prev_free;
-    last->next_free = block;
-    block->prev_free = last;
-    block->next_free = &fake_tail;
-    fake_tail.prev_free = block;
-    num_free_blocks++;
-}
-
-void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
-    if (blocks.empty()) return;
-    KVCacheBlock* last = fake_tail.prev_free;
-    for (KVCacheBlock* b : blocks) {
-        b->prev_free = last;
-        last->next_free = b;
-        last = b;
-    }
-    last->next_free = &fake_tail;
-    fake_tail.prev_free = last;
-    num_free_blocks += blocks.size();
-}
-
-void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
-    if (blocks.empty()) return;
-    KVCacheBlock* first = fake_head.next_free;
-    KVCacheBlock* prev = &fake_head;
-    for (KVCacheBlock* b : blocks) {
-        b->prev_free = prev;
-        prev->next_free = b;
-        prev = b;
-    }
-    prev->next_free = first;
-    first->prev_free = prev;
-    num_free_blocks += blocks.size();
-}
-
-std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
-    std::vector<KVCacheBlock*> ret;
-    const KVCacheBlock* curr = fake_head.next_free;
-    while (curr && curr->next_free != nullptr) {
-        ret.push_back(const_cast<KVCacheBlock*>(curr));
-        curr = curr->next_free;
-    }
-    return ret;
-}
-
-// ---------------------------------------------------------------------------
-// BlockPool  (port of block_pool.py)
-// ---------------------------------------------------------------------------
-
-static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
-    std::vector<KVCacheBlock*> p;
-    p.reserve(v.size());
-    for (auto& b : v) p.push_back(&b);
-    return p;
-}
-
-static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
-    std::vector<KVCacheBlock> v;
-    v.reserve(num_blocks);
-    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
-    return v;
-}
-
-BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
-    : enable_caching_(enable_caching),
-      blocks_(make_block_vec(num_blocks)),
-      ptrs_(make_ptrs(blocks_)),
-      free_queue_(ptrs_) {
-    // vLLM reserves block_id 0 as the null block (never cached).
-    null_block = free_queue_.popleft();
-    null_block->is_null = true;
-}
-
-bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
-    if (!block->has_hash) return false;
-    auto it = cached_block_hash_to_block_.find(block->block_hash);
-    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
-    cached_block_hash_to_block_.erase(it);
-    block->reset_hash();
-    return true;
-}
-
-std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
-    if (n > get_num_free_blocks())
-        throw std::runtime_error("Cannot get free blocks from pool");
-    auto ret = free_queue_.popleft_n(n);
-    for (KVCacheBlock* b : ret) {
-        if (enable_caching_) maybe_evict_cached_block(b);
-        assert(b->ref_cnt == 0);
-        b->ref_cnt += 1;
-    }
-    return ret;
-}
-
-KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
-    auto it = cached_block_hash_to_block_.find(block_hash);
-    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
-}
-
-void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
-    for (KVCacheBlock* b : blocks) {
-        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
-        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
-        b->ref_cnt += 1;
-    }
-}
-
-void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
-    std::vector<KVCacheBlock*> without_hash, with_hash;
-    for (KVCacheBlock* b : ordered_blocks) {
-        if (b->is_null) continue;
-        b->ref_cnt -= 1;
-        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
-    }
-    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
-    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
-}
-
-void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-                                  size_t num_cached_blocks, size_t num_full_blocks,
-                                  const std::vector<uint64_t>& block_hashes) {
-    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
-        KVCacheBlock* blk = req_blocks[i];
-        if (blk->has_hash) continue;
-        blk->has_hash = true;
-        blk->block_hash = block_hashes[i];
-        cached_block_hash_to_block_[blk->block_hash] = blk;
-    }
-}
-
-// ---------------------------------------------------------------------------
-// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
-// ---------------------------------------------------------------------------
-
-static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-
-PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
-    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
-
-bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
-    auto& req = req_to_blocks_[seq_id];
-    size_t need = cdiv(total_tokens, block_size_);
-    if (need <= req.size()) return true;
-    size_t add = need - req.size();
-    if (add > pool_.get_num_free_blocks()) return false; // OOM
-    auto nb = pool_.get_new_blocks(add);
-    req.insert(req.end(), nb.begin(), nb.end());
-    return true;
-}
-
-std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
-    std::vector<int32_t> bt;
-    auto it = req_to_blocks_.find(seq_id);
-    if (it == req_to_blocks_.end()) return bt;
-    bt.reserve(it->second.size());
-    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
-    return bt;
-}
-
-int64_t PagedKVManager::slot(int seq_id, int pos) const {
-    const auto& req = req_to_blocks_.at(seq_id);
-    int32_t phys = req[pos / block_size_]->block_id;
-    return (int64_t)phys * block_size_ + (pos % block_size_);
-}
-
-std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
-    std::vector<int64_t> sm;
-    sm.reserve(positions.size());
-    for (int p : positions) sm.push_back(slot(seq_id, p));
-    return sm;
-}
-
-void PagedKVManager::free(int seq_id) {
-    auto it = req_to_blocks_.find(seq_id);
-    if (it == req_to_blocks_.end()) return;
-    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
-    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
-    pool_.free_blocks(ordered);
-    req_to_blocks_.erase(it);
-}
-
-// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
-// hash into the seed so each block hash transitively encodes its whole prefix
-// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
-uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
-    uint64_t h = 1469598103934665603ull ^ parent_hash;
-    for (int t : token_ids) {
-        h ^= (uint64_t)(uint32_t)t;
-        h *= 1099511628211ull;
-    }
-    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
-    return h;
-}
-
-std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
-    std::vector<uint64_t> hashes;
-    uint64_t parent = 0; // NONE_HASH analogue
-    size_t n_full = token_ids.size() / block_size_;
-    for (size_t i = 0; i < n_full; ++i) {
-        std::vector<int> blk(token_ids.begin() + i * block_size_,
-                             token_ids.begin() + (i + 1) * block_size_);
-        parent = hash_block(parent, blk);
-        hashes.push_back(parent);
-    }
-    return hashes;
-}
-
-size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
-    std::vector<KVCacheBlock*> hits;
-    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
-        KVCacheBlock* cb = pool_.get_cached_block(bh);
-        if (!cb) break;
-        hits.push_back(cb);
-    }
-    pool_.touch(hits);                        // ++ref_cnt, pull from free list
-    return hits.size() * (size_t)block_size_;
-}
-
-void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
-    auto& req = req_to_blocks_[seq_id];
-    size_t n_full = num_tokens / block_size_;
-    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
-}
-
-} // namespace paged
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.h
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.h
@@ -1,108 +0,0 @@
-#pragma once
-// Paged KV cache block manager for llama.cpp (CPU-first prototype).
-//
-// Host-side block management is a faithful port of vLLM V1:
-//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
-//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
-//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
-//
-// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
-// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
-// dependency so it can be unit-tested in isolation.
-
-#include <cstdint>
-#include <vector>
-#include <unordered_map>
-#include <map>
-
-namespace paged {
-
-// vLLM KVCacheBlock (kv_cache_utils.py).
-struct KVCacheBlock {
-    int32_t  block_id   = 0;
-    int      ref_cnt    = 0;
-    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
-    uint64_t block_hash = 0;
-    bool     is_null    = false;
-    KVCacheBlock* prev_free = nullptr;
-    KVCacheBlock* next_free = nullptr;
-
-    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
-    void reset_hash() { has_hash = false; block_hash = 0; }
-};
-
-// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
-// O(1) middle removal is required so touch() can pull a warm cached block out of the
-// free list when a later request hits its prefix.
-class FreeBlockQueue {
-public:
-    size_t num_free_blocks = 0;
-
-    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
-    KVCacheBlock* popleft();
-    std::vector<KVCacheBlock*> popleft_n(size_t n);
-    void remove(KVCacheBlock* block);
-    void append(KVCacheBlock* block);
-    void append_n(const std::vector<KVCacheBlock*>& blocks);
-    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
-    std::vector<KVCacheBlock*> get_all_free_blocks() const;
-
-private:
-    KVCacheBlock fake_head{-1};
-    KVCacheBlock fake_tail{-1};
-};
-
-// vLLM BlockPool (block_pool.py).
-class BlockPool {
-public:
-    KVCacheBlock* null_block = nullptr;
-
-    BlockPool(int32_t num_blocks, bool enable_caching);
-    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
-    KVCacheBlock* get_cached_block(uint64_t block_hash);
-    void touch(const std::vector<KVCacheBlock*>& blocks);
-    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
-    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-                           size_t num_cached_blocks, size_t num_full_blocks,
-                           const std::vector<uint64_t>& block_hashes);
-    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
-
-private:
-    bool maybe_evict_cached_block(KVCacheBlock* block);
-
-    bool enable_caching_;
-    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
-    std::vector<KVCacheBlock*> ptrs_;
-    FreeBlockQueue free_queue_;
-    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
-    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
-    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
-};
-
-// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
-// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
-class PagedKVManager {
-public:
-    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
-
-    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
-    bool allocate(int seq_id, size_t total_tokens);
-    std::vector<int32_t> block_table(int seq_id) const;
-    int64_t slot(int seq_id, int pos) const;
-    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
-    void free(int seq_id);
-    int block_size() const { return block_size_; }
-
-    // Prefix caching (win 3).
-    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
-    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
-    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
-
-protected:
-    int block_size_;
-    BlockPool pool_;
-    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
-};
-
-} // namespace paged
--- a/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
@@ -1,59 +0,0 @@
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index a49a055a6..d95102bbd 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -11,6 +11,8 @@
- #include <cstring>
- #include <limits>
- #include <map>
-+#include <numeric>
-+#include <cstdlib>
- #include <stdexcept>
- 
- static bool ggml_is_power_of_2(int n) {
-@@ -931,6 +933,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             return { };
-         }
- 
-+        // [paged, experimental] Place this sequence's tokens at permuted,
-+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
-+        // This validates that attention is invariant to physical KV placement -
-+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-+        // Single-sequence scope (uses get_used() as the logical base); falls back
-+        // to the normal allocator if the permuted cells aren't available.
-+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+        if (paged_mode) {
-+            const uint32_t bs   = 16;                 // block size (tokens/block)
-+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            if (nblk >= 2) {
-+                // stride coprime to nblk => block-index permutation is a bijection
-+                uint32_t k = 1;
-+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-+                }
-+                const uint32_t base = cells.get_used();
-+                bool ok = true;
-+                for (uint32_t i = 0; i < n_tokens; ++i) {
-+                    const uint32_t L    = base + i;
-+                    const uint32_t b    = L / bs;
-+                    const uint32_t off  = L % bs;
-+                    if (b >= nblk) { ok = false; break; }
-+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-+                    res.idxs[s].push_back(phys);
-+                }
-+                if (ok && res.idxs[s].size() == n_tokens) {
-+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                    }
-+                    continue; // paged placement succeeded for this sequence
-+                }
-+                res.idxs[s].clear(); // fall back to the normal allocator
-+            }
-+        }
-+
-         uint32_t n_tested = 0;
- 
-         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
--- a/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
@@ -1,12 +0,0 @@
-diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp
-index 5a352e3..06ead50 100644
--- a/tests/test-paged-kv-e2e.cpp
-+++ b/tests/test-paged-kv-e2e.cpp
-@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) {
-     params.sampling.temp = 0.0f;  // greedy
-     params.warmup        = false;
-     params.kv_paged      = true;
-+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
-     params.n_gpu_blocks  = 64;
-     params.n_cpu_blocks  = 16;
-     params.n_sequences   = 1;
--- a/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
@@ -1,42 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-using namespace paged;
-
-int main() {
-    BlockPool pool(/*num_blocks=*/8, /*enable_caching=*/true);
-    // block 0 is reserved as null_block (vLLM pops one at init)
-    assert(pool.null_block != nullptr && pool.null_block->block_id == 0);
-    assert(pool.get_num_free_blocks() == 7);
-
-    // get_new_blocks sets ref_cnt=1 and removes from free list
-    auto b = pool.get_new_blocks(2);
-    assert(b.size() == 2 && b[0]->ref_cnt == 1 && b[1]->ref_cnt == 1);
-    assert(pool.get_num_free_blocks() == 5);
-
-    // cache two full blocks with chained hashes, then look them up
-    std::vector<uint64_t> hashes = {1111, 2222};
-    pool.cache_full_blocks(b, /*num_cached=*/0, /*num_full=*/2, hashes);
-    assert(b[0]->has_hash && b[0]->block_hash == 1111);
-    assert(pool.get_cached_block(1111) == b[0]);
-    assert(pool.get_cached_block(2222) == b[1]);
-    assert(pool.get_cached_block(9999) == nullptr);
-
-    // free: hashed blocks go to tail (kept warm), so they remain queryable.
-    pool.free_blocks(b);
-    assert(b[0]->ref_cnt == 0);
-    assert(pool.get_num_free_blocks() == 7);
-    assert(pool.get_cached_block(1111) == b[0]); // still cached/warm
-
-    // touch a warm cached block: pulls it out of free list, ++ref_cnt
-    pool.touch({b[0]});
-    assert(b[0]->ref_cnt == 1);
-    assert(pool.get_num_free_blocks() == 6);
-
-    // exhausting the pool then allocating evicts a warm cached hash
-    auto rest = pool.get_new_blocks(pool.get_num_free_blocks());
-    (void) rest;
-    assert(pool.get_cached_block(2222) == nullptr); // evicted on reuse
-    printf("test_block_pool: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
@@ -1,44 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-#include <vector>
-
-using namespace paged;
-
-static std::vector<KVCacheBlock> make_blocks(int n) {
-    std::vector<KVCacheBlock> v;
-    v.reserve(n);
-    for (int i = 0; i < n; ++i) v.push_back(KVCacheBlock{i});
-    return v;
-}
-
-int main() {
-    // ordered 0..9 at init; popleft yields ascending block_ids
-    auto blocks = make_blocks(10);
-    std::vector<KVCacheBlock*> ptrs;
-    for (auto& b : blocks) ptrs.push_back(&b);
-    FreeBlockQueue q(ptrs);
-    assert(q.num_free_blocks == 10);
-
-    KVCacheBlock* b0 = q.popleft();
-    assert(b0->block_id == 0);
-    assert(q.num_free_blocks == 9);
-
-    auto two = q.popleft_n(2);            // {1,2}
-    assert(two.size() == 2 && two[0]->block_id == 1 && two[1]->block_id == 2);
-    assert(q.num_free_blocks == 7);
-
-    // O(1) middle removal: remove block 5 (currently free), count drops
-    q.remove(ptrs[5]);
-    assert(q.num_free_blocks == 6);       // free: 3,4,6,7,8,9
-
-    // append puts a block at the tail; it comes back out only after the rest
-    q.append(b0);                          // free order now: 3,4,6,7,8,9,0
-    assert(q.num_free_blocks == 7);
-    auto all = q.get_all_free_blocks();
-    assert(all.front()->block_id == 3);
-    assert(all.back()->block_id == 0);
-
-    printf("test_free_block_queue: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
@@ -1,133 +0,0 @@
-// Phase 2 (core numeric de-risk): attention over GATHERED paged KV must equal
-// an independent host-computed reference.
-//
-// This answers the central risk in the design: feeding gather-to-scratch KV
-// (a sequence whose blocks are non-contiguous in the shared pool) into ggml's
-// standard attention ops (mul_mat -> soft_max_ext -> mul_mat) produces correct
-// attention. If this holds, the paged read path is numerically sound; the
-// remaining work is wiring it into llama-graph.cpp (Gate 0 in a real model).
-
-#include "../paged_kv_manager.h"
-
-#include "ggml.h"
-#include "ggml-cpu.h"
-#include "ggml-alloc.h"
-#include "ggml-backend.h"
-
-#include <cassert>
-#include <cstdio>
-#include <cmath>
-#include <vector>
-
-using namespace paged;
-
-int main() {
-    const int d          = 8;     // head dim
-    const int n_kv       = 48;    // 3 blocks worth of KV tokens
-    const int n_q        = 4;     // query tokens
-    const int block_size = 16;
-    const int num_blocks = 8;
-    const int total_slots = block_size * num_blocks;
-    const float scale = 1.0f / std::sqrt((float) d);
-
-    // Non-contiguous physical layout for the KV sequence (blocks [2,1,5]).
-    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-    assert(m.allocate(0, 2 * block_size));
-    assert(m.allocate(1, 2 * block_size));
-    m.free(0);
-    assert(m.allocate(2, n_kv));
-    std::vector<int> positions(n_kv);
-    for (int i = 0; i < n_kv; ++i) positions[i] = i;
-    auto slots64 = m.slot_mapping(2, positions);
-    std::vector<int32_t> slots32(slots64.begin(), slots64.end());
-
-    // Deterministic K, V, Q in logical [d, n] layout (column-major: col = token).
-    std::vector<float> K(d * n_kv), V(d * n_kv), Q(d * n_q);
-    for (int t = 0; t < n_kv; ++t)
-        for (int e = 0; e < d; ++e) {
-            K[t * d + e] = std::sin(0.1f * t + 0.3f * e);
-            V[t * d + e] = std::cos(0.2f * t - 0.1f * e);
-        }
-    for (int q = 0; q < n_q; ++q)
-        for (int e = 0; e < d; ++e) Q[q * d + e] = std::sin(0.05f * q + 0.7f * e);
-
-    // ---- Independent host reference attention -------------------------------
-    std::vector<float> ref(d * n_q, 0.0f);
-    for (int q = 0; q < n_q; ++q) {
-        std::vector<float> score(n_kv);
-        float mx = -1e30f;
-        for (int t = 0; t < n_kv; ++t) {
-            float dot = 0.0f;
-            for (int e = 0; e < d; ++e) dot += K[t * d + e] * Q[q * d + e];
-            score[t] = dot * scale;
-            mx = std::fmax(mx, score[t]);
-        }
-        float sum = 0.0f;
-        for (int t = 0; t < n_kv; ++t) { score[t] = std::exp(score[t] - mx); sum += score[t]; }
-        for (int t = 0; t < n_kv; ++t) {
-            float p = score[t] / sum;
-            for (int e = 0; e < d; ++e) ref[q * d + e] += p * V[t * d + e];
-        }
-    }
-
-    // ---- ggml paged path ----------------------------------------------------
-    ggml_backend_t backend = ggml_backend_cpu_init();
-    struct ggml_init_params dp = { ggml_tensor_overhead() * 16, NULL, true };
-    struct ggml_context * ctx_data = ggml_init(dp);
-
-    struct ggml_tensor * poolK = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
-    struct ggml_tensor * poolV = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
-    struct ggml_tensor * kSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
-    struct ggml_tensor * vSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
-    struct ggml_tensor * qT    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_q);
-    struct ggml_tensor * wIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_kv);
-    struct ggml_tensor * gIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_kv);
-
-    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
-    std::vector<float> zeros(d * total_slots, 0.0f);
-    ggml_backend_tensor_set(poolK, zeros.data(), 0, ggml_nbytes(poolK));
-    ggml_backend_tensor_set(poolV, zeros.data(), 0, ggml_nbytes(poolV));
-    ggml_backend_tensor_set(kSrc, K.data(), 0, ggml_nbytes(kSrc));
-    ggml_backend_tensor_set(vSrc, V.data(), 0, ggml_nbytes(vSrc));
-    ggml_backend_tensor_set(qT,   Q.data(), 0, ggml_nbytes(qT));
-    ggml_backend_tensor_set(wIdx, slots64.data(), 0, ggml_nbytes(wIdx));
-    ggml_backend_tensor_set(gIdx, slots32.data(), 0, ggml_nbytes(gIdx));
-
-    struct ggml_init_params cp = { ggml_tensor_overhead() * 64 + ggml_graph_overhead(), NULL, true };
-    struct ggml_context * ctx = ggml_init(cp);
-
-    struct ggml_tensor * wroteK = ggml_set_rows(ctx, poolK, kSrc, wIdx);
-    struct ggml_tensor * wroteV = ggml_set_rows(ctx, poolV, vSrc, wIdx);
-    struct ggml_tensor * gK = ggml_get_rows(ctx, wroteK, gIdx);          // [d, n_kv]
-    struct ggml_tensor * gV = ggml_get_rows(ctx, wroteV, gIdx);          // [d, n_kv]
-
-    struct ggml_tensor * kq    = ggml_mul_mat(ctx, gK, qT);              // [n_kv, n_q]
-    struct ggml_tensor * probs = ggml_soft_max_ext(ctx, kq, NULL, scale, 0.0f);
-    struct ggml_tensor * vT    = ggml_cont(ctx, ggml_transpose(ctx, gV)); // [n_kv, d]
-    struct ggml_tensor * out   = ggml_mul_mat(ctx, vT, probs);           // [d, n_q]
-    ggml_set_output(out);
-
-    struct ggml_cgraph * gf = ggml_new_graph(ctx);
-    ggml_build_forward_expand(gf, out);
-    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
-    assert(ggml_gallocr_alloc_graph(galloc, gf));
-    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
-
-    std::vector<float> got(d * n_q);
-    ggml_backend_tensor_get(out, got.data(), 0, ggml_nbytes(out));
-
-    // ---- compare ------------------------------------------------------------
-    double max_err = 0.0;
-    for (int i = 0; i < d * n_q; ++i) max_err = std::fmax(max_err, std::fabs(got[i] - ref[i]));
-    printf("paged attention max abs err vs host reference: %.3e\n", max_err);
-    assert(max_err < 1e-4 && "paged-gathered attention must match host reference");
-
-    ggml_gallocr_free(galloc);
-    ggml_free(ctx);
-    ggml_free(ctx_data);
-    ggml_backend_buffer_free(buf);
-    ggml_backend_free(backend);
-
-    printf("test_ggml_paged_attn: OK (attention over non-contiguous paged KV matches reference)\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
@@ -1,142 +0,0 @@
-// Phase 1 integration test: prove the paged KV write+read MECHANISM at the
-// ggml-op level, driven by PagedKVManager.
-//
-//   write:  ggml_set_rows(pool, k_src, slot_mapping)   // scatter by slot
-//   read:   ggml_get_rows(pool, gather_idx)            // gather seq's slots
-//
-// The decisive property: a sequence's physical blocks are NON-CONTIGUOUS and
-// OUT-OF-ORDER (forced via allocate/free/reallocate), yet gather(write(x)) == x,
-// and a second sequence written into disjoint blocks does not contaminate it.
-// This is exactly how a paged read path feeds contiguous scratch to attention.
-
-#include "../paged_kv_manager.h"
-
-#include "ggml.h"
-#include "ggml-cpu.h"
-#include "ggml-alloc.h"
-#include "ggml-backend.h"
-
-#include <cassert>
-#include <cstdio>
-#include <cmath>
-#include <vector>
-
-using namespace paged;
-
-int main() {
-    const int n_embd      = 8;
-    const int block_size  = 16;
-    const int num_blocks  = 8;                       // block 0 reserved as null
-    const int total_slots = block_size * num_blocks; // 128
-
-    // --- Force a non-contiguous, out-of-order block layout for seqC ----------
-    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
-    assert(m.allocate(/*seqA=*/0, 2 * block_size)); // blocks {1,2}
-    assert(m.allocate(/*seqB=*/1, 2 * block_size)); // blocks {3,4}
-    m.free(0);                                       // returns {1,2} to free list
-    assert(m.allocate(/*seqC=*/2, 3 * block_size));  // reuses freed blocks, reordered
-
-    auto btC = m.block_table(2);
-    auto btB = m.block_table(1);
-    printf("seqC block_table = [");
-    for (size_t i = 0; i < btC.size(); ++i) printf("%s%d", i ? "," : "", btC[i]);
-    printf("]\n");
-    assert(btC.size() == 3);
-    // sanity: seqC and seqB occupy disjoint physical blocks
-    for (int cb : btC) for (int bb : btB) assert(cb != bb);
-
-    const int n_tokens = 3 * block_size; // 48 tokens for seqC
-
-    // slot_mapping for seqC positions 0..n_tokens-1
-    std::vector<int> positions(n_tokens);
-    for (int i = 0; i < n_tokens; ++i) positions[i] = i;
-    std::vector<int64_t> slots64 = m.slot_mapping(2, positions); // I64 for set_rows
-    std::vector<int32_t> slots32(slots64.begin(), slots64.end()); // I32 for get_rows
-
-    // seqB occupies different blocks; write a sentinel there to prove isolation.
-    std::vector<int> posB(2 * block_size);
-    for (size_t i = 0; i < posB.size(); ++i) posB[i] = (int) i;
-    std::vector<int64_t> slotsB64 = m.slot_mapping(1, posB);
-
-    // --- ggml backend + persistent (statically allocated) tensors ------------
-    ggml_backend_t backend = ggml_backend_cpu_init();
-    assert(backend);
-
-    struct ggml_init_params dp = { /*mem_size=*/ ggml_tensor_overhead() * 16,
-                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
-    struct ggml_context * ctx_data = ggml_init(dp);
-
-    // The shared paged KV pool: one flat block pool, exactly like a paged layer.
-    struct ggml_tensor * pool    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, total_slots);
-    struct ggml_tensor * k_src   = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, n_tokens);
-    struct ggml_tensor * w_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_tokens);
-    struct ggml_tensor * g_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_tokens);
-    struct ggml_tensor * kB_src  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, (int) posB.size());
-    struct ggml_tensor * wB_idx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, (int) posB.size());
-
-    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
-    assert(buf);
-
-    // pool starts zeroed
-    std::vector<float> zeros(n_embd * total_slots, 0.0f);
-    ggml_backend_tensor_set(pool, zeros.data(), 0, ggml_nbytes(pool));
-
-    // token t carries the value (float) t in every embedding lane -> easy to verify
-    std::vector<float> ksrc(n_embd * n_tokens);
-    for (int t = 0; t < n_tokens; ++t)
-        for (int e = 0; e < n_embd; ++e) ksrc[t * n_embd + e] = (float) t;
-    ggml_backend_tensor_set(k_src, ksrc.data(), 0, ggml_nbytes(k_src));
-    ggml_backend_tensor_set(w_idx, slots64.data(), 0, ggml_nbytes(w_idx));
-    ggml_backend_tensor_set(g_idx, slots32.data(), 0, ggml_nbytes(g_idx));
-
-    // seqB sentinel = 999 everywhere
-    std::vector<float> kBsrc(n_embd * posB.size(), 999.0f);
-    ggml_backend_tensor_set(kB_src, kBsrc.data(), 0, ggml_nbytes(kB_src));
-    ggml_backend_tensor_set(wB_idx, slotsB64.data(), 0, ggml_nbytes(wB_idx));
-
-    // --- compute graph: write seqB, write seqC, then gather seqC -------------
-    struct ggml_init_params cp = { /*mem_size=*/ ggml_tensor_overhead() * 32 + ggml_graph_overhead(),
-                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
-    struct ggml_context * ctx = ggml_init(cp);
-
-    struct ggml_tensor * wroteB = ggml_set_rows(ctx, pool,   kB_src, wB_idx); // view(pool)
-    struct ggml_tensor * wroteC = ggml_set_rows(ctx, wroteB, k_src,  w_idx);  // chain so order is fixed
-    struct ggml_tensor * gathered = ggml_get_rows(ctx, wroteC, g_idx);
-    ggml_set_output(gathered);
-
-    struct ggml_cgraph * gf = ggml_new_graph(ctx);
-    ggml_build_forward_expand(gf, gathered);
-
-    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
-    assert(ggml_gallocr_alloc_graph(galloc, gf));
-
-    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
-
-    // --- verify gather(write(x)) == x for the non-contiguous sequence --------
-    std::vector<float> out(n_embd * n_tokens);
-    ggml_backend_tensor_get(gathered, out.data(), 0, ggml_nbytes(gathered));
-
-    int mism = 0;
-    for (int t = 0; t < n_tokens; ++t)
-        for (int e = 0; e < n_embd; ++e)
-            if (std::fabs(out[t * n_embd + e] - (float) t) > 1e-6f) mism++;
-    assert(mism == 0 && "gathered paged KV must equal source (round-trip)");
-
-    // --- verify isolation: read seqC slots directly from pool, unaffected by seqB
-    std::vector<float> pool_host(n_embd * total_slots);
-    ggml_backend_tensor_get(pool, pool_host.data(), 0, ggml_nbytes(pool));
-    for (int t = 0; t < n_tokens; ++t) {
-        int slot = (int) slots64[t];
-        for (int e = 0; e < n_embd; ++e)
-            assert(std::fabs(pool_host[slot * n_embd + e] - (float) t) < 1e-6f);
-    }
-
-    ggml_gallocr_free(galloc);
-    ggml_free(ctx);
-    ggml_free(ctx_data);
-    ggml_backend_buffer_free(buf);
-    ggml_backend_free(backend);
-
-    printf("test_ggml_paged_rw: OK (non-contiguous paged write/gather round-trip)\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
@@ -1,32 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-using namespace paged;
-
-int main() {
-    PagedKVManager m(/*num_blocks=*/8, /*block_size=*/16, /*enable_caching=*/false);
-    // 20 tokens -> ceil(20/16)=2 blocks
-    assert(m.allocate(/*seq=*/0, 20));
-    auto bt = m.block_table(0);
-    assert(bt.size() == 2);
-
-    // slot arithmetic: pos 0 -> block bt[0]*16 + 0 ; pos 17 -> bt[1]*16 + 1
-    assert(m.slot(0, 0)  == (int64_t)bt[0] * 16 + 0);
-    assert(m.slot(0, 17) == (int64_t)bt[1] * 16 + 1);
-
-    auto sm = m.slot_mapping(0, {0, 16, 17});
-    assert(sm.size() == 3 && sm[1] == (int64_t)bt[1] * 16 + 0);
-
-    // growing the same seq reuses existing blocks, adds only new ones
-    assert(m.allocate(0, 40)); // ceil(40/16)=3 -> +1 block
-    assert(m.block_table(0).size() == 3);
-
-    // OOM: blocks left = 8 - 1(null) - 3 = 4 blocks; ask for 5 blocks
-    assert(m.allocate(1, 5 * 16) == false);
-
-    // free returns blocks to the pool for reuse
-    m.free(0);
-    assert(m.allocate(1, 5 * 16)); // now fits
-    printf("test_paged_kv_manager: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
@@ -1,35 +0,0 @@
-#include "../paged_kv_manager.h"
-#include <cassert>
-#include <cstdio>
-#include <vector>
-using namespace paged;
-
-int main() {
-    PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*enable_caching=*/true);
-
-    // shared prefix of 32 tokens (2 full blocks) + distinct suffix
-    std::vector<int> shared(32);
-    for (int i = 0; i < 32; ++i) shared[i] = 100 + i;
-
-    // chained hashing is deterministic and prefix-sensitive
-    auto h = m.compute_block_hashes(shared);
-    assert(h.size() == 2);
-    auto h2 = m.compute_block_hashes(shared);
-    assert(h == h2);                          // deterministic
-    std::vector<int> other = shared; other[0] = 999;
-    assert(m.compute_block_hashes(other)[0] != h[0]); // sensitive to content
-
-    // seq 0: cold, no cache hit yet
-    assert(m.get_computed_blocks(h) == 0);
-    assert(m.allocate(0, 32));
-    m.cache_blocks(0, h, 32);
-
-    // seq 1: warm — the 2 shared blocks are a cache hit (32 tokens)
-    assert(m.get_computed_blocks(h) == 32);
-
-    // first-miss stop: a chain that diverges after block 1 hits only 1 block
-    auto hmix = h; hmix[1] = 0xDEADBEEF;
-    assert(m.get_computed_blocks(hmix) == 16);
-    printf("test_prefix_cache: OK\n");
-    return 0;
-}
--- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
@@ -1,106 +0,0 @@
-# Paged-attention / parity benchmarks (GB10 / DGX Spark)
-
-Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
-
-**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
-(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
-prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
-
-## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
-
-| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
-|---|-----------|-----------|--------|-----------|-----------|--------|
-| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
-| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
-| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
-| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
-
-## Verdict — two distinct gaps, only one is the engine's
-
-1. **Prefill (S_PP): 6–48× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
-   GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this** —
-   it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
-2. **Decode at concurrency (S_TG): 2.5–3.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
-   engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
-   **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel).
-
-## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
-
-The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
-
-| kernel | % GPU time |
-|---|---|
-| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
-| `flash_attn_ext` (attention) | 19.8 |
-| `mul_mat_q<Q8>` (dense) | 10.9 |
-| KV writes / quant / norms / rest | ~15 |
-
-**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
-Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
-fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are
-MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
-
-### What the paged series IS still good for (just not throughput parity)
-
- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
-
-These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
-batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
-
-## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
-
-The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
-so we benchmarked a dense 32B head-to-head.
-
-**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
-DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
-(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
-the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
-root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
-
-| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
-|---|---|---|---|---|---|---|
-| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
-| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
-| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
-| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
-
-**Findings:**
-1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
-   B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
-   tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
-   than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
-2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
-   gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
-3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
-   block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
-   (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
-   - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
-     dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
-     nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
-     model classes. Parity is strictly the FP4 tensor-core kernel.
-4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
-   limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
-   kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
-   working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
-   CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
-   120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
-   `enable_flashinfer_autotune=True` at 0–3% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
-   with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
-   everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
-   isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
-   **Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
-   Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
-   `VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
-
-## So, honestly, where parity stands
-
- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
-  0005 continuous batching). Target: close the 2.5–3.7× at B≥8.
- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
-
-**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
-turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
-B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.
--- a/backend/cpp/llama-cpp/patches/README.md
+++ b/backend/cpp/llama-cpp/patches/README.md
@@ -1,82 +0,0 @@
-# llama.cpp patch series — paged attention (vLLM-parity engine)
-
-A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
-in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
-(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
-`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
-**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
-patch needs fixing, and the failure points at exactly which step the upstream change touched.
-
-## Base
-
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
-  the pin = re-run the regen workflow below and fix only the patches that no longer apply.
-
-## The series (phases → patches)
-
-| # | Patch | What | Verifies |
-|---|-------|------|----------|
-| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
-| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
-| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
-| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
-| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
-| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
-
-Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
-(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
-
-## Regen workflow (the anti-drift recipe)
-
-```sh
-# 1. check out the exact pin into a dev tree
-git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
-git checkout <LLAMA_VERSION from ../Makefile>
-git checkout -b paged
-
-# 2. apply the current series (each becomes a commit), or develop the next patch
-git am /path/to/backend/cpp/llama-cpp/patches/00*.patch     # or `git apply` + commit per patch
-
-# 3. iterate a phase as ONE commit, then export the whole series 1:1
-git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
-
-# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
-```
-
-## Build integration
-
-`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
-```
-for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
-```
-All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
-
-## Status
-
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
-  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
-  (`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
-  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
-  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
-  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
-  **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
-  compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
-  - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
-    flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
-    scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
-    the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
-    identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
-    makes paged placement token-identical under flash-attn.**
- 0004–0006 follow.
-
-### Honest parity note (important)
-
-This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
-by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
-(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
-touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
-gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
-`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
-kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
+++ b/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
@@ -1,91 +0,0 @@
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-new file mode 100644
-index 0000000..5f5a782
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
-@@ -0,0 +1,46 @@
-+#include "fp4-grouped-moe.cuh"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+
-+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
-+//
-+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
-+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
-+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
-+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
-+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
-+//
-+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
-+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
-+
-+static bool fp4_grouped_enabled() {
-+    static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
-+    return en;
-+}
-+
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,
-+        const ggml_tensor * src1,
-+        const ggml_tensor * ids,
-+        ggml_tensor       * dst) {
-+    GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
-+
-+    if (!fp4_grouped_enabled()) {
-+        return false; // default: existing MMQ path
-+    }
-+    if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
-+        return false;
-+    }
-+
-+    // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
-+    // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
-+    // vs mul_mat_q<MXFP4> before enabling by default.
-+    static bool warned = false;
-+    if (!warned) {
-+        warned = true;
-+        fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
-+    }
-+    return false; // scaffold: fall back until the kernel lands
-+}
-diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-new file mode 100644
-index 0000000..29e1b5a
--- /dev/null
-+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
-@@ -0,0 +1,13 @@
-+#pragma once
-+
-+#include "common.cuh"
-+
-+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
-+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
-+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
-+bool ggml_cuda_fp4_grouped_moe(
-+        ggml_backend_cuda_context & ctx,
-+        const ggml_tensor * src0,   // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
-+        const ggml_tensor * src1,   // activations, F32 [n_embd, n_tokens, ...]
-+        const ggml_tensor * ids,    // expert routing, I32
-+        ggml_tensor       * dst);   // F32 output
-diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
-index 8ea462a..104d131 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
-+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
-@@ -30,6 +30,7 @@
- #include "ggml-cuda/im2col.cuh"
- #include "ggml-cuda/mmf.cuh"
- #include "ggml-cuda/mmq.cuh"
-+#include "ggml-cuda/fp4-grouped-moe.cuh"
- #include "ggml-cuda/mmvf.cuh"
- #include "ggml-cuda/mmvq.cuh"
- #include "ggml-cuda/norm.cuh"
-@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
-         }
- 
-         if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
-+            if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
-             ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
-             return;
-         }
--- a/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
@@ -1,447 +0,0 @@
-From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 19 Jun 2026 22:54:49 +0000
-Subject: [PATCH] vendor paged kv manager
-
-vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
-PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
-nothing uses it yet; wired in by later patches in the series.
---
- src/CMakeLists.txt       |   1 +
- src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
- src/paged-kv-manager.h   | 108 ++++++++++++++
- 3 files changed, 405 insertions(+)
- create mode 100644 src/paged-kv-manager.cpp
- create mode 100644 src/paged-kv-manager.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index d15ccfd99..a030940b8 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -24,6 +24,7 @@ add_library(llama
-             llama-io.cpp
-             llama-kv-cache.cpp
-             llama-kv-cache-iswa.cpp
-+            paged-kv-manager.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
-new file mode 100644
-index 000000000..ca0dcd83a
--- /dev/null
-+++ b/src/paged-kv-manager.cpp
-@@ -0,0 +1,296 @@
-+#include "paged-kv-manager.h"
-+#include <cassert>
-+#include <stdexcept>
-+
-+namespace paged {
-+
-+// ---------------------------------------------------------------------------
-+// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
-+// ---------------------------------------------------------------------------
-+
-+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
-+    num_free_blocks = blocks.size();
-+    for (size_t i = 0; i < blocks.size(); ++i) {
-+        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
-+        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
-+    }
-+    if (!blocks.empty()) {
-+        fake_head.next_free = blocks.front();
-+        blocks.front()->prev_free = &fake_head;
-+        fake_tail.prev_free = blocks.back();
-+        blocks.back()->next_free = &fake_tail;
-+    } else {
-+        fake_head.next_free = &fake_tail;
-+        fake_tail.prev_free = &fake_head;
-+    }
-+}
-+
-+KVCacheBlock* FreeBlockQueue::popleft() {
-+    KVCacheBlock* first = fake_head.next_free;
-+    if (first == &fake_tail || first == nullptr) {
-+        assert(num_free_blocks == 0);
-+        throw std::runtime_error("No free blocks available");
-+    }
-+    fake_head.next_free = first->next_free;
-+    first->next_free->prev_free = &fake_head;
-+    first->prev_free = first->next_free = nullptr;
-+    num_free_blocks--;
-+    return first;
-+}
-+
-+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
-+    std::vector<KVCacheBlock*> ret;
-+    if (n == 0) return ret;
-+    assert(num_free_blocks >= n);
-+    num_free_blocks -= n;
-+    KVCacheBlock* curr = fake_head.next_free;
-+    ret.reserve(n);
-+    for (size_t i = 0; i < n; ++i) {
-+        assert(curr != nullptr);
-+        ret.push_back(curr);
-+        KVCacheBlock* last = curr;
-+        curr = curr->next_free;
-+        last->prev_free = last->next_free = nullptr;
-+    }
-+    if (curr != nullptr) {
-+        fake_head.next_free = curr;
-+        curr->prev_free = &fake_head;
-+    }
-+    return ret;
-+}
-+
-+void FreeBlockQueue::remove(KVCacheBlock* block) {
-+    if (!block->prev_free || !block->next_free)
-+        throw std::runtime_error("remove() called on an invalid block");
-+    block->prev_free->next_free = block->next_free;
-+    block->next_free->prev_free = block->prev_free;
-+    block->prev_free = block->next_free = nullptr;
-+    num_free_blocks--;
-+}
-+
-+void FreeBlockQueue::append(KVCacheBlock* block) {
-+    KVCacheBlock* last = fake_tail.prev_free;
-+    last->next_free = block;
-+    block->prev_free = last;
-+    block->next_free = &fake_tail;
-+    fake_tail.prev_free = block;
-+    num_free_blocks++;
-+}
-+
-+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
-+    if (blocks.empty()) return;
-+    KVCacheBlock* last = fake_tail.prev_free;
-+    for (KVCacheBlock* b : blocks) {
-+        b->prev_free = last;
-+        last->next_free = b;
-+        last = b;
-+    }
-+    last->next_free = &fake_tail;
-+    fake_tail.prev_free = last;
-+    num_free_blocks += blocks.size();
-+}
-+
-+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
-+    if (blocks.empty()) return;
-+    KVCacheBlock* first = fake_head.next_free;
-+    KVCacheBlock* prev = &fake_head;
-+    for (KVCacheBlock* b : blocks) {
-+        b->prev_free = prev;
-+        prev->next_free = b;
-+        prev = b;
-+    }
-+    prev->next_free = first;
-+    first->prev_free = prev;
-+    num_free_blocks += blocks.size();
-+}
-+
-+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
-+    std::vector<KVCacheBlock*> ret;
-+    const KVCacheBlock* curr = fake_head.next_free;
-+    while (curr && curr->next_free != nullptr) {
-+        ret.push_back(const_cast<KVCacheBlock*>(curr));
-+        curr = curr->next_free;
-+    }
-+    return ret;
-+}
-+
-+// ---------------------------------------------------------------------------
-+// BlockPool  (port of block_pool.py)
-+// ---------------------------------------------------------------------------
-+
-+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
-+    std::vector<KVCacheBlock*> p;
-+    p.reserve(v.size());
-+    for (auto& b : v) p.push_back(&b);
-+    return p;
-+}
-+
-+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
-+    std::vector<KVCacheBlock> v;
-+    v.reserve(num_blocks);
-+    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
-+    return v;
-+}
-+
-+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
-+    : enable_caching_(enable_caching),
-+      blocks_(make_block_vec(num_blocks)),
-+      ptrs_(make_ptrs(blocks_)),
-+      free_queue_(ptrs_) {
-+    // vLLM reserves block_id 0 as the null block (never cached).
-+    null_block = free_queue_.popleft();
-+    null_block->is_null = true;
-+}
-+
-+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
-+    if (!block->has_hash) return false;
-+    auto it = cached_block_hash_to_block_.find(block->block_hash);
-+    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
-+    cached_block_hash_to_block_.erase(it);
-+    block->reset_hash();
-+    return true;
-+}
-+
-+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
-+    if (n > get_num_free_blocks())
-+        throw std::runtime_error("Cannot get free blocks from pool");
-+    auto ret = free_queue_.popleft_n(n);
-+    for (KVCacheBlock* b : ret) {
-+        if (enable_caching_) maybe_evict_cached_block(b);
-+        assert(b->ref_cnt == 0);
-+        b->ref_cnt += 1;
-+    }
-+    return ret;
-+}
-+
-+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
-+    auto it = cached_block_hash_to_block_.find(block_hash);
-+    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
-+}
-+
-+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
-+    for (KVCacheBlock* b : blocks) {
-+        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
-+        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
-+        b->ref_cnt += 1;
-+    }
-+}
-+
-+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
-+    std::vector<KVCacheBlock*> without_hash, with_hash;
-+    for (KVCacheBlock* b : ordered_blocks) {
-+        if (b->is_null) continue;
-+        b->ref_cnt -= 1;
-+        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
-+    }
-+    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
-+    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
-+}
-+
-+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-+                                  size_t num_cached_blocks, size_t num_full_blocks,
-+                                  const std::vector<uint64_t>& block_hashes) {
-+    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
-+        KVCacheBlock* blk = req_blocks[i];
-+        if (blk->has_hash) continue;
-+        blk->has_hash = true;
-+        blk->block_hash = block_hashes[i];
-+        cached_block_hash_to_block_[blk->block_hash] = blk;
-+    }
-+}
-+
-+// ---------------------------------------------------------------------------
-+// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
-+// ---------------------------------------------------------------------------
-+
-+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
-+
-+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
-+    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
-+
-+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
-+    auto& req = req_to_blocks_[seq_id];
-+    size_t need = cdiv(total_tokens, block_size_);
-+    if (need <= req.size()) return true;
-+    size_t add = need - req.size();
-+    if (add > pool_.get_num_free_blocks()) return false; // OOM
-+    auto nb = pool_.get_new_blocks(add);
-+    req.insert(req.end(), nb.begin(), nb.end());
-+    return true;
-+}
-+
-+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
-+    std::vector<int32_t> bt;
-+    auto it = req_to_blocks_.find(seq_id);
-+    if (it == req_to_blocks_.end()) return bt;
-+    bt.reserve(it->second.size());
-+    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
-+    return bt;
-+}
-+
-+int64_t PagedKVManager::slot(int seq_id, int pos) const {
-+    const auto& req = req_to_blocks_.at(seq_id);
-+    int32_t phys = req[pos / block_size_]->block_id;
-+    return (int64_t)phys * block_size_ + (pos % block_size_);
-+}
-+
-+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
-+    std::vector<int64_t> sm;
-+    sm.reserve(positions.size());
-+    for (int p : positions) sm.push_back(slot(seq_id, p));
-+    return sm;
-+}
-+
-+void PagedKVManager::free(int seq_id) {
-+    auto it = req_to_blocks_.find(seq_id);
-+    if (it == req_to_blocks_.end()) return;
-+    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
-+    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
-+    pool_.free_blocks(ordered);
-+    req_to_blocks_.erase(it);
-+}
-+
-+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
-+// hash into the seed so each block hash transitively encodes its whole prefix
-+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
-+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
-+    uint64_t h = 1469598103934665603ull ^ parent_hash;
-+    for (int t : token_ids) {
-+        h ^= (uint64_t)(uint32_t)t;
-+        h *= 1099511628211ull;
-+    }
-+    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
-+    return h;
-+}
-+
-+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
-+    std::vector<uint64_t> hashes;
-+    uint64_t parent = 0; // NONE_HASH analogue
-+    size_t n_full = token_ids.size() / block_size_;
-+    for (size_t i = 0; i < n_full; ++i) {
-+        std::vector<int> blk(token_ids.begin() + i * block_size_,
-+                             token_ids.begin() + (i + 1) * block_size_);
-+        parent = hash_block(parent, blk);
-+        hashes.push_back(parent);
-+    }
-+    return hashes;
-+}
-+
-+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
-+    std::vector<KVCacheBlock*> hits;
-+    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
-+        KVCacheBlock* cb = pool_.get_cached_block(bh);
-+        if (!cb) break;
-+        hits.push_back(cb);
-+    }
-+    pool_.touch(hits);                        // ++ref_cnt, pull from free list
-+    return hits.size() * (size_t)block_size_;
-+}
-+
-+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
-+    auto& req = req_to_blocks_[seq_id];
-+    size_t n_full = num_tokens / block_size_;
-+    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
-+}
-+
-+} // namespace paged
-diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
-new file mode 100644
-index 000000000..740280a7f
--- /dev/null
-+++ b/src/paged-kv-manager.h
-@@ -0,0 +1,108 @@
-+#pragma once
-+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
-+//
-+// Host-side block management is a faithful port of vLLM V1:
-+//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
-+//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
-+//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
-+//
-+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
-+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
-+// dependency so it can be unit-tested in isolation.
-+
-+#include <cstdint>
-+#include <vector>
-+#include <unordered_map>
-+#include <map>
-+
-+namespace paged {
-+
-+// vLLM KVCacheBlock (kv_cache_utils.py).
-+struct KVCacheBlock {
-+    int32_t  block_id   = 0;
-+    int      ref_cnt    = 0;
-+    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
-+    uint64_t block_hash = 0;
-+    bool     is_null    = false;
-+    KVCacheBlock* prev_free = nullptr;
-+    KVCacheBlock* next_free = nullptr;
-+
-+    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
-+    void reset_hash() { has_hash = false; block_hash = 0; }
-+};
-+
-+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
-+// O(1) middle removal is required so touch() can pull a warm cached block out of the
-+// free list when a later request hits its prefix.
-+class FreeBlockQueue {
-+public:
-+    size_t num_free_blocks = 0;
-+
-+    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
-+    KVCacheBlock* popleft();
-+    std::vector<KVCacheBlock*> popleft_n(size_t n);
-+    void remove(KVCacheBlock* block);
-+    void append(KVCacheBlock* block);
-+    void append_n(const std::vector<KVCacheBlock*>& blocks);
-+    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
-+    std::vector<KVCacheBlock*> get_all_free_blocks() const;
-+
-+private:
-+    KVCacheBlock fake_head{-1};
-+    KVCacheBlock fake_tail{-1};
-+};
-+
-+// vLLM BlockPool (block_pool.py).
-+class BlockPool {
-+public:
-+    KVCacheBlock* null_block = nullptr;
-+
-+    BlockPool(int32_t num_blocks, bool enable_caching);
-+    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
-+    KVCacheBlock* get_cached_block(uint64_t block_hash);
-+    void touch(const std::vector<KVCacheBlock*>& blocks);
-+    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
-+    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
-+                           size_t num_cached_blocks, size_t num_full_blocks,
-+                           const std::vector<uint64_t>& block_hashes);
-+    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
-+
-+private:
-+    bool maybe_evict_cached_block(KVCacheBlock* block);
-+
-+    bool enable_caching_;
-+    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
-+    std::vector<KVCacheBlock*> ptrs_;
-+    FreeBlockQueue free_queue_;
-+    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
-+    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
-+    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
-+};
-+
-+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
-+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
-+class PagedKVManager {
-+public:
-+    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
-+
-+    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
-+    bool allocate(int seq_id, size_t total_tokens);
-+    std::vector<int32_t> block_table(int seq_id) const;
-+    int64_t slot(int seq_id, int pos) const;
-+    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
-+    void free(int seq_id);
-+    int block_size() const { return block_size_; }
-+
-+    // Prefix caching (win 3).
-+    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
-+    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
-+    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-+    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
-+
-+protected:
-+    int block_size_;
-+    BlockPool pool_;
-+    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
-+};
-+
-+} // namespace paged
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
@@ -1,75 +0,0 @@
-From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Fri, 19 Jun 2026 23:04:17 +0000
-Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
-
-Place each sequence's tokens at permuted, non-contiguous fixed-size block
-positions in find_slot, proving attention is invariant to physical KV placement
-(token-identical greedy generation). Default off; single-sequence scope; falls
-back to the normal allocator. The paged-placement substrate for the gather-read.
---
- src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
- 1 file changed, 41 insertions(+)
-
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 2802103bd..999e2ae61 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -11,6 +11,8 @@
- #include <cstring>
- #include <limits>
- #include <map>
-+#include <numeric>
-+#include <cstdlib>
- #include <stdexcept>
- 
- static bool ggml_is_power_of_2(int n) {
-@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             return { };
-         }
- 
-+        // [paged, experimental] Place this sequence's tokens at permuted,
-+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
-+        // This validates that attention is invariant to physical KV placement -
-+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-+        // Single-sequence scope (uses get_used() as the logical base); falls back
-+        // to the normal allocator if the permuted cells aren't available.
-+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+        if (paged_mode) {
-+            const uint32_t bs   = 16;                 // block size (tokens/block)
-+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            if (nblk >= 2) {
-+                // stride coprime to nblk => block-index permutation is a bijection
-+                uint32_t k = 1;
-+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-+                }
-+                const uint32_t base = cells.get_used();
-+                bool ok = true;
-+                for (uint32_t i = 0; i < n_tokens; ++i) {
-+                    const uint32_t L    = base + i;
-+                    const uint32_t b    = L / bs;
-+                    const uint32_t off  = L % bs;
-+                    if (b >= nblk) { ok = false; break; }
-+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-+                    res.idxs[s].push_back(phys);
-+                }
-+                if (ok && res.idxs[s].size() == n_tokens) {
-+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                    }
-+                    continue; // paged placement succeeded for this sequence
-+                }
-+                res.idxs[s].clear(); // fall back to the normal allocator
-+            }
-+        }
-+
-         uint32_t n_tested = 0;
- 
-         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
+++ b/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
@@ -1,102 +0,0 @@
-# Patch 0003 — paged gather-read: exact implementation plan
-
-**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
-`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
-path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
-
-**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
-
-## Design
-
-The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
-graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
-three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
-mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
-
-ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
-→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
-`transpose → get_rows → transpose`.
-
-### KEY CORRECTIONS (found while implementing — these change the edits)
-
-1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
-   *current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
-   `kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
-   and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
-
-2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
-   count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
-   tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
-   MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
-
-3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
-   in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
-   `reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
-   Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
-
-4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
-   everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
-   n_kv, then gathered). Smaller than the 7-edit estimate above.
-
-## Edits
-
-### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
-```cpp
-    bool        is_paged() const { return paged_active; }            // near get_size()
-    ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
-    void          set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
-    uint32_t      get_n_gather(const slot_info & sinfo) const;       // == sum of used cells gathered
-```
-Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
-`build_input_k_idxs`/`get_n_kv`).
-
-### 2. `src/llama-kv-cache.cpp`
- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
-  the mask/k/v will see). This is the canonical gather order.
-
-### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
-Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
-
-### 4. `src/llama-graph.cpp`
- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`.
- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
-  mctx_cur->build_input_gather_idxs(ctx0, ...)`.
- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
-```cpp
-if (ggml_tensor * gi = inp->get_gather_idxs()) {
-    k = ggml_get_rows(ctx0, k, gi);                                   // [d, n_gather, ...] (reshape view ok)
-    v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
-    ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
-    m = ggml_get_rows(ctx0, m, gi);                                   // [n_tokens, n_gather]
-    kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m));              // [n_gather, n_tokens]
-}
-ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
-```
-Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
-variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
-`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
-mask gather in the graph. (Cleaner — revisit during impl.)
-
-### 5. V-transposed path
-When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
-work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
-flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
-
-## Verification (the gate)
-```sh
-cmake --build build-cpu --target llama-simple -j
-M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
-build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt                    # stock
-LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt   # paged gather-read
-diff a.txt b.txt        # MUST be identical
-```
-Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
-Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
-
-## Risks
- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
-  means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.
--- a/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
@@ -1,369 +0,0 @@
-From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 10:24:22 +0200
-Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
-
-Gather K, V and the kq_mask down to each sequence stream's non-empty cells
-before build_attn_mha. Position-sorted per stream so the flash-attn online
-softmax reduction order matches stock byte-for-byte. Multi-stream: one index
-column per stream over k->ne[3], padded to the max non-empty count with a
-masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
---
- src/CMakeLists.txt     |   1 +
- src/llama-graph.cpp    |   9 ++-
- src/llama-kv-cache.cpp |  74 ++++++++++++++++++++++++
- src/llama-kv-cache.h   |  11 ++++
- src/paged-attn.cpp     | 128 +++++++++++++++++++++++++++++++++++++++++
- src/paged-attn.h       |  40 +++++++++++++
- 6 files changed, 262 insertions(+), 1 deletion(-)
- create mode 100644 src/paged-attn.cpp
- create mode 100644 src/paged-attn.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index a030940..58083b3 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -25,6 +25,7 @@ add_library(llama
-             llama-kv-cache.cpp
-             llama-kv-cache-iswa.cpp
-             paged-kv-manager.cpp
-+            paged-attn.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
-index 68c9e60..b59d2a5 100644
--- a/src/llama-graph.cpp
-+++ b/src/llama-graph.cpp
-@@ -6,6 +6,8 @@
- #include "llama-cparams.h"
- 
- #include "llama-kv-cache.h"
-+
-+#include "paged-attn.h"
- #include "llama-kv-cache-iswa.h"
- #include "llama-kv-cache-dsa.h"
- #include "llama-memory-hybrid.h"
-@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
-     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
-     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- 
-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
-+    // [paged 0003] gather K, V and the mask to the sequence's used cells only
-+    //   (no-op unless env LLAMA_KV_PAGED is set).
-+    ggml_tensor * kq_mask_g = kq_mask;
-+    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+
-+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
-     cb(cur, "kqv_out", il);
- 
-     if (inp->self_v_rot) {
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 999e2ae..30d02d7 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1,4 +1,6 @@
- #include "llama-kv-cache.h"
-+#include <vector>
-+#include <utility>
- 
- #include "llama-impl.h"
- #include "llama-io.h"
-@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
-             ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
- }
- 
-+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
-+// single stream addressed by sinfo. With paged placement (patch 0002) these are
-+// the sequence's scattered block cells; gathering K/V/mask by this index list
-+// compacts the attention read while preserving every unmasked (token,cell) pair.
-+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
-+    // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
-+    // n_stream], so n_gather is the MAX non-empty count across the batch streams.
-+    // Streams with fewer cells are padded (see get_gather_idxs) with a masked
-+    // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
-+    // K is laid out over physical streams [s0, s1]; index v_cells the same way.
-+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
-+    uint32_t mx = 0;
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        uint32_t cnt = 0;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                ++cnt;
-+            }
-+        }
-+        mx = std::max(mx, cnt);
-+    }
-+    return mx;
-+}
-+
-+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
-+    const uint32_t ns       = sinfo.s1 - sinfo.s0 + 1;
-+    const uint32_t n_gather = get_n_gather(n_kv, sinfo);
-+    // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        // Collect the non-empty cells, then order them by token POSITION (not by
-+        // physical cell index). The attention reduction (flash-attn online
-+        // softmax, and the non-flash soft_max) runs over cells in array order and
-+        // is order-sensitive in floating point. Stock (contiguous) placement
-+        // happens to store cells in position order, so emitting the gathered
-+        // indices in position order reproduces stock's exact reduction order -
-+        // making the paged read bit-identical, not merely math-equivalent.
-+        std::vector<std::pair<llama_pos, int32_t>> pc;
-+        pc.reserve(n);
-+        int32_t pad = -1;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
-+            } else if (pad < 0) {
-+                pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
-+            }
-+        }
-+        std::sort(pc.begin(), pc.end());
-+        int32_t * col = dst + (size_t) j * n_gather;
-+        for (size_t k = 0; k < pc.size(); ++k) {
-+            col[k] = pc[k].second;
-+        }
-+        // Pad the tail to n_gather with a masked (empty) cell so the rectangular
-+        // gather drops to zero contribution for streams shorter than the max.
-+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
-+        for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
-+            col[k] = padv;
-+        }
-+    }
-+}
-+
- ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
-     GGML_UNUSED(sinfo);
- 
-@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
-     return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
- }
- 
-+uint32_t llama_kv_cache_context::get_n_gather() const {
-+    return kv->get_n_gather(n_kv, sinfos[i_cur]);
-+}
-+
-+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
-+    kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
-+}
-+
- ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
-     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index 3d68f98..494c0fb 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -171,6 +171,12 @@ public:
-     ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
-     ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
- 
-+    // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
-+    //   sinfo (position-sorted, padded across streams). Used by paged-attn
-+    //   gather-read. get_n_gather returns the max count across streams.
-+    uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
-+    void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
-+
-     // store k_cur and v_cur in the cache based on the provided head location
-     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
-     ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
-@@ -368,6 +374,11 @@ public:
-     ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
-     ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
- 
-+    // [paged 0003] gather-read helpers (delegate to the kv cache for the
-+    //   current ubatch's stream).
-+    uint32_t get_n_gather() const;
-+    void     get_gather_idxs(int32_t * dst) const;
-+
-     // store k_cur and v_cur in the cache based on the provided head location
-     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
-     //   - k_cur  [n_embd_head_k, n_head_k, n_tokens]
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-new file mode 100644
-index 0000000..ade75e8
--- /dev/null
-+++ b/src/paged-attn.cpp
-@@ -0,0 +1,128 @@
-+#include "paged-attn.h"
-+
-+#include "llama-graph.h"
-+#include "llama-kv-cache.h"
-+
-+#include "ggml.h"
-+#include "ggml-backend.h"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+
-+namespace paged_attn {
-+
-+bool active() {
-+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+    return a;
-+}
-+
-+static bool debug() {
-+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
-+    return d;
-+}
-+
-+namespace {
-+
-+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
-+// with each stream's non-empty cell indices (position-sorted, padded with a
-+// masked/empty cell) by delegating to the kv-cache context. Private to this
-+// unit; default can_reuse()==false keeps the graph from being reused across
-+// decodes (n_gather grows every step).
-+class input_gather_idxs : public llm_graph_input_i {
-+public:
-+    input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
-+        : mctx(mctx), idxs(idxs) {}
-+
-+    void set_input(const llama_ubatch * ubatch) override {
-+        GGML_UNUSED(ubatch);
-+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
-+        mctx->get_gather_idxs((int32_t *) idxs->data);
-+    }
-+
-+    const llama_kv_cache_context * mctx;
-+    ggml_tensor * idxs;
-+};
-+
-+} // namespace
-+
-+void gather(ggml_context * ctx0,
-+            llm_graph_result * res,
-+            const llama_kv_cache_context * mctx,
-+            ggml_tensor ** k,
-+            ggml_tensor ** v,
-+            ggml_tensor ** kq_mask) {
-+    if (!active()) {
-+        return;
-+    }
-+
-+    ggml_tensor * K = *k;
-+    ggml_tensor * V = *v;
-+    ggml_tensor * M = *kq_mask;
-+
-+    // Number of streams (sequences) in the unified batch. K is laid out
-+    // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
-+    // gather is per-stream (one index column per stream), so a single
-+    // ggml_get_rows over the stream axis handles 1..N streams uniformly.
-+    const int64_t n_stream = K->ne[3];
-+    GGML_ASSERT(M->ne[3] == n_stream);
-+
-+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
-+    if (n_gather <= 0) {
-+        // Worst-case graph reserve (empty cache) or nothing placed yet: leave
-+        // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
-+        return;
-+    }
-+
-+    if (debug()) {
-+        static int64_t once = 0;
-+        if (once++ < 2) {
-+            fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
-+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
-+        }
-+    }
-+
-+    // Per-stream index tensor [n_gather, n_stream], filled at set_input from
-+    // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
-+    // n_stream, so column s gathers from stream s of the source.
-+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
-+    ggml_set_input(idx);
-+    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
-+
-+    // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
-+    {
-+        ggml_tensor * t = ggml_cont(ctx0, K);                                          // [d, h, n_kv, ns]
-+        t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream);           // [d*h, n_kv, ns]
-+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
-+        *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream);         // [d, h, n_gather, ns]
-+    }
-+
-+    // --- gather V ---
-+    // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
-+    // result is contiguous and build_attn_mha sees a consistent v_trans==false.
-+    {
-+        const bool v_trans = V->nb[1] > V->nb[2];
-+        ggml_tensor * vsrc = v_trans
-+            ? ggml_permute(ctx0, V, 2, 1, 0, 3)   // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
-+            : V;                                  // already [d, h, n_kv, ns]
-+        ggml_tensor * t = ggml_cont(ctx0, vsrc);                                       // [d, h, n_kv, ns]
-+        t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream);  // [d*h, n_kv, ns]
-+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
-+        *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream);   // [d, h, n_gather, ns]
-+    }
-+
-+    // --- gather mask (cells are ne0): transpose so cells become the row axis,
-+    //     gather per stream, transpose back ---
-+    {
-+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);      // [n_kv, n_tps, ns]
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_tps, n_kv, ns]
-+        m = ggml_get_rows(ctx0, m, idx);                                               // [n_tps, n_gather, ns] (F32)
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_gather, n_tps, ns]
-+        m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
-+        if (M->type != m->type) {
-+            m = ggml_cast(ctx0, m, M->type);   // flash-attn requires an F16 mask
-+        }
-+        *kq_mask = m;
-+    }
-+}
-+
-+} // namespace paged_attn
-diff --git a/src/paged-attn.h b/src/paged-attn.h
-new file mode 100644
-index 0000000..c5b7bd7
--- /dev/null
-+++ b/src/paged-attn.h
-@@ -0,0 +1,40 @@
-+#pragma once
-+// Paged attention gather-read (patch 0003, experimental).
-+//
-+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
-+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
-+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
-+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
-+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
-+// build_attn_mha.
-+//
-+// Correctness: attention is permutation-invariant over the KV set, and dropping
-+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
-+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
-+//
-+// All logic lives here to keep the core files additive: build_attn gets one
-+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
-+
-+#include <cstdint>
-+
-+struct ggml_context;
-+struct ggml_tensor;
-+class  llm_graph_result;
-+class  llama_kv_cache_context;
-+
-+namespace paged_attn {
-+
-+// true iff env LLAMA_KV_PAGED is set (evaluated once).
-+bool active();
-+
-+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
-+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
-+// point at the compacted tensors; pass them straight to build_attn_mha.
-+void gather(ggml_context * ctx0,
-+            llm_graph_result * res,
-+            const llama_kv_cache_context * mctx,
-+            ggml_tensor ** k,
-+            ggml_tensor ** v,
-+            ggml_tensor ** kq_mask);
-+
-+} // namespace paged_attn
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
@@ -1,298 +0,0 @@
-From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 10:50:35 +0200
-Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
- 0004
-
-Drive the paged placement in find_slot through the vendored PagedKVManager
-(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
-free pool on demand as the sequence crosses block boundaries (peak << full
-reservation) and returned on sequence end (seq_rm full removal / clear). One
-manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
-so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
-gated call. Default off; stock path byte-identical.
---
- src/CMakeLists.txt     |   1 +
- src/llama-kv-cache.cpp |  69 +++++++++++++++++----------
- src/paged-alloc.cpp    | 106 +++++++++++++++++++++++++++++++++++++++++
- src/paged-alloc.h      |  39 +++++++++++++++
- 4 files changed, 190 insertions(+), 25 deletions(-)
- create mode 100644 src/paged-alloc.cpp
- create mode 100644 src/paged-alloc.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index 58083b3..4d9d7d1 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -26,6 +26,7 @@ add_library(llama
-             llama-kv-cache-iswa.cpp
-             paged-kv-manager.cpp
-             paged-attn.cpp
-+            paged-alloc.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 30d02d7..1125d9a 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1,4 +1,5 @@
- #include "llama-kv-cache.h"
-+#include "paged-alloc.h"
- #include <vector>
- #include <utility>
- 
-@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
- }
- 
- void llama_kv_cache::clear(bool data) {
-+    // [paged 0004] return all on-demand blocks to the pool on cache clear.
-+    if (paged_alloc::active()) {
-+        paged_alloc::release_all(this);
-+    }
-+
-     for (uint32_t s = 0; s < n_stream; ++s) {
-         v_cells[s].reset();
-         v_heads[s] = 0;
-@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
-         p1 = std::numeric_limits<llama_pos>::max();
-     }
- 
-+    // [paged 0004] free a stream's on-demand blocks when its whole sequence is
-+    // removed (sequence end), so they return to the pool for reuse.
-+    if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
-+        if (seq_id >= 0) {
-+            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
-+        } else {
-+            paged_alloc::release_all(this);
-+        }
-+    }
-+
-     if (seq_id >= 0) {
-         auto & cells = v_cells[seq_to_stream[seq_id]];
-         auto & head  = v_heads[seq_to_stream[seq_id]];
-@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-         // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
-         // Single-sequence scope (uses get_used() as the logical base); falls back
-         // to the normal allocator if the permuted cells aren't available.
-        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-        if (paged_mode) {
-+        // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
-+        // invariant to physical KV placement; here that placement is driven by
-+        // the vendored PagedKVManager (patch 0001): blocks are popped from a free
-+        // pool only as the sequence crosses block boundaries (peak << full
-+        // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
-+        // falls back to the normal allocator on pool exhaustion or any conflict.
-+        if (paged_alloc::active()) {
-             const uint32_t bs   = 16;                 // block size (tokens/block)
-            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
-+            const uint32_t nblk = cells.size() / bs;  // this stream's block budget
-             if (nblk >= 2) {
-                // stride coprime to nblk => block-index permutation is a bijection
-                uint32_t k = 1;
-                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
-                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
-                }
-                 const uint32_t base = cells.get_used();
-                bool ok = true;
-                for (uint32_t i = 0; i < n_tokens; ++i) {
-                    const uint32_t L    = base + i;
-                    const uint32_t b    = L / bs;
-                    const uint32_t off  = L % bs;
-                    if (b >= nblk) { ok = false; break; }
-                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
-                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
-                    res.idxs[s].push_back(phys);
-                }
-                if (ok && res.idxs[s].size() == n_tokens) {
-                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
-                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
-+                const int      strm = (int) seq_to_stream[seq_id];
-+                std::vector<uint32_t> placed;
-+                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
-+                    bool ok = (placed.size() == n_tokens);
-+                    for (uint32_t i = 0; ok && i < n_tokens; ++i) {
-+                        if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
-+                            ok = false;
-+                        }
-+                    }
-+                    if (ok) {
-+                        for (uint32_t phys : placed) {
-+                            res.idxs[s].push_back(phys);
-+                        }
-+                        if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
-+                            fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
-+                            for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
-+                            fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
-+                        }
-+                        continue; // on-demand paged placement succeeded
-                     }
-                    continue; // paged placement succeeded for this sequence
-+                    res.idxs[s].clear(); // fall back to the normal allocator
-                 }
-                res.idxs[s].clear(); // fall back to the normal allocator
-             }
-         }
- 
-diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
-new file mode 100644
-index 0000000..1d13f9c
--- /dev/null
-+++ b/src/paged-alloc.cpp
-@@ -0,0 +1,106 @@
-+#include "paged-alloc.h"
-+#include "paged-kv-manager.h"
-+
-+#include <cstdlib>
-+#include <cstdio>
-+#include <map>
-+#include <memory>
-+#include <utility>
-+
-+namespace paged_alloc {
-+
-+bool active() {
-+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
-+    return a;
-+}
-+
-+static bool debug() {
-+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
-+    return d;
-+}
-+
-+namespace {
-+
-+using key_t = std::pair<const void *, int>;
-+
-+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-+// physical pool of cells.size() cells, so a manager's block ids map directly to
-+// cell ranges within that stream's pool. The internal request id is always 0.
-+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
-+
-+paged::PagedKVManager * get_mgr(const void * cache, int stream,
-+                                uint32_t pool_blocks, uint32_t block_size) {
-+    const key_t k{cache, stream};
-+    auto it = g_managers.find(k);
-+    if (it == g_managers.end()) {
-+        // enable_caching=false: prefix caching is a later patch; 0004 exercises
-+        // only on-demand allocate / free.
-+        auto mgr = std::make_unique<paged::PagedKVManager>(
-+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
-+        it = g_managers.emplace(k, std::move(mgr)).first;
-+    }
-+    return it->second.get();
-+}
-+
-+} // namespace
-+
-+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+           uint32_t block_size, uint32_t pool_blocks,
-+           std::vector<uint32_t> & out) {
-+    if (n_tokens == 0) {
-+        return true;
-+    }
-+
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+
-+    const size_t before = mgr->block_table(0).size();
-+
-+    // Grow the request to cover the highest logical position. The manager pops
-+    // free blocks only for the boundaries actually crossed - that is the on-
-+    // demand behavior; an already-covered range adds nothing.
-+    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
-+        return false; // pool exhausted -> caller falls back to the stock path
-+    }
-+
-+    out.reserve(out.size() + n_tokens);
-+    for (uint32_t i = 0; i < n_tokens; ++i) {
-+        const int64_t s = mgr->slot(0, (int) (base + i));
-+        out.push_back((uint32_t) s);
-+    }
-+
-+    if (debug()) {
-+        const size_t after = mgr->block_table(0).size();
-+        if (after != before) {
-+            fprintf(stderr,
-+                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
-+                    "(budget=%u; base=%u +%u tok)\n",
-+                    cache, stream, before, after, pool_blocks, base, n_tokens);
-+        }
-+    }
-+
-+    return true;
-+}
-+
-+void release(const void * cache, int stream) {
-+    auto it = g_managers.find({cache, stream});
-+    if (it == g_managers.end()) {
-+        return;
-+    }
-+    it->second->free(0);
-+    g_managers.erase(it);
-+    if (debug()) {
-+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
-+    }
-+}
-+
-+void release_all(const void * cache) {
-+    for (auto it = g_managers.begin(); it != g_managers.end(); ) {
-+        if (it->first.first == cache) {
-+            it = g_managers.erase(it);
-+        } else {
-+            ++it;
-+        }
-+    }
-+}
-+
-+} // namespace paged_alloc
-diff --git a/src/paged-alloc.h b/src/paged-alloc.h
-new file mode 100644
-index 0000000..bf66665
--- /dev/null
-+++ b/src/paged-alloc.h
-@@ -0,0 +1,39 @@
-+#pragma once
-+// On-demand paged KV block allocation (patch 0004, experimental).
-+//
-+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-+// sequence's logical positions onto a fixed full-pool permutation, blocks are
-+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-+// and returned to the pool on sequence end. This is where the paged memory-
-+// capacity benefit begins: a short sequence holds only a few blocks, not the
-+// whole reserved window.
-+//
-+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-+// struct stays untouched - find_slot only gains a gated call.
-+
-+#include <cstdint>
-+#include <vector>
-+
-+namespace paged_alloc {
-+
-+// true iff env LLAMA_KV_PAGED is set (evaluated once).
-+bool active();
-+
-+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-+// demand, appending their physical cell indices to `out`. pool_blocks =
-+// cells.size()/block_size is this stream's block budget. Returns false (leaving
-+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
-+// allocator. The caller still validates each returned cell is empty.
-+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+           uint32_t block_size, uint32_t pool_blocks,
-+           std::vector<uint32_t> & out);
-+
-+// Return a stream's blocks to the pool (sequence end).
-+void release(const void * cache, int stream);
-+
-+// Return every stream's blocks for a kv-cache (clear() / teardown).
-+void release_all(const void * cache);
-+
-+} // namespace paged_alloc
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
@@ -1,143 +0,0 @@
-From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 12:13:44 +0200
-Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
- patch 0006
-
-Add host-side cross-request prefix sharing to the vendored PagedKVManager
-(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
-matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
-fresh blocks only for the divergent suffix. A shared block is freed only at
-ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
-write so co-owners stay byte-correct. All logic lives in the vendored
-src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
-core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
-
-Wiring the physical-cell reuse into find_slot so the engine itself skips
-recompute needs core seq-membership changes and is left to a later patch.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
- src/paged-kv-manager.h   | 23 ++++++++++++++
- 2 files changed, 88 insertions(+)
-
-diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
-index ca0dcd8..4c6ee4c 100644
--- a/src/paged-kv-manager.cpp
-+++ b/src/paged-kv-manager.cpp
-@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
-     pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
- }
- 
-+// ---------------------------------------------------------------------------
-+// Cross-request prefix caching + copy-on-write  (patch 0006)
-+// ---------------------------------------------------------------------------
-+
-+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
-+    auto& req = req_to_blocks_[seq_id];
-+
-+    // Longest cached prefix: hash the full blocks and stop at the first miss.
-+    // A block hash transitively encodes its whole prefix (FNV chaining), so the
-+    // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
-+    const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
-+    std::vector<KVCacheBlock*> hits;
-+    for (uint64_t bh : hashes) {
-+        KVCacheBlock* cb = pool_.get_cached_block(bh);
-+        if (!cb) break;
-+        hits.push_back(cb);
-+    }
-+
-+    // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
-+    // splice the shared physical blocks into this sequence's block table.
-+    pool_.touch(hits);
-+    req.insert(req.end(), hits.begin(), hits.end());
-+
-+    // Allocate fresh blocks only for the divergent suffix.
-+    const size_t need = cdiv(token_ids.size(), block_size_);
-+    if (need > req.size()) {
-+        const size_t add = need - req.size();
-+        if (add > pool_.get_num_free_blocks()) {
-+            // OOM: roll the sequence back (un-touch the shared prefix so no ref
-+            // leaks) and report no placement; the caller falls back to stock.
-+            std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
-+            pool_.free_blocks(ordered);
-+            req.clear();
-+            return 0;
-+        }
-+        auto nb = pool_.get_new_blocks(add);
-+        req.insert(req.end(), nb.begin(), nb.end());
-+    }
-+    return hits.size();
-+}
-+
-+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
-+    auto& req = req_to_blocks_.at(seq_id);
-+    KVCacheBlock* old = req.at(bi);
-+    if (old->ref_cnt <= 1) {
-+        return { old->block_id, old->block_id }; // already private - no copy
-+    }
-+    // Private copy for this sequence. get_new_blocks sets the fresh block's
-+    // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
-+    // it is NOT returned to the pool and the other owners are left untouched.
-+    KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
-+    pool_.free_blocks({ old });
-+    req[bi] = fresh;
-+    return { old->block_id, fresh->block_id };
-+}
-+
-+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
-+    return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
-+}
-+
-+size_t PagedKVManager::num_blocks(int seq_id) const {
-+    auto it = req_to_blocks_.find(seq_id);
-+    return it == req_to_blocks_.end() ? 0 : it->second.size();
-+}
-+
- } // namespace paged
-diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
-index 740280a..34decbc 100644
--- a/src/paged-kv-manager.h
-+++ b/src/paged-kv-manager.h
-@@ -14,6 +14,7 @@
- #include <vector>
- #include <unordered_map>
- #include <map>
-+#include <utility>
- 
- namespace paged {
- 
-@@ -99,6 +100,28 @@ public:
-     size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
-     void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
- 
-+    // Cross-request prefix caching + copy-on-write (patch 0006).
-+    //
-+    // Splice the longest cached prefix of token_ids into seq_id (reuse the
-+    // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
-+    // allocate fresh blocks only for the divergent suffix. Returns the number of
-+    // shared (reused) blocks; the caller skips recomputing those tokens. On pool
-+    // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
-+    size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
-+
-+    // Copy-on-write the block at logical index bi of seq_id. If that block is
-+    // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
-+    // the shared one (other owners keep it, content untouched) and install the
-+    // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
-+    // block was already private (ref_cnt<=1) and no copy is needed. The caller
-+    // copies the physical cell contents old_block_id -> new_block_id.
-+    std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
-+
-+    // Introspection for the prefix-share gate (debug/tests).
-+    int    block_ref_cnt_at(int seq_id, size_t bi) const;
-+    size_t num_blocks(int seq_id) const;
-+    size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
-+
- protected:
-     int block_size_;
-     BlockPool pool_;
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
@@ -1,531 +0,0 @@
-From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 12:46:28 +0200
-Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
- patch 0007
-
-Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
-new sequence physically SHARES the cached prefix blocks and skips recomputing the
-shared prefix - the actual compute win that 0006 (which only proved the host-side
-machinery + realised reuse via the stock seq_cp) did not yet deliver from the
-paged path itself.
-
-Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
-
-  * paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
-    into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
-    requests are keyed by the real llama_seq_id. free(seq) now releases exactly
-    one sequence, so ref-counted shared blocks survive while another sharer holds
-    them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
-    slot, commit (publish a sequence into the content cache), ref-counted release,
-    plus ref/num-free introspection.
-
-  * Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
-    paged_prefix_share() reuses the longest cached content prefix for a sequence
-    and marks the shared physical cells as belonging to it (cells.seq_add) so the
-    engine's attention mask includes the already-computed prefix KV; the caller
-    then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
-    sequence's full blocks for later reuse.
-
-  * find_slot's paged branch anchors placement on each sequence's own logical base
-    (ubatch.pos) and keys the manager request by seq_id, so an independently-freed
-    sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
-    per-sequence (ref-counted) instead of nuking the whole stream.
-
-  * paged-prefix-api: a thin gated shim so a caller holding only the public
-    llama.h can reach the seam and the introspection without the internal headers.
-
-Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
-additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
-sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
-scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
-a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
-hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
-use-after-free) and returns to the pool only when all sharers are freed. The
-0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/CMakeLists.txt       |   1 +
- src/llama-kv-cache.cpp   |  66 +++++++++++++++++++++++--
- src/llama-kv-cache.h     |   8 +++
- src/paged-alloc.cpp      | 104 ++++++++++++++++++++++++++++++---------
- src/paged-alloc.h        |  69 +++++++++++++++++++-------
- src/paged-prefix-api.cpp |  48 ++++++++++++++++++
- src/paged-prefix-api.h   |  27 ++++++++++
- 7 files changed, 280 insertions(+), 43 deletions(-)
- create mode 100644 src/paged-prefix-api.cpp
- create mode 100644 src/paged-prefix-api.h
-
-diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
-index 4d9d7d1..432f42d 100644
--- a/src/CMakeLists.txt
-+++ b/src/CMakeLists.txt
-@@ -27,6 +27,7 @@ add_library(llama
-             paged-kv-manager.cpp
-             paged-attn.cpp
-             paged-alloc.cpp
-+            paged-prefix-api.cpp
-             llama-kv-cache-dsa.cpp
-             llama-memory.cpp
-             llama-memory-hybrid.cpp
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 1125d9a..7510ff9 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
-     // removed (sequence end), so they return to the pool for reuse.
-     if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
-         if (seq_id >= 0) {
-            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
-+            paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
-         } else {
-             paged_alloc::release_all(this);
-         }
-@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-             const uint32_t bs   = 16;                 // block size (tokens/block)
-             const uint32_t nblk = cells.size() / bs;  // this stream's block budget
-             if (nblk >= 2) {
-                const uint32_t base = cells.get_used();
-+                // [paged 0007] Anchor placement on this sequence's own logical
-+                // base position (ubatch.pos), not the shared used-count, and key
-+                // the manager request by the real seq_id. slot(seq,pos) is then
-+                // stable per sequence, so an independently-freed (ref-counted)
-+                // sequence and a shared prefix can coexist in one unified pool.
-+                const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
-                 const int      strm = (int) seq_to_stream[seq_id];
-                 std::vector<uint32_t> placed;
-                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
-+                if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
-                     bool ok = (placed.size() == n_tokens);
-                     for (uint32_t i = 0; ok && i < n_tokens; ++i) {
-                         if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
-@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
-     return res;
- }
- 
-+// [paged 0007] Cross-request prefix recompute-skip.
-+//
-+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
-+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
-+// blocks for the divergent suffix. We then mark the shared physical cells as
-+// belonging to seq_id - those cells already hold the owner's computed KV at the
-+// matching logical positions, so the caller decodes ONLY the suffix and the
-+// prefix is never recomputed. Returns the number of shared prefix tokens.
-+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
-+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
-+    if (!paged_alloc::active() || tokens.empty()) {
-+        return 0;
-+    }
-+    const uint32_t bs   = 16;
-+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
-+    auto & cells = v_cells[strm];
-+    const uint32_t nblk = cells.size() / bs;
-+    if (nblk < 2) {
-+        return 0;
-+    }
-+
-+    std::vector<int> toks(tokens.begin(), tokens.end());
-+    const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
-+
-+    for (size_t p = 0; p < kshare; ++p) {
-+        const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
-+        if (cell < 0 || (uint32_t) cell >= cells.size() ||
-+            cells.is_empty((uint32_t) cell) ||
-+            cells.pos_get((uint32_t) cell) != (llama_pos) p) {
-+            // Owner cell missing / repurposed: cannot safely share. Roll the
-+            // sequence back so the caller recomputes the whole prompt.
-+            paged_alloc::release(this, (int) strm, (int) seq_id);
-+            return 0;
-+        }
-+        if (!cells.seq_has((uint32_t) cell, seq_id)) {
-+            cells.seq_add((uint32_t) cell, seq_id);
-+        }
-+    }
-+    return (int32_t) kshare;
-+}
-+
-+// [paged 0007] Publish a sequence's full blocks into the content cache so a
-+// later paged_prefix_share() can reuse them. Call after the sequence KV is
-+// computed (its prefill decode has run).
-+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
-+    if (!paged_alloc::active() || tokens.empty()) {
-+        return;
-+    }
-+    const uint32_t bs   = 16;
-+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
-+    const uint32_t nblk = v_cells[strm].size() / bs;
-+    std::vector<int> toks(tokens.begin(), tokens.end());
-+    paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
-+}
-+
- void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
-     // TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
-     if (other) {
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index 494c0fb..f374ac6 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -199,6 +199,14 @@ public:
-     // emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
-     void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
- 
-+    // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
-+    // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
-+    // for seq_id and returns the number of shared prefix tokens (the caller
-+    // decodes only the suffix); paged_prefix_commit() publishes a sequence into
-+    // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
-+    int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
-+    void    paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
-+
-     //
-     // input API
-     //
-diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
-index 1d13f9c..c1027fb 100644
--- a/src/paged-alloc.cpp
-+++ b/src/paged-alloc.cpp
-@@ -23,9 +23,13 @@ namespace {
- 
- using key_t = std::pair<const void *, int>;
- 
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-// physical pool of cells.size() cells, so a manager's block ids map directly to
-// cell ranges within that stream's pool. The internal request id is always 0.
-+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
-+// separate physical pool of cells.size() cells, so a manager's block ids map
-+// directly to cell ranges within that stream's pool. Requests inside a manager
-+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
-+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
-+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
-+// can publish blocks and share_prefix() can hit them.
- std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
- 
- paged::PagedKVManager * get_mgr(const void * cache, int stream,
-@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
-     const key_t k{cache, stream};
-     auto it = g_managers.find(k);
-     if (it == g_managers.end()) {
-        // enable_caching=false: prefix caching is a later patch; 0004 exercises
-        // only on-demand allocate / free.
-         auto mgr = std::make_unique<paged::PagedKVManager>(
-            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
-+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
-         it = g_managers.emplace(k, std::move(mgr)).first;
-     }
-     return it->second.get();
- }
- 
-+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
-+    auto it = g_managers.find({cache, stream});
-+    return it == g_managers.end() ? nullptr : it->second.get();
-+}
-+
- } // namespace
- 
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
-            uint32_t block_size, uint32_t pool_blocks,
-            std::vector<uint32_t> & out) {
-     if (n_tokens == 0) {
-@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
- 
-     paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
- 
-    const size_t before = mgr->block_table(0).size();
-+    const size_t before = mgr->block_table(seq).size();
- 
-    // Grow the request to cover the highest logical position. The manager pops
-    // free blocks only for the boundaries actually crossed - that is the on-
-    // demand behavior; an already-covered range adds nothing.
-    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
-+    // Grow this sequence's request to cover its highest logical position. The
-+    // manager pops free blocks only for boundaries actually crossed; if
-+    // share_prefix() already reserved these blocks, this is a no-op.
-+    if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
-         return false; // pool exhausted -> caller falls back to the stock path
-     }
- 
-     out.reserve(out.size() + n_tokens);
-     for (uint32_t i = 0; i < n_tokens; ++i) {
-        const int64_t s = mgr->slot(0, (int) (base + i));
-+        const int64_t s = mgr->slot(seq, (int) (base + i));
-         out.push_back((uint32_t) s);
-     }
- 
-     if (debug()) {
-        const size_t after = mgr->block_table(0).size();
-+        const size_t after = mgr->block_table(seq).size();
-         if (after != before) {
-             fprintf(stderr,
-                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
-+                    "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
-                     "(budget=%u; base=%u +%u tok)\n",
-                    cache, stream, before, after, pool_blocks, base, n_tokens);
-+                    cache, stream, seq, before, after, pool_blocks, base, n_tokens);
-         }
-     }
- 
-     return true;
- }
- 
-void release(const void * cache, int stream) {
-    auto it = g_managers.find({cache, stream});
-    if (it == g_managers.end()) {
-+size_t share_prefix(const void * cache, int stream, int seq,
-+                    const std::vector<int> & tokens,
-+                    uint32_t block_size, uint32_t pool_blocks) {
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+    const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
-+    const size_t shared_tokens = shared_blocks * (size_t) block_size;
-+    if (debug() && shared_blocks > 0) {
-+        fprintf(stderr,
-+                "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
-+                "(%zu tokens) - prefix NOT recomputed\n",
-+                cache, stream, seq, shared_blocks, shared_tokens);
-+    }
-+    return shared_tokens;
-+}
-+
-+int64_t slot(const void * cache, int stream, int seq, int pos) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-+        return -1;
-+    }
-+    if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
-+        return -1;
-+    }
-+    return mgr->slot(seq, pos);
-+}
-+
-+void commit(const void * cache, int stream, int seq,
-+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
-+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
-+    mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
-+    if (debug()) {
-+        fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
-+                cache, stream, seq, tokens.size());
-+    }
-+}
-+
-+void release(const void * cache, int stream, int seq) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-         return;
-     }
-    it->second->free(0);
-    g_managers.erase(it);
-+    mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
-     if (debug()) {
-        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
-+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
-+                cache, stream, seq, mgr->num_free_blocks());
-     }
- }
- 
-@@ -103,4 +146,21 @@ void release_all(const void * cache) {
-     }
- }
- 
-+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    if (!mgr) {
-+        return -1;
-+    }
-+    const size_t bi = (size_t) pos / block_size;
-+    if (bi >= mgr->num_blocks(seq)) {
-+        return -1;
-+    }
-+    return mgr->block_ref_cnt_at(seq, bi);
-+}
-+
-+size_t num_free(const void * cache, int stream) {
-+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
-+    return mgr ? mgr->num_free_blocks() : 0;
-+}
-+
- } // namespace paged_alloc
-diff --git a/src/paged-alloc.h b/src/paged-alloc.h
-index bf66665..88dedef 100644
--- a/src/paged-alloc.h
-+++ b/src/paged-alloc.h
-@@ -1,17 +1,27 @@
- #pragma once
-// On-demand paged KV block allocation (patch 0004, experimental).
-+// On-demand paged KV block allocation + cross-request prefix reuse
-+// (patches 0004 + 0007, experimental).
- //
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-// and returned to the pool on sequence end. This is where the paged memory-
-// capacity benefit begins: a short sequence holds only a few blocks, not the
-// whole reserved window.
-+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
-+// host-side PagedKVManager (patch 0001). Two responsibilities:
- //
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-// struct stays untouched - find_slot only gains a gated call.
-+//   * On-demand allocation (0004): a sequence's logical positions are mapped to
-+//     physical cells block-by-block, popped from a free pool only as the
-+//     sequence grows and returned on sequence end.
-+//
-+//   * Cross-request prefix reuse (0007): before a new sequence's suffix is
-+//     decoded, share_prefix() reuses the cached physical blocks of a matching
-+//     content prefix (ref_cnt++), so the engine shares the already-computed KV
-+//     cells and the caller decodes ONLY the divergent suffix - the prefix is not
-+//     recomputed. commit() publishes a sequence's full blocks into the content
-+//     cache so later sequences can hit them. Freeing is ref-counted: a shared
-+//     block returns to the pool only when every sharer has been released.
-+//
-+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
-+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
-+// shared blocks survive at ref>0. All state lives in this unit (a static
-+// registry), so the core kv-cache struct stays untouched - find_slot gains only
-+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
- 
- #include <cstdint>
- #include <vector>
-@@ -21,19 +31,42 @@ namespace paged_alloc {
- // true iff env LLAMA_KV_PAGED is set (evaluated once).
- bool active();
- 
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-// demand, appending their physical cell indices to `out`. pool_blocks =
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
-+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
-+// on demand, appending their physical cell indices to `out`. pool_blocks =
-+// cells.size()/block_size is the stream's block budget. Returns false (leaving
- // `out` unchanged) on pool exhaustion, so the caller falls back to the stock
- // allocator. The caller still validates each returned cell is empty.
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
-+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
-            uint32_t block_size, uint32_t pool_blocks,
-            std::vector<uint32_t> & out);
- 
-// Return a stream's blocks to the pool (sequence end).
-void release(const void * cache, int stream);
-+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
-+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
-+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
-+// (block-aligned); the caller marks those cells for seq and decodes only the
-+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
-+size_t share_prefix(const void * cache, int stream, int seq,
-+                    const std::vector<int> & tokens,
-+                    uint32_t block_size, uint32_t pool_blocks);
-+
-+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
-+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
-+int64_t slot(const void * cache, int stream, int seq, int pos);
- 
-// Return every stream's blocks for a kv-cache (clear() / teardown).
-+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
-+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
-+void commit(const void * cache, int stream, int seq,
-+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
-+
-+// Return one sequence's blocks to the pool (ref-counted; sequence end).
-+void release(const void * cache, int stream, int seq);
-+
-+// Drop every manager for a kv-cache (clear() / teardown).
- void release_all(const void * cache);
- 
-+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
-+// ref count of the block backing logical position `pos`, or -1 if unknown.
-+int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
-+size_t num_free(const void * cache, int stream);
-+
- } // namespace paged_alloc
-diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
-new file mode 100644
-index 0000000..8573cd2
--- /dev/null
-+++ b/src/paged-prefix-api.cpp
-@@ -0,0 +1,48 @@
-+#include "paged-prefix-api.h"
-+#include "paged-alloc.h"
-+#include "llama-kv-cache.h"
-+
-+#include <vector>
-+
-+namespace paged_prefix_api {
-+
-+static llama_kv_cache * kv_of(llama_context * ctx) {
-+    // The driver targets a plain unified KV-cache model; dynamic_cast yields null
-+    // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
-+    // not apply, so the shim degrades to a safe no-op.
-+    return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
-+}
-+
-+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv || n <= 0) {
-+        return 0;
-+    }
-+    return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
-+}
-+
-+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv || n <= 0) {
-+        return;
-+    }
-+    kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
-+}
-+
-+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv) {
-+        return -1;
-+    }
-+    return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
-+}
-+
-+long num_free(llama_context * ctx) {
-+    llama_kv_cache * kv = kv_of(ctx);
-+    if (!kv) {
-+        return 0;
-+    }
-+    return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
-+}
-+
-+} // namespace paged_prefix_api
-diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
-new file mode 100644
-index 0000000..78a3864
--- /dev/null
-+++ b/src/paged-prefix-api.h
-@@ -0,0 +1,27 @@
-+#pragma once
-+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
-+// (patch 0007). Lets a driver that only includes the public llama.h reach the
-+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
-+// without pulling in the internal kv-cache headers. All entry points are no-ops
-+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
-+
-+#include "llama.h"
-+
-+namespace paged_prefix_api {
-+
-+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
-+// return the number of shared prefix tokens (the caller decodes only the
-+// suffix). 0 if nothing was shared.
-+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+
-+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
-+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+
-+// Ref count of the paged block backing logical position `pos` of `seq` (unified
-+// stream 0), or -1 if unknown.
-+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
-+
-+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
-+long num_free(llama_context * ctx);
-+
-+} // namespace paged_prefix_api
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -1,130 +0,0 @@
-From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 17:02:22 +0200
-Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
- - patch 0008
-
-Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
-paged_prefix_api::share/commit) into the llama-server continuous-batching loop
-(update_slots) so CONCURRENT requests that share a long prefix physically reuse
-one committed copy of the prefix blocks and prefill only their divergent suffix.
-Patch 0007 proved the engine seam correct via a standalone driver, but the server
-never called it: two concurrent shared-prefix requests each recomputed the full
-prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
-(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
-concurrent slots. 0008 adds that cross-slot share.
-
-Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
-
-  * In update_slots prompt-processing, after the native n_past is computed and
-    only for a FRESH slot (n_past < one block, i.e. the native cache did not
-    already cover the prefix), call paged_prefix_api::share() to splice the
-    longest committed cross-request prefix into this sequence (ref_cnt++ on the
-    shared physical blocks) and advance n_past past it, so the batch fill computes
-    ONLY the suffix. The slot's own divergent tail cells are removed first so the
-    shared cells own [n_past, kshare) without colliding (the native path removes
-    these later anyway). The n_past < block gate guarantees any block-aligned
-    share the engine returns is strictly larger than n_past and therefore always
-    adopted, so the engine's reservation always matches the suffix-only batch and
-    never leaves stale blocks (which otherwise fragment the paged pool).
-
-  * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
-    KV just computed), call paged_prefix_api::commit() to publish its prefix so
-    concurrent/later sharers can reuse it.
-
-The share() / commit() entry points are forward-declared (defined in libllama,
-src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
-server translation unit.
-
-Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
-holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
-~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
-K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
-blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
-documented CUDA batch-shape non-determinism band (stock native prompt-caching
-shows the same magnitude). Cross-request sharing requires the unified KV cache.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
- 1 file changed, 50 insertions(+)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index da6a475..04c6361 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -15,6 +15,16 @@
- #include "mtmd.h"
- #include "mtmd-helper.h"
- 
-+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
-+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
-+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
-+// cache wires into update_slots() without pulling in internal kv-cache headers.
-+// Fully gated; stock (paged off) is byte-identical.
-+namespace paged_prefix_api {
-+    int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+    void    commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
-+}
-+
- #include <algorithm>
- #include <cstddef>
- #include <cinttypes>
-@@ -3007,6 +3017,37 @@ private:
-                             }
-                         }
- 
-+                        // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
-+                        // above only reuses THIS slot's own prior prompt; when the paged KV
-+                        // engine is active, also reuse a committed CROSS-slot prefix so
-+                        // concurrent requests sharing a long prefix skip recompute. Gated on
-+                        // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
-+                        static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
-+                        // Only attempt the cross-request share on a FRESH slot (the native
-+                        // cache above did not already cover the prefix). With n_past < a
-+                        // block, any block-aligned share the engine returns is strictly
-+                        // larger than n_past and is therefore always adopted below - so the
-+                        // engine's full-prompt reservation always matches the suffix-only
-+                        // submission and never leaves stale blocks (which fragmented the
-+                        // paged pool and crashed the server under high fan-out otherwise).
-+                        if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
-+                            const llama_tokens ptoks = input_tokens.get_text_tokens();
-+                            // Drop this slot's own cells beyond the natively-cached prefix before
-+                            // splicing the shared physical prefix in, so the shared cells can own
-+                            // [n_past, kshare) without colliding (the native path removes exactly
-+                            // these later; a no-op for a fresh slot).
-+                            common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
-+                            const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
-+                            if (kshare > n_past) {
-+                                slot.prompt.tokens.keep_first(n_past);
-+                                for (int i = n_past; i < kshare; ++i) {
-+                                    slot.prompt.tokens.push_back(ptoks[i]);
-+                                }
-+                                n_past = kshare;
-+                                SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
-+                            }
-+                        }
-+
-                         // [TAG_PROMPT_LOGITS]
-                         if (n_past == slot.task->n_tokens() && n_past > 0) {
-                             SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
-@@ -3427,6 +3468,15 @@ private:
-                     // prompt evaluated for next-token prediction
-                     slot.state = SLOT_STATE_GENERATING;
- 
-+                    // [paged 0008] Publish this slot's computed prefix so concurrent/later
-+                    // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
-+                    // for [0, n_tokens) has just run, so the prefix KV is computed.
-+                    static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
-+                    if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
-+                        const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
-+                        paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
-+                    }
-+
-                     if (slot.can_speculate()) {
-                         common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
-                     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
@@ -1,609 +0,0 @@
-From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 20:03:17 +0200
-Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
-
-Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
-contiguous buffer) with an in-kernel paged read on the decode step. build_attn
-passes the UNMODIFIED physical K/V views plus a block table (src[5] of
-ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
-index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
-reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
-read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
-(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
-causal mask in the same position order; KV_max / parallel_blocks / stream_k
-split-K are unchanged. The decode shape is forced onto the vec kernel (the only
-one wired for the block table); a nullptr block table => the stock contiguous
-read, byte-identical.
-
-Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
-so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
-tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
-the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
-ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
-recovering the gather regression to stock parity (647 ms). Gated behind
-LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h                  |   6 ++
- ggml/src/ggml-cpu/ops.cpp            |  10 ++-
- ggml/src/ggml-cuda/fattn-common.cuh  |   8 +-
- ggml/src/ggml-cuda/fattn-mma-f16.cuh |   4 +-
- ggml/src/ggml-cuda/fattn-tile.cuh    |   4 +-
- ggml/src/ggml-cuda/fattn-vec.cuh     |  25 +++++--
- ggml/src/ggml-cuda/fattn-wmma-f16.cu |   4 +-
- ggml/src/ggml-cuda/fattn.cu          |   9 +++
- ggml/src/ggml.c                      |  14 ++++
- src/llama-graph.cpp                  |  23 ++++--
- src/llama-graph.h                    |   3 +-
- src/llama-kv-cache.cpp               |  31 ++++++++
- src/llama-kv-cache.h                 |   4 +
- src/paged-attn.cpp                   | 107 +++++++++++++++++++++++++++
- src/paged-attn.h                     |  18 +++++
- 15 files changed, 248 insertions(+), 22 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index d6807b6..823f5a9 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2427,6 +2427,12 @@ extern "C" {
-             struct ggml_tensor * a,
-             struct ggml_tensor * sinks);
- 
-+    // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
-+    // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
-+    GGML_API void ggml_flash_attn_ext_set_block_table(
-+            struct ggml_tensor * a,
-+            struct ggml_tensor * block_table);
-+
-     // TODO: needs to be adapted to ggml_flash_attn_ext
-     GGML_API struct ggml_tensor * ggml_flash_attn_back(
-            struct ggml_context * ctx,
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 74611dc..63c07a2 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
-     const ggml_tensor * v     = dst->src[2];
-     const ggml_tensor * mask  = dst->src[3];
-     const ggml_tensor * sinks = dst->src[4];
-+    const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
-+    const int32_t     * bt    = block_table ? (const int32_t *) block_table->data : nullptr;
- 
-     GGML_TENSOR_LOCALS(int64_t, neq, q,   ne)
-     GGML_TENSOR_LOCALS(size_t,  nbq, q,   nb)
-@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
- 
-             float s; // KQ value
- 
-            const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
-+            // [paged] map the logical KV index ic to its physical cell via the block table.
-+            const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
-+            const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
-             kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
- 
-             s = s*scale; // scale KQ value
-@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
-             float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
-             float vs = 1.0f; // post-softmax KQ value, expf(s - M)
- 
-            const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
-+            const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
- 
-             if (v->type == GGML_TYPE_F16) {
-                 if (s > M) {
-@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
-         const int64_t dr = (nr + nchunk - 1) / nchunk;
- 
-         static constexpr int64_t Q_TILE_SZ  = ggml_fa_tile_config::Q;
-        bool use_tiled = !use_ref &&
-+        bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
-                                (q->type == GGML_TYPE_F32 &&
-                                 kv_is_f32_or_f16 &&
-                                 k->type == v->type &&
-diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
-index 8dfa51a..3c6ddd5 100644
--- a/ggml/src/ggml-cuda/fattn-common.cuh
-+++ b/ggml/src/ggml-cuda/fattn-common.cuh
-@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33);
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table);
- 
- typedef float (*vec_dot_KQ_t)(
-     const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
-@@ -981,6 +982,8 @@ void launch_fattn(
- 
-     const ggml_tensor * mask  = dst->src[3];
-     const ggml_tensor * sinks = dst->src[4];
-+    const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
-+    const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
- 
-     ggml_tensor * KQV = dst;
- 
-@@ -1217,7 +1220,8 @@ void launch_fattn(
-         K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
-         nb21, nb22, nb23,
-         mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
-        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
-+        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
-+        bt_ptr
-     );
-     CUDA_CHECK(cudaGetLastError());
- 
-diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-index 83478a0..0a92cd6 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
-@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
-     ggml_cuda_pdl_sync(); // TODO optimize placement
- #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
-index 0a09981..0ff14e6 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
-+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
-@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
-index 69dd936..a09e2fb 100644
--- a/ggml/src/ggml-cuda/fattn-vec.cuh
-+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
-@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-     ggml_cuda_pdl_lc();
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
-                   nb11, nb12, nb13,
-                   nb21, nb22, nb23,
-                   ne31, ne32, ne33,
-                  nb31, nb32, nb33);
-+                  nb31, nb32, nb33, block_table);
-         NO_DEVICE_CODE;
-         return;
-     }
-@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
-     K += nb13*sequence + nb12*(head / gqa_ratio);
-     V += nb23*sequence + nb22*(head / gqa_ratio);
- 
-+    // [paged] in-kernel block-table read: logical KV index j -> physical cell
-+    // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
-+    // mask/KV_max stay logical (the table is in token-position order). nullptr =>
-+    // the stock contiguous read below.
-+    const char * GGML_CUDA_RESTRICT K0 = K;
-+    const char * GGML_CUDA_RESTRICT V0 = V;
-+    const int  * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
-+
-     const half * maskh  = (const half  *) (mask + nb33*(sequence % ne33) + nb31*ic0);
- 
-     const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
-@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-         for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
-             const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
-+            const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
- 
- #pragma unroll
-             for (int j = 0; j < ncols; ++j) {
-                float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
-+                float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
-                 sum = warp_reduce_sum<nthreads_KQ>(sum);
- 
-                 if (use_logit_softcap) {
-@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-         for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
-             const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
-+            const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
- 
- #ifdef V_DOT2_F32_F16_AVAILABLE
-             half2 KQ_k[ncols];
-@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
-                 half2 tmp[V_rows_per_thread/2];
-                 if constexpr (type_V == GGML_TYPE_BF16) {
-                     float2 tmp_f[V_rows_per_thread/2];
-                    dequantize_V(V + k*nb21, tmp_f,
-+                    dequantize_V(V_blk, tmp_f,
-                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
- #pragma unroll
-                     for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
-                         tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
-                     }
-                 } else {
-                    dequantize_V(V + k*nb21, tmp,
-+                    dequantize_V(V_blk, tmp,
-                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
-                 }
- #pragma unroll
-@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
- #pragma unroll
-             for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
-                 float2 tmp[V_rows_per_thread/2];
-                dequantize_V(V + k*nb21, tmp,
-+                dequantize_V(V_blk, tmp,
-                     2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
- #pragma unroll
-                 for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
-@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
-               nb11, nb12, nb13,
-               nb21, nb22, nb23,
-               ne31, ne32, ne33,
-              nb31, nb32, nb33);
-+              nb31, nb32, nb33, block_table);
-     NO_DEVICE_CODE;
- #endif // FLASH_ATTN_AVAILABLE
- }
-diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-index 6850716..5357849 100644
--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
-@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
-                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
-                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
-+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
-+        const int  * __restrict__ block_table) {
-+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index d6c501b..e3771ee 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
- 
- void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-     ggml_cuda_set_device(ctx.device);
-+
-+    // [paged] the block table (src[5]) is only honored by the vec kernel's
-+    // in-kernel read; force it. build_attn only sets it for a vec-supported
-+    // 1-token-per-stream decode shape.
-+    if (dst->src[5] != nullptr) {
-+        ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        return;
-+    }
-+
-     switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
-         case BEST_FATTN_KERNEL_NONE:
-             GGML_ABORT("fatal error");
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index b43016c..adbe52b 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
-     a->src[4] = sinks;
- }
- 
-+void ggml_flash_attn_ext_set_block_table(
-+        struct ggml_tensor * a,
-+        struct ggml_tensor * block_table) {
-+    if (!block_table) {
-+        a->src[5] = NULL;
-+        return;
-+    }
-+
-+    GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
-+    GGML_ASSERT(block_table->type == GGML_TYPE_I32);
-+
-+    a->src[5] = block_table;
-+}
-+
- // ggml_flash_attn_back
- 
- struct ggml_tensor * ggml_flash_attn_back(
-diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
-index b59d2a5..abdb48d 100644
--- a/src/llama-graph.cpp
-+++ b/src/llama-graph.cpp
-@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
-          ggml_tensor * sinks,
-          ggml_tensor * v_mla,
-                float   kq_scale,
-                 int   il) const {
-+                 int   il,
-+         ggml_tensor * block_table) const {
-     const bool v_trans = v->nb[1] > v->nb[2];
- 
-     // split the batch into streams if needed
-@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
-                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
-         cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
- 
-+        if (block_table) {
-+            ggml_flash_attn_ext_set_block_table(cur, block_table);
-+        }
-         ggml_flash_attn_ext_add_sinks(cur, sinks);
-         ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
- 
-@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
-     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
-     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- 
-    // [paged 0003] gather K, V and the mask to the sequence's used cells only
-    //   (no-op unless env LLAMA_KV_PAGED is set).
-    ggml_tensor * kq_mask_g = kq_mask;
-    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+    // [paged] decode read: when paging is active and this is a 1-token-per-stream
-+    //   decode step, present K/V as n_gather views + a block table so the fattn
-+    //   kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
-+    //   fall back to the gather-read (prefill, transposed V, or env off). All a
-+    //   no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
-+    ggml_tensor * kq_mask_g   = kq_mask;
-+    ggml_tensor * block_table = nullptr;
-+    const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
-+    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
-+        paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
-+    }
- 
-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
-+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
-     cb(cur, "kqv_out", il);
- 
-     if (inp->self_v_rot) {
-diff --git a/src/llama-graph.h b/src/llama-graph.h
-index 5e8a658..c95ae49 100644
--- a/src/llama-graph.h
-+++ b/src/llama-graph.h
-@@ -969,7 +969,8 @@ struct llm_graph_context {
-             ggml_tensor * sinks,   // [n_head_q]
-             ggml_tensor * v_mla,   // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
-                   float   kq_scale,
-                    int   il) const;
-+                    int   il,
-+            ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
- 
-     llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
- 
-diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
-index 7510ff9..0351f86 100644
--- a/src/llama-kv-cache.cpp
-+++ b/src/llama-kv-cache.cpp
-@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
-     }
- }
- 
-+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
-+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
-+    for (uint32_t j = 0; j < ns; ++j) {
-+        const auto & cells = v_cells[sinfo.s0 + j];
-+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
-+        std::vector<std::pair<llama_pos, int32_t>> pc;
-+        pc.reserve(n);
-+        int32_t pad = -1;
-+        for (uint32_t i = 0; i < n; ++i) {
-+            if (!cells.is_empty(i)) {
-+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
-+            } else if (pad < 0) {
-+                pad = (int32_t) i;
-+            }
-+        }
-+        std::sort(pc.begin(), pc.end());
-+        int32_t * col = dst + (size_t) j * n_blk;
-+        for (size_t k = 0; k < pc.size(); ++k) {
-+            col[k] = pc[k].second;
-+        }
-+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
-+        for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
-+            col[k] = padv;
-+        }
-+    }
-+}
-+
- ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
-     GGML_UNUSED(sinfo);
- 
-@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
-     kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
- }
- 
-+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
-+    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
-+}
-+
- ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
-     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
- }
-diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
-index f374ac6..e9980b6 100644
--- a/src/llama-kv-cache.h
-+++ b/src/llama-kv-cache.h
-@@ -176,6 +176,9 @@ public:
-     //   gather-read. get_n_gather returns the max count across streams.
-     uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
-     void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
-+    // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
-+    //   per column with a masked empty cell) for the in-kernel paged read.
-+    void     get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
- 
-     // store k_cur and v_cur in the cache based on the provided head location
-     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
-@@ -386,6 +389,7 @@ public:
-     //   current ubatch's stream).
-     uint32_t get_n_gather() const;
-     void     get_gather_idxs(int32_t * dst) const;
-+    void     get_block_table(int32_t * dst, uint32_t n_blk) const;
- 
-     // store k_cur and v_cur in the cache based on the provided head location
-     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-index ade75e8..8eebeaa 100644
--- a/src/paged-attn.cpp
-+++ b/src/paged-attn.cpp
-@@ -43,6 +43,25 @@ public:
-     ggml_tensor * idxs;
- };
- 
-+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
-+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
-+// with a masked empty cell, by delegating to the kv-cache context.
-+class input_block_table : public llm_graph_input_i {
-+public:
-+    input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
-+        : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
-+
-+    void set_input(const llama_ubatch * ubatch) override {
-+        GGML_UNUSED(ubatch);
-+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
-+        mctx->get_block_table((int32_t *) idxs->data, n_blk);
-+    }
-+
-+    const llama_kv_cache_context * mctx;
-+    ggml_tensor * idxs;
-+    uint32_t n_blk;
-+};
-+
- } // namespace
- 
- void gather(ggml_context * ctx0,
-@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
-     }
- }
- 
-+bool in_kernel_decode(ggml_context * ctx0,
-+                      llm_graph_result * res,
-+                      const llama_kv_cache_context * mctx,
-+                      ggml_tensor ** k,
-+                      ggml_tensor ** v,
-+                      ggml_tensor ** kq_mask,
-+                      ggml_tensor ** block_table) {
-+    if (!active()) {
-+        return false;
-+    }
-+    // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
-+    // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
-+    static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
-+    if (force_gather) {
-+        return false;
-+    }
-+
-+    ggml_tensor * K = *k;
-+    ggml_tensor * V = *v;
-+    ggml_tensor * M = *kq_mask;
-+
-+    const int64_t n_stream = K->ne[3];
-+    GGML_ASSERT(M->ne[3] == n_stream);
-+
-+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
-+    if (n_gather <= 0) {
-+        // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
-+        return false;
-+    }
-+
-+    // The in-kernel read addresses V along its d-major (non-transposed) axis. If
-+    // the cache stores V transposed, fall back to gather() (which normalizes it).
-+    if (V->nb[1] > V->nb[2]) {
-+        return false;
-+    }
-+
-+    if (debug()) {
-+        static int64_t once = 0;
-+        if (once++ < 2) {
-+            fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
-+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
-+        }
-+    }
-+
-+    // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
-+    // in token-POSITION order (identical to the gather index, so the reduction
-+    // order matches stock bit-for-bit), padded with a masked empty cell. Filled
-+    // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
-+    // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
-+    // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
-+    // engages; padded entries point at a masked empty cell (0 contribution). Stays
-+    // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
-+    int64_t n_view = GGML_PAD(n_gather, 256);
-+    if (n_view > K->ne[2]) {
-+        n_view = K->ne[2];
-+    }
-+
-+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
-+    ggml_set_input(idx);
-+    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
-+
-+    // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
-+    // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
-+    // dim shrinks to n_view. NOT materialized - the kernel reads in place.
-+    *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
-+                      K->nb[1], K->nb[2], K->nb[3], 0);
-+    *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
-+                      V->nb[1], V->nb[2], V->nb[3], 0);
-+
-+    // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
-+    // the kernel's logical mask index aligns with the block table. Cheap: the
-+    // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
-+    {
-+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
-+        m = ggml_get_rows(ctx0, m, idx);
-+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
-+        m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
-+        if (M->type != m->type) {
-+            m = ggml_cast(ctx0, m, M->type);
-+        }
-+        *kq_mask = m;
-+    }
-+
-+    *block_table = idx;
-+    return true;
-+}
-+
- } // namespace paged_attn
-diff --git a/src/paged-attn.h b/src/paged-attn.h
-index c5b7bd7..23e2184 100644
--- a/src/paged-attn.h
-+++ b/src/paged-attn.h
-@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
-             ggml_tensor ** v,
-             ggml_tensor ** kq_mask);
- 
-+// [paged inc1] In-kernel paged decode read. Instead of materializing the
-+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
-+// full physical window and return the position-ordered physical-cell index list
-+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
-+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
-+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
-+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
-+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
-+// not apply - env off, nothing placed, or a transposed V cache - so the caller
-+// keeps the dense gather()/contiguous read.
-+bool in_kernel_decode(ggml_context * ctx0,
-+                      llm_graph_result * res,
-+                      const llama_kv_cache_context * mctx,
-+                      ggml_tensor ** k,
-+                      ggml_tensor ** v,
-+                      ggml_tensor ** kq_mask,
-+                      ggml_tensor ** block_table);
-+
- } // namespace paged_attn
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
@@ -1,269 +0,0 @@
-From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Mon, 22 Jun 2026 22:36:09 +0200
-Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
- LLAMA_KV_PAGED) - patch 0010
-
-Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
-safe against silent mis-routing, and plumb the same read into the tile kernel
-for the increment-3 GQA head-group work.
-
-fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
-fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
-(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
-i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
-The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
-so every existing non-paged caller is byte-identical to stock. The mask / KV_max
-stay logical (token-position order), as in vec.
-
-fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
-the vec or tile kernel and never fall through to the best-kernel switch. The
-mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
-(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
-GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
-(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
-kernel once.
-
-Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
-vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
-Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
-vec, matching the nsys profile (flash_attn_ext_vec).
-
-The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
-and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
-tile-paged diverging from tile-stock at the first cross-tile KV depth: the
-GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
-oob_check=false while the compacted paged mask is not padded to cover the tile, so
-past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
-the tile path is increment-3 work; the default vec route and all stock paths are
-untouched.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
- ggml/src/ggml-cuda/fattn.cu       | 38 +++++++++++++++++++++++---
- 2 files changed, 64 insertions(+), 19 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
-index 0ff14e6..bb84d61 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
-+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
-@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
- // TODO: deduplicate with mma-f16
- template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
- static __device__ __forceinline__ void flash_attn_tile_load_tile(
-        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
-+        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
-+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
-                     const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
- 
-                     const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
-+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
-+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
-                     ggml_cuda_memcpy_1<cpy_nb>(
-                         tile_KV + i*(J/2 + J_padding) + j,
-                        !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
-+                        !oob_check || i < i_sup ? KV_row + j : zero);
-                 }
-             }
-         }
-@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
- 
- template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
- static __device__ __forceinline__ void flash_attn_tile_load_tile(
-        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
-+        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
-+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
- 
-                     const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
-                     __align__(16) half2 tmp_h2[cpy_ne/2];
-+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
-+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
-                     ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
-                        tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
-+                        tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
- 
-                     __align__(16) float2 tmp_f2[cpy_ne/2];
- #pragma unroll
-@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
-         const int k_VKQ_0,
-         const int k_VKQ_sup,
-         const int k_KQ_0,
-+        const int * const __restrict__ block_table,
-         float * KQ_acc) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
-@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
-     constexpr int cpw   = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
-     constexpr int np    = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
- 
-+    // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
-+    const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
-     flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
-        (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
-+        (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
-     __syncthreads();
- 
- #ifdef FAST_FP16_AVAILABLE
-@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
-         T_acc * const VKQ,
-         const int k_VKQ_0,
-         const int k_VKQ_max,
-        const int col_Q_0) {
-+        const int col_Q_0,
-+        const int * const __restrict__ block_table) {
-     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
-     constexpr int cpy_ne = cpy_nb / 4;
- 
-@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
- #pragma unroll
-     for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
-         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
-+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
-     }
-     if (nbatch_K_last > 0) {
-         constexpr int k_KQ_0 = DKQ - nbatch_K_last;
-         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
-+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
-     }
- 
-     // Apply logit softcap + mask, update KQ_max:
-@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
-     static_assert(nbatch_V % np == 0, "bad nbatch_V");
- #pragma unroll
-     for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
-+        // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
-+        const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
-         flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
-            (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
-+            (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
-         __syncthreads();
- 
- #ifdef FAST_FP16_AVAILABLE
-@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
-                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
-                             const int32_t nb31, const int32_t nb32, const int64_t nb33,
-         const int  * __restrict__ block_table) {
-    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
- #ifdef FLASH_ATTN_AVAILABLE
-     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
-     const char * GGML_CUDA_RESTRICT K        = K_ptr;
-@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
-                   nb11, nb12, nb13,
-                   nb21, nb22, nb23,
-                   ne31, ne32, ne33,
-                  nb31, nb32, nb33);
-+                  nb31, nb32, nb33, block_table);
-         NO_DEVICE_CODE;
-         return;
-     }
-@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
-     const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
-     const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
- 
-+    // [paged] per-sequence logical->physical block table in token-position order
-+    // (mask/KV_max stay logical); nullptr => the stock contiguous read.
-+    const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
-+
-     const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
- 
-     const int stride_K2   = nb11 / sizeof(half2);
-@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
-             constexpr bool oob_check = false;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-             k_VKQ_0 += gridDim.y*nbatch_fa;
-         }
-         if (k_VKQ_0 < k_VKQ_max) {
-             constexpr bool oob_check = true;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-         }
-     } else {
-         // Branch without out-of-bounds checks.
-@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
-             constexpr bool oob_check = false;
-             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
-                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
-+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
-         }
-     }
- 
-@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
-               nb11, nb12, nb13,
-               nb21, nb22, nb23,
-               ne31, ne32, ne33,
-              nb31, nb32, nb33);
-+              nb31, nb32, nb33, block_table);
-     NO_DEVICE_CODE;
- #endif // FLASH_ATTN_AVAILABLE
- }
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index e3771ee..afcafa2 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
- void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-     ggml_cuda_set_device(ctx.device);
- 
-    // [paged] the block table (src[5]) is only honored by the vec kernel's
-    // in-kernel read; force it. build_attn only sets it for a vec-supported
-    // 1-token-per-stream decode shape.
-+    // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
-+    // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
-+    // silently read the wrong (contiguous physical) cells. So when a block table
-+    // is present we route here and NEVER fall through to the best-kernel switch
-+    // below - no decode shape can silently reach an mma/wmma misread. build_attn
-+    // only sets src[5] for the 1-token-per-stream decode shape; the vec
-+    // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
-+    // and any shape that should not be paged must take the host-side gather path
-+    // (LLAMA_KV_PAGED_GATHER=1) instead.
-+    //
-+    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
-+    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
-+    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
-+    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
-+    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
-+    // with oob_check=false while the compacted paged mask is not padded to cover
-+    // it, so it diverges from stock. Not for production paged decode until
-+    // increment-3 bounds that path; the default vec route is unaffected.
-     if (dst->src[5] != nullptr) {
-        ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
-+        if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
-+            static bool logged = false;
-+            if (!logged) {
-+                logged = true;
-+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
-+                    paged_tile ? "TILE(experimental)" : "VEC",
-+                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
-+                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
-+            }
-+        }
-+        if (paged_tile) {
-+            ggml_cuda_flash_attn_ext_tile(ctx, dst);
-+        } else {
-+            ggml_cuda_flash_attn_ext_vec(ctx, dst);
-+        }
-         return;
-     }
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
@@ -1,147 +0,0 @@
-From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 00:18:35 +0200
-Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
- gqa>=2) - patch 0011
-
-Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
-in-kernel decode to the tile kernel for the common grouped-query F16 case, and
-keep the inc-1 vec kernel for everything else.
-
-The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
-q-heads that share one kv-head, so each K/V row is loaded once for the whole
-group instead of once per q-head. vec re-streams each kv-head's K/V once per
-q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
-3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
-The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
-this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
-
-Routing guard (why conditional): the tile kernel has no K/V type template - it
-loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
-launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
-read (the table indexes the original paged layout, not the copy). So tile is
-correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
-fall back to the inc-1 vec path, exactly as before this change. The head-group
-reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
-Note: paged decode is currently exercised with an F16 cache only; quantized +
-paged is a separate pre-existing limitation, independent of this change
-(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
-after this patch, since both route the non-F16 cache to vec).
-
-Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
-1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
-same build, env-toggled:
-  STOCK (mma)            174.8 ms/step  183.1 t/s
-  PAGED-VEC  (inc-1)     186.3 ms/step  171.8 t/s   (+6.6% vs stock)
-  PAGED-TILE (inc-3)     177.9 ms/step  179.8 t/s   (+1.8% vs stock)
-GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
-paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
-vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
-takes a larger share of the step.
-
-Why not the split-K tune: the vec decode grid is already block-saturated
-(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
-SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
-intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
-directly; more split-K does not.
-
-Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
-  - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
-  - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
-    in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
-    band where vec also drifts from stock. Stock uses the mma kernel for this
-    multi-stream GQA shape, so a different kernel = different rounding =
-    autoregressive token drift; vec and tile agree with each other while both
-    differ from stock (both pick 15678 where stock picks 38835), confirming the
-    drift is kernel choice, not a paging error.
-  - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
-    (seq3: tile == stock == 624 at the token where vec picked 13).
-
-Stock is byte-identical: the dispatch guard only diverts when the block table
-(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
-path reads the last nbatch_fa tile with oob_check=false and relies on the mask
-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
-mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
-
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
-Assisted-by: Claude:opus-4.8 [Claude Code]
---
- ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
- 1 file changed, 36 insertions(+), 15 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
-index afcafa2..6b15810 100644
--- a/ggml/src/ggml-cuda/fattn.cu
-+++ b/ggml/src/ggml-cuda/fattn.cu
-@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
-     // silently read the wrong (contiguous physical) cells. So when a block table
-     // is present we route here and NEVER fall through to the best-kernel switch
-     // below - no decode shape can silently reach an mma/wmma misread. build_attn
-    // only sets src[5] for the 1-token-per-stream decode shape; the vec
-+    // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
-     // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
-     // and any shape that should not be paged must take the host-side gather path
-     // (LLAMA_KV_PAGED_GATHER=1) instead.
-     //
-    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
-    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
-    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
-    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
-    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
-    // with oob_check=false while the compacted paged mask is not padded to cover
-    // it, so it diverges from stock. Not for production paged decode until
-    // increment-3 bounds that path; the default vec route is unaffected.
-+    // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
-+    // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
-+    // kv-head (ncols2), loading each K/V row once for the whole group instead of
-+    // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
-+    // Two constraints make this conditional: (1) the tile kernel has no K/V type
-+    // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
-+    // converted by launch_fattn to a contiguous F16 copy, which breaks the
-+    // in-kernel block-table read (the table indexes the original paged layout, not
-+    // the copy); vec instead reads the original cache with in-kernel dequant, so it
-+    // is the only correct paged path for non-F16 caches. (2) the head-group reuse
-+    // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
-+    // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
-+    // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
-+    // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
-+    // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
-+    // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
-+    // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
-+    // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
-+    // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
-+    // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
-+    // uses for ncols2>1); the compacted paged mask is gathered to the n_view
-+    // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
-+    // the inc-1 vec path for A/B.
-     if (dst->src[5] != nullptr) {
-        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
-+        const ggml_tensor * Qp = dst->src[0];
-+        const ggml_tensor * Kp = dst->src[1];
-+        const ggml_tensor * Vp = dst->src[2];
-+        const bool kv_f16    = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
-+        const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
-+        const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
-+        const bool use_tile  = !force_vec && kv_f16 && gqa_ratio >= 2;
-         if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
-             static bool logged = false;
-             if (!logged) {
-                 logged = true;
-                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
-                    paged_tile ? "TILE(experimental)" : "VEC",
-                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
-                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
-+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
-+                    use_tile ? "TILE(gqa)" : "VEC",
-+                    (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
-+                    (long) gqa_ratio, (int) kv_f16);
-             }
-         }
-        if (paged_tile) {
-+        if (use_tile) {
-             ggml_cuda_flash_attn_ext_tile(ctx, dst);
-         } else {
-             ggml_cuda_flash_attn_ext_vec(ctx, dst);
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
@@ -1,50 +0,0 @@
-From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 11:12:05 +0200
-Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
- route (patch 0012)
-
-The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
-leak past-end KV rows only because the compacted mask/block-table length is
-padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
-256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
-tile sits entirely inside the -inf pad window. That invariant was implicit.
-
-Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
-future change to the pad (e.g. < 256) or the tile (> 256) that broke the
-whole-tile property cannot silently reintroduce the leak. Additive only, no
-behaviour change.
-
-Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
-vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
-stays silent (n_view remains a whole number of tiles across all decode steps).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- src/paged-attn.cpp | 9 +++++++++
- 1 file changed, 9 insertions(+)
-
-diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
-index 8eebeaa..fed8ca9 100644
--- a/src/paged-attn.cpp
-+++ b/src/paged-attn.cpp
-@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
-         n_view = K->ne[2];
-     }
- 
-+    // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
-+    // a whole number of such tiles so the in-kernel decode never reads past the gathered
-+    // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
-+    // the boundary still contributes zero. This holds today only because the pad (256) is a
-+    // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
-+    // silently reintroduce a past-end KV leak, so assert it rather than trust it.
-+    // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
-+    GGML_ASSERT(n_view % 64 == 0);
-+
-     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
-     ggml_set_input(idx);
-     res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -1,137 +0,0 @@
-From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 11:52:45 +0200
-Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
- 0013)
-
-llama-server already co-batches decode with chunked prefill: update_slots()
-appends every generating slot's sampled token first, then fills the rest of the
-n_batch budget with prompt tokens, deferring the overflow to the next step. But
-the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
-~2048-token prefill chunk lands in a single compute-heavy step, and every decode
-co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
-Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
-prefill throughput, because they are coupled.
-
-Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
-(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
-The prompt-fill loop and the outer slot loop now also stop once this many prompt
-tokens have been added in the current update_slots() step, so a long prefill is
-split across more steps that each still advance in-flight decode. Default (env
-unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
-LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
-
-Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
-streams with one 6000-token prefill injected mid-stream; same binary, only
-LLAMA_PREFILL_BUDGET differs:
-
-  metric                        stock(off)  budget=256   budget=512
-  worst decode freeze (ms)         3380      482 (7.0x)   778 (4.3x)
-  median decode ITL in window      2264      411 (5.5x)   689
-  decode_stall (ms)                3285      387 (8.5x)   684 (4.8x)
-  decode steps during prefill        38      201 (5.3x)   108
-  injected-req TTFT (ms)           8493     10172 (+20%)  8432 (~0%)
-  steady-state baseline ITL          94        95          94
-
-This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
-the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
-worst freeze and 5.3x more decode progress during the prefill at budget=256), in
-exchange for a modest TTFT rise on the long request (the classic chunked-prefill
-trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
-unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
-which the scheduler cannot lift.
-
-Correctness (same model, greedy temp 0, fa on):
- budget unset or >= n_batch: byte-identical to stock (the added break never
-  fires before the existing n_batch break; the off-path is a no-op by
-  construction).
- short prompt (<= budget): byte-identical to stock.
- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
-  stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
-  n_batch=2048 for decode width.
- on a prompt larger than the budget the chunked greedy output diverges from the
-  single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
-  stock -b256 diverges from stock -b2048 the same way with the patch inactive,
-  and the output stays coherent and answers correctly.
-
-Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
-mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
-Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
-stays disjoint from the paged allocation hunks.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
- 1 file changed, 34 insertions(+), 1 deletion(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 04c6361..5d83b30 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -2723,6 +2723,29 @@ private:
-         int32_t n_batch  = llama_n_batch(ctx_tgt);
-         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- 
-+        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
-+        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
-+        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
-+        // sampled decode tokens of every generating slot are appended FIRST, then prompt
-+        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
-+        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
-+        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
-+        // tokens added per step independently of n_batch, splitting a long prefill across
-+        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
-+        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
-+        // (this is a pure scheduler knob; works with paged off).
-+        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
-+        {
-+            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
-+            if (env_pb) {
-+                const int v = atoi(env_pb);
-+                if (v > 0) {
-+                    n_prefill_budget = std::min(n_batch, std::max(1, v));
-+                }
-+            }
-+        }
-+        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
-+
-         float  alora_scale       = -1.0f;
-         size_t alora_disabled_id = 0;
- 
-@@ -3159,7 +3182,10 @@ private:
-                     const bool n_before_user_known = n_before_user > 0;
- 
-                     // add prompt tokens for processing in the current batch
-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
-+                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
-+                    // prompt is split across more steps and leaves batch room for co-batched decode
-+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
-+                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
-                         // get next token to process
-                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
-                         if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3185,6 +3211,7 @@ private:
-                         slot.prompt.tokens.push_back(cur_tok);
- 
-                         slot.n_prompt_tokens_processed++;
-+                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
- 
-                         // stop the prompt batch exactly before the latest user input, so a checkpoint
-                         // can be created after the previous messages
-@@ -3293,6 +3320,12 @@ private:
-                 if (batch.n_tokens >= n_batch) {
-                     break;
-                 }
-+
-+                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
-+                // leaving the remaining batch capacity for co-batched decode of other slots
-+                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
-+                    break;
-+                }
-             }
-         }
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -1,140 +0,0 @@
-From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 15:47:06 +0200
-Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
-
-On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
-sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
-mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
-originally reported npl128 throughput cliff does NOT reproduce on this build.
-llama-batched-bench decode (S_TG t/s) is monotonic across batch:
-
-  npl        1     8    32    64   128   256
-  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
-
-There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
-at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
-
-What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
-token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
-column upper bound = token count, up to 128) in one column-tile. At MoE decode
-the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
-ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
-col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
-time and burns throughput on the padding columns while the larger y-tile lowers
-occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
-covers the density would raise fill + occupancy at no extra weight read (at
-tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
-emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
-kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
-
-Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
-(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
-selection, and therefore every kernel launched, is byte-identical to stock. The
-cap only ever lowers the loop's upper bound and still selects from the same
-granularity- and shared-memory-validated mmq_x set stock already uses for
-smaller batches, so no new kernel configuration is exercised.
-
-Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
-only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
-
-  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
-   64        936          938      +0.1       2924         2883
-  128       1295         1357      +4.8       3075         3038
-  256       1784         1825      +2.3       3085         3046
-
-  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
-
-cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
-npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
-cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
-tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
-re-reads), so 64 is the recommended value and the only one that helps net.
-
-Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
-throughput unlock (llama-server continuous batching already scales). It is a
-modest high-effective-batch DECODE micro-optimization that matches vLLM's
-smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
-durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
-ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
-patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
-
-Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
-stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
-prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
-npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
- 1 file changed, 36 insertions(+), 1 deletion(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index edf546d..cff608e 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -6,6 +6,7 @@
- 
- #include <climits>
- #include <cstdint>
-+#include <cstdlib>
- 
- using namespace ggml_cuda_mma;
- 
-@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     }
- }
- 
-+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
-+static inline int ggml_cuda_moe_mmq_x_cap() {
-+    static const int cap = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_MMQ_X");
-+        return s ? atoi(s) : 0;
-+    }();
-+    return cap;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-     const int mmq_y = get_mmq_y_host(cc);
- 
-+    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
-+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
-+    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
-+    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
-+    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
-+    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
-+    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
-+    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
-+    // per-expert density raises tile fill + occupancy with no extra weight reads (at
-+    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
-+    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
-+    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
-+    // off the ids path the cap never applies.
-+    int mmq_x_lim = mmq_x_max;
-+    if (args.expert_bounds != nullptr) {
-+        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-+        if (moe_cap > 0) {
-+            const int cap = moe_cap < 8 ? 8 : moe_cap;
-+            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
-+        }
-+    }
-+
-     int mmq_x_best  = 0;
-     int ntiles_x_best = INT_MAX;
- 
-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
-+    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
-         const int granularity = mmq_get_granularity_host(mmq_x, cc);
- 
-         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -1,238 +0,0 @@
-From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Tue, 23 Jun 2026 21:03:00 +0200
-Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
- (patch 0015)
-
-The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
-0014 doc itself scoped): replace the manual env cap with a host-side, default-on
-auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
-MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
-(decode), and keeps the large 128-wide tile when density is high (prefill). No new
-kernel: the selection only lowers the loop's upper bound to an already-compiled,
-granularity- and shared-memory-validated mmq_x.
-
-Density is estimated host-side from the args the ids path already passes:
-  ne_get_rows = ncols_dst   = ne12 * n_expert_used   (token-expert assignments)
-  n_experts   = nchannels_x = ne02
-  density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
-Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
-global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
-regress by construction.
-
-density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
-a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
-standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
-16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
-sits strictly between for every n_experts in [128,511], so it caps decode and leaves
-prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
-cratered its S_PP by ~2%, the regression this threshold exists to avoid.
-
-Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
-attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
-(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
-
-  npl   S_TG stock  S_TG 0015   dTG%    S_PP stock  S_PP 0015   dPP%
-    8      183.59     183.18  -0.22%       1489.2     1500.1  +0.73%
-   32      264.02     263.44  -0.22%       2034.5     2033.5  -0.05%
-   64      311.76     310.41  -0.43%       2028.3     2027.6  -0.03%
-  128      336.10     337.32  +0.36%       2025.0     2027.7  +0.13%
-
-Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
-and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
-256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
-lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
-cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
-useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
-smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
-
-Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
-(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
-decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
-the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
-neutral on the SSM model, harmless where it does not help. Conservative by design:
-at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
-(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
-+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
-work.
-
-LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
-old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
-select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
-LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
-
-Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
-NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
-{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
-All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
-LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
-nothing changes (non-MoE mul_mat byte-identical to stock).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
- tests/test-backend-ops.cpp |  16 ++++++
- 2 files changed, 99 insertions(+), 17 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index cff608e..9718b12 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     }
- }
- 
-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
-+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
-+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
-+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
-+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
-+// as an explicit override / A-B knob; the default path is now the auto-select.
- static inline int ggml_cuda_moe_mmq_x_cap() {
-     static const int cap = []() -> int {
-         const char * s = getenv("LLAMA_MOE_MMQ_X");
-@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
-     return cap;
- }
- 
-+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
-+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
-+static inline bool ggml_cuda_moe_auto_tile_enabled() {
-+    static const bool en = []() -> bool {
-+        const char * s = getenv("LLAMA_MOE_AUTO_TILE");
-+        return !(s && atoi(s) == 0);
-+    }();
-+    return en;
-+}
-+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
-+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
-+static inline int ggml_cuda_moe_decode_tile() {
-+    static const int t = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_DECODE_TILE");
-+        const int v = s ? atoi(s) : 0;
-+        return v >= 8 ? v : 64;
-+    }();
-+    return t;
-+}
-+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
-+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
-+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
-+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
-+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
-+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
-+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
-+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
-+// segment never splits into an extra col-tile.
-+static inline int ggml_cuda_moe_density_max() {
-+    static const int d = []() -> int {
-+        const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
-+        const int v = s ? atoi(s) : 0;
-+        return v > 0 ? v : 8;
-+    }();
-+    return d;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-     const int mmq_y = get_mmq_y_host(cc);
- 
-    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
-    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
-    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
-    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
-    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
-    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
-    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
-    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
-    // per-expert density raises tile fill + occupancy with no extra weight reads (at
-    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
-    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
-    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
-    // off the ids path the cap never applies.
-+    // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
-+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
-+    // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
-+    // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
-+    // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
-+    // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
-+    // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
-+    // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
-+    // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
-+    // SMALLER mmq_x when - and only when - the per-expert density is low:
-+    //
-+    //   ne_get_rows  = args.ncols_dst    = ne12 * n_expert_used  (total token-expert assignments)
-+    //   n_experts    = args.nchannels_x  = ne02
-+    //   n_active_est = min(n_experts, ne_get_rows)               (upper bound on active experts)
-+    //   density      = ceil(ne_get_rows / n_active_est)          (avg tokens per active expert)
-+    //
-+    // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
-+    // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
-+    // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
-+    // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
-+    // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
-+    // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
-+    // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
-+    // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
-+    // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
-+    // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
-+    // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
-+    //   - LLAMA_MOE_MMQ_X=<n>   : manual blunt global cap, overrides the auto-select (patch 0014).
-+    //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
-+    //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
-     int mmq_x_lim = mmq_x_max;
-     if (args.expert_bounds != nullptr) {
-         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-         if (moe_cap > 0) {
-             const int cap = moe_cap < 8 ? 8 : moe_cap;
-             mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
-+        } else if (ggml_cuda_moe_auto_tile_enabled()) {
-+            const int64_t ne_get_rows = args.ncols_dst;
-+            const int64_t n_experts   = args.nchannels_x;
-+            if (ne_get_rows > 0 && n_experts > 0) {
-+                const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
-+                const int64_t density  = (ne_get_rows + n_active - 1) / n_active;
-+                const int     tile     = ggml_cuda_moe_decode_tile();
-+                if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
-+                    mmq_x_lim = tile;
-+                }
-+            }
-         }
-     }
- 
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index 15ae389..f219309 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-     // gpt-oss issue with Vulkan mmq_id
-     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
- 
-+    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
-+    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
-+    // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
-+    // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
-+    // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
-+    // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
-+    // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
-+    // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
-+    // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
-+    // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
-+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
-+        for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
-+            test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
-+        }
-+    }
-+
-     for (ggml_type type_a : all_types) {
-         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -1,205 +0,0 @@
-From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Wed, 24 Jun 2026 07:44:25 +0000
-Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
- 0016, continuous-batch P1)
-
-Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
-decode-first token budget: the P1 of the token-granular continuous-batch
-scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
-change only inside update_slots(): no new slot states, no batch-formation
-rewrite, zero libllama changes. llama-server already emits one unified
-mixed prefill+decode batch per step (Phase 1 appends every ready decode
-token unconditionally; Phase 2 fills prefill into the same batch); 0013
-already ships that mixed ubatch. 0016 only changes the COUNT of prefill
-tokens admitted per step.
-
-The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
-== D (the live decode load) is known there. Instead of 0013's constant
-LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
-one long prompt monopolise the step), compute a dynamic budget:
-
-  T  = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
-       n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
-  prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
-       auto-shrinks as decode load rises so the step never inflates past T)
-  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
-       (the long_prefill_token_threshold analogue: one long prompt cannot
-       eat the whole leftover; LLAMA_PREFILL_CAP overrides)
-
-Phase 2's inner prompt-fill loop and outer admission break are bounded by
-prefill_budget_step (across slots) and a new per-slot slot_prompt_added
-counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
-ceiling stays as the compute bound. Decode is structurally claimed first
-and never capped (Phase 1), so the decode-first guarantee is free.
-
-Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
-that is net-negative at low npl and costs MoE TTFT; the T - D budget is
-self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
-decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
-tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
-lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
-TTFT + tuning-free robustness + clean supersession of 0013.
-
-DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
-to stock. The degenerate T == n_batch case is byte-identical to stock/0013
-(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
-n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
-ceiling at the same point, so no new bound fires. The legacy
-LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
-LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
-to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
- 1 file changed, 85 insertions(+), 22 deletions(-)
-
-diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
-index 5d83b30..f7a114c 100644
--- a/tools/server/server-context.cpp
-+++ b/tools/server/server-context.cpp
-@@ -2723,24 +2723,78 @@ private:
-         int32_t n_batch  = llama_n_batch(ctx_tgt);
-         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- 
-        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
-        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
-        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
-        // sampled decode tokens of every generating slot are appended FIRST, then prompt
-        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
-        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
-        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
-        // tokens added per step independently of n_batch, splitting a long prefill across
-        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
-        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
-        // (this is a pure scheduler knob; works with paged off).
-        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
-+        // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
-+        // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
-+        // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
-+        // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
-+        // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
-+        // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
-+        // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
-+        // lets one long prompt monopolise the step.
-+        //
-+        // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
-+        // a single total per-step token budget T, decode claims its D tokens first
-+        // (already in the batch), and prefill gets the leftover T - D distributed across
-+        // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
-+        // leftover auto-shrinks, so the step never inflates past T at any concurrency:
-+        // the budget self-tunes across the npl range and across dense vs MoE without a
-+        // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
-+        // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
-+        // never capped (Phase 1), so the decode-first guarantee is free here.
-+        //
-+        //   LLAMA_MAX_BATCH_TOKENS (T)  total per-step token budget (decode + prefill),
-+        //                               default n_batch, clamped to [n_ubatch, n_batch] so
-+        //                               the compute loop stays a single llama_decode and
-+        //                               prefill keeps an n_ubatch floor of progress.
-+        //   LLAMA_PREFILL_CAP           per-slot max prompt tokens per step (the
-+        //                               long_prefill_token_threshold analogue), default
-+        //                               min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
-+        //                               one long prompt cannot eat the whole leftover.
-+        //   LLAMA_PREFILL_BUDGET        legacy static cap (patch 0013); honoured ONLY when
-+        //                               LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
-+        //
-+        // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
-+        // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
-+        // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
-+        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
-+        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
-+        // scheduler policy, identical decisions with paged on or off.
-+        const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
-+        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
-+        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
-         {
-            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
-            if (env_pb) {
-+            int32_t mbt = 0;
-+            if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
-+                mbt = atoi(env_mbt);
-+            }
-+            if (mbt > 0) {
-+                // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
-+                int32_t T = std::min(n_batch, mbt);
-+                T = std::max(T, n_ubatch);
-+                // leftover after decode, floored at n_ubatch so prefill never fully starves
-+                prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
-+                // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
-+                int32_t cap = 0;
-+                if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
-+                    cap = atoi(env_cap);
-+                }
-+                if (cap <= 0) {
-+                    const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
-+                    cap = std::min(T, std::max(n_ubatch, pct4));
-+                }
-+                cap = std::min(n_batch, std::max(n_ubatch, cap));
-+                // at T == n_batch the leftover and cap both reach the n_batch ceiling
-+                // together; pin the cap to n_batch so this case stays byte-identical
-+                if (T >= n_batch) {
-+                    cap = n_batch;
-+                }
-+                prefill_cap_per_slot = cap;
-+            } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
-+                // legacy static budget (patch 0013), kept for back-compat when the
-+                // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
-                 const int v = atoi(env_pb);
-                 if (v > 0) {
-                    n_prefill_budget = std::min(n_batch, std::max(1, v));
-+                    prefill_budget_step = std::min(n_batch, std::max(1, v));
-                 }
-             }
-         }
-@@ -3181,11 +3235,18 @@ private:
-                     const int32_t n_before_user = slot.task->params.n_before_user;
-                     const bool n_before_user_known = n_before_user > 0;
- 
-+                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
-+                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
-+                    int32_t slot_prompt_added = 0;
-+
-                     // add prompt tokens for processing in the current batch
-                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
-                    // prompt is split across more steps and leaves batch room for co-batched decode
-+                    // (patch 0016) also stop once (a) the dynamic per-step prefill budget
-+                    // (the T - D leftover) is spent across all slots, or (b) this slot's
-+                    // per-slot chunk cap is hit, so a long prompt is split across more steps
-+                    // and leaves batch room for co-batched decode of the other slots
-                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
-                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
-+                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
-+                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
-                         // get next token to process
-                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
-                         if (cur_tok == LLAMA_TOKEN_NULL) {
-@@ -3211,7 +3272,8 @@ private:
-                         slot.prompt.tokens.push_back(cur_tok);
- 
-                         slot.n_prompt_tokens_processed++;
-                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
-+                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
-+                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
- 
-                         // stop the prompt batch exactly before the latest user input, so a checkpoint
-                         // can be created after the previous messages
-@@ -3321,9 +3383,10 @@ private:
-                     break;
-                 }
- 
-                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
-                // leaving the remaining batch capacity for co-batched decode of other slots
-                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
-+                // (patch 0016) stop admitting prompts once the dynamic per-step prefill
-+                // budget (the T - D leftover) is spent, leaving the remaining batch
-+                // capacity for co-batched decode of the other slots
-+                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
-                     break;
-                 }
-             }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
@@ -1,245 +0,0 @@
-From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Wed, 24 Jun 2026 19:56:05 +0200
-Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
- occupancy instrumentation (patch 0017)
-
-Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
-bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
-honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
-kill-gate tripped - so nothing is enabled by default.
-
-P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
-row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
-M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
-default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
-
-P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
-  - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
-    weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
-    assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
-    left as the P2 kernel change; the host/device plumbing is in place and inert.
-  - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
-    (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
-  - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
-
-Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
-  DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
-    (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
-  MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
-    (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
-
-nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
-(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
-not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
-mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
-mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
- tests/test-backend-ops.cpp | 16 +++++++
- 2 files changed, 92 insertions(+), 9 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
-index 9718b12..b53e38a 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
-+++ b/ggml/src/ggml-cuda/mmq.cuh
-@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
- #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
- }
- 
-static int get_mmq_y_host(const int cc) {
-+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
-+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
-+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
-+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
-+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
-+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
-+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
-+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
-+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
-+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
-+#ifndef GGML_CUDA_FP4_MMQ_Y
-+#define GGML_CUDA_FP4_MMQ_Y 128
-+#endif
-+
-+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
-+    if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
-+        return GGML_CUDA_FP4_MMQ_Y;
-+    }
-     return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
-         ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
- }
-@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
-     return MMQ_ITER_K;
- }
- 
-+template <ggml_type type = GGML_TYPE_COUNT>
- static constexpr __device__ int get_mmq_y_device() {
-+#if defined(BLACKWELL_MMA_AVAILABLE)
-+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
-+        return GGML_CUDA_FP4_MMQ_Y;
-+    }
-+#endif // defined(BLACKWELL_MMA_AVAILABLE)
- #if defined(GGML_USE_HIP)
- #if defined(RDNA1)
-     return 64;
-@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
- #endif // defined(GGML_USE_HIP)
- }
- 
-+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
-+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
-+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
-+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
-+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
-+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
-+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
-+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
-+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
-+#ifndef GGML_CUDA_FP4_MINBLOCKS
-+#define GGML_CUDA_FP4_MINBLOCKS 1
-+#endif
-+template <ggml_type type = GGML_TYPE_COUNT>
-+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
-+#if defined(BLACKWELL_MMA_AVAILABLE)
-+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
-+        return GGML_CUDA_FP4_MINBLOCKS;
-+    }
-+#endif // defined(BLACKWELL_MMA_AVAILABLE)
-+    return stock;
-+}
-+
- // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
- // The K dimension of the tiles has either,
- // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
-@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
-     constexpr int              warp_size  = ggml_cuda_get_physical_warp_size();
-     constexpr int              nwarps     = mmq_get_nwarps_device();
-     constexpr int              qk         = ggml_cuda_type_traits<type>::qk;
-    constexpr int              mmq_y      = get_mmq_y_device();
-+    constexpr int              mmq_y      = get_mmq_y_device<type>();
-     constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
- 
-     extern __shared__ int data_mul_mat_q[];
-@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
- template <ggml_type type, int mmq_x, bool need_check>
- #if defined(GGML_USE_HIP)
- #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
- #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
- #else
- #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
- #else
-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
-+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
- #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
- #endif // defined(GGML_USE_HIP)
- static __global__ void mul_mat_q(
-@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size();
- 
-     constexpr int qk    = ggml_cuda_type_traits<type>::qk;
-    constexpr int mmq_y = get_mmq_y_device();
-+    constexpr int mmq_y = get_mmq_y_device<type>();
- 
-     const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
- 
-@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
-         float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
-         const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
-         const int stride_sample_dst, const uint3 ntx) {
-    constexpr int mmq_y           = get_mmq_y_device();
-+    constexpr int mmq_y           = get_mmq_y_device<type>();
-     constexpr int qk              = ggml_cuda_type_traits<type>::qk;
-     constexpr int ITER_K          = get_iter_k(type);
-     constexpr int blocks_per_iter = ITER_K / qk;
-@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
-     const int nsm = ggml_cuda_info().devices[id].nsm;
-     const int warp_size = ggml_cuda_info().devices[id].warp_size;
-     const int nwarps = mmq_get_nwarps_host(cc, warp_size);
-    const int mmq_y = get_mmq_y_host(cc);
-+    const int mmq_y = get_mmq_y_host(cc, type);
- 
-     const dim3 block_dims(warp_size, nwarps, 1);
- 
-@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
-     return d;
- }
- 
-+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
-+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
-+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
-+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
-+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
-+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
-+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
-+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
-+    static const int c = []() -> int {
-+        const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
-+        return s ? atoi(s) : 0;
-+    }();
-+    return c;
-+}
-+
- template <ggml_type type>
- void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
-     const int    id     = ggml_cuda_get_device();
-@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     const int nwarps    = mmq_get_nwarps_host(cc, warp_size);
- 
-     const int mmq_x_max = get_mmq_x_max_host(cc);
-    const int mmq_y = get_mmq_y_host(cc);
-+    const int mmq_y = get_mmq_y_host(cc, type);
- 
-     // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
-     // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
-@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
-     //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
-     //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
-     int mmq_x_lim = mmq_x_max;
-+    if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
-+        // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
-+        const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
-+        if (cap > 0 && cap < mmq_x_max) {
-+            mmq_x_lim = cap < 8 ? 8 : cap;
-+        }
-+    }
-     if (args.expert_bounds != nullptr) {
-         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
-         if (moe_cap > 0) {
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index f219309..291c275 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-         }
-     }
- 
-+    // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
-+    // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
-+    // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
-+    // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
-+    // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
-+    // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
-+    // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
-+    // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
-+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
-+        for (int64_t m : {2048, 1600, 2050}) {
-+            for (int64_t n : {32, 128}) {
-+                test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
-+            }
-+        }
-+    }
-+
-     for (ggml_type type_a : all_types) {
-         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
-     }
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
@@ -1,349 +0,0 @@
-From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 00:44:13 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
- write-back (patch 0018)
-
-Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
-16 full-attention layers) was dominated by recurrent-state plumbing, not the
-FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
-recurrent state into graph scratch, then a separate ggml_cpy persisted it into
-the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
-~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
-
-This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
-ggml_gated_delta_net_inplace writes the final recurrent state directly into the
-active sequences contiguous cache slot (at kv_head), removing the copy-back. The
-op output then carries only the attention scores; the SSM arithmetic is
-unchanged (bit-identical greedy output vs the copy-back baseline).
-
- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
-  dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
-
-Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
-  npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
-  npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
-MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
-nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
-(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
-next lever.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/include/ggml.h                   | 14 ++++++
- ggml/src/ggml-cpu/ops.cpp             | 13 ++++-
- ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
- ggml/src/ggml.c                       | 68 +++++++++++++++++++++++++++
- src/models/delta-net-base.cpp         | 30 ++++++++++++
- 5 files changed, 152 insertions(+), 12 deletions(-)
-
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 823f5a9..4e7ab32 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2579,6 +2579,20 @@ extern "C" {
-             struct ggml_tensor  * state,
-             int64_t               K);
- 
-+    // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
-+    // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
-+    // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
-+    // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
-+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * q,
-+            struct ggml_tensor  * k,
-+            struct ggml_tensor  * v,
-+            struct ggml_tensor  * g,
-+            struct ggml_tensor  * beta,
-+            struct ggml_tensor  * state,
-+            struct ggml_tensor  * state_dst);
-+
-     // custom operators
- 
-     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 63c07a2..9457add 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-     ggml_tensor * src_g     = dst->src[3];
-     ggml_tensor * src_beta  = dst->src[4];
-     ggml_tensor * src_state = dst->src[5];
-+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
- 
-     const int64_t S_v      = src_v->ne[0];
-     const int64_t H        = src_v->ne[1];
-@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
- 
-     const float scale = 1.0f / sqrtf((float) S_v);
- 
-+    // when src_state_dst is provided (in-place decode write-back) the final state is written
-+    // directly into the persistent cache view, removing the separate state copy-back node.
-+    float * inplace_state_base = nullptr;
-+    if (src_state_dst != nullptr) {
-+        GGML_ASSERT(K == 1);
-+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
-+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
-+        inplace_state_base = (float *) src_state_dst->data;
-+    }
-+
-     for (int64_t ir = ir0; ir < ir1; ++ir) {
-         const int64_t iv1 = ir % H; // head_index
-         const int64_t iv3 = ir / H; // sequence
-@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-         // For K>1, work in scratch and copy out per-token when the slot is in range.
-         float * s_out = (K > 1)
-             ? state_work
-            : state_out_base + (iv3 * H + iv1) * S_v * S_v;
-+            : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
- 
-         // copy input state into the working buffer and operate in-place
-         // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index a547360..61a2b91 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
-                                      const uint3   neqk1_magic,
-                                      const uint3   rq3_magic,
-                                      float         scale,
-                                     int           K) {
-+                                     int           K,
-+                                     float *       state_dst) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-     // each warp owns one column, using warp-level primitives to reduce across rows
-@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
- 
-     const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
-     float *       attn_data        = dst;
-    float *       state            = dst + attn_score_elems;
-+    // when state_dst is provided (in-place decode write-back) the final recurrent state is written
-+    // directly into the persistent cache view instead of being appended to the op output; this
-+    // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
-+    float *       state            = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
- 
-     // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
-     // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
-@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
- static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-         const float * g_d, const float * b_d, const float * s_d,
-        float * dst_d,
-+        float * dst_d, float * state_dst_d,
-         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
-         int64_t sq1,   int64_t sq2, int64_t sq3,
-         int64_t sv1,   int64_t sv2, int64_t sv3,
-@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         case 32:
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         case 64: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         }
-         case 128: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-             break;
-         }
-         default:
-@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     ggml_tensor * src_g     = dst->src[3];
-     ggml_tensor * src_beta  = dst->src[4];
-     ggml_tensor * src_state = dst->src[5];
-+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
- 
-     GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
-     GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
-@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const float * s_d   = (const float *) src_state->data;
-     float *       dst_d = (float *) dst->data;
- 
-+    float * state_dst_d = nullptr;
-+    if (src_state_dst != nullptr) {
-+        // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
-+        GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
-+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
-+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
-+        state_dst_d = (float *) src_state_dst->data;
-+    }
-+
-     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
-@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const int K = ggml_get_op_params_i32(dst, 0);
-     const bool keep_rs = K > 1;
- 
-+    // in-place write-back is only valid for the single-snapshot (final-state) case
-+    GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
-+
-     if (kda) {
-         if (keep_rs) {
-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-     } else {
-         if (keep_rs) {
-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
-+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index adbe52b..b8d34bf 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
-     return result;
- }
- 
-+// ggml_gated_delta_net_inplace
-+//
-+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
-+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
-+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
-+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
-+// the attention-output view layout is identical to ggml_gated_delta_net.
-+struct ggml_tensor * ggml_gated_delta_net_inplace(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * q,
-+        struct ggml_tensor  * k,
-+        struct ggml_tensor  * v,
-+        struct ggml_tensor  * g,
-+        struct ggml_tensor  * beta,
-+        struct ggml_tensor  * state,
-+        struct ggml_tensor  * state_dst) {
-+    GGML_ASSERT(ggml_is_contiguous_rows(q));
-+    GGML_ASSERT(ggml_is_contiguous_rows(k));
-+    GGML_ASSERT(ggml_is_contiguous_rows(v));
-+    GGML_ASSERT(ggml_is_contiguous(g));
-+    GGML_ASSERT(ggml_is_contiguous(beta));
-+    GGML_ASSERT(ggml_is_contiguous(state));
-+
-+    GGML_ASSERT(q->type == GGML_TYPE_F32);
-+    GGML_ASSERT(k->type == GGML_TYPE_F32);
-+    GGML_ASSERT(v->type == GGML_TYPE_F32);
-+    GGML_ASSERT(g->type == GGML_TYPE_F32);
-+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state_dst != NULL);
-+    GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
-+
-+    const int64_t S_v      = v->ne[0];
-+    const int64_t H        = v->ne[1];
-+    const int64_t n_tokens = v->ne[2];
-+    const int64_t n_seqs   = v->ne[3];
-+
-+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
-+    GGML_ASSERT(beta->ne[0] == 1);
-+
-+    GGML_ASSERT(state->ne[0] == S_v);
-+    GGML_ASSERT(state->ne[1] == S_v);
-+    GGML_ASSERT(state->ne[2] == H);
-+    GGML_ASSERT(state->ne[3] == n_seqs);
-+
-+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
-+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
-+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
-+
-+    const int64_t state_rows = S_v * n_seqs; // K == 1
-+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
-+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
-+
-+    ggml_set_op_params_i32(result, 0, 1); // K == 1
-+
-+    result->op     = GGML_OP_GATED_DELTA_NET;
-+    result->src[0] = q;
-+    result->src[1] = k;
-+    result->src[2] = v;
-+    result->src[3] = g;
-+    result->src[4] = beta;
-+    result->src[5] = state;
-+    result->src[6] = state_dst;
-+
-+    return result;
-+}
-+
- ////////////////////////////////////////////////////////////////////////////////
- 
- struct ggml_hash_set ggml_hash_set_new(size_t size) {
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index ad9ce77..26a718b 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-     const bool keep = cparams.n_rs_seq > 0;
- 
-     if (!keep) {
-+        const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
-+
-+        if (fused) {
-+            // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
-+            // directly into the persistent cache slot for the active sequences (a contiguous block
-+            // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
-+            // dominated decode. The op output then carries only the attention scores.
-+            ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
-+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
-+
-+            ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
-+            if (n_seq_tokens == 1) {
-+                cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
-+            } else {
-+                cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
-+            }
-+
-+            ggml_tensor * output = ggml_view_4d(ctx0, result,
-+                    S_v, H_v, n_seq_tokens, n_seqs,
-+                    ggml_row_size(result->type, S_v),
-+                    ggml_row_size(result->type, S_v * H_v),
-+                    ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
-+            cb(output, "attn_output", il);
-+
-+            // the state write is a side effect of the op; pull the op into the graph via the output
-+            ggml_build_forward_expand(gf, output);
-+
-+            return output;
-+        }
-+
-         auto attn_out = build_delta_net(q, k, v, g, b, s, il);
-         ggml_tensor * output    = attn_out.first;
-         ggml_tensor * new_state = attn_out.second;
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
@@ -1,678 +0,0 @@
-From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 01:45:02 +0200
-Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
- (patch 0019)
-
-Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
-write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
-state get_rows gather (18.8% of decode GPU time): build_rs materialized each
-sequence's prior state into a contiguous scratch via ggml_get_rows before the
-gated-DeltaNet op read it.
-
-This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
-ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
-s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
-each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
-in-place write the op now reads AND writes the cache directly: no recurrent-state
-materialization at all. build_recurrent_attn feeds the full cache + ids through
-the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
-clear and the extra-states copy around the op.
-
-Race-free by construction on CUDA. In-place write plus an ids read of the same
-cache is only safe when read slot == write slot; s_copy is identity
-(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
-remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
-The recurrence kernel handles both per (seq, head) block on device: identity
-sequences read s0 in place from the destination slot (the kernel loads all of s0
-into registers before writing, so reading and writing the same slot is safe),
-and non-identity sequences read from a disjoint scratch that a small gather
-kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
-another block writes. The CPU op mirrors this (host identity check + a serial
-gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
-device-resident at op-execute time). Bit-identical to the get_rows path in every
-case.
-
- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
-  (gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
-  CPU identity guard + serial gather fallback in the dispatcher
- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
-  qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
-  the rollback (n_rs_seq > 0) path are unchanged.
-
-Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
-  dense q36-27b-nvfp4 : npl 32  137.64 -> 170.68 (+24.0 percent)
-                        npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
-  MoE   q36-35b-a3b-nvfp4: npl 32  299.68 -> 366.69 (+22.4 percent)
-                           npl 128 409.30 -> 553.63 (+35.3 percent)
-Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
-(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
-k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
-is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
-is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- SSM_DECODE_FIX_RESULTS.md             | 86 +++++++++++++++++++++++++++
- ggml/include/ggml.h                   | 17 ++++++
- ggml/src/ggml-cpu/ops.cpp             | 49 ++++++++++++++-
- ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
- ggml/src/ggml.c                       | 76 +++++++++++++++++++++++
- src/models/delta-net-base.cpp         | 63 ++++++++++++++++++++
- src/models/models.h                   | 13 ++++
- src/models/qwen35.cpp                 |  6 +-
- src/models/qwen35moe.cpp              |  6 +-
- 9 files changed, 378 insertions(+), 23 deletions(-)
-
-diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md
-index 2e7c8c2..77879e4 100644
--- a/SSM_DECODE_FIX_RESULTS.md
-+++ b/SSM_DECODE_FIX_RESULTS.md
-@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL
- after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
- track. No paged/graph/block-table change can move decode on this model (full
- attention is 0.4% of decode).
-+
-+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
-+
-+After Step 1 the largest non-GEMM decode bucket was the recurrent-state
-+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
-+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
-+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
-+`ggml_ssm_scan`'s `ids` source.
-+
-+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
-+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
-+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
-+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
-+writes the cache directly: no recurrent-state materialization at all. The
-+`build_recurrent_attn` fused path feeds the full cache and ids through the
-+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
-+`rs_zero` clear and the extra-states copy around the op.
-+
-+### Race-free by construction (CUDA)
-+
-+In-place write plus an ids read of the same cache is only safe when the read slot
-+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
-+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
-+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
-+prefill ubatch). The kernel handles both per (seq, head) block on device:
-+
-+- identity sequences read `s0` in place from the destination slot `state_dst`
-+  (the kernel loads all of `s0` into registers before it writes the new state,
-+  so reading and writing the same slot is race-free) -- no materialization;
-+- non-identity sequences read from a disjoint scratch that a small
-+  `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
-+  recurrence never reads a slot another block writes.
-+
-+`ids` stays a device pointer (dereferenced only in the kernels; the input is
-+device-resident at op-execute time, so a host read segfaults). The CPU op
-+mirrors the same logic (host identity check + a serial gather in the dispatcher
-+for the non-identity case). The math is unchanged, so the result is bit-identical
-+to the `get_rows` path in every case.
-+
-+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
-+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
-+untouched (they keep the materialized-state overload).
-+
-+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
-+
-+Dense `q36-27b-nvfp4`:
-+
-+| npl | Step 1 (baseline) | Step 2   | delta   | % of vLLM (391 @128) |
-+|-----|-------------------|----------|---------|----------------------|
-+| 32  | 137.64            | 170.68   | +24.0%  | -                    |
-+| 128 | 186.25            | 256.57   | +37.8%  | 47.6% -> 65.6%       |
-+
-+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
-+
-+MoE `q36-35b-a3b-nvfp4`:
-+
-+| npl | Step 1 (baseline) | Step 2   | delta   |
-+|-----|-------------------|----------|---------|
-+| 32  | 299.68            | 366.69   | +22.4%  |
-+| 128 | 409.30            | 553.63   | +35.3%  |
-+
-+(Step-1 baselines re-measured in the same session; the brief's reference figures
-+were 136 / 180 dense and 279 / 373 MoE.)
-+
-+### Bit-exact gate
-+
-+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
-+fusion off) vs the Step-1 build:
-+
-+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
-+- MoE `q36-35b-a3b-nvfp4`: byte-identical;
-+- Step-2 dense run1 == run2 (deterministic, no race).
-+
-+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
-+
-+The recurrent-state gather bucket collapsed:
-+
-+| kernel                     | Step 1   | Step 2                                  |
-+|----------------------------|----------|-----------------------------------------|
-+| `k_get_rows_float`         | 18.8%    | 0.7% (residual: embeddings / conv-state)|
-+| `gdn_gather_nonident`      | -        | 1.7% (no-op at decode, median ~1.2 us)  |
-+| `gated_delta_net_cuda`     | 26.0%    | 22.5%                                    |
-+| FP4 GEMM family            | ~37.5%   | ~48% (now the dominant residual)        |
-+
-+The SSM state gather is effectively eliminated. The residual decode gap to vLLM
-+is now the FP4 GEMM (~48% of decode), a separate kernel track.
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 4e7ab32..951dd21 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2593,6 +2593,23 @@ extern "C" {
-             struct ggml_tensor  * state,
-             struct ggml_tensor  * state_dst);
- 
-+    // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
-+    // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
-+    // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
-+    // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
-+    // base slot. Eliminates the recurrent-state gather on the decode path.
-+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * q,
-+            struct ggml_tensor  * k,
-+            struct ggml_tensor  * v,
-+            struct ggml_tensor  * g,
-+            struct ggml_tensor  * beta,
-+            struct ggml_tensor  * state,
-+            struct ggml_tensor  * state_dst,
-+            struct ggml_tensor  * ids,
-+            int                   rs_head);
-+
-     // custom operators
- 
-     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index 9457add..b6a1976 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
-     const int64_t K = ggml_get_op_params_i32(dst, 0);
-     GGML_ASSERT(K >= 1);
-     // per-seq stride in floats (seq s starts at state + s * seq_stride)
-    const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
-+    int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
- 
-     const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
-     const int ith = params->ith;
-@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
- 
-     const float * state_in_base = (const float *)src_state->data;
- 
-+    // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
-+    // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
-+    // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
-+    // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
-+    // region. Bit-identical to the get_rows path.
-+    ggml_tensor * src_ids = dst->src[7];
-+    if (src_ids != nullptr) {
-+        const int64_t   D       = S_v * S_v * H;
-+        const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
-+        const int32_t * ids     = (const int32_t *) src_ids->data;
-+        bool identity = true;
-+        for (int64_t s = 0; s < n_seqs; ++s) {
-+            if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
-+        }
-+        state_seq_stride = D;
-+        state_in_base = identity
-+            ? (const float *) src_state->data + (int64_t) rs_head * D
-+            : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
-+    }
-+
-   //const int64_t rq1 = nev1 / neq1;
-   //const int64_t rk1 = nev1 / nek1;
-     const int64_t rq3 = nev3 / neq3;
-@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
- 
-     if (ith == 0) {
-       ggml_threadpool_chunk_set(params->threadpool, nth);
-+
-+      // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
-+      // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
-+      // recurrence, so the in-place write never aliases another sequence's read.
-+      ggml_tensor * src_ids = dst->src[7];
-+      if (src_ids != nullptr) {
-+          const ggml_tensor * src_state = dst->src[5];
-+          const int64_t S_v      = V->ne[0];
-+          const int64_t H        = V->ne[1];
-+          const int64_t n_tokens = V->ne[2];
-+          const int64_t n_seqs   = V->ne[3];
-+          const int64_t D        = S_v * S_v * H;
-+          const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
-+          const int32_t * ids     = (const int32_t *) src_ids->data;
-+          bool identity = true;
-+          for (int64_t s = 0; s < n_seqs; ++s) {
-+              if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
-+          }
-+          if (!identity) {
-+              const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
-+              const float * cache   = (const float *) src_state->data;
-+              float *       scratch = (float *) dst->data + attn_score_elems;
-+              for (int64_t s = 0; s < n_seqs; ++s) {
-+                  memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
-+              }
-+          }
-+      }
-     }
- 
-     ggml_barrier(params->threadpool);
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index 61a2b91..86d5e2a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -1,6 +1,34 @@
- #include "gated_delta_net.cuh"
- #include "ggml-cuda/common.cuh"
- 
-+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
-+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
-+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
-+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
-+                                           float * scratch, int64_t D, int n_seqs) {
-+    const int s = blockIdx.x;
-+    if (s >= n_seqs) {
-+        return;
-+    }
-+    const int r = ids[s];
-+    if (r == rs_head + s) {
-+        return; // identity: prior state already lives in the in-place destination slot
-+    }
-+    const float * src = cache   + (int64_t) r * D;
-+    float *       dst = scratch + (int64_t) s * D;
-+    for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
-+        dst[i] = src[i];
-+    }
-+}
-+
-+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
-+                                          float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
-+    if (n_seqs <= 0) {
-+        return;
-+    }
-+    gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
-+}
-+
- template <int S_v, bool KDA, bool keep_rs_t>
- __global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
- gated_delta_net_cuda(const float * q,
-@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
-                                      const uint3   rq3_magic,
-                                      float         scale,
-                                      int           K,
-                                     float *       state_dst) {
-+                                     float *       state_dst,
-+                                     const int32_t * ids,
-+                                     int           rs_head) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-     // each warp owns one column, using warp-level primitives to reduce across rows
-@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
-     const int64_t state_in_offset      = sequence * H * S_v * S_v + h_idx * S_v * S_v;
-     const int64_t state_out_offset     = (sequence * H + h_idx) * S_v * S_v;
-     state += state_out_offset;
-    curr_state += state_in_offset + col * S_v;
-+    // Step 2: select the prior-state read base per sequence. For the ids variant, identity
-+    // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
-+    // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
-+    // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
-+    // offset. The whole s0 is loaded into registers before the new state is written, so reading and
-+    // writing the same slot per block (identity) is race-free.
-+    const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
-+        ? state_dst : curr_state;
-+    read_state += state_in_offset + col * S_v;
-     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
- 
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
-@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
- #pragma unroll
-     for (int r = 0; r < rows_per_lane; r++) {
-         const int i = r * warp_size + lane;
-        s_shard[r]  = curr_state[i];
-+        s_shard[r]  = read_state[i];
-     }
- 
-     for (int t = 0; t < n_tokens; t++) {
-@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-         const float * g_d, const float * b_d, const float * s_d,
-         float * dst_d, float * state_dst_d,
-+        const int32_t * ids_d, int rs_head,
-         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
-         int64_t sq1,   int64_t sq2, int64_t sq3,
-         int64_t sv1,   int64_t sv2, int64_t sv3,
-@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         case 32:
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         case 64: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         }
-         case 128: {
-             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
-+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-             break;
-         }
-         default:
-@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-     const float * g_d = (const float *) src_g->data;
-     const float * b_d = (const float *) src_beta->data;
- 
-    const float * s_d   = (const float *) src_state->data;
-     float *       dst_d = (float *) dst->data;
- 
-     float * state_dst_d = nullptr;
-@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
-         state_dst_d = (float *) src_state_dst->data;
-     }
- 
-+    // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
-+    // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
-+    // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
-+    // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
-+    // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
-+    // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
-+    // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
-+    ggml_tensor * src_ids = dst->src[7];
-+    const float *   s_d     = (const float *) src_state->data;
-+    const int32_t * ids_d   = nullptr;
-+    int             rs_head = 0;
-+    ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
-+    if (src_ids != nullptr) {
-+        GGML_ASSERT(state_dst_d != nullptr);
-+        GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
-+        rs_head = ggml_get_op_params_i32(dst, 1);
-+        ids_d   = (const int32_t *) src_ids->data;
-+        const int64_t D = S_v * S_v * H;
-+        float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
-+        ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
-+        s_d = scratch;
-+    }
-+
-     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
-     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
-@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
- 
-     if (kda) {
-         if (keep_rs) {
-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-     } else {
-         if (keep_rs) {
-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         } else {
-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
-+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
-                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
-         }
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index b8d34bf..1762037 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
-     return result;
- }
- 
-+// ggml_gated_delta_net_inplace_ids
-+//
-+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
-+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
-+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
-+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
-+// + s), where the prior state already lives in the in-place destination slots.
-+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * q,
-+        struct ggml_tensor  * k,
-+        struct ggml_tensor  * v,
-+        struct ggml_tensor  * g,
-+        struct ggml_tensor  * beta,
-+        struct ggml_tensor  * state,
-+        struct ggml_tensor  * state_dst,
-+        struct ggml_tensor  * ids,
-+        int                   rs_head) {
-+    GGML_ASSERT(ggml_is_contiguous_rows(q));
-+    GGML_ASSERT(ggml_is_contiguous_rows(k));
-+    GGML_ASSERT(ggml_is_contiguous_rows(v));
-+    GGML_ASSERT(ggml_is_contiguous(g));
-+    GGML_ASSERT(ggml_is_contiguous(beta));
-+    GGML_ASSERT(ggml_is_contiguous(state));
-+
-+    GGML_ASSERT(q->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(k->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(v->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(g->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state->type == GGML_TYPE_F32);
-+    GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
-+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
-+
-+    const int64_t S_v      = v->ne[0];
-+    const int64_t H        = v->ne[1];
-+    const int64_t n_tokens = v->ne[2];
-+    const int64_t n_seqs   = v->ne[3];
-+
-+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
-+    GGML_ASSERT(beta->ne[0] == 1);
-+
-+    // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
-+    GGML_ASSERT(state->ne[0] == S_v);
-+    GGML_ASSERT(state->ne[1] == S_v);
-+    GGML_ASSERT(state->ne[2] == H);
-+    GGML_ASSERT(state->ne[3] >= n_seqs);
-+
-+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
-+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
-+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
-+
-+    // ids: per-seq source slot into the full cache (s_copy_main)
-+    GGML_ASSERT(ids->ne[0] >= n_seqs);
-+
-+    const int64_t state_rows = S_v * n_seqs; // K == 1
-+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
-+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
-+
-+    ggml_set_op_params_i32(result, 0, 1);       // K == 1
-+    ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
-+
-+    result->op     = GGML_OP_GATED_DELTA_NET;
-+    result->src[0] = q;
-+    result->src[1] = k;
-+    result->src[2] = v;
-+    result->src[3] = g;
-+    result->src[4] = beta;
-+    result->src[5] = state;     // FULL cache (read via ids)
-+    result->src[6] = state_dst; // in-place final-state write-back target
-+    result->src[7] = ids;       // per-seq source slots (s_copy)
-+
-+    return result;
-+}
-+
- ////////////////////////////////////////////////////////////////////////////////
- 
- struct ggml_hash_set ggml_hash_set_new(size_t size) {
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index 26a718b..194e611 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
-     return conv_input;
- }
- 
-+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
-+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
-+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
-+// and rollback paths fall back to materializing the prior state and delegating below.
-+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-+        llm_graph_input_rs * inp,
-+        ggml_tensor *        ssm_states_all,
-+        ggml_tensor *        q,
-+        ggml_tensor *        k,
-+        ggml_tensor *        v,
-+        ggml_tensor *        g,
-+        ggml_tensor *        b,
-+        int                  il) {
-+    const auto * mctx_cur = inp->mctx;
-+    const auto   kv_head  = mctx_cur->get_head();
-+
-+    const int64_t S_v          = v->ne[0];
-+    const int64_t H_v          = v->ne[1];
-+    const int64_t n_seqs       = v->ne[3];
-+    const int64_t n_seq_tokens = q->ne[2];
-+
-+    const bool keep  = cparams.n_rs_seq > 0;
-+    const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
-+
-+    if (!keep && fused) {
-+        // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
-+        // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
-+        // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
-+        // the final state in place at kv_head; no recurrent-state materialization at all.
-+        auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
-+            ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
-+            ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
-+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
-+            return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
-+        };
-+
-+        ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
-+        if (n_seq_tokens == 1) {
-+            cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
-+        } else {
-+            cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
-+        }
-+
-+        ggml_tensor * output = ggml_view_4d(ctx0, result,
-+                S_v, H_v, n_seq_tokens, n_seqs,
-+                ggml_row_size(result->type, S_v),
-+                ggml_row_size(result->type, S_v * H_v),
-+                ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
-+        cb(output, "attn_output", il);
-+
-+        // the state write is a side effect of the op; pull the op into the graph via the output
-+        ggml_build_forward_expand(gf, output);
-+
-+        return output;
-+    }
-+
-+    // non-fused / rollback: materialize the prior state via gather and delegate to the
-+    // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
-+    ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-+    s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
-+    return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
-+}
-+
- ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
-         llm_graph_input_rs * inp,
-         ggml_tensor *        ssm_states_all,
-diff --git a/src/models/models.h b/src/models/models.h
-index 2ac8415..98b89e9 100644
--- a/src/models/models.h
-+++ b/src/models/models.h
-@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
-             ggml_tensor *        b,
-             ggml_tensor *        s,
-             int                  il);
-+
-+    // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
-+    // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
-+    // delegates to the state-taking overload for the non-fused and rollback paths.
-+    ggml_tensor * build_recurrent_attn(
-+            llm_graph_input_rs * inp,
-+            ggml_tensor *        ssm_states_all,
-+            ggml_tensor *        q,
-+            ggml_tensor *        k,
-+            ggml_tensor *        v,
-+            ggml_tensor *        g,
-+            ggml_tensor *        b,
-+            int                  il);
- };
- 
- struct llm_build_rwkv6_base : public llm_graph_context {
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 6783d98..0be3247 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
- 
-     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-
-     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-     cb(conv_output_proper, "conv_output_raw", il);
- 
-@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     cb(k_conv, "k_conv_predelta", il);
-     cb(v_conv, "v_conv_predelta", il);
- 
-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
-+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
- 
-     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
-     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index eb5e9a4..2995f04 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
- 
-     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-
-     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-     cb(conv_output_proper, "conv_output_raw", il);
- 
-@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     cb(k_conv, "k_conv_predelta", il);
-     cb(v_conv, "v_conv_predelta", il);
- 
-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
-+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
- 
-     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
-     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch
@@ -1,225 +0,0 @@
-From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 12:40:49 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
- (patch 0020)
-
-Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
-models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
-(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
-both engines pinned the largest llama-specific overage to the gated-DeltaNet
-OUTPUT projection (ssm_out).
-
-The GDN op left its output in SSM layout and the graph reshaped it to 3D
-[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
-src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
-sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
-ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
-the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
-M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
-
-The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
-(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
-routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
-all 128 tokens). The result is then already 2D, so the redundant post-matmul
-reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
-Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
-untouched.
-
-Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
-q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
-test-backend-ops MUL_MAT and MUL_MAT_ID OK.
-
-decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
-  dense q36-27b:    170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
-  MoE   q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
-Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
-
-nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
-to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
-per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
-vs 2.77 ms/call for the old GEMV.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- LEVER1_OPROJ_MMQ_RESULTS.md | 77 +++++++++++++++++++++++++++++++++++++
- src/models/qwen35.cpp       | 13 ++++---
- src/models/qwen35moe.cpp    | 13 ++++---
- src/models/qwen3next.cpp    | 13 ++++---
- 4 files changed, 98 insertions(+), 18 deletions(-)
- create mode 100644 LEVER1_OPROJ_MMQ_RESULTS.md
-
-diff --git a/LEVER1_OPROJ_MMQ_RESULTS.md b/LEVER1_OPROJ_MMQ_RESULTS.md
-new file mode 100644
-index 0000000..9a5721f
--- /dev/null
-+++ b/LEVER1_OPROJ_MMQ_RESULTS.md
-@@ -0,0 +1,77 @@
-+# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
-+
-+The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
-+(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
-+bit-exact tensor reshape that re-routes the per-layer SSM output projection
-+from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
-+
-+## The mechanism (profiled, both engines)
-+
-+Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
-+The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
-+(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
-+to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
-+`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
-+128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
-+the ssm_out weight read across the 128 sequences. vLLM packs the same projection
-+into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
-+only the output projection was in 3D SSM layout.
-+
-+## The fix
-+
-+In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
-+the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
-+decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
-+MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
-+so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
-+2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
-+proven by the in-projection.
-+
-+```
-+-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-++    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-+     ...
-+     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-+-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-+```
-+
-+## Gates (all PASS)
-+
-+- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
-+  post-SSM baseline build:
-+  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
-+  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
-+- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
-+- Coherent dense + MoE output (greedy text inspected).
-+
-+## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
-+
-+S_TG t/s (decode aggregate):
-+
-+| model            | npl | baseline | Lever 1 | delta   |
-+|------------------|-----|----------|---------|---------|
-+| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
-+| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
-+| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
-+| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
-+
-+Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
-+up from 65% post-SSM).
-+
-+## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
-+
-+The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
-+
-+| kernel                              | baseline           | Lever 1          |
-+|-------------------------------------|--------------------|------------------|
-+| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
-+| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
-+
-+The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
-+(+1200 instances, +363 ms over the window), and its per-call average DROPS
-+(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
-+than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
-+~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
-+old GEMV: the amortized weight read is the win.
-+
-+Assisted-by: Claude:opus-4.8 [Claude Code]
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index 2995f04..1f6f643 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
-index 97200a4..bfdf026 100644
--- a/src/models/qwen3next.cpp
-+++ b/src/models/qwen3next.cpp
-@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
-     // Apply gated normalization: self.norm(core_attn_out, z)
-     ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- 
-    // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
-+    // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
-+    // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
-+    // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
-+    // data, just a 2D vs 3D view, so the result is bit-identical.
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     cb(final_output, "final_output", il);
- 
-    // Output projection
-+    // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output);
-     cb(cur, "linear_attn_out", il);
- 
-    // Reshape back to original dimensions
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
-     return cur;
- }
- 
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch
@@ -1,769 +0,0 @@
-From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 18:55:54 +0200
-Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch
- 0021)
-
-The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
-design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
-recurrence is already single-pass at the f32 byte floor), the decode conv path
-was the only remaining bit-exact lever.
-
-New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
-by a non-null src[3]). On the single-token decode path it replaces the four-op
-conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
-+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
-(channel, sequence), assembles the width-K window in registers from the K-1 cached
-taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
-ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
-output, and writes the 1-token-shifted ring state back IN PLACE into the conv
-cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
-in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
-write target (the cache view) are disjoint buffers, so it is race-free by
-construction with no ids/identity logic.
-
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
-  src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
-  op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
-  ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
-  + branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
-  conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
-  (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
-  the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
-
-test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
-
-Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
-(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
-ac163882... both BYTE-IDENTICAL.
-
-decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
-  dense q36-27b-nvfp4 : npl 32  199.76 -> 202.99 (+1.6%)
-                        npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
-  MoE   q36-35b-a3b   : npl 32  421.72 -> 432.39 (+2.5%)
-                        npl 128 689.74 -> 713.54 (+3.5%)
-Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
-(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
-decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
-conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
-conv-cache plumbing.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- CONV_STATE_FUSION_RESULTS.md   | 106 +++++++++++++++++++++++++++++++
- ggml/include/ggml.h            |  16 +++++
- ggml/src/ggml-cpu/ops.cpp      |  73 ++++++++++++++++++++-
- ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
- ggml/src/ggml.c                |  54 ++++++++++++++++
- src/models/delta-net-base.cpp  |  51 +++++++++++++++
- src/models/models.h            |  14 +++++
- src/models/qwen35.cpp          |  23 +++++--
- src/models/qwen35moe.cpp       |  23 +++++--
- src/models/qwen3next.cpp       |  29 ++++++---
- tests/test-backend-ops.cpp     |  47 ++++++++++++++
- 11 files changed, 526 insertions(+), 22 deletions(-)
- create mode 100644 CONV_STATE_FUSION_RESULTS.md
-
-diff --git a/CONV_STATE_FUSION_RESULTS.md b/CONV_STATE_FUSION_RESULTS.md
-new file mode 100644
-index 0000000..f59b6e5
--- /dev/null
-+++ b/CONV_STATE_FUSION_RESULTS.md
-@@ -0,0 +1,106 @@
-+# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
-+
-+The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
-+After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
-+the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
-+
-+## What changed
-+
-+A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
-+non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
-+
-+    qkv_mixed transpose -> ggml_concat (build width-K window)   [concat_cont 8.14 ms/step]
-+    -> ggml_ssm_conv (depthwise conv)                           [ssm_conv_f32 ~8.6 ms/step]
-+    -> ggml_silu                                                [folded into ssm_conv on CUDA]
-+    -> ggml_cpy of the shifted ring state into the conv cache   [cpy_scalar 5.76 ms/step]
-+
-+with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
-+the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
-+ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
-+the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
-+the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
-+vLLM's `causal_conv1d_update`.
-+
-+Files:
-+- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
-+  (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
-+  src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
-+- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
-+  (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
-+  `ggml_cuda_op_ssm_conv`.
-+- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
-+  channels) + branch in `ggml_compute_forward_ssm_conv`.
-+- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
-+  conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
-+  target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
-+- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
-+  (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
-+  rollback keep the existing concat+ssm_conv+silu+cpy chain.
-+- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
-+  vs the CPU reference across backends.
-+
-+## Gate: test-backend-ops (CUDA0 vs CPU reference)
-+
-+- SSM_CONV: 45/45 OK (unchanged path intact)
-+- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
-+- SSM_CONV_BIAS_SILU: 90/90 OK
-+
-+## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
-+
-+Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
-+
-+| model              | baseline md5                     | fused md5                        | result          |
-+|--------------------|----------------------------------|----------------------------------|-----------------|
-+| q36-27b-nvfp4      | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL  |
-+| q36-35b-a3b-nvfp4  | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL  |
-+
-+## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
-+
-+Dense q36-27b-nvfp4:
-+
-+| mode      | npl | baseline | fused  | delta   |
-+|-----------|-----|----------|--------|---------|
-+| CUDA-graph| 32  | 199.76   | 202.99 | +1.6%   |
-+| CUDA-graph| 128 | 336.35   | 347.14 | +3.2%   |
-+| eager     | 32  | 196.07   | 197.61 | +0.8%   |
-+| eager     | 128 | 333.62   | 342.97 | +2.8%   |
-+
-+MoE q36-35b-a3b-nvfp4:
-+
-+| mode      | npl | baseline | fused  | delta   |
-+|-----------|-----|----------|--------|---------|
-+| CUDA-graph| 32  | 421.72   | 432.39 | +2.5%   |
-+| CUDA-graph| 128 | 689.74   | 713.54 | +3.5%   |
-+| eager     | 32  | 421.05   | 432.46 | +2.7%   |
-+| eager     | 128 | 689.15   | 713.87 | +3.6%   |
-+
-+Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
-+**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
-+
-+## Step time + nsys kernel delta
-+
-+Per-step decode time (dense npl128, T_TG / ntg=128):
-+- baseline 48.711 s / 128 = 380.6 ms/step
-+- fused    47.197 s / 128 = 368.7 ms/step  -> **-11.9 ms/step** (matches the predicted +12-14 ms)
-+- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
-+
-+nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
-+
-+| kernel              | baseline calls | fused calls | per-step (eager) |
-+|---------------------|----------------|-------------|------------------|
-+| concat_cont (decode)| 1152           | 0 (GONE)    | 7.95 -> 0 ms     |
-+| cpy_scalar (decode) | 1152 of 3648   | 0 (GONE)    | 4.29 -> 0 ms     |
-+| ssm_conv_f32 (decode)| 1152 of 2736  | 0 (prefill-only) | 8.65 -> 0 ms |
-+| ssm_conv_update     | 0              | 1152        | 0 -> 7.56 ms     |
-+
-+Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
-+and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
-+prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
-+
-+## Verdict
-+
-+Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
-+391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
-+de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
-+
-+Assisted-by: Claude:opus-4.8 [Claude Code]
-diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
-index 951dd21..76fa401 100644
--- a/ggml/include/ggml.h
-+++ b/ggml/include/ggml.h
-@@ -2447,6 +2447,22 @@ extern "C" {
-             struct ggml_tensor  * sx,
-             struct ggml_tensor  * c);
- 
-+    // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles
-+    // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels,
-+    // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise
-+    // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes
-+    // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs]
-+    // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back +
-+    // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV;
-+    // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode).
-+    GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace(
-+            struct ggml_context * ctx,
-+            struct ggml_tensor  * conv_states,
-+            struct ggml_tensor  * conv_kernel,
-+            struct ggml_tensor  * x_cur,
-+            struct ggml_tensor  * conv_state_dst,
-+            bool                  fuse_silu);
-+
-     GGML_API struct ggml_tensor * ggml_ssm_scan(
-             struct ggml_context * ctx,
-             struct ggml_tensor  * s,
-diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
-index b6a1976..f9cd850 100644
--- a/ggml/src/ggml-cpu/ops.cpp
-+++ b/ggml/src/ggml-cpu/ops.cpp
-@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32(
-     }
- }
- 
-+// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the
-+// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same
-+// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv
-+// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split
-+// over channels.
-+static void ggml_compute_forward_ssm_conv_update_f32(
-+        const ggml_compute_params * params,
-+        ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    ggml_tensor       * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+
-+    const int ith = params->ith;
-+    const int nth = params->nth;
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+
-+    const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float);
-+    const int64_t states_ch_stride  = conv_states->nb[1] / sizeof(float);
-+    const int64_t w_stride          = conv_kernel->nb[1] / sizeof(float);
-+    const int64_t x_seq_stride      = x_cur->nb[2] / sizeof(float);
-+    const int64_t dst_seq_stride    = dst->nb[2] / sizeof(float);
-+    const int64_t cdst_seq_stride   = cdst->nb[1] / sizeof(float);
-+
-+    const float * states_base = (const float *) conv_states->data;
-+    const float * w_base      = (const float *) conv_kernel->data;
-+    const float * x_base      = (const float *) x_cur->data;
-+    float *       cdst_base   = (float *) cdst->data;
-+    float *       dst_base    = (float *) dst->data;
-+
-+    const int64_t dc = (channels + nth - 1) / nth;
-+    const int64_t c0 = dc * ith;
-+    const int64_t c1 = MIN(c0 + dc, channels);
-+
-+    for (int64_t s = 0; s < n_seqs; ++s) {
-+        for (int64_t c = c0; c < c1; ++c) {
-+            const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride;
-+            const float * w_c      = w_base + c * w_stride;
-+            const float   xc       = x_base[s * x_seq_stride + c];
-+
-+            // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
-+            float sumf = 0.0f;
-+            for (int64_t j = 0; j < d_conv - 1; ++j) {
-+                sumf += states_c[j] * w_c[j];
-+            }
-+            sumf += xc * w_c[d_conv - 1];
-+            sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
-+
-+            dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
-+
-+            // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
-+            float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
-+            for (int64_t j = 0; j < d_conv - 2; ++j) {
-+                out_state[j] = states_c[j + 1];
-+            }
-+            out_state[d_conv - 2] = xc;
-+        }
-+    }
-+}
-+
- void ggml_compute_forward_ssm_conv(
-         const ggml_compute_params * params,
-         ggml_tensor * dst) {
-     switch (dst->src[0]->type) {
-         case GGML_TYPE_F32:
-             {
-                ggml_compute_forward_ssm_conv_f32(params, dst);
-+                if (dst->src[3] != nullptr) {
-+                    ggml_compute_forward_ssm_conv_update_f32(params, dst);
-+                } else {
-+                    ggml_compute_forward_ssm_conv_f32(params, dst);
-+                }
-             } break;
-         default:
-             {
-diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
-index 1463169..e1af1cd 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
-+++ b/ggml/src/ggml-cuda/ssm-conv.cu
-@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
-     }
- }
- 
-+// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of
-+// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the
-+// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as
-+// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted
-+// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back.
-+template <bool apply_silu, int d_conv>
-+static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states,
-+                                           const float * __restrict__ conv_kernel,
-+                                           const float * __restrict__ x_cur,
-+                                           float       * __restrict__ conv_state_dst,
-+                                           float       * __restrict__ dst,
-+                                           const int channels,
-+                                           const int states_seq_stride,
-+                                           const int w_stride,
-+                                           const int x_seq_stride,
-+                                           const int dst_seq_stride,
-+                                           const int cdst_seq_stride) {
-+    const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
-+    const int s = blockIdx.y;                            // sequence
-+    if (c >= channels) {
-+        return;
-+    }
-+
-+    const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1);
-+    const float * w_c       = conv_kernel + (int64_t) c * w_stride;
-+    const float   xc        = x_cur[(int64_t) s * x_seq_stride + c];
-+
-+    // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window
-+    float window[d_conv];
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        window[j] = states_c[j];
-+    }
-+    window[d_conv - 1] = xc;
-+
-+    float sumf = 0.0f;
-+#pragma unroll
-+    for (int j = 0; j < d_conv; j++) {
-+        sumf += window[j] * w_c[j];
-+    }
-+    sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
-+    dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
-+
-+    // 1-token-shifted ring write-back: drop the oldest tap, append the current token
-+    float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
-+#pragma unroll
-+    for (int j = 0; j < d_conv - 1; j++) {
-+        out_state[j] = window[j + 1];
-+    }
-+}
-+
-+static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-+    const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
-+    const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
-+    const ggml_tensor * x_cur       = dst->src[2]; // [channels, 1, n_seqs]
-+    const ggml_tensor * cdst        = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+    const bool    apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
-+
-+    GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
-+    GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
-+    GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
-+
-+    const float * states_d = (const float *) conv_states->data;
-+    const float * w_d      = (const float *) conv_kernel->data;
-+    const float * x_d      = (const float *) x_cur->data;
-+    float *       cdst_d   = (float *) cdst->data;
-+    float *       dst_d    = (float *) dst->data;
-+    cudaStream_t  stream   = ctx.stream();
-+
-+    const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float));
-+    const int w_stride          = (int) (conv_kernel->nb[1] / sizeof(float));
-+    const int x_seq_stride      = (int) (x_cur->nb[2] / sizeof(float));
-+    const int dst_seq_stride    = (int) (dst->nb[2] / sizeof(float));
-+    const int cdst_seq_stride   = (int) (cdst->nb[1] / sizeof(float));
-+
-+    const int threads = 128;
-+    const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
-+
-+    auto launch = [&](auto NC) {
-+        constexpr int kNC = decltype(NC)::value;
-+        if (apply_silu) {
-+            ssm_conv_update_f32<true, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
-+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        } else {
-+            ssm_conv_update_f32<false, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
-+                (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
-+        }
-+    };
-+
-+    switch (d_conv) {
-+        case 3: launch(std::integral_constant<int, 3>{}); break;
-+        case 4: launch(std::integral_constant<int, 4>{}); break;
-+        default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4");
-+    }
-+}
-+
- template <bool apply_silu>
- static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
-                               const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
-@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa
- }
- 
- void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
-+    // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a
-+    // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/
-+    // silu of the decode conv path into a single kernel.
-+    if (dst->src[3] != nullptr) {
-+        GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
-+        ggml_cuda_op_ssm_conv_update(ctx, dst);
-+        return;
-+    }
-+
-     const struct ggml_tensor * src0 = dst->src[0];  // conv_x
-     const struct ggml_tensor * src1 = dst->src[1];  // conv1d.weight
-     const bool fuse_bias = bias_add_node != nullptr;
-diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
-index 1762037..b777748 100644
--- a/ggml/src/ggml.c
-+++ b/ggml/src/ggml.c
-@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv(
-     return result;
- }
- 
-+// ggml_ssm_conv_update_inplace
-+//
-+// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a
-+// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new
-+// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv),
-+// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the
-+// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache
-+// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern.
-+struct ggml_tensor * ggml_ssm_conv_update_inplace(
-+        struct ggml_context * ctx,
-+        struct ggml_tensor  * conv_states,
-+        struct ggml_tensor  * conv_kernel,
-+        struct ggml_tensor  * x_cur,
-+        struct ggml_tensor  * conv_state_dst,
-+        bool                  fuse_silu) {
-+    GGML_ASSERT(ggml_is_3d(conv_states));
-+    GGML_ASSERT(ggml_is_matrix(conv_kernel));
-+    GGML_ASSERT(ggml_is_3d(x_cur));
-+
-+    const int64_t d_conv   = conv_kernel->ne[0];
-+    const int64_t channels = conv_kernel->ne[1];
-+    const int64_t n_seqs   = conv_states->ne[2];
-+
-+    GGML_ASSERT(conv_states->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_kernel->type    == GGML_TYPE_F32);
-+    GGML_ASSERT(x_cur->type          == GGML_TYPE_F32);
-+    GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
-+
-+    // conv_states: [K-1, channels, n_seqs], contiguous taps per channel
-+    GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
-+    GGML_ASSERT(conv_states->ne[1] == channels);
-+    GGML_ASSERT(conv_states->nb[0] == sizeof(float));
-+    // x_cur: single decode token per sequence
-+    GGML_ASSERT(x_cur->ne[0] == channels);
-+    GGML_ASSERT(x_cur->ne[1] == 1);
-+    GGML_ASSERT(x_cur->ne[2] == n_seqs);
-+    // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
-+    GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
-+    GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
-+    GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
-+
-+    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+
-+    ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
-+
-+    result->op     = GGML_OP_SSM_CONV;
-+    result->src[0] = conv_states;
-+    result->src[1] = conv_kernel;
-+    result->src[2] = x_cur;
-+    result->src[3] = conv_state_dst;
-+
-+    return result;
-+}
-+
- // ggml_ssm_scan
- 
- struct ggml_tensor * ggml_ssm_scan(
-diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
-index 194e611..0eee804 100644
--- a/src/models/delta-net-base.cpp
-+++ b/src/models/delta-net-base.cpp
-@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
-     return conv_input;
- }
- 
-+// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same
-+// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token-
-+// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat
-+// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate
-+// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the
-+// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy
-+// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu.
-+ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
-+        llm_graph_input_rs * inp,
-+        ggml_tensor *        conv_states_all,
-+        ggml_tensor *        qkv_mixed,
-+        ggml_tensor *        conv_kernel,
-+        int64_t              conv_kernel_size,
-+        int64_t              conv_channels,
-+        int                  il) {
-+    const auto * mctx_cur = inp->mctx;
-+    const auto   kv_head  = mctx_cur->get_head();
-+
-+    const int64_t n_seqs       = ubatch.n_seqs;
-+    const int64_t n_seq_tokens = ubatch.n_seq_tokens;
-+
-+    GGML_ASSERT(n_seq_tokens == 1);        // single-token decode only
-+    GGML_ASSERT(cparams.n_rs_seq == 0);    // no rollback splits on this path
-+
-+    // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
-+    // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
-+    ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
-+    conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
-+    cb(conv_states, "conv_states_reshaped", il);
-+
-+    // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
-+    ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
-+
-+    // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
-+    // destination the baseline ggml_cpy wrote to (s_slot == 0).
-+    const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
-+    const size_t  row_size  = ggml_row_size(conv_states_all->type, row_count);
-+    ggml_tensor * conv_state_dst =
-+        ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
-+    cb(conv_state_dst, "conv_state_update", il);
-+
-+    ggml_tensor * conv_output =
-+        ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
-+    cb(conv_output, "conv_output_silu", il);
-+
-+    // the ring write is a side effect of the op; pull the op into the graph via the output
-+    ggml_build_forward_expand(gf, conv_output);
-+
-+    return conv_output; // [conv_channels, 1, n_seqs], already silu'd
-+}
-+
- // Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
- // gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
- // ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
-diff --git a/src/models/models.h b/src/models/models.h
-index 98b89e9..da0dd86 100644
--- a/src/models/models.h
-+++ b/src/models/models.h
-@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context {
-             int64_t              conv_channels,
-             int                  il);
- 
-+    // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu +
-+    // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and
-+    // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring
-+    // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv
-+    // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain.
-+    ggml_tensor * build_conv_state_fused(
-+            llm_graph_input_rs * inp,
-+            ggml_tensor *        conv_states_all,
-+            ggml_tensor *        qkv_mixed,
-+            ggml_tensor *        conv_kernel,
-+            int64_t              conv_kernel_size,
-+            int64_t              conv_channels,
-+            int                  il);
-+
-     // run delta-net attention and write the new recurrent state(s) back to ssm_states_all
-     // s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs)
-     ggml_tensor * build_recurrent_attn(
-diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
-index 0874c43..b6dcc5f 100644
--- a/src/models/qwen35.cpp
-+++ b/src/models/qwen35.cpp
-@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
-index 1f6f643..c7c7c44 100644
--- a/src/models/qwen35moe.cpp
-+++ b/src/models/qwen35moe.cpp
-@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
-index bfdf026..92749d1 100644
--- a/src/models/qwen3next.cpp
-+++ b/src/models/qwen3next.cpp
-@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
-     const int64_t conv_kernel_size = conv_kernel->ne[0];
-     const int64_t conv_channels    = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- 
-    ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
-+    // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
-+    // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
-+    // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
-+    const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
-+
-+    ggml_tensor * conv_qkv_mix;
-+    if (conv_decode_fused) {
-+        conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
-+                conv_kernel_size, conv_channels, il);
-+    } else {
-+        ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- 
-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-    cb(state, "state_predelta", il);
-+        ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-+        cb(conv_output_proper, "conv_output_raw", il);
- 
-    ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
-    cb(conv_output_proper, "conv_output_raw", il);
-+        ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-+        cb(conv_output_silu, "conv_output_silu", il);
- 
-    ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
-    cb(conv_output_silu, "conv_output_silu", il);
-+        conv_qkv_mix = conv_output_silu;
-+    }
- 
-    ggml_tensor * conv_qkv_mix = conv_output_silu;
-+    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
-+    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
-+    cb(state, "state_predelta", il);
- 
-     // Calculate the total conv dimension
-     int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
-diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
-index 291c275..c7348d6 100644
--- a/tests/test-backend-ops.cpp
-+++ b/tests/test-backend-ops.cpp
-@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case {
-     }
- };
- 
-+// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021).
-+// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token-
-+// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy
-+// md5 gate); here it just exercises the in-place write target as an op src.
-+struct test_ssm_conv_update : public test_case {
-+    const int64_t d_conv;
-+    const int64_t channels;
-+    const int64_t n_seqs;
-+
-+    std::string op_desc(ggml_tensor * t) override {
-+        GGML_UNUSED(t);
-+        return "SSM_CONV_UPDATE";
-+    }
-+
-+    std::string vars() override {
-+        return VARS_TO_STR3(d_conv, channels, n_seqs);
-+    }
-+
-+    test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
-+        : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
-+
-+    ggml_tensor * build_graph(ggml_context * ctx) override {
-+        ggml_tensor * conv_states    = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
-+        ggml_tensor * conv_kernel    = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
-+        ggml_tensor * x_cur          = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
-+        ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
-+        ggml_set_name(conv_states, "conv_states");
-+        ggml_set_name(conv_kernel, "conv_kernel");
-+        ggml_set_name(x_cur, "x_cur");
-+        ggml_set_name(conv_state_dst, "conv_state_dst");
-+
-+        ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true);
-+        ggml_set_name(out, "out");
-+        return out;
-+    }
-+};
-+
- // GGML_OP_SSM_SCAN
- struct test_ssm_scan : public test_case {
-     const ggml_type type;
-@@ -8355,6 +8392,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
-         }
-     }
- 
-+    // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be
-+    // a multiple of 128 for the CUDA SSM_CONV supports_op gate.
-+    for (int64_t d_conv : {3, 4}) {
-+        for (int64_t channels : {256, 3328}) {
-+            for (int64_t n_seqs : {1, 4, 32, 128}) {
-+                test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs));
-+            }
-+        }
-+    }
-+
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
-     test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64,  8, 2, 32, 4)); // Falcon-H1
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch
@@ -1,403 +0,0 @@
-From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 20:32:55 +0200
-Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
- occupancy/coalescing retune (patch 0022)
-
-Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
-kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
-float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
-efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
-the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
-order byte-identical (md5-gateable). The bf16-state plan stays shelved.
-
-Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
-(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
-state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
-... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
-grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
-across the lanes by the same strided i = r*warp_size + lane mapping, and every
-column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
-unchanged; only the (warp,block)->column assignment and visit order differ, which a
-column's value provably does not depend on (columns are fully independent). This
-raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
-state-load bursts before any reduction + interleaved butterfly reductions hiding
-each other's shfl latency), covering more DRAM latency on this bandwidth-bound
-kernel. Every global access stays identically coalesced, so it is a scheduling /
-latency-coverage win, not a coalescing change. The forbidden float4 state load
-(which would repartition a lane to 4 contiguous rows and change the reduction
-grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
-env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
-GB10 winner (16, 8).
-
-GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
-gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
-200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
-102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
-dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
-35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
-unchanged.
-
-Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
-both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
-test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
- 1 file changed, 157 insertions(+), 79 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
-index 86d5e2a..d071d5a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
-+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
-@@ -1,6 +1,8 @@
- #include "gated_delta_net.cuh"
- #include "ggml-cuda/common.cuh"
- 
-+#include <cstdlib>
-+
- // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
- // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
- // destination slot by the recurrence kernel and are skipped here. One block per sequence.
-@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
-     gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
- }
- 
-template <int S_v, bool KDA, bool keep_rs_t>
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
-+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
-+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
-+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
-+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
-+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
-+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
-+// f32 value provably does not depend on either (columns are fully independent: column c reads only
-+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
-+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
-+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
-+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
-+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
-+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
-+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
-+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
-+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
- gated_delta_net_cuda(const float * q,
-                                      const float * k,
-                                      const float * v,
-@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
-                                      int           rs_head) {
-     const uint32_t h_idx    = blockIdx.x;
-     const uint32_t sequence = blockIdx.y;
-    // each warp owns one column, using warp-level primitives to reduce across rows
-+    // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
-     const int      lane     = threadIdx.x;
-    const int      col      = blockIdx.z * blockDim.y + threadIdx.y;
-+    const int      col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
- 
-     const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
-     const uint32_t iq3 = fastdiv(sequence, rq3_magic);
-@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
-     // writing the same slot per block (identity) is race-free.
-     const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
-         ? state_dst : curr_state;
-    read_state += state_in_offset + col * S_v;
-+    read_state += state_in_offset;
-     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
- 
-     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
-     static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
-     constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
-    float         s_shard[rows_per_lane];
-    // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
-+    // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
-+    float         s_shard[COLS_PER_WARP][rows_per_lane];
- 
-     ggml_cuda_pdl_sync();
- #pragma unroll
-    for (int r = 0; r < rows_per_lane; r++) {
-        const int i = r * warp_size + lane;
-        s_shard[r]  = read_state[i];
-+    for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+        const int     col = col_base + cc * NUM_WARPS;
-+        const float * rs  = read_state + col * S_v;
-+#pragma unroll
-+        for (int r = 0; r < rows_per_lane; r++) {
-+            const int i   = r * warp_size + lane;
-+            s_shard[cc][r] = rs[i];
-+        }
-     }
- 
-     for (int t = 0; t < n_tokens; t++) {
-@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
- 
-         const float beta_val = *beta_t;
- 
-        // Cache k and q in registers
-+        // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
-         float k_reg[rows_per_lane];
-         float q_reg[rows_per_lane];
- #pragma unroll
-@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
-         if constexpr (!KDA) {
-             const float g_val = expf(*g_t);
- 
-            // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
-            float kv_shard = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                kv_shard += s_shard[r] * k_reg[r];
-            }
-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
-+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                const int col = col_base + cc * NUM_WARPS;
- 
-            // delta[col] = (v[col] - g * kv[col]) * beta
-            float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
-+                // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
-+                float kv_shard = 0.0f;
-+#pragma unroll
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    kv_shard += s_shard[cc][r] * k_reg[r];
-+                }
-+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- 
-            // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-            float attn_partial = 0.0f;
-+                // delta[col] = (v[col] - g * kv[col]) * beta
-+                float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
-+
-+                // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
-+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-+                float attn_partial = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                s_shard[r]  = g_val * s_shard[r] + k_reg[r] * delta_col;
-                attn_partial += s_shard[r] * q_reg[r];
-            }
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    s_shard[cc][r]  = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
-+                    attn_partial += s_shard[cc][r] * q_reg[r];
-+                }
- 
-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
-+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- 
-            if (lane == 0) {
-                attn_data[col] = attn_col * scale;
-+                if (lane == 0) {
-+                    attn_data[col] = attn_col * scale;
-+                }
-             }
-         } else {
-            // kv[col] = sum_i g[i] * S[i][col] * k[i]
-            float kv_shard = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                const int i = r * warp_size + lane;
-                kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
-            }
-+            for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                const int col = col_base + cc * NUM_WARPS;
-+
-+                // kv[col] = sum_i g[i] * S[i][col] * k[i]
-+                float kv_shard = 0.0f;
-+#pragma unroll
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    const int i = r * warp_size + lane;
-+                    kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
-+                }
- 
-            float kv_col = warp_reduce_sum<warp_size>(kv_shard);
-+                float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- 
-            // delta[col] = (v[col] - kv[col]) * beta
-            float delta_col = (v_t[col] - kv_col) * beta_val;
-+                // delta[col] = (v[col] - kv[col]) * beta
-+                float delta_col = (v_t[col] - kv_col) * beta_val;
- 
-            // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
-            // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-            float attn_partial = 0.0f;
-+                // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
-+                // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
-+                float attn_partial = 0.0f;
- #pragma unroll
-            for (int r = 0; r < rows_per_lane; r++) {
-                const int i = r * warp_size + lane;
-                s_shard[r]  = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
-                attn_partial += s_shard[r] * q_reg[r];
-            }
-+                for (int r = 0; r < rows_per_lane; r++) {
-+                    const int i = r * warp_size + lane;
-+                    s_shard[cc][r]  = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
-+                    attn_partial += s_shard[cc][r] * q_reg[r];
-+                }
- 
-            float attn_col = warp_reduce_sum<warp_size>(attn_partial);
-+                float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- 
-            if (lane == 0) {
-                attn_data[col] = attn_col * scale;
-+                if (lane == 0) {
-+                    attn_data[col] = attn_col * scale;
-+                }
-             }
-         }
- 
-@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
-             const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
-             const int target_slot = (int) n_tokens - 1 - t;
-             if (target_slot >= 0 && target_slot < K) {
-                float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
- #pragma unroll
-                for (int r = 0; r < rows_per_lane; r++) {
-                    const int i = r * warp_size + lane;
-                    curr_state[col * S_v + i] = s_shard[r];
-+                for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+                    const int col = col_base + cc * NUM_WARPS;
-+                    float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
-+#pragma unroll
-+                    for (int r = 0; r < rows_per_lane; r++) {
-+                        const int i = r * warp_size + lane;
-+                        curr_state[col * S_v + i] = s_shard[cc][r];
-+                    }
-                 }
-             }
-         }
-@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
- 
-     if constexpr (!keep_rs_t) {
- #pragma unroll
-        for (int r = 0; r < rows_per_lane; r++) {
-            const int i          = r * warp_size + lane;
-            state[col * S_v + i] = s_shard[r];
-+        for (int cc = 0; cc < COLS_PER_WARP; cc++) {
-+            const int col = col_base + cc * NUM_WARPS;
-+#pragma unroll
-+            for (int r = 0; r < rows_per_lane; r++) {
-+                const int i          = r * warp_size + lane;
-+                state[col * S_v + i] = s_shard[cc][r];
-+            }
-         }
-     }
- }
- 
-+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
-+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
-+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
-+#ifndef GDN_DEFAULT_NW
-+#define GDN_DEFAULT_NW 16
-+#endif
-+#ifndef GDN_DEFAULT_CPW
-+#define GDN_DEFAULT_CPW 8
-+#endif
-+
-+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
-+static void launch_gdn_variant(
-+        const float * q_d, const float * k_d, const float * v_d,
-+        const float * g_d, const float * b_d, const float * s_d,
-+        float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
-+        int64_t H, int64_t n_tokens, int64_t n_seqs,
-+        int64_t sq1, int64_t sq2, int64_t sq3,
-+        int64_t sv1, int64_t sv2, int64_t sv3,
-+        int64_t sb1, int64_t sb2, int64_t sb3,
-+        const uint3 neqk1_magic, const uint3 rq3_magic,
-+        float scale, int K, int warp_size, cudaStream_t stream) {
-+    static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
-+    dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
-+    dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
-+    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
-+    ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
-+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-+        n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-+        sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+}
-+
- template <bool KDA, bool keep_rs_t>
- static void launch_gated_delta_net(
-         const float * q_d, const float * k_d, const float * v_d,
-@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
-         float scale, int K, cudaStream_t stream) {
-     //TODO: Add chunked kernel for even faster pre-fill
-     const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
-    const int num_warps = 4;
-    dim3      grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
-    dim3      block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
- 
-     const uint3 neqk1_magic = init_fastdiv_values(neqk1);
-     const uint3 rq3_magic   = init_fastdiv_values(rq3);
- 
-    int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
-+#define GDN_LAUNCH_ARGS \
-+        q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
-+        H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
-+        neqk1_magic, rq3_magic, scale, K, warp_size, stream
- 
-    const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
-     switch (S_v) {
-         case 16:
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-         case 32:
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-        case 64: {
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+        case 64:
-+            launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
-             break;
-        }
-         case 128: {
-            ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
-                q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
-                n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
-+            // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
-+            // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
-+            // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
-+            // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
-+            static const int gdn_nw  = []{ const char * e = getenv("GDN_NW");  return e ? atoi(e) : GDN_DEFAULT_NW;  }();
-+            static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
-+            // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
-+            // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
-+            // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
-+            if      (gdn_nw == 4  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4,  1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 4  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4,  2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 4  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4,  4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8,  1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8,  2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8,  4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 8  && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8,  8, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
-+            else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
-+            else                                   launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
-             break;
-         }
-         default:
-             GGML_ABORT("fatal error");
-             break;
-     }
-+
-+#undef GDN_LAUNCH_ARGS
- }
- 
- void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch
@@ -1,144 +0,0 @@
-From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
-From: Ettore Di Giacinto <mudler@localai.io>
-Date: Thu, 25 Jun 2026 23:47:25 +0200
-Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
- (patch 0023)
-
-Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
-mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
-ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
-expert of a token receives the SAME token activation, so the stock path
-re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
-block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
-reduction), so the gathered blocks are byte-identical across the experts.
-
-Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
-gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
-ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
-byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
-to re-quantizing each gathered row; the GEMM is untouched. down_proj
-(ne11 == n_expert_used, distinct per expert) keeps the stock path.
-
-Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
-  gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
-
-Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
-  q36-27b-nvfp4     5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
-  q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
-  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
-
-On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
- ggml/src/ggml-cuda/mmq.cu       | 21 +++++++++++++++++--
- ggml/src/ggml-cuda/quantize.cu  | 37 +++++++++++++++++++++++++++++++++
- ggml/src/ggml-cuda/quantize.cuh |  4 ++++
- 3 files changed, 60 insertions(+), 2 deletions(-)
-
-diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
-index e1add5e..9933fa6 100644
--- a/ggml/src/ggml-cuda/mmq.cu
-+++ b/ggml/src/ggml-cuda/mmq.cu
-@@ -1,3 +1,4 @@
-+#include <cstdlib>
- #include "common.cuh"
- #include "mmq.cuh"
- #include "quantize.cuh"
-@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
-         const int64_t s13 = src1->nb[3] / ts_src1;
- 
-         if (use_native_fp4) {
-            quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-+            // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
-+            // across experts, so quantize the ne12 unique tokens once and gather the blocks.
-+            static const bool moe_quant_dedup = []{
-+                const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
-+                return e ? atoi(e) != 0 : true;  // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
-+            }();
-+            if (moe_quant_dedup && ne11 == 1) {
-+                const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
-+                    get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
-+                ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
-+                quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
-+                                        ne10_padded, ne12, 1, 1, stream);
-+                gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
-+                                    ne11_flat, ne12, ne10_padded, stream);
-+            } else {
-+                quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-+                                        ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-+            }
-         } else {
-             quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
-                                    ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
-diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
-index 39a500a..a7fd86f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
-+++ b/ggml/src/ggml-cuda/quantize.cu
-@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
-     }
- }
- 
-+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
-+// gathered row references one of ne12 unique token activations, so the stock path
-+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
-+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
-+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
-+static __global__ void gather_mmq_fp4(
-+        const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
-+        uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
-+        const int64_t total_words) {
-+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
-+    const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
-+    if (t >= total_words) {
-+        return;
-+    }
-+    const int     w   = (int) (t % W);
-+    const int64_t ib  = t / W;                 // destination block index = kb*ne11_flat + j
-+    const int     j   = (int) (ib % ne11_flat);
-+    const int     kb  = (int) (ib / ne11_flat);
-+    const int     src = ids[j];
-+    const int64_t ib_u = (int64_t) kb * ne12_unique + src;
-+    gathered[t] = unique[ib_u * W + w];
-+}
-+
-+void gather_mmq_fp4_cuda(
-+        const void * unique, const int32_t * ids, void * gathered,
-+        int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
-+    const int     blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
-+    constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
-+    const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
-+    const int     bs = 256;
-+    const dim3    block_size(bs, 1, 1);
-+    const dim3    num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
-+    gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
-+        (const uint4 *) unique, ids, (uint4 *) gathered,
-+        (int) ne11_flat, (int) ne12_unique, total_words);
-+}
-+
- void quantize_mmq_fp4_cuda(
-         const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
-         const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
-diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
-index 768a3ae..7f64069 100644
--- a/ggml/src/ggml-cuda/quantize.cuh
-+++ b/ggml/src/ggml-cuda/quantize.cuh
-@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
-         ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
-         int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
- 
-+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
-+                         int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
-+                         cudaStream_t stream);
-+
- void quantize_mmq_fp4_cuda(const float *   x,
-                              const int32_t * ids,
-                              void *          vy,
-- 
-2.43.0
-
--- a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
+++ b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
@@ -1,347 +0,0 @@
-# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
-
-Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
-HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
-harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
-clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
-
-## TL;DR verdict
-
-CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
-under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
-Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
-(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
-(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
-decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
-99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
-(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
-batch 128), not in launch overhead or the host loop.
-
-The premise that "the paged decode runs eager (graphs reused=0)" did not survive
-measurement: at the benchmarked context the default paged decode captures and
-replays graphs exactly like stock non-paged. Two measurement traps (below)
-explain the earlier "reused=0 / gap-bound" reading.
-
-## Method note: a graph-enable trap that was corrected
-
-`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
-(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
-still disables graphs. A first 4-cell pass that used
-`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
-in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
-variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
-never set the variable).
-
-## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
-
-npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
-
-| cell             | npl 32  | npl 128 |
-|------------------|---------|---------|
-| stock_graphon    | 116.47  | 148.41  |
-| stock_graphoff   | 115.17  | 148.21  |
-| paged_graphon    | 116.21  | 148.60  |
-| paged_graphoff   | 114.62  | 147.65  |
-
-ON vs OFF (the graph win):
-
-| config | npl 32 | npl 128 |
-|--------|--------|---------|
-| stock  | +1.13% | +0.13%  |
-| paged  | +1.39% | +0.64%  |
-
- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
-  npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
-  All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
-  eager at this config (see Step 2); it captures graphs like stock.
- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
-  npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
-
-All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
-shared by stock and paged; it is not paged-specific. Calibration cross-check
-(paged ON, ntg 64): 147.64, matching the reference 148-149.
-
-## Step 2 - why the "eager" premise is wrong, and what actually mutates
-
-CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
-`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
-properties did not change vs the previous step; any later change logs
-`CUDA graph warmup reset` and reverts to eager until stable again.
-`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
-each src's `data` ptr / `ne` / `nb`.
-
-`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
- stock:  `warmup complete` x2, `warmup reset` x0.
- paged:  `warmup complete` x2, `warmup reset` x0.
-Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
-for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
-the uid fast-path never fires; the graph is still replayed via the
-`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
-not evidence of eager execution.** (Trap #1.)
-
-Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
-token boundaries), counts over ~320 decode steps:
-
-| path                          | complete | reset | interpretation                |
-|-------------------------------|----------|-------|-------------------------------|
-| paged in-kernel (default)     | 10       | 8     | resets only at 256-boundaries |
-| paged gather (KV_PAGED_GATHER)| 0        | 0     | never captures -> pure eager  |
-| stock non-paged               | 10       | 8     | identical 256-cadence         |
-
-The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
-not per-step. The default paged decode is therefore captured for ~97% of steps,
-re-warming only every ~256 tokens, with the **same cadence as stock**.
-
-What mutates (the block-table / gather input):
- in-kernel decode (default): the block-table graph input
-  `idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
-  `n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
-  steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
-  kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
-  property change is per-256-tokens, not per-step.
- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
-  index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
-  (`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
-  step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
-  step"). That changes a node property every step, warmup never completes, and
-  the path runs pure eager. This is the only "graphs reused=0" path, and it is
-  not the default decode path.
-
-`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
-the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
-`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
-
-## Step 3 - where the step time goes (nsys), and a second trap
-
-npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
-Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
->= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
-reserve / prefill one-offs).
-
-Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
-graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
-rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
-Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
-and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
-nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
-the likely source of the earlier "freed GPU time became idle gaps" conclusion.
-
-Reliable steady-state numbers:
-
-| trace                          | GPU rows | busy   | within-step idle | between-step idle | host gap/step |
-|--------------------------------|----------|--------|------------------|-------------------|---------------|
-| OFF (eager)                    | 84,946   | 99.4%  | 0.37%            | 0.24%             | ~2.0 ms       |
-| ON (captured, node-trace)      | 84,946   | 99.5%  | 0.11%            | 0.38%             | ~1.9 ms       |
-
- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
-  ~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
-  ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
-  update_slots + batch build), 0.24-0.38% of the ~896 ms step.
-
-Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
-compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
-
-## The load-bearing question, answered
-
-Within-step or between-step? **Neither is large.** The steady decode is 99.4%
-GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
-the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
-~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
-either; the host loop is currently **hidden under GPU compute** (the GPU stays
-busy while the host syncs/schedules). It would only become a lever once the
-kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
-second-order floor, not the present bottleneck.
-
-## Verdict
-
-1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
-   default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
-   but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
-   +0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
-2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
-   hidden under GPU compute). It is the candidate floor only after the kernels
-   speed up.
-3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
-   gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
-   128). That, not graphs and not the host loop, is the throughput lever.
-4. Corrected premises: paged is not perpetually eager (it captures with a
-   256-token reset cadence identical to stock); "graphs reused=0" was a uid
-   fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
-   unless `--cuda-graph-trace=node` is set.
-
-No code patch in Phase 1 (graphs are not the lever, so there is no paged
-graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
-`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
-
-# Phase 2 - the real decode lever, located (per-kernel decomposition)
-
-Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
-per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
-that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
-`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
-measurement. Two corrections, then the lever.
-
-The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
-Phase 1 already showed the default paged decode captures, and the fresh re-check
-below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
-fix / between-step host loop) is the lever; the lever is a third thing, measured
-here.
-
-## Fresh re-confirmation: graphs are not the lever
-
-Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
-
-| paged decode  | S_TG t/s | vs vLLM 391 |
-|---------------|----------|-------------|
-| graphs ON     | 146.03   | 37.3%       |
-| graphs OFF    | 144.90   | 37.1%       |
-
-+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
-99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
-nodes too, so they cannot remove either the copies or the compute.
-
-## Correction 1: the model is a hybrid SSM, not a plain transformer
-
-`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
-`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
-decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
-`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
-gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
-hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
-layers.
-
-## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
-
-Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
-OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
-
-| activity set           | GPU busy | idle  |
-|------------------------|----------|-------|
-| kernels only           | 80.2%    | 19.8% |
-| kernels + memcpy (all) | 99.4%    | 0.6%  |
-
-The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
-filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
-356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
-recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
-paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
-47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
-and lumped them into "GPU kernel compute" - they are memory movement, and they
-are addressable.
-
-## Decode GPU-time decomposition (% of kernel+memcpy busy)
-
-OFF/eager rep, steady window. `/step` = instances per decode step.
-
-| share | activity                          | /step | role                          |
-|-------|-----------------------------------|-------|-------------------------------|
-| 23.4% | gated_delta_net_cuda              | 48    | linear-attn recurrence        |
-| 21.9% | k_get_rows_float                  | 97    | SSM state / conv-state gather |
-| 18.9% | MEMCPY DtoD                       | 80    | SSM recurrent-state copy      |
-| 15.5% | mul_mat_vec_q (nvfp4, ncols=1)    | 48    | FP4 GEMV                      |
-| 10.4% | mul_mat_q (nvfp4)                 | 352   | FP4 GEMM                      |
-|  1.9% | quantize_mmq_nvfp4                | 448   | act requant for MMQ           |
-|  1.0% | concat_cont                       | 48    | SSM state glue                |
-|  0.8% | ssm_conv_f32                      | 48    | SSM short conv                |
-|  0.7% | unary_gated_op silu               | 112   | SSM gating                    |
-|  0.4% | flash_attn_tile/_ext              | 16    | FULL attention (paged)        |
-
-Grouped:
- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
-  + conv + gating glue): **~67% of decode**.
- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
- Full attention - everything paged attention optimizes: **~0.4%**.
-
-## Verdict and scope of the real lever
-
-1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
-   the memcpy too, so they cannot touch the copies or the compute.
-2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
-3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
-   faster left decode_agg flat) - it was never the long pole.
-4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
-   block-table stabilization, gather rewrite) can move decode_agg on this model**
-   - it optimizes under half a percent of the step. This is the structural reason
-   A.2, and the paged-decode track generally, cannot close the vLLM gap on
-   q36-27b: the model barely uses the path being optimized.
-
-The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
-per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
-batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
-`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
-SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
-`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
-gather into the scan, avoiding both the per-layer D2D copy and the gathers.
-
-Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
-not paged-attn, not a graph capture, not a block-table tweak):
-1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
-   in place (or double-buffer / write-back), so the recurrence consumes and
-   produces the persistent state without a full-state copy each layer each step.
-2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
-
-Ceiling from this rep (upper bound; assumes the work is fully removed, not just
-overlapped):
- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
-  ~1.6x of vLLM 391 with the FP4 GEMM still untouched.
-
-No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
-the SSM path, too large for this measurement pass and orthogonal to paged
-attention. `patches/paged/0018` stays free. Evidence on the DGX:
-`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
-its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
-`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
-
-# A.2 final synthesis - the four-point verdict
-
-All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
-`decode_agg` = `S_TG t/s`), npl 128 unless noted.
-
-**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
-ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
-All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
-94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
-so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
-busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
-GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
-
-**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
-There was nothing to fix: the premise was false. At the benchmarked context the
-DEFAULT in-kernel paged decode already captures and replays graphs, with a
-256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
-~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
-reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
-only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
-index grows every step), which is not the default decode. Because graphs were
-already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
-Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
-launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
-
-**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
-**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
-re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
-
-**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
-A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
-full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
-layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
-path can move decode_agg. The residual gap is structural and lives elsewhere:
-~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
-get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
-per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
-B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
-(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
-GPU compute - neither is the residual.
-
-The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
-(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
-write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
-vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
-fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
-t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
-still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
-stays free.
--- a/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
+++ b/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
@@ -1,107 +0,0 @@
-# Additive layout for the paged-KV patch series - "hook, don't edit"
-
-Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
-minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
-"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
-the opposite: **additive**. This document is the design rule and the per-patch core-touch
-budget.
-
-## The rule
-
-> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
-> **single, env-gated hook** at one call site in a core file that delegates to the new
-> file. No logic lives in a core file. No core struct/signature is edited.
-
-Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
-`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
-hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
-old-0003 approach) conflicts on every bump and must be re-understood each time.
-
-This is enforceable as a **core-touch budget**: each patch declares the core files it
-touches and the line count; review rejects anything that grows logic in core.
-
-## Why it's achievable here (grounded in the pinned source)
-
-The two seams paged KV needs are both already abstract in llama.cpp at the pin
-(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
-
- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
-  cell indices. Paged placement is just *different indices*. 0002 already does this as one
-  gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
-  `llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
-  subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
-  added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
-
-## Per-patch core-touch budget
-
-| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
-|---|-------|----------------------|------------------------------|-----------:|
-| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
-| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
-| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
-| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
-| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
-| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
-
-Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
-cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
-is new files or the LocalAI-side server loop.
-
-## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
-
-The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
-(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
-The additive form removes the core-struct and core-`set_input` edits entirely:
-
-**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
-  gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
-  the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
-  same set the `kq_mask` keeps), in the canonical order.
- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
-  constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
-  and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
-  transpose`). When not active it returns immediately -> **stock path byte-identical**.
-
-**Core hooks (the whole core diff for 0003):**
-1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
-   ```cpp
-   paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
-   ```
-   One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
-   registered input, so `llama-graph.h` is untouched.
-2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
-   file can read the used-cell set without reaching into internals -
-   `uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
-   (delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
-   pattern). ~8 lines total, no signature changes to existing methods.
-3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
-
-First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
-(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
-no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
-`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
-KV set; `n_gather < n_kv` proves compaction, not identity).
-
-## Anti-drift practices (already in `README.md`, restated as policy)
-
- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
-  `git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
-  needs a touch, and the failure names the exact step.
- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
-  changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
-  (e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
-  commit, re-export. (Set up and verified for this pin during this work.)
-
-## Status / next step
-
- 0001, 0002: done, additive, verified token-identical.
- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
-  0001+0002 applied is ready (`paged` branch). Remaining work is the focused
-  implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
-  the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
-  FA-first), not a structural one - the structure is settled here.
- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
-  patch at all).
--- a/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BENCHMARK_PROGRESS.md
@@ -1,92 +0,0 @@
-# FINAL apples-to-apples NVFP4 benchmark (GB10 / DGX Spark) - CLEAN env, containers stopped
-# llama 0023 clean f7409c2 | LLAMA_KV_PAGED=1, LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget; beats stock 394s->142s TTFT@npl32), CUDA graphs ON, -c 131072 --parallel 128 -b 2048 -ub 512 -fa on
-# vLLM 0.23.0 | CUDA graphs ON (no enforce-eager), util 0.85, max-model-len 4096, max-num-seqs 256, tp1
-# client h2h_cli3.py: 512-tok UNIQUE-nonce prompt (fresh full prefill, defeats prefix caching), max_tokens=256, temp0, ignore_eos, stream+usage
-# llama restarts server PER NPL (paged-pool degrades after high-npl bursts); vllm one server/combo + npl8 re-check. 1 measured pass/npl + ptok8 graph warmup. peak_gb engine = PEAK-PRE.
-# started Fri Jun 26 04:43:38 AM CEST 2026 baseline=3.29 GB
-
-[2026-06-26 04:43:38] [dense_llama] ==== START dense_llama (llama) baseline_mem=3.29 ====
-[2026-06-26 04:43:38] [dense_llama] NPL=8 launching server PRE_GB=3.29
-[2026-06-26 04:43:48] [dense_llama] NPL=8 ready LOADED_GB=47.06
-[2026-06-26 04:43:55] [dense_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1.  **Analyze User Input:** The user says "The capital of France is". This is a straightforward factual question with a clear answer.\n2.  **Identify Key Entity:** France (country)\n3.  **Identify Question Type:** Capit
-[2026-06-26 04:44:30] [dense_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2040, "prompt_tok_total": 4195, "gen_per_req": 255.0, "agg_tps": 61.8, "decode_agg_tps": 82.5, "decode_perseq_tps": 9.57, "prefill_tps": 507.3, "ttft_mean_ms": 6038.1, "ttft_max_ms": 8270.0, "wall_s": 32.999}
-[2026-06-26 04:44:30] [dense_llama] NPL=8 PEAK_GB=53.51
-[2026-06-26 04:44:35] [dense_llama] NPL=8 server stopped mem=3.31
-[2026-06-26 04:44:35] [dense_llama] NPL=32 launching server PRE_GB=3.31
-[2026-06-26 04:44:40] [dense_llama] NPL=32 ready LOADED_GB=46.96
-[2026-06-26 04:47:55] [dense_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8180, "prompt_tok_total": 16900, "gen_per_req": 255.6, "agg_tps": 43.2, "decode_agg_tps": 192.6, "decode_perseq_tps": 4.79, "prefill_tps": 115.0, "ttft_mean_ms": 133551.7, "ttft_max_ms": 147007.0, "wall_s": 189.49}
-[2026-06-26 04:47:55] [dense_llama] NPL=32 PEAK_GB=69.63
-[2026-06-26 04:48:01] [dense_llama] NPL=32 server stopped mem=3.32
-[2026-06-26 04:48:01] [dense_llama] NPL=64 launching server PRE_GB=3.32
-[2026-06-26 04:48:11] [dense_llama] NPL=64 ready LOADED_GB=46.97
-[2026-06-26 04:55:10] [dense_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16382, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 39.8, "decode_agg_tps": 277.8, "decode_perseq_tps": 3.09, "prefill_tps": 95.9, "ttft_mean_ms": 321618.8, "ttft_max_ms": 352633.6, "wall_s": 411.603}
-[2026-06-26 04:55:10] [dense_llama] NPL=64 PEAK_GB=83.96
-[2026-06-26 04:55:16] [dense_llama] NPL=64 server stopped mem=3.30
-[2026-06-26 04:55:16] [dense_llama] NPL=128 launching server PRE_GB=3.30
-[2026-06-26 04:55:21] [dense_llama] NPL=128 ready LOADED_GB=47.09
-[2026-06-26 05:13:18] [dense_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32767, "prompt_tok_total": 67969, "gen_per_req": 256.0, "agg_tps": 30.9, "decode_agg_tps": 384.6, "decode_perseq_tps": 1.86, "prefill_tps": 69.7, "ttft_mean_ms": 902762.7, "ttft_max_ms": 975832.6, "wall_s": 1061.031}
-[2026-06-26 05:13:18] [dense_llama] NPL=128 PEAK_GB=93.82
-[2026-06-26 05:13:25] [dense_llama] NPL=128 server stopped mem=3.31
-[2026-06-26 05:13:25] [dense_llama] ==== DONE dense_llama POST_GB=3.31 ====
-[2026-06-26 05:13:25] [dense_vllm] ==== START dense_vllm (vllm) baseline_mem=3.31 ====
-[2026-06-26 05:13:25] [dense_vllm] launching vllm PRE_GB=3.31
-[2026-06-26 05:21:15] [dense_vllm] vllm ready LOADED_GB=110.48
-[2026-06-26 05:21:27] [dense_vllm] GATE='Here\'s a thinking process:\n\n1.  **Analyze User Input:** The user says "The capital of France is"\n2.  **Identify Key Entity/Question:** The question is asking for the capital city of France.\n3.  **Retrieve Knowledge:** I know from general knowledge that t
-[2026-06-26 05:21:59] [dense_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1959, "prompt_tok_total": 4195, "gen_per_req": 244.9, "agg_tps": 65.6, "decode_agg_tps": 70.4, "decode_perseq_tps": 8.76, "prefill_tps": 2096.2, "ttft_mean_ms": 1861.1, "ttft_max_ms": 2000.6, "wall_s": 29.843}
-[2026-06-26 05:21:59] [dense_vllm] NPL=8 PEAK_GB=110.92
-[2026-06-26 05:22:47] [dense_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8165, "prompt_tok_total": 16900, "gen_per_req": 255.2, "agg_tps": 176.3, "decode_agg_tps": 211.8, "decode_perseq_tps": 6.28, "prefill_tps": 2182.6, "ttft_mean_ms": 5353.2, "ttft_max_ms": 7741.4, "wall_s": 46.302}
-[2026-06-26 05:22:47] [dense_vllm] NPL=32 PEAK_GB=110.87
-[2026-06-26 05:23:59] [dense_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16314, "prompt_tok_total": 33828, "gen_per_req": 254.9, "agg_tps": 236.5, "decode_agg_tps": 309.1, "decode_perseq_tps": 4.38, "prefill_tps": 2088.9, "ttft_mean_ms": 9512.4, "ttft_max_ms": 16191.0, "wall_s": 68.976}
-[2026-06-26 05:23:59] [dense_vllm] NPL=64 PEAK_GB=110.88
-[2026-06-26 05:25:57] [dense_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32640, "prompt_tok_total": 67969, "gen_per_req": 255.0, "agg_tps": 288.4, "decode_agg_tps": 418.8, "decode_perseq_tps": 2.79, "prefill_tps": 1929.1, "ttft_mean_ms": 18449.5, "ttft_max_ms": 35227.7, "wall_s": 113.162}
-[2026-06-26 05:25:57] [dense_vllm] NPL=128 PEAK_GB=110.95
-[2026-06-26 05:26:27] [dense_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 2044, "prompt_tok_total": 4187, "gen_per_req": 255.5, "agg_tps": 68.1, "decode_agg_tps": 73.4, "decode_perseq_tps": 9.07, "prefill_tps": 1921.9, "ttft_mean_ms": 1877.6, "ttft_max_ms": 2178.1, "wall_s": 30.018}
-[2026-06-26 05:26:35] [dense_vllm] ==== DONE dense_vllm POST_GB=3.53 ====
-[2026-06-26 05:26:35] [moe_llama] ==== START moe_llama (llama) baseline_mem=3.53 ====
-[2026-06-26 05:26:35] [moe_llama] NPL=8 launching server PRE_GB=3.53
-[2026-06-26 05:26:50] [moe_llama] NPL=8 ready LOADED_GB=36.42
-[2026-06-26 05:26:52] [moe_llama] GATE=' |REASON:Here\'s a thinking process:\n\n1.  **Analyze User Input:**\n   - User says: "The capital of France is"\n   - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2.  **Identify Key Information:*
-[2026-06-26 05:27:06] [moe_llama] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 2048, "prompt_tok_total": 4195, "gen_per_req": 256.0, "agg_tps": 156.8, "decode_agg_tps": 211.8, "decode_perseq_tps": 24.45, "prefill_tps": 1236.4, "ttft_mean_ms": 2477.1, "ttft_max_ms": 3392.9, "wall_s": 13.061}
-[2026-06-26 05:27:06] [moe_llama] NPL=8 PEAK_GB=39.66
-[2026-06-26 05:27:11] [moe_llama] NPL=8 server stopped mem=3.34
-[2026-06-26 05:27:11] [moe_llama] NPL=32 launching server PRE_GB=3.34
-[2026-06-26 05:27:16] [moe_llama] NPL=32 ready LOADED_GB=36.54
-[2026-06-26 05:27:54] [moe_llama] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 8192, "prompt_tok_total": 16900, "gen_per_req": 256.0, "agg_tps": 235.6, "decode_agg_tps": 393.0, "decode_perseq_tps": 10.02, "prefill_tps": 1213.9, "ttft_mean_ms": 8225.2, "ttft_max_ms": 13921.9, "wall_s": 34.768}
-[2026-06-26 05:27:54] [moe_llama] NPL=32 PEAK_GB=47.11
-[2026-06-26 05:28:00] [moe_llama] NPL=32 server stopped mem=3.30
-[2026-06-26 05:28:00] [moe_llama] NPL=64 launching server PRE_GB=3.30
-[2026-06-26 05:28:05] [moe_llama] NPL=64 ready LOADED_GB=36.39
-[2026-06-26 05:29:10] [moe_llama] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 16384, "prompt_tok_total": 33828, "gen_per_req": 256.0, "agg_tps": 271.0, "decode_agg_tps": 527.0, "decode_perseq_tps": 6.15, "prefill_tps": 1152.3, "ttft_mean_ms": 15849.5, "ttft_max_ms": 29356.9, "wall_s": 60.449}
-[2026-06-26 05:29:10] [moe_llama] NPL=64 PEAK_GB=57.13
-[2026-06-26 05:29:16] [moe_llama] NPL=64 server stopped mem=3.28
-[2026-06-26 05:29:16] [moe_llama] NPL=128 launching server PRE_GB=3.28
-[2026-06-26 05:29:21] [moe_llama] NPL=128 ready LOADED_GB=36.48
-[2026-06-26 05:34:19] [moe_llama] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 32760, "prompt_tok_total": 67969, "gen_per_req": 255.9, "agg_tps": 112.7, "decode_agg_tps": 726.4, "decode_perseq_tps": 3.73, "prefill_tps": 276.8, "ttft_mean_ms": 213017.2, "ttft_max_ms": 245528.7, "wall_s": 290.634}
-[2026-06-26 05:34:19] [moe_llama] NPL=128 PEAK_GB=61.51
-[2026-06-26 05:34:25] [moe_llama] NPL=128 server stopped mem=3.28
-[2026-06-26 05:34:25] [moe_llama] ==== DONE moe_llama POST_GB=3.28 ====
-[2026-06-26 05:34:25] [moe_vllm] ==== START moe_vllm (vllm) baseline_mem=3.28 ====
-[2026-06-26 05:34:25] [moe_vllm] launching vllm PRE_GB=3.28
-[2026-06-26 05:39:35] [moe_vllm] vllm ready LOADED_GB=109.46
-[2026-06-26 05:39:38] [moe_vllm] GATE='Here\'s a thinking process:\n\n1.  **Analyze User Input:**\n   - User says: "The capital of France is"\n   - This is a straightforward factual question, incomplete but clearly asking for the capital city of France.\n\n2.  **Identify Key Information:**\n   - C
-[2026-06-26 05:39:47] [moe_vllm] NPL=8 PASS=1 {"n": 8, "reqs": 8, "gen_total": 1900, "prompt_tok_total": 4195, "gen_per_req": 237.5, "agg_tps": 231.2, "decode_agg_tps": 256.5, "decode_perseq_tps": 31.84, "prefill_tps": 5186.5, "ttft_mean_ms": 768.8, "ttft_max_ms": 808.2, "wall_s": 8.217}
-[2026-06-26 05:39:47] [moe_vllm] NPL=8 PEAK_GB=109.62
-[2026-06-26 05:40:07] [moe_vllm] NPL=32 PASS=1 {"n": 32, "reqs": 32, "gen_total": 7794, "prompt_tok_total": 16900, "gen_per_req": 243.6, "agg_tps": 426.4, "decode_agg_tps": 500.8, "decode_perseq_tps": 14.9, "prefill_tps": 6223.4, "ttft_mean_ms": 1830.4, "ttft_max_ms": 2714.2, "wall_s": 18.28}
-[2026-06-26 05:40:07] [moe_vllm] NPL=32 PEAK_GB=109.63
-[2026-06-26 05:40:37] [moe_vllm] NPL=64 PASS=1 {"n": 64, "reqs": 64, "gen_total": 15927, "prompt_tok_total": 33828, "gen_per_req": 248.9, "agg_tps": 550.7, "decode_agg_tps": 686.1, "decode_perseq_tps": 9.83, "prefill_tps": 5926.5, "ttft_mean_ms": 3224.4, "ttft_max_ms": 5704.9, "wall_s": 28.92}
-[2026-06-26 05:40:37] [moe_vllm] NPL=64 PEAK_GB=109.63
-[2026-06-26 05:41:27] [moe_vllm] NPL=128 PASS=1 {"n": 128, "reqs": 128, "gen_total": 31795, "prompt_tok_total": 67969, "gen_per_req": 248.4, "agg_tps": 650.7, "decode_agg_tps": 882.2, "decode_perseq_tps": 6.05, "prefill_tps": 5300.5, "ttft_mean_ms": 6487.7, "ttft_max_ms": 12817.8, "wall_s": 48.863}
-[2026-06-26 05:41:27] [moe_vllm] NPL=128 PEAK_GB=109.64
-[2026-06-26 05:41:36] [moe_vllm] RECHECK_NPL8 {"n": 8, "reqs": 8, "gen_total": 1702, "prompt_tok_total": 4187, "gen_per_req": 212.8, "agg_tps": 207.2, "decode_agg_tps": 226.4, "decode_perseq_tps": 28.06, "prefill_tps": 6021.3, "ttft_mean_ms": 642.7, "ttft_max_ms": 694.8, "wall_s": 8.213}
-[2026-06-26 05:41:44] [moe_vllm] ==== DONE moe_vllm POST_GB=3.31 ====
-
-==== ALL 16 ROWS COLLECTED (2 models x 2 engines x 4 npl) ====
-decode_agg t/s (llama | vLLM | llama%vLLM):
- DENSE q36-27b-nvfp4:  npl8 82.5|70.4|117%  npl32 192.6|211.8|91%  npl64 277.8|309.1|90%  npl128 384.6|418.8|92%
- MoE   q36-35b-a3b:    npl8 211.8|256.5|83%  npl32 393.0|500.8|78%  npl64 527.0|686.1|77%  npl128 726.4|882.2|82%
-peak_gb (llama on-demand grows | vLLM fixed ~107 pool):
- DENSE llama 53.5->93.8 ; vLLM ~110.9 flat
- MoE   llama 39.7->61.5 ; vLLM ~109.6 flat
-Final CSV: final_benchmark.csv ; analysis: QWEN36_NVFP4_BENCH.md (FINAL section).
-Cleanup: no leftover server/bench PIDs; GPU free (memnow 3.28 GB); local-ai + local-ai-worker
-containers restarted (host returned). DONE.
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PLAN.md
@@ -1,628 +0,0 @@
-# bf16 SSM-state cache: BUILD PLAN (PART C synthesis - hand this to the build agent)
-
-Status: READ-ONLY design. Lands ON TOP of patch 0021 (conv-state in-place fusion, building
-concurrently on the GPU). DEFAULT = bf16 SSM recurrent state, f32 opt-out. This PART C is the
-executive build brief: ordered edits, acceptance gate, bench targets, semantics/back-compat/risk
-register, and the de-risk-first item. PART A (cparams wiring), PART B (kernel/op plumbing) and the
-Appendix (upstream precedent + numeric safety) below are the detailed reference each step points into.
-
-The decision (settled by GDN_RECURRENCE_BYTE_GATE.md): the gated-DeltaNet recurrence is the dominant
-decode kernel (51.6% of the step, 805 MB f32 state R+W/call at 74% of GB10 peak BW) and is ALREADY
-single-pass (measured re-stream ~1.0x, hard-capped <=1.33x). The whole ~2x DRAM gap vs vLLM is purely
-f32(llama) vs bf16(vLLM) state-cache WIDTH, not extra passes. Narrowing the persisted SSM state to
-bf16 (load->f32, recurrence math in f32 UNCHANGED, store->bf16) halves the dominant term and reaches
-vLLM parity-to-ahead. vLLM's own GDN state cache is bf16, so this is a fair equal-precision change.
-
-## C.0 Synthesis decisions that OVERRIDE the per-part text
-
-1. v1 ships `type_s` = BF16 (SSM recurrent state, the 805 MB lever) and KEEPS `type_r` = F32 (conv
-   state). Reason: `ggml_concat` at prefill (`build_conv_state`, delta-net-base.cpp:472) requires
-   same-type operands; a bf16 conv cache breaks the f32 `qkv_mixed` concat. Conv state is ~12.6 MB
-   (launch-bound, ~0 ms byte benefit), so keeping it f32 costs nothing. This OVERRIDES PART A §3a/§3b,
-   which set BOTH defaults to BF16: in v1 set the `type_r` / `cache_type_conv` DEFAULT to
-   `GGML_TYPE_F32`. `type_r`=bf16 is a v2 follow-up (needs an f32 staging view before the prefill
-   concat - PART B §B.6).
-2. Keep ALL transient/scratch tensors f32: the GDN op OUTPUT scratch (ggml.c:6327), the 0019 gather
-   scratch, and the keep_rs_t prefill snapshot. ONLY the PERSISTED cache rows narrow to bf16 (the
-   src[5] read view and the src[6] in-place write view).
-3. The gate REPLACES the bit-exact md5 gate for the bf16 default: bf16 is intentionally non-bit-exact
-   vs llama f32 (it is equal precision to vLLM's bf16). The 0018/0019 md5 gate STILL applies to (a)
-   patch 0021's conv fusion and (b) verifying the f32 opt-out path is byte-identical to the pre-bf16
-   f32 baseline.
-
-## C.1 Ordered file-by-file edit list (build order, on top of 0021)
-
-Order is dependency- and de-risk-driven: prove the kernel dtype-correct in ISOLATION before flipping
-any default. Section refs point into PART A / PART B below.
-
-STEP 1 - kernel + op made dtype-generic (the load/store conversion), validated standalone:
- 1a `ggml/src/ggml.c` - relax the F32-only state asserts to {F32,BF16} in the 3 GDN builders:
-  `ggml_gated_delta_net` (~6308), `_inplace` (~6370), `_inplace_ids` (~6430), on `state` and
-  `src_state_dst`. KEEP the op OUTPUT scratch F32 (6327). [PART B §B.2]
- 1b `ggml/src/ggml-cuda/ggml-cuda.cu` - `supports_op` `GGML_OP_GATED_DELTA_NET` (~3096): permit a
-  BF16 `src[5]`/`src[6]`. [PART B §B.3]
- 1c `ggml/src/ggml-cuda/gated_delta_net.cu` - template kernel+gather+launch on `bool STATE_BF16`;
-  `#include <cuda_bf16.h>`. LOAD `__bfloat162float` (~102), STORE `__float2bfloat16` (~207), GATHER
-  bf16->f32 scratch (~20). Cast `src_state`/`src_state_dst` pointers to `nv_bfloat16` on bf16; relax
-  dispatcher asserts (309-311) `sizeof(float)` -> `ggml_type_size(type)`. Keep gather scratch +
-  keep_rs_t snapshot f32. ALL recurrence math (106-200) UNCHANGED in f32 registers. [PART B §B.4,§B.8]
- 1d `ggml/src/ggml-cpu/ops.cpp` - matching bf16 load/store branch in the GDN reference (10726/10744/
-  10891 load via `GGML_BF16_TO_FP32`, 10758-10762 store via `GGML_FP32_TO_BF16`); relax `nb[]` asserts
-  to `ggml_type_size(type)`. [PART B §B.5]
- 1e `tests/test-backend-ops.cpp` - add a BF16-state `GATED_DELTA_NET` case covering BOTH `n_tokens==1`
-  decode AND a multi-token (prefill/chunk) + `keep_rs_t==true` path, CUDA bf16 vs CPU bf16 reference.
-  THIS IS THE DE-RISK GATE for Step 1 (see C.5). Build + pass before Step 2.
-
-STEP 2 - cparams selection wiring (llama.cpp core):
- 2a `include/llama.h` (after :366) - add `enum ggml_type type_s;` and `type_r;` adjacent to
-  `type_k`/`type_v`, marked `[EXPERIMENTAL]`. [PART A §3a]
- 2b `src/llama-context.cpp:3468` (`llama_context_default_params`) - add `/*.type_s =*/ GGML_TYPE_BF16,`
-  and `/*.type_r =*/ GGML_TYPE_F32,`. THIS IS THE DEFAULT CHANGE (type_r stays F32 per C.0). [PART A §3a]
- 2c `src/llama-memory.h:19` (`struct llama_memory_params`) - add `ggml_type type_r;` and `type_s;`.
-  [PART A §3a]
- 2d `src/llama-context.cpp:325` (`params_mem` init) - pass `params.type_r` / `params.type_s`. [PART A §3a]
- 2e `src/llama-model.cpp` - replace the 3 hardcoded `GGML_TYPE_F32` pairs (2056-57 recurrent, 2098-99
-  hybrid_iswa, 2117-18 hybrid = the qwen35/qwen35moe path) with `params.type_r` / `params.type_s`.
-  [PART A §2/§3a]
-
-STEP 3 - back-compat for saved recurrent state (REQUIRED, the default flips):
- 3a `src/llama-memory-recurrent.cpp` `state_read_data` - on `s_type_i_ref != live type` with both in
-  {F32,BF16}, CONVERT row-by-row during load instead of returning false (same for `r`). Bump the
-  recurrent state-file version. [PART A §5, option A]
-
-STEP 4 - CLI / llama-server surface (needed by the gate harness):
- 4a `common/common.h:566` region - `cache_type_ssm = GGML_TYPE_BF16;` and
-  `cache_type_conv = GGML_TYPE_F32;` (conv default F32 per C.0). [PART A §3b]
- 4b `common/common.cpp:1589` region - `cparams.type_s = params.cache_type_ssm;` and
-  `cparams.type_r = params.cache_type_conv;`. [PART A §3b]
- 4c `common/arg.cpp` (after :2074) - add `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv`
-  via the existing `kv_cache_type_from_str` (arg.cpp:402); confirm `bf16` -> `GGML_TYPE_BF16`. The C.2
-  harness depends on `--cache-type-ssm {f32,bf16}`. [PART A §3b]
-
-STEP 5 - LocalAI gRPC / YAML (force f32 from model config):
- 5a `backend/backend.proto` - `string CacheTypeSSM` / `CacheTypeConv` (next free tags after 64);
-  regen proto. [PART A §3c]
- 5b `backend/cpp/llama-cpp/grpc-server.cpp:504` region - `params.cache_type_ssm =
-  kv_cache_type_from_str(request->cachetypessm());` + conv. [PART A §3c]
- 5c `core/config/model_config.go:935` - `CacheTypeSSM`/`CacheTypeConv` yaml fields. [PART A §3c]
- 5d `core/backend/options.go:247` - map into the request. [PART A §3c]
- 5e `core/config/meta/registry.go` + `build_test.go` - register `cache_type_ssm`/`cache_type_conv`
-  as static fields (gate). [PART A §3c]
-
-STEP 6 - capability fallback (heterogeneous / CPU-offload safety):
- 6a `src/llama-context.cpp:518-595` - an `auto_fgdn`-style device-match probe: if a participating
-  device lacks the bf16 GDN load/store specialization (CPU-offloaded GDN layer, non-GB10 backend),
-  demote `type_s` to F32 BEFORE alloc and log once. [PART A §4]
-
-## C.2 Acceptance gate (REPLACES the bit-exact md5 gate)
-
-bf16 is intentionally non-bit-exact, so the 0018/0019 md5 byte-equality gate does NOT apply to the
-bf16 default. The gate is teacher-forced KL-divergence + PPL-delta + greedy coherence + a
-long-context drift sweep, vs the SAME model run f32. All commands on `dgx.casa` (DO NOT run during
-this design - GPU busy). Binaries `~/llama-paged-dev/build*/bin`; models `~/bench/q36-27b-nvfp4.gguf`
-(dense) and `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE); scratch `~/bench/klgate`.
-
-Why teacher-forced (not self-greedy): a self-greedy decode lets each precision pick its own argmax,
-so after the first divergence the contexts differ and per-token logits are no longer comparable (you
-measure trajectory divergence, not numeric error). `llama-perplexity --kl-divergence` feeds both
-precisions the IDENTICAL token stream and compares output distributions position-by-position; the
-greedy trajectory is validated SEPARATELY by the Same-top-p metric + a coherence read.
-
-Corpus (one-time): wikitext-2 raw test (~280k tokens) into `~/bench/klgate`. KL mode needs
->= 2*n_ctx tokens; any fixed >=8k-token UTF-8 file works as long as base AND test share it.
-
-256-token headline gate (per model; shown for dense):
-```
-M=~/bench/q36-27b-nvfp4.gguf; F=~/bench/klgate/wikitext-2-raw/wiki.test.raw; D=~/bench/klgate
-COMMON="-m $M -f $F -c 256 -b 256 -ngl 99 -fa on --seed 1 --chunks 32"
-# (a) f32 BASE: reference logits + f32 PPL
-llama-perplexity $COMMON --cache-type-ssm f32  --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32.c256.base.log
-# (b) bf16 TEST: KL(bf16||f32) + bf16 PPL + Same-top-p
-llama-perplexity $COMMON --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.bf16.c256.kl.log
-```
-Noise floor (run FIRST, mandatory - GPU reductions are not bit-deterministic, so KLD has a non-zero
-floor; bf16 is judged against BOTH the absolute threshold AND this floor):
-```
-llama-perplexity $COMMON --cache-type-ssm f32 --kl-divergence --kl-divergence-base $D/q27.f32.c256.kld | tee $D/q27.f32f32.floor.log
-```
-Record `Mean KLD_floor` and `Same-top-p_floor` (expect KLD ~1e-6..1e-5, top-p ~100%).
-
-Coherence spot-check (greedy trajectory, reuses the 0018/0019 `--temp 0 --seed 1` convention):
-```
-P="Explain how a transformer language model generates text, step by step."
-for T in f32 bf16; do llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm $T 2>/dev/null > $D/q27.greedy.$T.txt; done
-diff $D/q27.greedy.f32.txt $D/q27.greedy.bf16.txt && echo "GREEDY BYTE-IDENTICAL"
-```
-Long-context drift sweep (verifies the g<1 decay bound: bf16 state-rounding error must stay FLAT, not
-accumulate, as context grows - the GDN state spans the whole window):
-```
-for C in 256 1024 2048 4096; do
-  CMN="-m $M -f $F -c $C -b $C -ngl 99 -fa on --seed 1 --chunks 8"
-  llama-perplexity $CMN --cache-type-ssm f32  --kl-divergence-base $D/q27.f32.c$C.kld >/dev/null
-  llama-perplexity $CMN --cache-type-ssm bf16 --kl-divergence --kl-divergence-base $D/q27.f32.c$C.kld | tee $D/q27.bf16.c$C.kl.log
-done
-```
-f32 opt-out verification (the safety valve must actually select f32 and reproduce the committed f32
-greedy md5 from 0018/0019 - the bf16 default must NOT change the f32-path output):
-```
-llama-cli -m $M -ngl 99 -fa on --temp 0 --seed 1 -n 256 -p "$P" --cache-type-ssm f32 2>/dev/null | md5sum  # == 0018/0019 f32 baseline md5
-```
-Repeat the WHOLE gate verbatim for the MoE model (`M=~/bench/q36-35b-a3b-nvfp4.gguf`).
-
-PASS/FAIL (bf16 ships as DEFAULT only if ALL rows pass for BOTH dense and MoE):
-
-| metric | source | PASS threshold |
-|---|---|---|
-| Mean KLD | 256-gate (b) | **< 1e-3 nats** (hard, the brief) |
-| Mean KLD vs floor | (b) vs floor | <= ~5x `Mean KLD_floor` (bounded signal, not pure noise) |
-| Same top p | (b) | **>= 99.5%** (100% => greedy byte-identical to f32) |
-| PPL-delta `ln(PPL_bf16/PPL_f32)` | (a)+(b) | **abs < 0.005** (PPL within +-0.5%) |
-| Max / 99.9% KLD | (b) | report; flag if Max > 0.05 (tail outliers) |
-| Coherence | greedy | fluent + on-topic; byte-identical if Same-top-p=100% |
-| Long-context drift | sweep | MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0% |
-
-If any row fails for a model: keep THAT model on f32 (gallery YAML `cache_type_ssm: f32`) while the
-global default stays bf16; the cparams f32 fallback is the safety valve. MoE has fewer GDN layers
-(31 vs 48) and smaller per-head state (H_v=32 vs 48), so expected KLD <= dense; same thresholds.
-Same-top-p is the bridge to the old md5 harness: at 100% the bf16 greedy output is byte-identical to
-f32 and the 0018/0019 md5 gate would still pass - the strongest possible non-bit-exact result.
-
-## C.3 Bench targets + nsys confirmation
-
-Dense q36-27b-nvfp4 (48 GDN layers, S_v=128, H_v=48), npl128, GB10/sm_121, graphs-OFF
-apples-to-apples (the measured baseline):
- Recurrence per call: 3.98 ms (f32, 805 MB R+W, 74% peak) -> **~2.0-3.0 ms** (bf16, ~413 MB R+W).
-  2.0 ms = 74% peak retained; 3.0 ms = conservative 50% peak on the smaller footprint.
- Recurrence per step: 191 ms -> ~96-143 ms (save ~48-95 ms).
- Step time: 384 ms -> **289-339 ms**.
- Decode throughput: ~335 -> **360-443 tok/s** = parity-to-ahead of vLLM (327 ms / 391 tok/s).
-
-MoE q36-35b-a3b-nvfp4 (31 GDN layers, H_v=32): state per (seq,layer) = 128*128*32*4 = 2.0 MiB f32 ->
-per-call R+W ~537 MB f32 -> ~268 MB bf16. Fewer layers + smaller state => smaller ABSOLUTE recurrence
-savings, and MoE decode is more GEMM-bound (the `MUL_MAT_ID` expert path), so the bf16-state win is a
-smaller FRACTION of the MoE step. Target: a measurable per-call halving of the GDN recurrence time
-with the C.2 KL gate passing; no absolute MoE step target is asserted here (the MoE step is
-MUL_MAT_ID-dominated, a separate lever from this one).
-
-nsys confirmation (the measurement that proves the lever landed):
-```
-GGML_CUDA_DISABLE_GRAPHS=1 nsys profile -o ssmbf16 --force-overwrite true \
-  llama-batched-bench -m $M -npp 8 -ntg 12 -npl 128 -ub 2048
-nsys stats --report cuda_gpu_kern_sum ssmbf16.nsys-rep | grep -i gated_delta_net
-```
-Confirm: `gated_delta_net_cuda` mean duration/call drops 3.98 -> 2.0-3.0 ms; step time + tok/s land in
-the 289-339 ms / 360-443 tok/s band; the f32 opt-out reproduces the 3.98 ms f32 call. The gate is the
-JOINT condition: per-call speed in band AND KL<1e-3 - neither alone ships bf16.
-
-## C.4 Default / opt-out semantics, back-compat, risk register
-
-Semantics:
- DEFAULT `type_s` = `GGML_TYPE_BF16` (SSM recurrent state). `type_r` = `GGML_TYPE_F32` in v1 (conv
-  state; bf16 is v2). This is the INVERSE of KV (KV is opt-IN to compression at F16 default; SSM is
-  opt-OUT to f32).
- Opt-out: `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML) -> bit-exact f32
-  recurrence. Per-model opt-out lives in gallery YAML if a model fails the gate; the global default
-  stays bf16.
- Silent capability fallback: the C.1 STEP 6 device-match probe demotes `type_s` to F32 before alloc
-  on devices lacking the bf16 GDN specialization (CPU offload / non-GB10) and logs once.
-
-Back-compat (the ONE real breakage): `llama-memory-recurrent.cpp` serializes the per-layer state
-dtype and HARD-matches on restore (mismatch -> `"mismatched s type"` -> returns false). The f32->bf16
-default flip makes OLD f32-saved sessions fail to restore against a bf16 build. Fix = STEP 3a: convert
-row-by-row on mismatch (both in {F32,BF16}) + bump the recurrent state-file version. KV never hit this
-because `type_k`/`type_v` were EXPERIMENTAL and never default-changed; the SSM default FLIP is what
-forces the convert/version work.
-
-Risk register:
- **R1 numeric drift (KL gate fails).** Likelihood LOW: g<1 geometric decay contracts per-step bf16
-  rounding to a bounded series (~`eps/(1-exp(g_mean))`), f32 registers confine rounding to one
-  per-step cache write, and vLLM ships this exact config in production. Mitigation: C.2 gate +
-  per-model f32 opt-out + global f32 fallback.
- **R2 prefill / keep_rs_t / gather state path (the silent-corruption landmine).** The conversion
-  points are documented for DECODE; the SAME kernel also runs the chunked prefill path, the keep_rs_t
-  snapshot (writes to f32 scratch while the cache is bf16), and the 0019 gather (reads bf16 cache ->
-  f32 scratch). A dtype mistake on any of these corrupts the state at the prefill->decode handoff and
-  surfaces ONLY as long-context drift, which a decode-only 256-token gate can mask. Mitigation: STEP
-  1e test-backend-ops MUST cover the multi-token prefill + keep_rs_t==true path, not just decode; the
-  C.2 long-context sweep is the second net. (This is C.5, the single biggest risk.)
- **R3 MoE MUL_MAT_ID path.** The GDN recurrence op is IDENTICAL for dense and MoE; the MoE expert
-  GEMM (`MUL_MAT_ID`) does NOT touch the SSM state, so bf16-state is orthogonal to the expert path.
-  Residual risk: `qwen35moe` `build_recurrent_attn` must route the same bf16 state view (it shares
-  delta-net-base.cpp). Mitigation: run the full C.2 gate on the MoE model; the test-backend-ops case
-  is arch-agnostic.
- **R4 conv-state coupling with patch 0021.** Flipping `type_r` to bf16 breaks `ggml_concat` at
-  prefill (different types). Mitigation: v1 keeps `type_r`=F32 (C.0); `type_r`=bf16 deferred to v2
-  with an f32 staging view (PART B §B.6).
- **R5 back-compat restore failure.** Mitigation: STEP 3a convert + version bump (above).
-
-## C.5 Single biggest risk + how the build agent de-risks it FIRST
-
-Single biggest risk: **R2 - silent state corruption on the NON-decode state paths** (chunked prefill,
-the keep_rs_t snapshot, the 0019 gather). The 805 MB measurement and every conversion-point in the
-cheat-sheet describe the STEADY decode path (`n_tokens==1`, `!keep_rs_t`). But the bf16 cache is ALSO
-read/written by the multi-token prefill path and the prefill/rollback snapshot (which targets f32
-scratch while the cache is bf16). A dtype bug there does not crash and barely moves the 256-token
-decode md5; it corrupts the recurrent state at the prefill->decode boundary and shows up ONLY as
-long-context drift - exactly the failure a quick gate misses.
-
-De-risk FIRST (before ANY default flip or wiring): implement STEP 1 (kernel + op dtype-generic) and
-STEP 1e (test-backend-ops) ONLY, then prove the kernel is dtype-correct in ISOLATION by forcing a
-bf16 state allocation behind a temporary debug flag and running test-backend-ops with a case that
-exercises (a) single-token decode, (b) a multi-token prefill chunk, and (c) `keep_rs_t==true`,
-comparing CUDA bf16 against the CPU bf16 reference AND against the f32 path within tolerance. Only
-after that case is GREEN does the build agent proceed to STEP 2 (flip the default) and the C.2
-model-level gate. This decouples kernel dtype-correctness from the cparams wiring, so a Step-1 bug is
-caught by a deterministic op test in minutes instead of as a fuzzy long-context regression after the
-full stack is wired.
-
---
-
-# bf16 SSM state cache — cparams wiring (DEFAULT bf16 + f32 opt-out)
-
-Label: cparams-default-fallback (READ-ONLY design). Mirrors the KV-cache `type_k`/`type_v`
-precision plumbing exactly. Designed against HEAD-after-patch-0021 (conv-state in-place fusion).
-
-This is lever (2) of GDN_RECURRENCE_BYTE_GATE.md: the recurrent SSM state cache is the dominant
-decode byte stream (805 MB R+W/call, 51.6% of step, single-pass f32 = at the BW floor). The whole
-~2x DRAM gap vs vLLM is f32(llama) vs bf16(vLLM) state width. Storing the persisted state in bf16
-(load→f32, recurrence math in f32 UNCHANGED, store→bf16) halves the dominant term. vLLM's GDN state
-cache is bf16, so bf16-default is the fair equal-precision comparison → make it the DEFAULT.
-
---
-
-## 1. The KV-cache template we mirror (exact chain for type_k / type_v)
-
-```
-CLI   common/arg.cpp:2052     -ctk/--cache-type-k TYPE → params.cache_type_k
-                              (common_params, common/common.h:566, default GGML_TYPE_F16)
-  ↓
-glue  common/common.cpp:1589  cparams.type_k = params.cache_type_k   (cparams = llama_context_params)
-  ↓
-API   include/llama.h:365     llama_context_params.type_k  // [EXPERIMENTAL]
-      llama-context.cpp:3468  default in llama_context_default_params() = GGML_TYPE_F16
-  ↓
-mem   llama-context.cpp:326   llama_memory_params params_mem.type_k = params.type_k
-      llama-memory.h:19       struct llama_memory_params { ggml_type type_k; type_v; ... }
-  ↓
-alloc llama-model.cpp:2030    create_memory(params_mem, cparams) → KV cache uses params.type_k
-```
-
-Key facts:
- `type_k`/`type_v` are NOT stored in `struct llama_cparams` (src/llama-cparams.h). They ride in
-  `llama_context_params` → `llama_memory_params` and are consumed directly at cache-alloc time.
-  We mirror that: NO new `llama_cparams` field is needed.
- KV default is opt-IN to compression (F16 default, pass `-ctk q8_0` to shrink). SSM is the INVERSE:
-  bf16 DEFAULT, pass an explicit `f32` to opt out / restore bit-exactness.
-
-## 2. Where the SSM state type is currently hardcoded (the targets)
-
-The recurrent cache constructor already accepts the types — only the model hardcodes F32:
-
- `src/llama-memory-recurrent.cpp:22-23` ctor params `ggml_type type_r, type_s`
-  - `r_l` (line 100, `n_embd_r`) = short conv state  → `type_r` (TINY: conv_width-1 taps × conv_dim)
-  - `s_l` (line 101, `n_embd_s`) = SSM recurrent state → `type_s` (THE 805 MB/call dominant)
- `src/llama-memory-hybrid.h:32-33` ctor params `type_r, type_s` (qwen35 / qwen35moe path)
- Hardcoded `GGML_TYPE_F32` call sites in `src/llama-model.cpp::create_memory`:
-  - 2056-2057  `llama_memory_recurrent(...)`            (pure recurrent arches)
-  - 2098-2099  `llama_memory_hybrid_iswa(...)`          recurrent_type_r / recurrent_type_s
-  - 2117-2118  `llama_memory_hybrid(...)`               recurrent_type_k / recurrent_type_v (mislabeled; they are r/s)
-
-Note: `qwen35` / `qwen35moe` are HYBRID (filter_attn/filter_recr, no SWA) → they take the
-`llama_memory_hybrid` branch (2108-2118). That is the call site that matters for the parity push.
-
-## 3. New plumbing (parallel chain `type_s` / `type_r`)
-
-### 3a. Public API + cparams glue (llama.cpp side)
-
-| File | Change |
-|------|--------|
-| `include/llama.h` (after :366) | Add `enum ggml_type type_s; // data type for recurrent SSM state cache [EXPERIMENTAL]` and `enum ggml_type type_r; // data type for recurrent conv state cache [EXPERIMENTAL]`. Place adjacent to `type_k`/`type_v`. |
-| `src/llama-context.cpp:3468` (default params) | Add `/*.type_s =*/ GGML_TYPE_BF16,` and `/*.type_r =*/ GGML_TYPE_BF16,`. **This is the DEFAULT change.** |
-| `src/llama-memory.h:19` (`struct llama_memory_params`) | Add `ggml_type type_r;` and `ggml_type type_s;` next to `type_k`/`type_v`. |
-| `src/llama-context.cpp:325` (`params_mem` init) | Add `/*.type_r =*/ params.type_r,` and `/*.type_s =*/ params.type_s,`. |
-| `src/llama-model.cpp` 2056-57 / 2098-99 / 2117-18 | Replace the 3 hardcoded `GGML_TYPE_F32` pairs with `params.type_r` / `params.type_s`. |
-
-### 3b. CLI / llama-server (common side)
-
-| File | Change |
-|------|--------|
-| `common/common.h:566` region | Add `ggml_type cache_type_ssm = GGML_TYPE_BF16;` and `ggml_type cache_type_conv = GGML_TYPE_BF16;` (mirror `cache_type_k/v`; note the DEFAULT is BF16, not F16). |
-| `common/common.cpp:1589` region | Add `cparams.type_s = params.cache_type_ssm;` and `cparams.type_r = params.cache_type_conv;`. |
-| `common/arg.cpp` (after :2074) | Add `--cache-type-ssm TYPE` (`-ctssm`) → `params.cache_type_ssm = kv_cache_type_from_str(value)`, and `--cache-type-conv TYPE` (`-ctconv`). Reuse the existing `kv_cache_type_from_str` (arg.cpp:402). Help text: "recurrent SSM state cache type (default bf16; pass f32 for bit-exact recurrence)". |
-
-`kv_cache_type_from_str` already accepts `f32`/`bf16`/`f16` — no change needed; just confirm `bf16`
-maps to `GGML_TYPE_BF16` (add the case if absent).
-
-### 3c. LocalAI gRPC backend (so users can force f32 from model YAML)
-
-Mirror `CacheTypeKey` exactly:
-
-| File | Change |
-|------|--------|
-| `backend/backend.proto:419` region | Add `string CacheTypeSSM = NN;` and `string CacheTypeConv = NN;` (next free field tags). Regenerate proto. |
-| `backend/cpp/llama-cpp/grpc-server.cpp:504` region | `if (!request->cachetypessm().empty()) params.cache_type_ssm = kv_cache_type_from_str(request->cachetypessm());` and the conv equivalent. (grpc-server already has its own `kv_cache_type_from_str`; ensure it knows `bf16`.) |
-| `core/config/model_config.go:935` region | Add `CacheTypeSSM string yaml:"cache_type_ssm,omitempty"` and `CacheTypeConv string yaml:"cache_type_conv,omitempty"`. |
-| `core/backend/options.go:247` region | Add `CacheTypeSSM: c.CacheTypeSSM,` and `CacheTypeConv: c.CacheTypeConv,` to the request build. |
-| `core/config/meta/registry.go:161` + `core/config/meta/build_test.go:140` | Register `cache_type_ssm` / `cache_type_conv` as static fields (the `staticFields` slice + registry map) so the meta-config gate passes. |
-
-LocalAI semantics: leaving `cache_type_ssm` UNSET in YAML → empty gRPC string → backend keeps its
-BF16 default. Setting `cache_type_ssm: f32` → forces the f32 opt-out (bit-exact recurrence).
-
-## 4. Default / fallback semantics
-
- **DEFAULT = `GGML_TYPE_BF16`** for both SSM state (`type_s`) and conv state (`type_r`).
-  - SSM state (`type_s`) is the lever: f32→bf16 halves 805→413 MB/call → ~3.98→~2.0-3.0 ms/call.
-  - Conv state (`type_r`) is negligible bytes; default it bf16 too for consistency, but it can stay
-    f32 with zero perf cost if patch-0021's in-place conv path assumes f32 — see §6.
- **Opt-out = `GGML_TYPE_F32`** via `--cache-type-ssm f32` (CLI) or `cache_type_ssm: f32` (LocalAI YAML).
-  Restores bit-exact recurrence; use when the KL gate (<1e-3 / PPL-delta over 256-tok greedy) fails
-  for a given model, or for deterministic regression baselines.
- **Silent capability fallback**: gate the bf16 path behind a device-match probe modeled on
-  `auto_fgdn` (llama-context.cpp:518-595). If the GDN recurrence kernel's bf16 load/store
-  specialization is unavailable on a participating device (e.g. a CPU-offloaded GDN layer with no
-  bf16 op, or a non-GB10 backend), fall back to `GGML_TYPE_F32` for `type_s` BEFORE cache alloc and
-  log it once. This keeps "bf16 default" from breaking heterogeneous/CPU setups.
- The kernel contract is unchanged-math: load bf16→f32 into `s_shard` (registers stay f32), all
-  recurrence arithmetic in f32, store f32→bf16. Only the persisted cache is rounded per step;
-  geometric decay (g<1) bounds the rounding (does not accumulate unboundedly).
-
-## 5. Back-compat (the one real breakage — saved sessions / state files)
-
-`src/llama-memory-recurrent.cpp` SERIALIZES the per-layer state tensor dtype and does a HARD match
-on restore:
- write: `state_write_data` writes `s_type_i = (int32_t)s_l[il]->type` (line ~900) and the r type.
- read: `state_read_data` reads `s_type_i_ref`, compares to current `s_l[il]->type`, and on
-  mismatch logs `"mismatched s type (%d != %d, layer %d)"` and **returns false** (restore FAILS).
-  Same for `r` type.
-
-Consequence of the default flip f32→bf16:
- Sessions SAVED by an old f32-default build will FAIL to RESTORE against a new bf16-default build
-  (and vice versa), because the serialized `s_type_i_ref` (F32) ≠ the new cache type (BF16).
-
-Required handling (pick one, recommend A):
- **A (convert on mismatch, recommended)**: in `state_read_data`, when `s_type_i_ref != current`
-  and both ∈ {F32, BF16}, convert row-by-row during load (`ggml_fp32_to_bf16` / `bf16→fp32`) instead
-  of returning false. Same for `r`. Bump the recurrent state-file version so older readers reject
-  cleanly. This makes old f32 sessions loadable into bf16 caches and round-trips safely.
- **B (pin precision to the saved file)**: if a session is being restored, read `s_type_i_ref`
-  first and set `type_s`/`type_r` from it, overriding the default for that context. Keeps restore
-  working but silently disables the bf16 win for resumed sessions.
- **C (document-only)**: keep the hard match; document that bf16-default invalidates cross-version
-  saved recurrent states. Lowest effort, worst UX. Not recommended given parity is the goal.
-
-KV-cache parallel: `type_k`/`type_v` were always EXPERIMENTAL and non-default-changing, so the KV
-path never had to solve this. The SSM default-FLIP is what forces the convert/version work — call it
-out as the single most load-bearing back-compat item.
-
-## 6. Coupling notes / sequencing
-
- Land ON TOP of patch 0021 (conv-state in-place fusion). If 0021's fused conv write assumes an f32
-  conv-state tensor, either (a) extend it to the cache tensor's dtype, or (b) keep `type_r` = F32 by
-  default and make ONLY `type_s` bf16 (conv bytes are negligible, so this loses nothing perf-wise and
-  de-risks 0021). Decision: ship `type_s`=BF16 first; make `type_r`=BF16 a follow-up gated on 0021's
-  conv path being dtype-generic.
- Kernel side (separate patch, not this wiring): `ggml/src/ggml-cuda/gated_delta_net.cu` currently
-  takes `const float * curr_state` / `float * state_dst` and does `s_shard[r] = read_state[i]`
-  (line 102) — hardcoded f32. The bf16 build needs the dispatch to read `s0->type` and route a
-  bf16 load/store specialization; the gather kernel `gdn_gather_nonident_kernel` (line 7, `const
-  float * cache`) likewise needs a bf16 variant. The cparams wiring here only selects the cache
-  dtype; the kernel patch consumes it. Patches 0018 (in-place) / 0019 (gather) state asserts must be
-  relaxed from f32-only to {f32,bf16}.
- CPU mirror `ggml-cpu/ops.cpp` GDN path needs the same bf16 load/store for CI parity / fallback.
-
-## 7. Validation gate
-
- KL < 1e-3 and PPL-delta within tolerance vs the f32-state build over a 256-token greedy run, per
-  model (dense q36-27b-nvfp4, MoE q36-35b-a3b-nvfp4). If a model fails, that model sets
-  `cache_type_ssm: f32` in its gallery YAML (per-model opt-out) — the global default stays bf16.
- Add a `test-backend-ops` case for the GDN recurrence with bf16 state (mirror the 0021 harness:
-  dense text md5 + MoE byte check) to lock the load→f32→store→bf16 contract.
-
---
-
-# Appendix - label `upstream-bf16-precedent` (READ-ONLY research)
-
-Precedent + numeric-safety justification for the §1-7 wiring above. Sources: paged dev tree
-(`dgx.casa:~/llama-paged-dev`, branch `paged`) and the vLLM checkout
-(`~/vllm-bench/.../site-packages/vllm`).
-
-## A.1 Upstream llama.cpp: recurrent-cache f32 is HARDCODED (no f16/bf16 path), not a documented numeric guard
-
-The asymmetry to override: the attention KV cache type is user-tunable; the recurrent state cache is not.
-
- KV cache: `llama_context_params.type_k/type_v` default `GGML_TYPE_F16`
-  (`src/llama-context.cpp:3468-3469`), `[EXPERIMENTAL]` in `include/llama.h:365-366`, plumbed from
-  user params (`attn_type_k = params.type_k`).
- Recurrent/SSM cache: `llama_memory_recurrent(... type_r, type_s ...)` and the hybrid wrappers take
-  the recurrent types as ctor args, but EVERY call site in `src/llama-model.cpp` passes the literal
-  `GGML_TYPE_F32` (2056-2057 pure-recurrent; 2098-2099 hybrid-iswa `recurrent_type_r/s`;
-  2117-2118 hybrid `recurrent_type_k/v`). No cparams field feeds these - compile-time constants.
-  So mamba/mamba2/rwkv/falcon-h1/nemotron-h/qwen3.5 ALL get f32 recurrent + conv state unconditionally.
- Alloc: `r = ggml_new_tensor_2d(ctx, type_r, ...)`, `s = ggml_new_tensor_2d(ctx, type_s, ...)`
-  (`src/llama-memory-recurrent.cpp:100-101`). No f16 branch anywhere.
-
-Is f32 a deliberate numeric constraint? Structural, not documented:
- `ggml_ssm_conv` / `ggml_ssm_conv_update_inplace` HARD-ASSERT f32 on conv state/kernel/x_cur/dst
-  plus `nb[0]==sizeof(float)` (`ggml/src/ggml.c:5581-5584,5589,5597`). Conv path is f32-locked at the
-  builder.
- `ggml_ssm_scan` does NOT assert input state `s` dtype, but hardcodes its OUTPUT as
-  `GGML_TYPE_F32` (`ggml/src/ggml.c:5662`); scan kernels read `s` as `float *`.
- `ggml/src/ggml-cuda/gated_delta_net.cu` takes `const float * curr_state`, `float * state`,
-  `float * state_dst`; the per-(seq,head) shard `float s_shard[rows_per_lane]` is loaded/stored as raw
-  float (34-102). Same in `ggml-cpu/ops.cpp`.
- NO code comment anywhere justifies "f32 for precision". The constraint is that the ops were written
-  float-only. => recurrent-cache-f32 is a hardcoded implementation default to override deliberately:
-  the 3 literal `GGML_TYPE_F32` call-site pairs (gate behind `type_s`/`type_r` per §3), the
-  gated_delta_net.cu load/store convert, and KEEP conv f32 unless its asserts are extended (conv bytes
-  are negligible - only the temporal `type_s` state needs bf16).
-
-## A.2 vLLM: GDN temporal state cache is bf16 BY DEFAULT, fp32-accumulated in-kernel (the exact design)
-
- Dtype: `qwen3_next.py:780-787` -> `MambaStateDtypeCalculator.gated_delta_net_state_dtype` ->
-  `_mamba_state_dtype` (`mamba_utils.py:84-96`):
-  `conv_state_dtype = get_kv_cache_torch_dtype(mamba_cache_dtype, model_dtype)`;
-  `if mamba_ssm_cache_dtype == "auto": temporal_state_dtype = conv_state_dtype`.
-  With both knobs default `"auto"`, `get_kv_cache_torch_dtype("auto", model_dtype)` returns
-  `model_dtype` (`torch_utils.py:293-297`) = bf16 for Qwen3-Next => BOTH conv and temporal state are
-  bf16 by default. Explicit opt-out: `--mamba-ssm-cache-dtype float32` (mirror of our f32 fallback).
- In-kernel numerics (decode), `fla/ops/fused_recurrent.py`:
-  `b_h = tl.load(p_h0).to(tl.float32)` (303) load bf16->fp32; q/k/v/g/beta `.to(tl.float32)` (309-318);
-  recurrence in fp32 `b_h*=exp(g); b_v-=sum(b_h*b_k); b_v*=beta; b_h+=b_v*b_k; b_o=sum(b_h*b_q)`
-  (327-331); `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))` (337) store fp32->bf16. Prefill chunk path
-  identical (`b_h=tl.zeros(...,tl.float32)`, `+= load().to(fp32)`, 102/120).
-  => byte-for-byte the proposed llama lever: load bf16->f32, math in f32 (UNCHANGED order, matches
-  gated_delta_net.cu's v-g*kv -> *beta -> S-update -> S^T q), store f32->bf16; only the persisted cache
-  crosses the bf16 boundary, once per step.
- vLLM numeric guards: NONE beyond fp32 accumulation - no per-step renorm, no clamp, no Kahan. Optional
-  `use_qk_l2norm_in_kernel` normalizes q,k (keeps k unit-norm) but does not touch the state.
- KDA nuance: `kda_state_dtype` returns `(state_dtype, torch.float32)` - Kimi Delta Attention keeps a
-  fp32 secondary component. qwen3.5 is `gated_delta_net` (fully-bf16 temporal state), but this shows
-  vLLM judged a fp32 component necessary for one delta variant -> reinforces keeping the f32 toggle.
-
-Verdict: vLLM's own GDN state cache is bf16, so bf16-state in llama is a FAIR equal-precision target,
-not a regression vs the competitor. bf16 brings llama TO vLLM's precision.
-
-## A.3 Numeric-safety assessment for bf16 gated-DeltaNet state
-
-Update: `S <- S*diag(exp(g)) + beta * k (x) (v - S k)`, with
-`g = -exp(A_log)*softplus(a+dt_bias) <= 0` so `exp(g) in (0,1]` (strict geometric decay) and
-`beta = sigmoid(.) in (0,1)`.
-
- Decay bounds error accumulation. bf16 = 8 mantissa bits -> per-element rel rounding
-  `eps ~= 2^-8 ~= 3.9e-3`. An error injected at step t is multiplied by `exp(g)<1` every later step ->
-  carry-error is a CONTRACTING geometric series bounded by ~`eps/(1-exp(g_mean))`, a small constant
-  multiple of one step's eps, NOT linear/unbounded. The recurrence is a contraction map - no
-  divergence. (The "per-step renorm" framing is not a literal renorm op in either codebase; the bound
-  IS the `g<1` contraction + `beta in (0,1)` + unit-norm k from the l2norm capping `||k (x) delta||`.)
- fp32 register accumulation is the minimal-error placement: load bf16->f32, do `S k`, `v-g*kv`,
-  `*beta`, the outer-product accumulate and `S^T q` ALL in fp32 (UNCHANGED math), store f32->bf16 once.
-  Identical to vLLM, which ships this as the Qwen3-Next default with no reported quality regression -
-  the strongest empirical safety evidence.
- Dominant risk is small KL/PPL drift, not instability. Gate KL<1e-3 + PPL-delta over 256-tok greedy
-  vs the f32 build; fall back to f32 via the §3c toggle if it fails. Keep conv state f32 (ssm_conv* is
-  f32-locked, conv bytes negligible) - no reason to risk it.
-
-Bottom line: (1) upstream recurrent-cache f32 is a hardcoded implementation default (conv asserts f32;
-scan/gdn kernels float-only; no numeric-rationale comments) - override via §3's `type_s`/`type_r`
-plumbing, bf16-default + f32 opt-out, touching only the temporal state. (2) vLLM's GDN temporal state
-is bf16 by default (auto->model_dtype), fp32-accumulated, with `--mamba-ssm-cache-dtype float32`
-opt-out - a fair equal-precision target. (3) bf16 GDN state is numerically safe: g<1 decay contracts
-rounding to a bounded geometric series, fp32 registers confine bf16 rounding to one per-step cache
-write, and vLLM ships this exact config in production. KL<1e-3 / PPL gate + f32 fallback is the right
-safety net.
-
---
-
-# PART B - label `bf16-kernel-plumbing` (the kernel/op edits §6 defers)
-
-Part A wires the cache DTYPE selection (cparams -> memory_params -> `s_l`/`r_l` alloc). Part B is the
-consuming half: every kernel/op that reads or writes those caches, and the exact
-load->f32->compute(f32, UNCHANGED)->store->bf16 conversion points. Traced against HEAD-after-0021 on
-`dgx.casa:~/llama-paged-dev` (branch `paged`).
-
-## B.1 Complete set of state-cache READERS/WRITERS (one op family only)
-`s_l` (ssm_states_all) reaches compute through exactly ONE op family - the gated-DeltaNet recurrence -
-via a strided VIEW from `build_rs` (graph base) that carries the cache dtype. The cache-touching srcs:
- `src[5]` `src_state` - the s0 read view (the cache, or the 0019 gather scratch).
- `src[6]` `src_state_dst` - the 0018 in-place write-back target (a view INTO the cache).
- `src[7]` `ids` - I32 seq map for the 0019 gather (no dtype concern).
-No other op reads `s_l`. `build_rs` only re-strides (dtype rides through); the 0019
-`gdn_gather_nonident_kernel` is the only other reader. So bf16 awareness localizes to: the 3 ggml.c
-builders (asserts), cuda `supports_op`, `gated_delta_net.cu`, and the CPU mirror in `ops.cpp`.
-
-## B.2 ggml.c builder asserts (relax F32-only -> {F32,BF16})
-File `ggml/src/ggml.c`:
- `ggml_gated_delta_net` (6287): line 6308 `GGML_ASSERT(state->type == GGML_TYPE_F32)` ->
-  `... == GGML_TYPE_F32 || ... == GGML_TYPE_BF16`.
- `ggml_gated_delta_net_inplace` (6349): same `state` assert (~6366-6370) + any `src_state_dst`
-  type assert -> allow BF16.
- `ggml_gated_delta_net_inplace_ids` (6417): same `state` + `src_state_dst` relax.
- KEEP the op OUTPUT scratch f32: line 6327 `ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne)` stays. The
-  `[attn_scores | new_states]` output is a TRANSIENT graph tensor; the bf16 persisted write goes
-  through `src_state_dst`/`state` (in-place). The non-in-place fallback `cpy`s scratch->cache and
-  `ggml_cpy` already type-converts f32->bf16.
-
-## B.3 CUDA supports_op
-`ggml/src/ggml-cuda/ggml-cuda.cu`, `supports_op` case `GGML_OP_GATED_DELTA_NET` (3096): allow a BF16
-`src[5]`/`src[6]` (add BF16 to the permitted state-src types).
-
-## B.4 CUDA recurrence kernel `ggml/src/ggml-cuda/gated_delta_net.cu`
-Template the kernel + gather + launch on the CACHE-pointer dtype (`bool STATE_BF16`); keep f32 valid so
-the f32 opt-out is the SAME kernel. Include `<cuda_bf16.h>`; convert with `__bfloat162float` /
-`__float2bfloat16`. ALL recurrence math (lines 106-200) stays in f32 registers, byte-for-byte UNCHANGED.
- Signatures: line 39 `const float * curr_state` -> `const STATE_T * curr_state`; line 57
-  `float * state_dst` -> `STATE_T * state_dst`; `read_state` (85-88) -> `const STATE_T * read_state`.
- LOAD (s0 -> f32 regs), lines 100-103:
-  `if constexpr (STATE_BF16) s_shard[r]=__bfloat162float(read_state[i]); else s_shard[r]=read_state[i];`
-  `s_shard` stays `float`.
- STORE-BACK (f32 regs -> bf16 cache):
-  - non-keep final write (203-208): `state[col*S_v+i] = STATE_BF16 ? __float2bfloat16(s_shard[r]) : s_shard[r];`
-  - keep_rs_t snapshot (191-200) targets `dst + attn_score_elems` = the f32 OUTPUT scratch (kept f32
-    per B.2); this is the prefill/rollback path (n_rs_seq>0), decode is `!keep_rs_t`. KEEP it f32.
-    Only the CACHE pointers (`curr_state` src[5], `state_dst` src[6]) are STATE_T.
- 0019 gather `gdn_gather_nonident_kernel` (7-30): `const float * cache` -> `const STATE_T * cache`;
-  `dst[i] = STATE_BF16 ? __bfloat162float(src[i]) : src[i];`. Keep `scratch` OUTPUT f32 (pool alloc
-  326-333 stays `ggml_cuda_pool_alloc<float>`) so the non-identity read path feeds f32; the identity
-  in-place path reads bf16 directly. `read_state`'s dtype follows the branch that selected it.
- Dispatcher (270-353):
-  - casts 299/323 `(const float *)src_state->data`, 312 `(float *)src_state_dst->data` ->
-    `(const nv_bfloat16 *)` / `(nv_bfloat16 *)` when `type == GGML_TYPE_BF16`; branch launch on type.
-  - asserts 309-311: `src_state_dst->type == GGML_TYPE_F32` -> allow BF16; `nb[0] == sizeof(float)` ->
-    `== ggml_type_size(type)`; `nb[1] == S_v*S_v*H*sizeof(float)` -> `... * ggml_type_size(type)`.
-  - q/k/v/g/beta strides (348-353) are ACTIVATION (f32) strides - UNCHANGED. Kernel indexes state by
-    ELEMENT (`col*S_v+i`), so the typed pointer halves the byte stride implicitly.
-  - `launch_gated_delta_net` (212-) + S_v switch (230-260): thread `STATE_BF16` into the
-    `gated_delta_net_cuda<S_v, KDA, keep_rs_t, STATE_BF16>` instantiations.
-
-## B.5 CPU reference `ggml/src/ggml-cpu/ops.cpp` (parity / CI / CPU-offload fallback)
-`ggml_compute_forward_gated_delta_net_one_chunk` (10662) + `_f32` (10847), dispatch (10915):
- LOAD: 10726 `const float * state_in_base = (const float *)src_state->data`, the rs_head/gather read
-  10744-10745, and 10891 `const float * cache = (const float *)src_state->data` -> when
-  `src_state->type == GGML_TYPE_BF16`, read `GGML_BF16_TO_FP32(((const ggml_bf16_t*)..)[..])`.
- STORE: 10758-10762 `inplace_state_base = (float *)src_state_dst->data` -> store
-  `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`; relax asserts `nb[0]`/`nb[1]`
-  to `ggml_type_size(type)`. Keep ONE impl, branch load/store on `src_state->type`.
-
-## B.6 Conv state (`r_l`) -> bf16 : DEFER (optional, low-value, prefill snag)
-Conv state ~12.6 MB total, LAUNCH-bound (0021 removed concat/cpy); bf16 saves ~0 ms, adds complexity:
- DECODE (0021 fused) `ggml_ssm_conv_update_inplace` (ggml.c:5566) asserts 5581-5584
-  `conv_states/conv_state_dst->type == F32`; CUDA `ssm_conv_update_f32` (ssm-conv.cu:131) + CPU
-  `ggml_compute_forward_ssm_conv_update_f32` (ops.cpp:9471) read/write f32. To bf16: relax the 2
-  asserts, template tap LOAD (`__bfloat162float`) + ring write-back STORE (`__float2bfloat16`), cast
-  `conv_states`/`conv_state_dst` ptrs in both dispatchers.
- PREFILL (non-fused) `build_conv_state` (delta-net-base.cpp:449-524): `conv_states=build_rs(...)`
-  (bf16 view) then `ggml_concat(conv_states, qkv_mixed, 0)` (472). **`ggml_concat` requires same type**
-  - qkv_mixed is f32 -> bf16 conv cache BREAKS the prefill concat (needs an f32 staging view of the
-  taps first; the ring write-back `ggml_cpy` at 496/520 already converts; concat is the blocker).
-RECOMMENDATION: keep `type_r` = F32 in v1 (matches Part A §6). Ship `type_s`=BF16 first; `type_r`=BF16
-is a follow-up that adds the f32 staging view.
-
-## B.7 Confirm UNTOUCHED: full-attn KV-cache (16 layers) + FP4 weights
- KV-cache: the `llama_kv_cache` half of `llama_memory_hybrid`, alloc with `params.type_k/type_v`
-  (llama-model.cpp 2030-2031 / 2089-2090 / 2108-2109). Part A changes ONLY the recurrent half's
-  `type_s`; `attn_type_k`/`attn_type_v` untouched. Paged-KV gather (0003-0011), flash-attn,
-  `type_k()/type_v()` accessors (kv-cache.h 161-162/381-382) unaffected.
- FP4 weights (nvfp4 dense + MoE): model weights, separate from runtime state caches; recurrence/conv
-  kernels read STATE not weights. FP4 GEMM (0017/0020) untouched.
- Activations (q/k/v/g/beta, attn-out, z) stay f32 (<1% of bytes). Only persisted `s_l` rows narrow.
-
-## B.8 Conversion-point cheat-sheet (the ONLY numeric-precision boundaries)
-1. CUDA load   `gated_delta_net.cu` ~102: `s_shard[r] = __bfloat162float(read_state[i])`.
-2. CUDA store  ~207: `state[col*S_v+i] = __float2bfloat16(s_shard[r])`.
-3. CUDA gather ~20: `dst[i] = __bfloat162float(src[i])` (bf16 cache -> f32 scratch).
-4. CPU load    `ops.cpp` ~10726/10744/10891: `GGML_BF16_TO_FP32(((ggml_bf16_t*)src_state->data)[..])`.
-5. CPU store   ~10762: `((ggml_bf16_t*)inplace_state_base)[..] = GGML_FP32_TO_BF16(s_shard)`.
-Everything between (1)/(4) and (2)/(5) is f32-register math, identical to today's f32 kernel. Only the
-persisted cache rounds to bf16 once per step; g<1 geometric decay bounds the rounding.
-
-## B.9 File-by-file edit table (Part B)
-| File | Edit |
-|---|---|
-| `ggml/src/ggml.c` | relax `state`/`src_state_dst` F32 asserts -> allow BF16 in the 3 GDN builders (6308, ~6370, ~6430); keep output scratch F32 (6327) |
-| `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` GATED_DELTA_NET (3096): allow BF16 state src |
-| `ggml/src/ggml-cuda/gated_delta_net.cu` | template kernel+gather+launch on STATE_BF16; `__bfloat162float` load / `__float2bfloat16` store; cast src_state/src_state_dst ptrs; relax dispatcher asserts (309-311) to `ggml_type_size(type)`; keep gather scratch + keep_rs snapshot f32 |
-| `ggml/src/ggml-cpu/ops.cpp` | bf16 load/store branch in GDN ref (10726/10744/10758-10762/10891); relax asserts |
-| `tests/test-backend-ops.cpp` | add BF16-state GATED_DELTA_NET case (CUDA bf16 vs CPU bf16) |
-| (deferred) conv: `ggml.c:5581-84`, `ssm-conv.cu:131`, `ops.cpp:9471`, `delta-net-base.cpp:472` | v2 only - f32 staging before prefill concat |
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_PROGRESS.md
@@ -1,37 +0,0 @@
-# bf16 SSM state - build/de-risk progress
-
-DECISION (user override of plan): f32 DEFAULT + bf16 OPT-IN. type_s default = GGML_TYPE_F32.
-Conv state (type_r) stays F32. Recurrence math stays f32 (load->f32, store->cache dtype).
-
-## STEP 1 (dtype-generic kernel + op) - DONE + DE-RISK GATE 1 PASSED
-Files (DGX ~/llama-paged-dev):
- ggml/src/ggml.c: 3 GDN builder asserts F32 -> {F32,BF16}; state_dst nb[0] -> ggml_type_size.
- ggml/src/ggml-cuda/gated_delta_net.cu: gdn_state_t<STATE_BF16> alias; gather + recurrence kernel +
-  launchers templated on STATE_BF16; __bfloat162float load / __float2bfloat16 store; gather scratch
-  shares cache dtype (uniform read); dispatcher detects src_state->type, GDN_DISPATCH macro 8-way.
- ggml/src/ggml-cpu/ops.cpp: byte-based read base + read_bf16 load conversion; bf16 in-place
-  convert-store after token loop; bf16 gather widening; relaxed asserts to ggml_type_size.
- ggml/src/ggml-cpu/ggml-cpu.c: work-size +S_v*S_v for bf16 in-place.
- tests/test-backend-ops.cpp: state_type field on test_gated_delta_net; 16 bf16 cases (hs 64+128 x
-  decode/prefill/keep_rs x kda).
-GATE 1: build clean (EXIT=0); test-backend-ops -o GATED_DELTA_NET = 52/52 OK (CUDA bf16 vs CPU bf16).
-
-## STEP 2/3/4 (cparams opt-in wiring) - IN PROGRESS
-f32 DEFAULT everywhere; --cache-type-ssm bf16 opts in.
-
-## STEP 2/3/4 (cparams opt-in) - DONE
- llama.h/llama-context.cpp/llama-memory.h/llama-model.cpp: type_r/type_s plumbed, DEFAULT F32.
- common.h/common.cpp/arg.cpp: cache_type_ssm/conv (F32 default) + --cache-type-ssm/-conv CLI.
- llama-memory-recurrent.cpp: convert-on-mismatch f32<->bf16 (r and s) via ggml_*_row API.
-
-## EXTRA FIX (plan B.1 miss): build_rs rs_zero clear used ggml_scale (f32-only) -> bf16 abort.
- llama-graph.cpp: f32 keeps ggml_scale_inplace (bit-exact); non-f32 uses ggml_fill_inplace.
- fill.cu + ops.cpp + ggml.c: added BF16 to ggml_fill. get_rows/cpy already bf16-capable.
-
-## DE-RISK GATE - ALL PASS
- build clean EXIT=0 (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
- test-backend-ops -o GATED_DELTA_NET = 52/52 (16 bf16 cases: decode/prefill/keep_rs x kda x hs64/128).
- f32 default md5: dense 5951a5b4... MoE 07db32c2... == 0023 (non-invasive; also --cache-type-ssm f32 matches).
- bf16 opt-in: coherent "Paris", no crash; byte-identical to f32 on 48-tok sample (Same-top-p 100%).
- diff backup: ~/llama-paged-dev/BF16_SSM_STATE.diff (1003 lines, 15 files). NOT committed/pushed.
-READY FOR C.2 KL GATE (GateBench).
--- a/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BF16_SSM_STATE_RESULTS.md
@@ -1,199 +0,0 @@
-# bf16 SSM-state cache - BUILD + DE-RISK RESULTS
-
-Label: bf16-build-derisk (the GPU build agent). Lands on top of patch 0023 (HEAD f7409c2) on the DGX
-dev tree `~/llama-paged-dev` (branch `paged`). Status: **DE-RISK GATE PASSED, READY FOR THE C.2 KL
-GATE (GateBench).** Work is built into `build-cuda` and saved as `~/llama-paged-dev/BF16_SSM_STATE.diff`
-(uncommitted on the dev tree; the 0024 ship/shelve decision is gated on GateBench's KL results).
-
-## DECISION applied (user override of the plan): f32 DEFAULT + bf16 OPT-IN
-The plan defaulted bf16; the user wants f32 to stay the bit-exact DEFAULT and bf16 to be opt-IN via
-`--cache-type-ssm bf16`. So `type_s` default = `GGML_TYPE_F32`, `type_r` default = `GGML_TYPE_F32`
-(conv stays f32 always, per plan C.0). Only the persisted RECURRENT (temporal) state narrows to bf16
-when opted in; recurrence math stays f32 (load->f32, compute f32, store->cache dtype). The opt-in is
-non-invasive: with no flag the output is byte-identical to 0023.
-
-## Files changed (15; full diff = ~/llama-paged-dev/BF16_SSM_STATE.diff, 1003 lines)
-
-STEP 1 - dtype-generic kernel + op (the de-risk core):
- `ggml/src/ggml.c` - 3 GDN builder `state`/`state_dst` asserts F32 -> {F32,BF16}; `state_dst->nb[0]`
-  `sizeof(float)` -> `ggml_type_size(state_dst->type)`. Also relaxed the `ggml_fill` builder assert to
-  allow BF16 (needed by the rs_zero clear; see below).
- `ggml/src/ggml-cuda/gated_delta_net.cu` - `gdn_state_t<STATE_BF16>` alias (`nv_bfloat16`/`float`);
-  recurrence kernel + gather kernel + both launchers + the dispatcher templated on `STATE_BF16`.
-  LOAD `__bfloat162float`, STORE `__float2bfloat16`; the gather scratch is allocated at the CACHE
-  dtype so `read_state` is a single uniform dtype (no mixed-dtype read path - eliminates the plan-R2
-  landmine). The keep_rs snapshot + the non-in-place final write stay f32 (op output scratch); the
-  bf16 store happens ONLY on the in-place cache path. `supports_op` already returned `true`
-  unconditionally for GATED_DELTA_NET, so no change there.
- `ggml/src/ggml-cpu/ops.cpp` - byte-based prior-state read base + `read_bf16` load conversion
-  (`GGML_BF16_TO_FP32`); bf16 in-place convert-store after the per-(head,seq) token loop
-  (`GGML_FP32_TO_BF16`); bf16-widening non-identity gather; relaxed `nb[]` asserts to
-  `ggml_type_size`. Added a `ggml_compute_forward_fill_bf16` + dispatch case.
- `ggml/src/ggml-cuda/fill.cu` - BF16 case in the fill kernel switch.
- `ggml/src/ggml-cpu/ggml-cpu.c` - GDN work-size adds the extra `S_v*S_v` f32 buffer when the cache is
-  bf16 in-place (mirror of `need_work` in ops.cpp).
- `tests/test-backend-ops.cpp` - `state_type` field on `test_gated_delta_net`; 16 bf16-state cases
-  (head_size 64 + 128 x {decode, multi-token prefill 33/64/100, keep_rs_t K=4} x kda 0/1, n_seqs 1/2).
-
-STEP 2/3/4 - cparams opt-in wiring (f32 DEFAULT):
- `include/llama.h` - `type_r`/`type_s` in `llama_context_params` (adjacent to type_k/type_v).
- `src/llama-context.cpp` - default-params `type_r = type_s = GGML_TYPE_F32`; `params_mem` passes them.
- `src/llama-memory.h` - `type_r`/`type_s` in `llama_memory_params`.
- `src/llama-model.cpp` - the 3 hardcoded `GGML_TYPE_F32` recurrent ctor pairs (recurrent /
-  hybrid_iswa / hybrid = the qwen35/qwen35moe path) now pass `params.type_r` / `params.type_s`.
- `src/llama-memory-recurrent.cpp` - back-compat: `state_read_data` converts f32<->bf16 on type
-  mismatch (helper `recurrent_read_convert_rows` via the public `ggml_bf16_to_fp32_row` /
-  `ggml_fp32_to_bf16_row`) instead of failing, for both r and s; lets an f32-saved session restore
-  into a bf16 cache and vice versa.
- `src/llama-graph.cpp` - `build_rs` rs_zero clear: f32 keeps the exact `ggml_scale_inplace(.,0)` op
-  (bit-exactness); bf16 (and any non-f32) state uses `ggml_fill_inplace(.,0)` (CUDA scale is f32-only;
-  this was the one extra state-touching op the plan's "one op family" claim missed). get_rows + cpy
-  on the extra-states path already support bf16, so no change needed there.
- `common/common.h` / `common/common.cpp` / `common/arg.cpp` - `cache_type_ssm` / `cache_type_conv`
-  (default F32) + `--cache-type-ssm`/`-ctssm` and `--cache-type-conv`/`-ctconv` CLI (reusing the
-  existing `kv_cache_type_from_str`, which already maps `f32`/`bf16`).
-
-## DE-RISK GATE - ALL PASS
-
-1. **Build clean** (build-cuda, CUDA arch 121): EXIT=0 for ggml/ggml-cuda/ggml-cpu/llama/llama-common
-   and the binaries (test-backend-ops, llama-completion, llama-cli, llama-perplexity, llama-batched-bench).
-2. **test-backend-ops -o GATED_DELTA_NET = 52/52 PASS** (CUDA backend vs CPU reference). Includes all
-   16 new bf16-state cases (CUDA bf16 vs CPU bf16) covering decode (n_tokens==1), multi-token
-   prefill/chunk (33/64/100), and keep_rs_t (K=4), with kda on/off and head_size 64 + 128 (production
-   S_v). The bf16 op test is the deterministic R2 de-risk for the load/compute/store contract.
-3. **f32-default md5 == 0023 baseline (opt-in is non-invasive):**
-   - dense  q36-27b-nvfp4: `5951a5b4d624ce891e22ab5fca9bc439` == 0023  (no flag AND `--cache-type-ssm f32`)
-   - MoE    q36-35b-a3b-nvfp4: `07db32c2bcb78d17a43ed18bc22705cd` == 0023
-   Command: `llama-completion -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1`.
-4. **bf16 opt-in coherence + engaged (dense, `--cache-type-ssm bf16`):** no crash; coherent + on-topic.
-   - 48-tok greedy ("The capital of France is"): "**Paris**." - byte-identical to f32 (md5 5951a5b4...),
-     i.e. Same-top-p = 100% over that short sample (the g<1 decay bounds the per-step rounding so the
-     argmax trajectory is unchanged at short length).
-   - 256-tok greedy ("Explain how a transformer LM generates text, step by step"): fluent, well-structured
-     step-by-step explanation, and the bf16 md5 (`fc82b4cd44f8ec999c3b6843eb3f8c61`) **DIVERGES** from
-     f32 (`554cc667a2e62a47c34a999a127ac7e5`) - definitive proof that bf16 is genuinely ENGAGED (not a
-     silent f32 fallback) and behaves as expected (non-bit-exact, coherent). The per-token divergence
-     is exactly what the C.2 teacher-forced KL gate quantifies.
-   - Independent proof bf16 is allocated: BEFORE the build_rs fill fix, decode aborted in
-     `ggml_cuda_op_scale` on the recurrent-state tensor - an f32 cache would never have reached that
-     bf16-only failure, so the opt-in demonstrably allocates bf16. Wiring is also directly traceable:
-     `--cache-type-ssm bf16` -> cache_type_ssm -> cparams.type_s -> params_mem.type_s -> the
-     llama_memory_hybrid recurrent `s_l` alloc.
-
-CONFIRM: ready for the C.2 KL-divergence + PPL-delta + long-context drift gate (GateBench).
-
-## A landmine fixed beyond the plan (record for the gate/ship agents)
-The plan B.1 asserted `s_l` reaches compute through ONLY the gated-DeltaNet op. It also flows through
-`build_rs`: (a) the rs_zero restart-slot clear used `ggml_scale_inplace(.,0)`, and `ggml_cuda_op_scale`
-hard-asserts f32 -> the first bf16 decode aborted in scale. Fixed by routing the bf16 clear through
-`ggml_fill` (with a new bf16 fill path). (b) the extra-states `ggml_get_rows` + `ggml_cpy` already
-support bf16 (verified) - no change. This is exactly the kind of non-decode state path the de-risk
-was meant to surface; it is now covered end-to-end (the bf16 coherence run exercises rs_zero on the
-fresh-sequence prompt).
-
-## NOT done in this phase (next agents)
- STEP 5 LocalAI gRPC/YAML (`CacheTypeSSM`/`CacheTypeConv` proto + grpc-server + model_config +
-  options + meta registry) - needed to force f32/bf16 from a gallery YAML; not on the de-risk gate.
- STEP 6 capability fallback (device-match probe to demote bf16->f32 before alloc on a device lacking
-  the bf16 GDN/fill path, e.g. CPU-offloaded GDN). The CPU reference DOES implement bf16 (load/store/
-  gather/fill) so a CPU fallback is numerically correct today, but the probe is the clean guard.
- The C.2 KL/PPL/long-context gate + the C.3 nsys per-call bench - GateBench (GPU gate agent, runs
-  sequentially after this build phase; binaries are pre-built in build-cuda).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# C.2/C.3 ACCEPTANCE GATE + PARITY BENCH RESULTS (label bf16-gate-bench)
-
-Status: **GATE FAILS -> NO-SHIP. KEEP SHELVED. patch 0024 NOT created; nothing committed.**
-All runs on `dgx.casa` build-cuda binaries, wikitext-2-raw test, `-ngl 99 -fa on --seed 1`.
-Corpus: `~/bench/klgate/wikitext-2-raw/wiki.test.raw` (symlink to `~/wikitext-2-raw`, ~280k tokens).
-
-## 1. KL acceptance gate
-
-### Noise floor (f32-vs-f32, c256 chunks32) - the non-determinism floor
-| model | Mean KLD | Max KLD | Same-top-p | ln(PPL(Q)/PPL(base)) |
-|---|---|---|---|---|
-| dense q27 | -1.3e-5 | 1e-6 | 100.000% | +0.001256 |
-| MoE q35   | ~0 (-3e-7) | 5.9e-5 | 100.000% | +0.000607 |
-
-### Headline 256-token gate (bf16-vs-f32, c256 chunks32) - PASSES, but vacuously
-bf16 c256 is **byte-identical to the floor** for both models (Mean KLD -1.3e-5 dense / ~0 MoE,
-Same-top-p 100%, identical PPL). Reason: a single 256-token window is processed in ONE ubatch
-(ub512 > 256), so the recurrent state is written to the bf16 cache only ONCE at the chunk end and is
-NEVER read back to produce that window's logits. The 256-token gate therefore does NOT exercise the
-bf16 round-trip at all - it is blind to the actual cost.
-
-### Long-context drift sweep (bf16-vs-f32, chunks8) - FAILS HARD for BOTH models
-| model | ctx | Mean KLD | Same-top-p | Max KLD | 99.9% KLD |
-|---|---|---|---|---|---|
-| dense | 256  | -1.3e-5 | 100.000% | 1e-6 | 0 |
-| dense | 1024 | 0.0647 | 91.54% | 20.17 | 7.69 |
-| dense | 2048 | 0.1739 | 90.65% | 24.89 | 18.18 |
-| dense | 4096 | 0.1258 | 90.40% | 26.03 | 17.22 |
-| MoE   | 256  | ~0      | 100.000% | 5.6e-5 | 4.9e-5 |
-| MoE   | 1024 | 0.0472 | 90.04% | 5.13 | 0.95 |
-| MoE   | 2048 | 0.0442 | 90.84% | 1.85 | 1.11 |
-| MoE   | 4096 | 0.0422 | 89.97% | 2.76 | 0.83 |
-
-Gate thresholds: Mean KLD < 1e-3; Same-top-p >= 99.5%; |ln(PPL ratio)| < 0.005;
-drift MeanKLD(4096) <= 1.5x MeanKLD(256) AND Same-top-p(4096) >= 99.0%.
-Result: 256-tok PASS (vacuous); **drift FAIL by ~50-170x on Mean KLD and ~9 pts on Same-top-p**
-(top-p ~90% = roughly 1 token in 10 flips its argmax at >=1024 ctx). FAIL for both dense and MoE.
-
-### Discrimination (is it a bug or genuine bf16?) - dense c1024 chunks8
- **f32 long-context floor c1024**: Mean KLD -1.2e-5, Same-top-p 100% -> the bf16 divergence is REAL
-  signal, not a long-context measurement artifact.
- **bf16 KLD is invariant to ubatch-boundary count** (= the cross-ubatch state read-back frequency):
-  ub1024 (0 internal boundaries) 0.0642 / 91.19%; ub512 (1) 0.0647 / 91.54%; ub256 (3) 0.0639 /
-  91.17%; ub64 (15) 0.0682 / 90.97%. Flat. -> The error is INTRINSIC to bf16 over the long
-  recurrence INSIDE a single op call, NOT a chunked-prefill/keep_rs/gather handoff bug (R2 ruled out;
-  test-backend-ops 52/52 already passed). The error PLATEAUS with context (contraction), i.e. it is
-  bounded but LARGE: the gated-DeltaNet has long-memory heads (exp(g) ~ 1), so the g<1 decay does NOT
-  tightly contract the per-step bf16 rounding the way the plan's A.3 optimistically assumed.
-
-Note: this is exactly vLLM's own precision (vLLM's GDN temporal cache is bf16). vLLM users never see
-this delta because vLLM has no f32 reference; our gate exposes the full bf16-vs-f32 gap because our
-f32 path is a HIGHER bar than vLLM.
-
-## 2. Parity bench - the perf lever IS real
-
-### nsys recurrence per-call (graphs OFF, npp4 ntg32 npl128) - gated_delta_net_cuda Avg
-| model | f32 ms/call | bf16 ms/call | delta |
-|---|---|---|---|
-| dense q27 | 3.381 | 1.726 | **-49.0%** |
-| MoE q35   | 2.245 | 1.153 | **-48.6%** |
-
-The predicted 3.49 -> 2-3 ms/call lever LANDED (even beat it). Total GPU time dropped too (dense
-~12.05 -> ~9.05 s graphs-off). bf16 halving the persisted SSM-state bytes halves the dominant decode
-kernel, exactly as designed.
-
-### End-to-end decode throughput (S_TG aggregate, npp128 ntg128, graphs ON unless noted)
-| model | npl | f32 t/s | bf16 t/s | note |
-|---|---|---|---|---|
-| dense | 32  | 212 | 239 | +12.8% |
-| dense | 128 | 371-376 (stable) | 287 / 336 / 487 / 498 (BIMODAL) | clean ~490 = +31%; bad runs from a CUDA-graph instability on the dense path |
-| dense | 128 | 371.67 (graphsOFF) | 486.68 (graphsOFF) | clean +31% |
-| MoE   | 32  | 449 | 509 | +13.4% |
-| MoE   | 128 | 767 | 958 | +24.9% (clean, nsys-corroborated) |
-
-% of vLLM (391 t/s dense reference): f32 default = 95-96% (bit-exact, higher precision than vLLM);
-bf16 clean ~490 = **125%** (but unstable on dense + fails the numeric gate). MoE bf16 +25% is clean.
-
-## 3. DECISION: NO-SHIP / KEEP SHELVED
- The KL gate **fails** the long-context drift criterion for both models: bf16 SSM state changes
-  ~10% of tokens at >=1024 ctx vs our f32 (Same-top-p ~90%, Mean KLD 0.04-0.17). It is therefore NOT
-  a quality-neutral opt-in and cannot honor the project's "f32 bit-exact default" promise.
- Per the task rule (gate FAIL -> do not ship as usable): **patch 0024 was NOT created and nothing was
-  committed** (DGX tree stays uncommitted; backup `~/llama-paged-dev/BF16_SSM_STATE.diff`).
- The perf lever is genuinely real (recurrence halves; dense ~490 t/s = 125% of vLLM when clean; MoE
-  +25%) and bf16 == vLLM's own precision, so it remains a valid FUTURE option - but only if shipped as
-  an explicitly-labeled "vLLM-precision-class, NON-bit-exact" mode (never quality-neutral), AND the
-  dense CUDA-graph throughput instability (bimodal 287..498) is fixed first.
- Recommendation: keep the shipped default as f32 bit-exact (95% of vLLM at higher precision). Shelve
-  bf16. Optional follow-up lever if precision must be cut: bf16 only on the SHORT-memory heads (those
-  with exp(g) well below 1), keeping long-memory heads f32 - a mixed-precision state that could pass
-  the gate while still cutting bytes; not implemented/measured here.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
+++ b/backend/cpp/llama-cpp/patches/paged/BITEXACT_VS_VLLM.md
@@ -1,339 +0,0 @@
-# Bit-exact vs vLLM, and the f32-preserving-parity hunt (Qwen3.5 gated-DeltaNet)
-
-Label: crossengine-bitexact (READ-ONLY, no GPU). Adversarial source+numerics study.
-Model: q36-27b-nvfp4 (dense, `Qwen3_5ForConditionalGeneration`) / q36-35b-a3b-nvfp4
-(MoE, `Qwen3_5MoeForConditionalGeneration`). Engines: llama dev `~/llama-paged-dev`,
-vLLM 0.23.0 `~/vllm-bench`. Decode B=128, enforce-eager / graphs-off, GB10 (~273 GB/s).
-
-> **CORRECTION NOTICE (supersedes the earlier draft of this file).** A prior pass concluded
-> "vLLM's GDN state cache is bf16, so the 2x recurrence-DRAM gap is f32(llama)-vs-bf16(vLLM)
-> width" (old B2/B3). **That is wrong.** It read `gated_delta_net_state_dtype(..., mamba_ssm_cache_dtype="auto")`
-> as auto->model-dtype=bf16, but it did **not** trace the Qwen3.5-specific config override that
-> reassigns `mamba_ssm_cache_dtype` from `"auto"` to `"float32"` *before* the state dtype is
-> resolved. **vLLM stores this model's gated-DeltaNet temporal state in float32**, the same width
-> as llama. Proof chain in Part B. Everything in Part C is re-derived from the corrected dtype.
->
-> **INDEPENDENT RE-VERIFICATION (this pass, live DGX source).** Two separate sub-agents reached
-> *opposite* dtype readings (one f32, two bf16). The contradiction was resolved by reading every
-> link of the chain directly, not by majority vote. All eight links confirm **float32 temporal
-> state**: `config.json text_config.mamba_ssm_dtype = "float32"` (both served models);
-> `config/cache.py:129` default `mamba_ssm_cache_dtype = "auto"`; the bench scripts
-> (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`, `serve_nvfp4.sh`) pass **only**
-> `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`,
-> no `--dtype`); `config/vllm.py:847 __post_init__` -> `:856 try_verify_and_update_config()` (runs at
-> finalize, before any state-dtype resolution); `MODELS_CONFIG_MAP` (`models/config.py:622-623`) maps
-> both `Qwen3_5ForConditionalGeneration` and `Qwen3_5MoeForConditionalGeneration` ->
-> `Qwen3_5ForConditionalGenerationConfig`; its override body (`config.py:546-549`)
-> `if mamba_ssm_cache_dtype=="auto": cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype` **fires**
-> (value "float32"); `mamba_utils.py:91-94` then takes the `!= "auto"` branch ->
-> `temporal = STR_DTYPE_TO_TORCH_DTYPE["float32"] = torch.float32` (conv stays bf16);
-> `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **temporal** (2nd)
-> tuple element and allocates the cache (`:1136`) at f32; `ssm_state = self_kv_cache[1]` (`:1316/1596/1664`).
-> The two bf16 sub-agent readings are **refuted** - they stopped at the `cache.py` default "auto" and
-> never traced the `__post_init__` override. **Numeric corroboration:** at the measured vLLM duration
-> 3.62 ms/call, bf16 (402 MB) would imply 111 GB/s = 41% peak (implausibly low for a tuned BW-bound
-> Triton kernel); f32 (805 MB) implies 222 GB/s = 81% peak (the expected regime). f32 is the only
-> reading consistent with both source *and* the measured time.
-
-## Headline (two answers)
-
-1. **Bit-exact-vs-vLLM (identical logits / probabilities) is IMPOSSIBLE - for this model and for any
-   two distinct engines.** B4 = CONFIRMED. The sharpest proof is the GDN recurrence itself: the two
-   kernels evaluate an *algebraically reassociated* expression (`g.Sigma` vs `Sigma.g`) on *different
-   reduction trees*, so they diverge **even if both ran pure f32 with identical inputs**. On top of
-   that the FP4 GEMM uses different operand precision (8-bit vs 4/16-bit activations) and different
-   accumulation - a >>ULP divergence in every projection and the LM head.
-
-2. **bf16 SSM state is NOT the only way to reach vLLM decode throughput, and an f32-preserving lever
-   was missed.** vLLM reaches its throughput **with an f32 GDN state** (proven). Both engines move the
-   same ~805 MB f32/recurrence-call; the ~10% per-call gap is a bandwidth-**efficiency** gap on equal
-   bytes (llama ~74% of peak, vLLM ~81%), i.e. an occupancy/grid/coalescing lever that is **bit-exact
-   vs llama's own f32**. bf16 state is an *optional over-clock* (goes AHEAD of vLLM on the recurrence),
-   not a parity requirement. B2/B3 (as "bf16 width is the lever") = REFUTED.
-
---
-
-# The five questions, answered (synthesis)
-
-**Q1. Can llama be BIT-EXACT with vLLM? NO.** Two *binding* (>>ULP) divergence sources make
-bit-identical logits impossible on their own: **(A1)** the FP4 GEMM - llama MMQ quantizes the
-activation to **q8_1 (8-bit)** while vLLM runs cutlass **w4a4 (4-bit acts)** or marlin **w4a16
-(16-bit acts)**; different operand precision + accumulation order -> ~1e-2 relative error in *every*
-projection and the LM head; **(A2)** the GDN recurrence - llama computes `g*(Sigma round(S*k))`
-(scalar decay *after* the reduction) while vLLM computes `Sigma round(round(g*h)*k)` (decay rounded
-into each element *before* the reduction): an IEEE-754 reassociation on *different reduction trees*
-(warp butterfly vs Triton `tl.sum`) that diverges **even with identical pure-f32 state and inputs**.
-A dozen further ops (L2/RMSNorm, MRoPE, gate `exp`, flash-attn softmax) add close-but-not-equal
-rounding. Cross-engine bit-exactness is impossible *in general* (FP non-associativity across distinct
-GEMM/recurrence/norm kernel stacks); the determinism literature only buys run-to-run determinism
-*within* one engine. **Weaker form reachable:** greedy **top-1 token agreement** is the right gate
-(top-1 / KL / PPL-delta, never md5). It is probabilistic (flips at low-margin steps), **compounds**
-with length (once one token differs the SSM/KV states fork), and is *weaker here* than a
-same-precision run because of the A8-vs-A4 GEMM gap.
-
-**Q2. Is bf16 SSM state the only path to vLLM decode throughput? NO - an f32-preserving lever exists
-and bf16 is not even required for parity.** vLLM carries the **same f32 temporal state** (proven +
-re-verified), so the recurrence gap is **bandwidth EFFICIENCY on equal f32 bytes** (llama 74% vs vLLM
-81% of GB10 peak), ~10% per call, *not* a 2x width gap. The lever: **retune `gated_delta_net_cuda`
-74% -> ~81%** - it launches 196608 tiny one-column blocks (butterfly-reduce per token); fold toward
-fewer/larger `BV x BK` tiles + vectorized `f32x4` loads + better row coalescing, **keeping the
-per-column reduction order -> BIT-EXACT vs llama's own f32** (md5-gateable). **Cost vs bf16:** zero
-precision risk and bit-exact, but it can only **match** vLLM's recurrence BW (81%), never beat it;
-worth ~+5% (~335 -> ~351 tok/s, ~90% of vLLM), and it caps below 100% unless stacked with the other
-bit-exact levers (conv fusion 0021, activation fold, oproj MMQ 0020). The adversarial sweep of every
-other f32 avenue (lossless sub-f32, delta/low-rank/sparse, recompute+checkpoint, 2nd-stream/overlap,
-chunked recurrence) **FAILS** to beat it; recompute is bit-exact but only **ties** the irreducible
-one-full-state-READ floor and is now moot (vLLM also writes f32, so you match its achieved BW, you
-don't need to eliminate the write). bf16 remains the **only** lever that goes *ahead* of vLLM on the
-recurrence (~440 tok/s) - an **over-clock**, not a requirement.
-
-**Q3. Does bf16 state MATCH vLLM's precision or DEGRADE below it? It DEGRADES below vLLM.** (This
-corrects the `precision-ground-truth` sub-agent's "matching, not degrading" claim, which rested on
-the refuted bf16 reading.) vLLM keeps the **temporal/recurrent** state in **f32**; only its small
-**conv** state is bf16 (llama keeps conv f32, so llama is *more* precise there). So bf16 **temporal**
-state in llama (~8 mantissa bits) sits **below vLLM's f32 temporal** (~24 bits) - it is a deliberate
-precision-for-speed trade, KL/PPL-gated vs llama's own f32 *and* a step under vLLM's recurrent-state
-precision. A genuine "match vLLM's envelope" change would be f32 temporal (as today) + bf16 conv -
-which costs llama precision only on a tiny stream and buys almost no BW.
-
-**Q4. What can "parity" mean here? Throughput at equal precision + a distributional quality bar -
-never identical bits.** Bit-identical logits are impossible cross-engine, so "parity" = **(a)**
-throughput (tok/s in the harness) at **(b)** a quality bar measured by **top-1 greedy agreement,
-KL(llama||vLLM)/step, and PPL-delta**, never md5. Both engines already run the recurrence math in f32
-registers; at **equal** precision (llama f32 temporal == vLLM f32 temporal) the *only* open variable
-is throughput, and that gap is closable **bit-exactly** (Q2). If llama adopts bf16 temporal, "parity"
-must be restated as "throughput >= vLLM at KL/PPL within gate vs llama's own f32" and reported as the
-precision-for-speed trade it is.
-
-**Q5. Did the prior analysis get B1-B4 right? B1 mostly; B2/B3 REFUTED; B4 CONFIRMED. Overturn the
-"bf16 is required" framing - keep the bit-exact levers.**
- **B1 TRUE** (single-pass f32, load-once/store-once, 74% peak) - but its sub-claim "more efficient
-  than vLLM (41%)" is **REFUTED** (41% was the bf16 artifact; vLLM is ~81%, *more* efficient).
- **B2 REFUTED** - not a f32-vs-bf16 width gap; equal f32 bytes both sides, ~10% efficiency gap.
- **B3 REFUTED** as written - vLLM reaches its throughput **with f32 state**; a bit-exact f32
-  occupancy retune reaches vLLM's recurrence BW. bf16 is optional.
- **B4 CONFIRMED** - impossible, on two independent grounds (structural A1+A2; general FP
-  non-associativity across distinct kernel stacks).
- **Plan disposition:** do **not** overturn the conv-fusion (0021) bit-exact lever - keep it.
-  **Re-prioritize the bit-exact f32 occupancy/coalescing retune of `gated_delta_net_cuda` as the
-  parity path.** Treat bf16 temporal state as an explicitly-gated **over-clock for going beyond
-  vLLM**, reported as a precision-for-speed trade (below vLLM's f32 recurrent precision), NOT as a
-  parity-matching change.
-
---
-
-# PART A - Divergence inventory (per source: bit-identical vs close)
-
-Per decode layer the two engines run *different kernels* for: FP4 GEMMs (proj + LM head), depthwise
-conv+SiLU, q/k L2-norm, the GDN recurrence, gated RMSNorm; and on the hybrid's full-attention layers:
-RMSNorm q/k-norm, MRoPE, flash attention, a sigmoid gate.
-
-## A1. NVFP4 dequant + FP4 GEMM -- NOT bit-identical (diverges >> ULP)
-
- **llama**: MMQ (`mmq.cuh` `block_fp4_mmq`, nvfp4 block=16, 4x ue4m3 sub-scales). Host path
-  (`ggml-cuda.cu` ~1955-2014) **quantizes the activation (src1) to q8_1** (`block_q8_1_mmq`, **8-bit**,
-  block 32) and accumulates over K in the MMQ tile (DP4A / Blackwell FP4-MMA); tile order set by
-  `mmq_y`/`mmq_x` + the warp-MMA fragment layout.
- **vLLM**: `compressed_tensors_w4a4_nvfp4` -> cutlass FP4 GEMM on Blackwell (**4-bit** activations,
-  w4a4, per-group act-quant, e4m3 block scale x global FP8 tensor scale) or marlin fp4 fallback
-  (**16-bit** activations, w4a16, dequant->bf16 then bf16 GEMM). `apply_weights` -> `self.kernel`.
- **Verdict: not close.** (a) *Operand precision differs*: llama 8-bit acts vs vLLM 4-bit (cutlass) or
-  16-bit (marlin) - per-GEMM outputs differ at ~1e-2 relative, not ULP. (b) Scale-application order
-  differs. (c) Accumulation tiling/order differs (MMQ fragment vs cutlass/marlin). This is the largest
-  divergence and is present in every projection + the LM head, so logits differ materially on its own.
-
-## A2. gated-DeltaNet recurrence -- NOT bit-identical, AND provably so even in pure f32
-
-Both single-pass over an **f32** state (Part B). llama: `gated_delta_net.cu`
-`gated_delta_net_cuda<128,KDA=false>`; vLLM: `fused_recurrent.py`
-`fused_recurrent_gated_delta_rule_packed_decode_kernel`. Scalar-gate (GDA) path, `g.ne0==1`.
-With S[k][v] (llama, transposed) == h[v][k] (vLLM):
-
-```
-llama:  kv[v] = Sigma_k S_old[k][v]*k[k]      # OLD state; g applied AFTER the sum
-        delta = (v[v] - g*kv[v])*beta;  S_new = g*S_old + k(x)delta;  o[v]=Sigma_k S_new[k][v]*q[k]
-vLLM:   h' = g*h_old                          # decay rounded into EVERY element first
-        kv[v]=Sigma_k h'[v][k]*k[k]=Sigma_k round(g*h_old)*k;  b_v=(v[v]-kv[v])*beta
-        h_new = h' + b_v(x)k;  o[v]=Sigma_k h_new[v][k]*q[k]
-```
-
-Algebraically identical (g scalar). **Numerically not**, for two structural reasons that survive even
-with identical f32 state, identical inputs, and identical reduction tree:
- **Reassociation:** llama forms `g*(Sigma round(S*k))` (scalar multiply *after* the reduction);
-  vLLM forms `Sigma round(round(g*h)*k)` (decay rounded into each element *before* the reduction).
-  Distributing a multiply across a sum is exact in R, not in IEEE-754. This is not a precision knob.
- **Different reduction trees:** llama `warp_reduce_sum<32>` (4 sequential per-lane FMAs + 5-step
-  butterfly) vs vLLM `tl.sum(...,1)` (Triton tree over the 128-wide BK axis).
-**Verdict: not bit-identical; cannot be made so without rewriting one kernel to the other's op order.**
-
-## A3. Depthwise conv1d (width 4) + SiLU -- NOT bit-identical
-llama `ggml_ssm_conv` (ascending-j f32 FMA) + `ggml_silu`, conv state cached **f32**. vLLM
-`causal_conv1d_update` (Triton) + SiLU, conv state cached **bf16** (`conv_state_dtype = bf16`; only the
-*temporal* SSM state is forced f32 - Part B). Different kernel + different conv-state width + FMA order.
-(Patch 0021 fuses llama's chain bit-exactly vs *llama's own* f32 path, not vs vLLM.)
-
-## A4. q/k L2-norm + RMSNorm/RMSNormGated -- NOT bit-identical (close, ~1e-6)
-L2-norm: llama standalone `ggml_l2_norm` (f32 tree) vs vLLM `l2norm_fwd`/in-kernel fold
-(`USE_QK_L2NORM_IN_KERNEL`). RMSNorm: llama `ggml_rms_norm` vs vLLM `vllm_c` fused kernel (run log:
-`rms_norm=['vllm_c','native']`); gated out-norm `build_norm_gated`=RMS*SiLU(z) vs `RMSNormGated`.
-Different variance reduction tree / eps placement / fusion boundary.
-
-## A5. MRoPE + gate scalar pipeline -- NOT bit-identical (close)
-MRoPE: `ggml_rope_multi` (ggml sin/cos) vs vLLM rotary cos/sin cache (different theta eval + apply
-order). Gate: vLLM computes `-exp(A_log)*softplus(a+dt)` then `exp` **in-kernel**; llama computes
-`softplus(alpha+ssm_dt)*ssm_a` as split graph ops with `ssm_a` baking `-exp(A_log)` at GGUF-convert
-time (rounded once), writes/reloads the intermediate, `expf` in-kernel. Same algebra, different
-rounding points + convert-time vs runtime `exp(A_log)`.
-
-## A6. Flash attention (full-attn layers) -- NOT bit-identical (close)
-llama `ggml_flash_attn_ext` -> `fattn-mma-f16`/`fattn-vec` (online softmax, F16/F32 PV accum per
-`GGML_PREC`) vs vLLM FlashInfer/FA2. Different tiling => different running max/sum order => different
-rounding.
-
-## A7. SiLU/sigmoid primitives + fusion -- equivalent IF inputs matched (they never do)
-Both ultimately use the same hardware `expf`/`__nv_expf`; the primitives could match given identical
-inputs, but every upstream value has diverged, and vLLM fuses act+quant / norm+quant differently than
-llama's separate ops (run log `fuse_act_quant=True`), moving the rounding points.
-
-### Inventory summary
-
-| Source | bit-identical? | divergence size |
-|---|---|---|
-| FP4 GEMM (proj/LM head): MMQ q8_1(A8) vs cutlass w4a4(A4)/marlin w4a16 | **NO** | **>>ULP (~1e-2)** |
-| GDN recurrence: hand-CUDA warp-reduce vs Triton tl.sum | **NO (provable even in f32)** | reassoc + tree |
-| conv1d+SiLU: f32 conv-state vs bf16 conv-state | NO | dtype + order |
-| L2-norm / RMSNorm | NO | ~1e-6 (tree) |
-| MRoPE | NO | ~ULP-1e-6 |
-| gate softplus/exp | NO | rounding points |
-| flash attention | NO | softmax tiling |
-| silu/sigmoid primitive | identical IFF inputs equal | inputs never equal |
-
-Any single NO makes the logits differ. A1 and A2 differ by far more than ULP -> the logit vectors are
-not close-to-equal at the bit level; they agree only to a few significant digits.
-
---
-
-# PART B - The decisive f32-state correction (proof from source)
-
-The byte-gate inferred vLLM's GDN temporal state is **bf16** (402 MB/call, 41% peak) and built the
-"bf16-width is the lever" case on it. The byte count was *inferred from the dtype*; ncu byte counters
-were blocked, so only the **duration** (3.62 ms/call) was measured. The dtype inference is falsified:
-
-1. `config.json`: `architectures=["Qwen3_5ForConditionalGeneration"]`, `text_config.dtype=bfloat16`,
-   and **`text_config.mamba_ssm_dtype = "float32"`**.
-2. `models/config.py:590 MODELS_CONFIG_MAP` maps `"Qwen3_5ForConditionalGeneration"` (line 622) and
-   `"Qwen3_5MoeForConditionalGeneration"` (623) to `Qwen3_5ForConditionalGenerationConfig`.
-3. `Qwen3_5ForConditionalGenerationConfig.verify_and_update_config` (config.py:536-562):
-   `mamba_ssm_dtype = getattr(hf_text_config,"mamba_ssm_dtype")` (="float32"); if
-   `cache_config.mamba_ssm_cache_dtype == "auto"` (the default) it executes
-   **`cache_config.mamba_ssm_cache_dtype = mamba_ssm_dtype`** -> sets it to **"float32"**.
-4. This override runs at config finalization: `config/vllm.py:856` -> `try_verify_and_update_config()`
-   (vllm.py:1880-1900) looks up the arch in `MODELS_CONFIG_MAP` and calls `verify_and_update_config`.
-   It runs **before** any layer/model state-dtype resolution.
-5. The bench left it default: `h2h_dense_vllm.sh` = `vllm serve .../q36-27b-nvfp4-vllm --enforce-eager
-   --gpu-memory-utilization 0.85 --max-model-len 4096` (no `--mamba-ssm-cache-dtype`; `dl-logs/vllm_dense.log`
-   non-default args confirm none). So the override fires and the value is "float32".
-6. State dtype resolution reads the **already-overridden** value:
-   - `gdn/base.py:53-57` `get_state_dtype()` -> `gated_delta_net_state_dtype(model_dtype=bf16,
-     cache_config.mamba_cache_dtype="auto", cache_config.mamba_ssm_cache_dtype="float32")`.
-   - `qwen3_5.py:678 get_mamba_state_dtype_from_config` likewise passes
-     `vllm_config.cache_config.mamba_ssm_cache_dtype` (= "float32", post-override) - **not** a raw "auto".
-   - `mamba_utils.py _mamba_state_dtype`: conv_state = `get_kv_cache_torch_dtype("auto", bf16)` = **bf16**;
-     temporal_state, since `mamba_ssm_cache_dtype != "auto"`, = `STR_DTYPE_TO_TORCH_DTYPE["float32"]`
-     = **torch.float32** (key verified: `torch_utils.py:33 "float32": torch.float32`).
-7. `qwen_gdn_linear_attn.py:1101` `_, state_dtype = self.get_state_dtype()` takes the **second** tuple
-   element (temporal) = **float32**, allocates the cache `dtype=state_dtype`. The packed_decode kernel
-   round-trips f32: `b_h = tl.load(p_h0).to(f32)` ... `tl.store(p_ht, b_h.to(p_ht.dtype.element_ty))`
-   with `p_ht.dtype == initial_state.dtype == float32`.
-
-**=> vLLM's gated-DeltaNet temporal (recurrent) state cache for this model is float32, identical width
-to llama's f32 state.** The earlier "bf16" reading hardcoded the third arg as `"auto"` and missed the
-override at step 3-4. Only the small *conv* state is bf16 in vLLM (f32 in llama: divergence A3, tiny
-byte stream).
-
-## Re-derived efficiency table (measured duration + PROVEN f32 byte volume)
-
-| kernel | state dtype (PROVEN) | bytes R+W/call | duration/call | eff. BW | % of 273 peak |
-|---|---|---|---|---|---|
-| llama `gated_delta_net_cuda` | f32 | 805 MB | 3.98 ms | 202 GB/s | **74%** |
-| vLLM `..._packed_decode` | **f32 (not bf16)** | **805 MB (not 402)** | 3.62 ms | **222 GB/s** | **~81%** |
-
- **B1 (single-pass f32 byte floor): TRUE** (load-once/store-once `s_shard`, coalesced). *Sub-claim
-  "more BW-efficient than vLLM (41%)" REFUTED* - 41% was the bf16 artifact; at the correct f32 byte
-  count vLLM is at ~81%, i.e. **more** efficient than llama.
- **B2 ("the gap is f32-vs-bf16 width"): REFUTED.** Equal f32 bytes both sides; the ~10% per-call gap
-  is bandwidth **efficiency** on equal bytes, not width.
- **B3 ("vLLM throughput REQUIRES bf16 state"): REFUTED.** vLLM reaches it *with f32 state*.
-
---
-
-# PART C - The f32-preserving lever, and where recompute/bf16 land
-
-Since vLLM hits ~81% on the **same f32 byte volume** llama runs at ~74%, the missed lever is **raising
-llama's `gated_delta_net_cuda` achieved BW 74% -> ~81%**, bit-exact, NOT dtype width:
- llama grid `(H=48, n_seqs=128, ceil(S_v/4)=32) = 196608` blocks/128 thr, each warp owns ONE state
-  column + warp-reduces over 128 rows. vLLM grid `(NV=4, B*HV=6144) = 24576` programs (num_warps=1),
-  each owns a BV=32 x BK=128 tile. llama's far-finer blocking (8x more blocks, one column of work each,
-  a butterfly reduce/token) is the likely ~7-point deficit. Retune toward fewer/larger blocks (more
-  columns/block, vectorized f32x4 loads, better row coalescing) - changes thread/tile mapping + load
-  width only, **keeps the per-column reduction order -> bit-exact vs llama's own f32**.
- Upper bound: 74%->81% on ~50% of the step ~= +17 ms/step (384 -> ~367), ~+5% -> ~351 tok/s (~90% of
-  vLLM 391), stacking with the landed bit-exact levers (oproj MMQ 0020 @86%, conv fusion 0021).
-
-**Other f32-preserving avenues (adversarial sweep) - none beats the simple bf16 over-clock, but the
-occupancy tune above is the real bit-exact win:**
- *Lossless sub-f32 state:* generic float compression is data-dependent (1.1-1.5x, never a guaranteed
-  2x) and breaks the 128-consecutive-f32 coalescing a BW-bound kernel depends on. The state is dense,
-  full-rank, non-symmetric (sum of `k(x)delta`, k!=delta) -> no low-rank/half-storage. FAILS.
- *Recompute (checkpoint every N + rank-1 replay):* eliminates the per-step WRITE; the per-step full
-  dense f32 READ (the `S^T k` / `S^T q` matvecs need every element; the checkpoint is itself a full
-  read) is irreducible. Optimal N~=11 -> ~473 MB/step (0.587x), realistically ~0.65-0.75x after
-  replay/latency overhead. A genuine bit-exact path but it only reaches - never beats - the read floor,
-  at large kernel/graph complexity. **Note: this was over-weighted before because vLLM was assumed
-  bf16; now that vLLM is f32 too and runs at 81%, you do NOT need to cut the write to match vLLM - you
-  need to match vLLM's achieved BW on the same f32 bytes.** Recompute is dominated.
- *2nd stream / overlap / pipelining:* DRAM BW (273) is one shared resource; the whole decode step is
-  uniformly BW-bound (state traffic + ~13.5 GB/step dense NVFP4 weight traffic both hit 273), so
-  overlapping two BW-bound phases sums to ~0. FAILS.
- *Equivalent recurrence with less decode traffic:* chunked gated-delta-rule is a prefill lever (C=1 at
-  decode); attention/materialization-free form is O(t) over the prefix. FAILS.
-
-**bf16 SSM state is therefore an OPTIONAL over-clock**, the only lever that goes *ahead* of vLLM on the
-recurrence (halve 805 -> ~440 tok/s) - but it takes llama below both its own f32 and vLLM's f32
-precision, so it must be **KL/PPL-gated vs llama's own f32**, never md5. f32-only parity-class
-throughput is plausible from the SUM of bit-exact levers (recurrence occupancy + conv fusion + oproj
-MMQ + activation fold); none require bf16.
-
---
-
-# PART D - Verdict on B4 + the meaningful weaker form
-
-## Bit-exact-vs-vLLM: IMPOSSIBLE (B4 CONFIRMED) - two independent grounds
-
-1. **Structural (this model):** A1 (FP4 GEMM operand precision + accumulation) and A2 (recurrence
-   `g.Sigma` vs `Sigma.g` + different reduction trees) make per-layer outputs differ by >>ULP, so logits
-   cannot be bit-identical. A2 shows it is not a precision knob: the kernels evaluate a *reassociated
-   expression*, differing **even given identical f32 state and inputs**.
-2. **General (any two engines):** IEEE-754 add/mul are non-associative; two engines that tile, reduce,
-   fuse, and quantize differently cannot produce bit-identical results for a non-trivial transformer.
-   Field determinism work (batch-invariant / fixed-reduction kernels, "defeating nondeterminism in LLM
-   inference") delivers **run-to-run determinism WITHIN one engine**; it does **not** and cannot deliver
-   **cross-engine** bit-exactness (that needs identical kernel+tiling+reduction-order+dtype for *every*
-   op). Cross-engine bit-exactness is essentially never achieved in practice. Bit-exactness is only a
-   meaningful gate **within** an engine (how llama patches 0018-0021 are validated by md5).
-
-## Greedy-token match (argmax robustness) - the right weaker form, but probabilistic
-Because logits differ mostly in low-order bits (A4-A7) plus a few-significant-digit GEMM/recurrence gap
-(A1-A2), the **argmax** frequently coincides whenever the top-1/top-2 logit margin exceeds the
-cross-engine noise. This is the only meaningful cross-engine "equivalence"; gate on **top-1 agreement /
-KL / PPL-delta**, never md5. Caveats: not guaranteed per-token (low-margin steps can flip); it
-**compounds** - once one greedy token differs the sequences fork and the KV/SSM states diverge, so
-agreement degrades with length (high on short continuations, drift on long ones); and the FP4 A4-vs-A8
-gap (A1) makes the per-step divergence *larger* here than a same-precision bf16-vs-bf16 comparison,
-weakening greedy agreement for this model specifically.
-
-**Bottom line:** target near-vLLM via KL/PPL/top-1-agreement, not bit-exactness. Reserve bit-exact
-gating for intra-llama validation (the f32 recurrence-occupancy lever and the conv fusion qualify;
-bf16 state does not and must be KL/PPL-gated vs llama's own f32).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
+++ b/backend/cpp/llama-cpp/patches/paged/BYTEGATE_PROGRESS.md
@@ -1,53 +0,0 @@
-# GDN Recurrence Byte-Gate - Progress (agent: ncu-byte-gate)
-
-## Hard blocker on direct DRAM counters
- ncu HW perf counters: ERR_NVGPUCTRPERM (NVreg_RestrictProfilingToAdminUsers=restricted, root-only).
- nsys --gpu-metrics-devices: same ERR_NVGPUCTRPERM.
- No passwordless sudo on dgx.casa. DRAM byte counters UNOBTAINABLE without root.
- FALLBACK (decisive, no perfcounters needed): CUPTI kernel TIMING (allowed) + exact byte
-  geometry from kernel source => implied effective BW + a hard mathematical cap on re-stream factor.
-
-## Byte geometry (exact, from gated_delta_net.cu + GGUF)
- Qwen3.5 dense q36-27b-nvfp4: 48 GDN layers, H=48 v-heads, S_v=128 (square state 128x128/head).
- State per (seq,head) = 128*128 f32 = 64 KiB. Per seq = 48*64KiB = 3.0 MiB.
- Kernel is SINGLE-PASS by construction: loads s_shard[] ONCE into regs, recurrence in-register,
-  writes state ONCE (read_state coalesced 128 consecutive f32/warp; writeback coalesced).
-  l2norm/sigmoid/softplus/gate act on small q/k/g/beta (NOT the 805MB state); gather no-ops at
-  steady decode (identity seqs). => NO multi-pass state re-streaming exists to fuse away.
- Minimal bytes/call (B=128): state R+W = 128*48*16384*4*2 = 805.3 MB; +q/k/v/out ~10 MB = ~816 MB.
- Floor time @273 GB/s = 816MB/273 = 2.99 ms/call.
-
-## Measured (clean nsys CUDA timing, graphs OFF, npp8 ntg12 npl128, build-cuda-base df1cc97)
- llama gated_delta_net_cuda steady decode: 480 calls, grid(48,128,32), avg 3.98 ms/call
-  (min 3.90, max 4.33; very tight => bandwidth-bound). 48 layers => 191 ms/step (50% of 384 ms).
- Implied effective BW @1.0x bytes = 816MB/3.98ms = 205 GB/s = 75% of 273 peak.
- HARD CAP: max bytes movable in 3.98ms @273 peak = 1.087 GB = 1.33x minimal.
-  => re-stream factor in [1.0x, 1.33x]. 2x re-streaming PHYSICALLY IMPOSSIBLE.
-  Source proves single-pass+coalesced => ~1.0x, kernel at ~75% peak.
-
-## Conv-path (same trace, steady-decode region kernels, per-call):
- ssm_conv_f32: 672 calls whole-trace avg 135.9us (incl prefill); decode-region TBD
- concat_cont: 576 calls avg 169.6us ; concat_non_cont 96 calls (prefill big)
- cpy_scalar: 896 calls avg 123.7us ; gdn_gather_nonident 672 calls avg 153.9us (mostly no-op)
-
-## vLLM (apples-to-apples: NSEQ=128, enforce_eager=True; postssm_decomp/vllm_decode.sqlite)
- vLLM state dtype = model_dtype = BF16 (_mamba_state_dtype default "auto"; config dtype=bfloat16).
-  Geometry identical to llama (H=48, k/v head_dim 128, S_v 128).
- vLLM fused_recurrent_gated_delta_rule_packed_decode steady: 3.62 ms/call (grid 4x6144x1),
-  bf16 state R+W = 402.6 MB => 111 GB/s = 41% peak. SINGLE-PASS (load p_h0 once -> f32 regs ->
-  store bf16 once).
- llama 3.98 ms/call, f32 805.3 MB => 202 GB/s = 74% peak. llama kernel is MORE BW-efficient.
-
-## Conv-path (llama steady decode, per call x48 layers)
- concat_cont 169.6us (8.14 ms/step) + cpy_scalar 120.1us (5.76) + ssm_conv_f32 115.9us (5.56)
-  = ~19.5 ms/step. Conv state ~12.6 MB (tiny) => LAUNCH-bound, not byte-bound => fusion lever (~5%).
- l2_norm 6.8us, gdn_gather 1.21us (no-op identity seqs => gather does NOT re-stream state).
-
-## FINAL VERDICT (DONE)
- llama re-stream factor ~1.0x (hard cap <=1.33x; >=1.5x physically impossible @273 peak).
- NO-BUILD fused single-pass recurrence: already single-pass, coalesced, 74% peak (> vLLM 41%);
-  gate ops touch tiny q/k/g/beta, not the 805MB state => recovers ~0 state bytes.
- BUILD bf16 SSM state (design lever (2)): the 2x gap vs vLLM is 100% f32-vs-bf16 cache width.
-  805->413 MB => ~45-95 ms/step => step 384 -> 289-339 ms = parity-to-ahead of vLLM 327.
-  Non-bit-exact vs llama f32 but equal to vLLM's own bf16 precision.
- Findings written: GDN_RECURRENCE_BYTE_GATE.md (MEASUREMENT + VERDICT section appended).
--- a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
@@ -1,499 +0,0 @@
-# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
-
-Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
-document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
-token-granular continuous-batch scheduler, and records the single honest finding that
-re-shapes what the change can and cannot buy.
-
-Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
-LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
-MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
-(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
-Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
-`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
-
-## TL;DR (the honest reframe)
-
-Three findings, read directly from the source at HEAD `151343b` and from the committed
-NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
-what it is allowed to claim:
-
-1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
-   `llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
-   tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
-   `SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
-   Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
-   is therefore **already claimed first and never dropped or capped** - the exact property
-   vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
-
-2. **The chunked-prefill slot state already exists and already persists across steps.** A
-   slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
-   is a partial prefill; it stays in that state and resumes next step until its prompt is
-   fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
-   (line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
-   simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
-   "allow a slot to be mid-prefill while others decode in the same step" is **not a state
-   machine to build - it is already the behaviour.** This is the single biggest de-risking
-   fact in this document.
-
-3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
-   (`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
-   slots in iteration order. It is not decode-load-aware (does not subtract the live decode
-   count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
-   `PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
-   static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
-   per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
-   `T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
-   behavioural change. **No new slot states, no batch-formation rewrite.**
-
-### The honest ceiling (this is load-bearing for how the work is scoped and sold)
-
-The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
-`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
-decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
-
- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
-  prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
-  ~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
-  same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
-  scheduler cannot lift this.**
- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
-  scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
-  workload.
-
-Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
-
- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
-  staggered arrival) - the genuine, large win.
- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
-  per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
-  TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
-  npl range and across dense vs MoE.
- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
-  burst-TTFT spread), and no admission collapse under sustained load.
-
-Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
-paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
-CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
-scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
-self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
-
-## The gap, precisely localized (recap of the committed bench)
-
-At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
-decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
-
-| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
-|----:|------------------------:|-----------:|----------------------:|---------:|
-| 8   | 63.5  / 4.3 s   | 64.3  / 2.6 s | 169.3 / 1.7 s  | 202.0 / 0.8 s |
-| 32  | 105.7 / 23.1 s  | 189.8 / 7.5 s | 239.0 / 9.0 s  | 462.0 / 2.3 s |
-| 64  | 132.0 / 109 s   | 284.2 / 13 s  | 277.0 / 16.2 s | 624.5 / 4.1 s |
-| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
-
-Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
-convergence is the signal: once prefill starvation is removed, a dense model and a
-12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
-and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
-decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
-This scope addresses the first; it names the second as the separate lever.
-
-## What already exists (reuse, do NOT rebuild)
-
-All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
-
- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
-  `SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
-  `generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
-  its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
-  `batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
-  `num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
-  slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
-  Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
-  line 3502. **This is exactly vLLM's "leave the request in `running`, advance
-  `num_computed_tokens` next step" - already implemented.**
- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
-  the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
-  min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
-  when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
-  2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
-  the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
-  n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
-  stock.
- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
-  the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
-  `LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
-  this seam identically.
- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
-  position. Batch formation only changes **which** tokens are in a step; paged alloc is
-  driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
-
-## vLLM v1 reference algorithm (the target, for fidelity)
-
-From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
-there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
-toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
-prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
-prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
-any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
-chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
-`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
-per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
-maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
-progress (-> low, flat TTFT), one `model.forward()` per step.
-
-The mapping to llama is clean because [A]+[B] already give us "running visited first" and
-"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
-sizing, and the per-request chunk cap with fair distribution.**
-
-## The unified per-step batch-formation algorithm (the design)
-
-New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
-
- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
-  budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
-  (2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
-  unchanged.
- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
-  step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
-  floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
-  exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
-
-Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
-
-```
-common_batch_clear(batch);                                  // line 2594
-
-// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
-for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
-... speculative draft ...
-for (slot : generating) slot.update_batch(batch);           // appends decode (+draft) tokens
-
->> D = batch.n_tokens;                                       // NEW seam: decode load is now final (after 2719)
->> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
->> prefill_budget_step  = max(0, T - D);                     // DYNAMIC leftover, auto-shrinks with D
->> prefill_cap_per_slot = PREFILL_CAP;                       // long_prefill_token_threshold analogue
->> n_prompt_budgeted    = 0;                                 // total prompt tokens added this step (subsumes 0013)
-
-// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
-if (cont_batching || batch.n_tokens == 0) {
->>  for (k = 0; k < n_slots; ++k) {                          // round-robin start offset (fairness, see P2)
->>      slot = slots[(rr_start + k) % n_slots];
-        if (!slot.is_processing() || !can_batch_with) continue;
-        if (slot.state == STARTED) slot.state = PROCESSING_PROMPT;     // line 2782 (unchanged)
->>      slot_prompt_added = 0;                               // NEW: per-slot chunk counter (reset each slot)
-        // inner prompt-fill (lines 3187-3239), guard now triple-bounded:
-        while (slot.prompt.n_tokens() < slot.task->n_tokens()
->>             && batch.n_tokens   < T                       // was: < n_batch
->>             && n_prompt_budgeted < prefill_budget_step    // was: 0013 static n_prefill_budget
->>             && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
-            common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
-            slot.prompt.tokens.push_back(cur_tok);
-            slot.n_prompt_tokens_processed++;
-            n_prompt_budgeted++; slot_prompt_added++;
-            ... checkpoint-boundary breaks (unchanged) ...
-        }
-        if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT;  // line 3252
-        ... checkpoint creation (unchanged) ...
->>      if (batch.n_tokens >= T) break;                      // was: >= n_batch (line 3320)
->>      if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
-    }
-}
-
-for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); }  // unchanged
-```
-
-The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
-2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
-and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
-with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
-early slots do not always win the leftover.
-
-**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
-bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
-As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
-the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
-reaches 161; here it is reached **automatically across the npl range** because the budget is
-`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
-(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
-saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
-bounded progress every step instead of waiting for a dedicated prefill burst.
-
-## Slot state machine changes (minimal - this is the headline de-risk)
-
-**No new states. No state-transition rewrite.** The existing 6-state machine
-(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
-already encodes everything:
-
- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
-  slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
-  task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
-  the slot and resumes from `prompt.n_tokens()`.
-
-The only **additions** are per-step scheduler scratch, not slot lifecycle state:
-
-1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
-   the per-slot chunk cap. Not stored on the slot across steps.
-2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
-   leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
-   draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
-   it, slot 0's prompt finishes first every time and the last slots starve).
-3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
-   transitions begin in one step. This falls out of the budget arithmetic already (a bounded
-   `prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
-   may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
-   new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
-
-That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
-The mission's feared "slot-state rewrite" does not materialize.
-
-## How it supersedes / subsumes patch 0013
-
-| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
-|----------|-------------------|----------------------------------|
-| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
-| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
-| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
-| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
-| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
-| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
-
-0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
-and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
-new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
-dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
-
-## Correctness
-
- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
-  advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
-  chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
-  receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
-  the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
-  pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
-  coherent and answers correctly. The op-level math per token is position-determined and
-  unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
-  oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
-  to end-to-end only, never to the op test.
- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
-  by sequence position and slot/stream, which this change does not touch; it changes only which
-  tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
-  per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
-  run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
-  scheduling decisions** (same per-step token counts, same admission order) - paged must be a
-  no-op on the scheduler.
- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
-  `task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
-  LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
-  per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
-  it, do not try to lift it here.
- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
-  reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
-  dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
-  heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
-  zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
-  This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
-  on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
-  should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
-  checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
-  change.
-
-## Phased plan P0 -> P3 (work, payoff, files, risk)
-
-| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
-|-------|------|-----------------------------------------------------|-------|------|
-| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
-| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
-| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
-| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
-
-**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
-
- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
-  staggered arrival it goes below the vLLM burst figure to ~2 s).
- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
-  the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
-  lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
-  ceiling, the kernel on the throughput residual.
-
-## GB10 considerations
-
- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
-  paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
-  keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
-  `llama_decode`; `n_ubatch` (512) governs the internal split.
- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
-  prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
-  TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
-  Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
-  checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
-  ~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
-  scheduler's uniform decode-only step is a precondition that *helps* a future capture.
-
-## Biggest risks and how to de-risk
-
-1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
-   decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
-   (`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
-   untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
-   verified by an A/B diff of per-step token counts.
-2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
-   documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
-   op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
-   the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
-   to 0013.
-3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
-   with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
-   no-op on batch formation). This is a hard gate, not a spot check.
-4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
-   scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
-   checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
-   batching win is legible independent of the checkpoint co-defect.
-5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
-   work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
-   frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
-   rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
-   moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
-   scheduler is never charged with the 391/811 number the kernel forbids.
-
-## Commit / hygiene
-
-Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
-scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
-implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
-scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
-beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
-`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
-(human pushes).
-
---
-
-## Review / risk (adversarial, source-verified)
-
-Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
-llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
-committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
-
-### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
-
-The central de-risking claims check out against the code, and the load-bearing honesty (decode
-residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
-Two calibration fixes are required before P1 (below), neither changes the go decision.
-
-### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
-
- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
-  per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
-  and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
-  `paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
-  (line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
-  decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
-  scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
-  `llama_decode`/ubatch/KV rewrite in scope.
- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
-  target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
-  `split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
-  forcing `-kvu`. Neither is introduced by this change.
- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
-  chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
-  -> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
-  (llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
-  does not break anything (0013 already hits it) but it means the actual scheduled object is a
-  split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
-  and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
-  the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
-  grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
-  hang off it identically.
-
-### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
-
- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
-  (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
-  `cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
-  **eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
-  launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
-  cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
-  refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
-  llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
-  **flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
-  quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
-  dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
-  `T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
-  compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
-  vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
-  "toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
-  dynamic budget would not. This strengthens P1's case beyond what the doc claims.
- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
-  `T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
-  co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
-  tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
-  drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
-  treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
-  decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
-  "barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
-  arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
-
-### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
-
- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
-  find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
-  change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
-  verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
-  config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
-  sound (default-off path is untouched).
- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
-  a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
-  not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
-
-### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
-
-The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
-with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
-2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
-line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
-The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
-to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
-`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
-(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
-burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
-fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
-npl8 and costs MoE TTFT), so P1 is the floor.
-
-### Realistic effort / payoff and sequencing
-
- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
-  set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
-  Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
-  0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
-  decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
-  tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
-  residual spread/robustness problem.
- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
-  host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
-  pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
-  step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
-  not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
-  over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
-  not change."
-
-### Bottom line
-
-GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
-libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
-cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
-decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
-`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
-P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
-of the 2.4x throughput residual.
--- a/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONV_STATE_FUSION_RESULTS.md
@@ -1,106 +0,0 @@
-# Patch 0021: qwen35 decode conv-state in-place fusion (no-regret, bit-exact)
-
-The no-regret conv-state cleanup from the GDN_RECURRENCE_BYTE_GATE design, point (3).
-After the recurrence byte-gate (NO-BUILD: the GDN recurrence is already single-pass at
-the f32 byte floor), the conv path was the only remaining bit-exact decode lever.
-
-## What changed
-
-A new fused op `ggml_ssm_conv_update_inplace` (reuses `GGML_OP_SSM_CONV`, discriminated by a
-non-null `src[3]`) that, on the single-token decode path, replaces the four-op conv chain:
-
-    qkv_mixed transpose -> ggml_concat (build width-K window)   [concat_cont 8.14 ms/step]
-    -> ggml_ssm_conv (depthwise conv)                           [ssm_conv_f32 ~8.6 ms/step]
-    -> ggml_silu                                                [folded into ssm_conv on CUDA]
-    -> ggml_cpy of the shifted ring state into the conv cache   [cpy_scalar 5.76 ms/step]
-
-with ONE kernel that, per (channel, sequence), assembles the width-K window in registers from
-the K-1 cached taps + the current `qkv_mixed` token, computes the depthwise conv with the SAME
-ascending-tap FMA order as `ssm_conv_f32` at i==0, folds silu, writes the conv output, and writes
-the 1-token-shifted ring state back IN PLACE into the conv cache slot at kv_head (the exact slot
-the baseline `ggml_cpy` wrote). Mirrors the 0018 in-place write-back + 0019 patterns. This is
-vLLM's `causal_conv1d_update`.
-
-Files:
- `ggml/include/ggml.h`, `ggml/src/ggml.c`: new builder `ggml_ssm_conv_update_inplace`
-  (src[0]=conv_states [K-1,channels,n_seqs], src[1]=conv_kernel, src[2]=x_cur [channels,1,n_seqs],
-  src[3]=conv_state_dst [(K-1)*channels,n_seqs] in-place ring; op_params[0]=fuse_silu).
- `ggml/src/ggml-cuda/ssm-conv.cu`: kernel `ssm_conv_update_f32<apply_silu,d_conv>` (one thread per
-  (channel,seq)) + `ggml_cuda_op_ssm_conv_update` + a `src[3]`-discriminated branch at the top of
-  `ggml_cuda_op_ssm_conv`.
- `ggml/src/ggml-cpu/ops.cpp`: `ggml_compute_forward_ssm_conv_update_f32` (threads split over
-  channels) + branch in `ggml_compute_forward_ssm_conv`.
- `src/models/delta-net-base.cpp` + `models.h`: `build_conv_state_fused` (keeps the cheap build_rs
-  conv-tap gather; fuses conv+silu+shifted write-back). Read source (gathered scratch) and write
-  target (cache view) are disjoint buffers -> race-free by construction; no ids/identity logic needed.
- `src/models/qwen35.cpp`, `qwen35moe.cpp`, `qwen3next.cpp`: route the single-token decode path
-  (`n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar`) to `build_conv_state_fused`; prefill/chunked/
-  rollback keep the existing concat+ssm_conv+silu+cpy chain.
- `tests/test-backend-ops.cpp`: `test_ssm_conv_update` (16 cases) comparing the fused conv output
-  vs the CPU reference across backends.
-
-## Gate: test-backend-ops (CUDA0 vs CPU reference)
-
- SSM_CONV: 45/45 OK (unchanged path intact)
- SSM_CONV_UPDATE: 16/16 OK (new op; d_conv 3/4 x channels 256/3328 x n_seqs 1/4/32/128)
- SSM_CONV_BIAS_SILU: 90/90 OK
-
-## Gate: greedy bit-exactness (--temp 0 --seed 1 --ignore-eos -n 256, -no-cnv, -fa on)
-
-Byte-identical to the clean Lever-1 (0019/0020) baseline, both models:
-
-| model              | baseline md5                     | fused md5                        | result          |
-|--------------------|----------------------------------|----------------------------------|-----------------|
-| q36-27b-nvfp4      | 675cd52265f2b3d7695c8739946d55ea | 675cd52265f2b3d7695c8739946d55ea | BYTE-IDENTICAL  |
-| q36-35b-a3b-nvfp4  | ac163882eb3812ef08d4c73e6d9a0abf | ac163882eb3812ef08d4c73e6d9a0abf | BYTE-IDENTICAL  |
-
-## decode_agg S_TG (npp128 ntg128, -fa on, -c 33000), same-session before/after
-
-Dense q36-27b-nvfp4:
-
-| mode      | npl | baseline | fused  | delta   |
-|-----------|-----|----------|--------|---------|
-| CUDA-graph| 32  | 199.76   | 202.99 | +1.6%   |
-| CUDA-graph| 128 | 336.35   | 347.14 | +3.2%   |
-| eager     | 32  | 196.07   | 197.61 | +0.8%   |
-| eager     | 128 | 333.62   | 342.97 | +2.8%   |
-
-MoE q36-35b-a3b-nvfp4:
-
-| mode      | npl | baseline | fused  | delta   |
-|-----------|-----|----------|--------|---------|
-| CUDA-graph| 32  | 421.72   | 432.39 | +2.5%   |
-| CUDA-graph| 128 | 689.74   | 713.54 | +3.5%   |
-| eager     | 32  | 421.05   | 432.46 | +2.7%   |
-| eager     | 128 | 689.15   | 713.87 | +3.6%   |
-
-Dense npl128 (production CUDA-graph) lands at 347.14 t/s, in the predicted 346-349 band, and at
-**88.8% of vLLM 391** (up from 86.0%). The lift holds in BOTH graph and eager modes.
-
-## Step time + nsys kernel delta
-
-Per-step decode time (dense npl128, T_TG / ntg=128):
- baseline 48.711 s / 128 = 380.6 ms/step
- fused    47.197 s / 128 = 368.7 ms/step  -> **-11.9 ms/step** (matches the predicted +12-14 ms)
- MoE npl128: 185.6 -> 179.4 ms/step (-6.2 ms/step)
-
-nsys eager decode (npp128 ntg24 npl128, 24 decode steps x 48 GDN layers), conv-path kernels:
-
-| kernel              | baseline calls | fused calls | per-step (eager) |
-|---------------------|----------------|-------------|------------------|
-| concat_cont (decode)| 1152           | 0 (GONE)    | 7.95 -> 0 ms     |
-| cpy_scalar (decode) | 1152 of 3648   | 0 (GONE)    | 4.29 -> 0 ms     |
-| ssm_conv_f32 (decode)| 1152 of 2736  | 0 (prefill-only) | 8.65 -> 0 ms |
-| ssm_conv_update     | 0              | 1152        | 0 -> 7.56 ms     |
-
-Decode conv path eager GPU time: ~20.9 ms/step -> ~7.56 ms/step = ~13.3 ms/step saved. concat_cont
-and the decode cpy_scalar are eliminated; ssm_conv at decode is replaced by the fused update kernel.
-prefill keeps the original chain (concat_non_cont 1584, ssm_conv_f32 1584 unchanged).
-
-## Verdict
-
-Bit-exact, no regression, and lifts decode: dense 336.35 -> 347.14 t/s (+3.2%, 86.0 -> 88.8% of vLLM
-391), MoE 689.74 -> 713.54 t/s (+3.5%) at npl128. Step -11.9 ms (dense). Additive and risk-free;
-de-risks the in-place conv-cache plumbing the bf16-state lever (design (2)/(4)) also touches.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
+++ b/backend/cpp/llama-cpp/patches/paged/CRITICALPATH_GAP_ANALYSIS.md
@@ -1,639 +0,0 @@
-# Critical-Path Gap Analysis - GDN decode region
-
-## vllm-gdn-compare (READ-ONLY, no GPU) - vLLM decode GDN kernel inventory vs llama
-
-### Source ground truth
- Local checkout `/home/mudler/_git/vllm` and the DGX's benchmarked venv
-  `/home/mudler/vllm-bench/lib/python3.12/site-packages/vllm` are STRUCTURALLY
-  IDENTICAL (same file `qwen_gdn_linear_attn.py`, byte-for-byte same line numbers
-  1287/1344/1457/1644/1684). So the analysis below is faithful to what was actually
-  benchmarked on the GB10. Both are a recent dev build (`__version__ = "dev"`), same
-  era as the "0.23.0" reference; the GDN path is the refactored
-  `vllm/model_executor/layers/mamba/gdn/qwen_gdn_linear_attn.py`.
-
-### The headline: vLLM runs the entire GDN region at decode as 2 Triton kernels + 3 GEMMs, ALL fused
-Per Qwen3.5 gated-DeltaNet (linear-attn) layer, vLLM decode launches:
-
-| # | Kernel | What is folded in |
-|---|--------|-------------------|
-| 1 | `in_proj_qkvz` GEMM | (quantized matmul - shared with llama) |
-| 2 | `in_proj_ba` GEMM | (quantized matmul - shared with llama) |
-| 3 | `_causal_conv1d_update_kernel` (causal_conv1d.py:1193) | conv1d **+ silu activation fused in** (the `activation` arg) |
-| 4 | `fused_recurrent_gated_delta_rule_packed_decode_kernel` (fused_recurrent.py:256-336) | **l2norm(q), l2norm(k), scale, softplus gate, A_log decay exp(g), sigmoid(beta), the delta-rule recurrence (b_h*=exp(g); delta update), the output b_o=sum(b_h*b_q), AND the SSM state write-back** - all in one kernel |
-| 5 | `RMSNormGated` (gated rms_norm) | **output gate silu/sigmoid * z fused into the rms_norm**; the comment notes the norm+quant is further fusable by the compilation pass (`fuse_norm_quant`) |
-| 6 | `out_proj` GEMM | (quantized matmul - shared with llama) |
-
-So the GDN-region "glue" elementwise op count in vLLM is effectively ZERO separate
-launches. Everything llama runs as standalone ggml nodes - conv-silu, gate
-sigmoid, softplus, l2norm, scale, decay mul, residual add, gather - is absorbed
-into kernels #3, #4, and #5.
-
-Verified kernel bodies:
- `fused_recurrent_gated_delta_rule_packed_decode_kernel` lines 313-336:
-  `b_q/sqrt(sum(b_q^2)+eps)`, `b_k/sqrt(...)` (l2norm), `b_q*scale`,
-  `softplus_x=where(x<=thr, log(1+exp(x)), x)`, `g_val=-exp(A_log)*softplus_x`,
-  `beta_val=sigmoid(b)`, `b_h*=exp(g_val)`, `b_v-=sum(b_h*b_k)`, `b_v*=beta_val`,
-  `b_h+=b_v*b_k`, `b_o=sum(b_h*b_q)`, `tl.store(p_o,...)`, `tl.store(p_ht,...)`.
-  ONE kernel = recurrence + ALL gating + l2norm + state writeback.
- The non-packed variant `fused_sigmoid_gating_delta_rule_update_kernel`
-  (fused_sigmoid_gating.py:24-179) is the same fusion (used for the spec-decode /
-  mixed-batch path); both fold gate+l2norm+recurrence+writeback into one launch.
- Decode dispatch: `_forward_core` (line 1286-1298) routes pure non-spec decode to
-  `_forward_core_decode_non_spec` (line 1644), which calls exactly
-  `causal_conv1d_update` (#3) then `fused_recurrent_gated_delta_rule_packed_decode`
-  (#4). `_output_projection` (line 851) does `self.norm(core_attn_out, z)` (#5,
-  gated rmsnorm) then `out_proj` (#6).
-
-### vLLM ALSO captures decode in a FULL CUDA graph - the launch bubbles are gone entirely
-`vllm/v1/attention/backends/gdn_attn.py`:
- `_cudagraph_support = AttentionCGSupport.UNIFORM_BATCH` (line 82)
- `use_full_cuda_graph = cudagraph_mode.has_full_cudagraphs()` (line 113)
- `build_for_cudagraph_capture` (line 509): "only decode is supported for full
-  cudagraphs with Mamba" / "GDN only supports decode-only full CUDAGraph capture".
-
-So at decode vLLM captures the WHOLE forward (all 48 layers: GDN linear-attn layers
-+ the 1-in-4 full-attn layers + projections + conv + recurrence + gated rmsnorm)
-into a single replayed CUDA graph. Per-kernel host launch latency and the
-data-dependent inter-op gaps are eliminated at replay time. Even the 2 Triton
-kernels per GDN layer incur no host-side launch bubble during graph replay.
-
-### Why this is the 62%-vs-40% explanation (not GEMM throughput)
- llama runs the GDN region as ~7-9 separate ggml nodes per layer at decode
-  (`ssm_conv`, `gated_delta_net` recurrence, `gdn_gather`, `k_bin_bcast` mul,
-  `silu`, `sigmoid`, `l2_norm`, `op_add`, `concat`), each a host-launched kernel,
-  serially data-dependent (conv -> gate -> recurrence -> gather), with the gating
-  elementwise wedged between recurrence steps. Each launch + the dependency stall
-  is a bubble ON the critical path. x48 layers x ~8 ops = ~384 launch bubbles/step.
- vLLM has 2 fused Triton kernels per GDN layer AND wraps them in a CUDA graph, so
-  the GDN-region inter-op bubble count at decode is ~0. The recurrence kernel
-  itself is already near-parity in llama (gated_delta_net 1.47 ms/call vs vLLM).
-  The gap is the surrounding launch/sync overhead, which is exactly the 60% idle
-  measured (llama ~40% busy vs vLLM 62%).
- This matches why P2a and Lever 2 were FLAT: they shrink GPU-busy kernels that are
-  already overlapped with the 42% mul_mat_q GEMM. The real wall-clock lever is the
-  SERIAL GDN gating chain's launch bubbles, which vLLM removed by (a) fusion into
-  the recurrence kernel and (b) CUDA-graph capture.
-
-### What llama would need to match vLLM (two independent wins, either helps)
-1. **Op fusion (Lever 3).** Collapse the GDN per-layer gating chain into the
-   recurrence kernel: fold conv-silu, q/k l2norm, scale, softplus+A_log gate,
-   sigmoid-beta, the exp-decay mul, the residual add, and the SSM-state write-back
-   INTO the `gated_delta_net` CUDA kernel (and fuse the output gate silu*z into the
-   final rms_norm). Target: from ~8 GDN nodes/layer down to ~2 (conv-fused +
-   recurrence-fused), mirroring vLLM's `fused_recurrent_gated_delta_rule_packed_decode`.
-   The conv silu fold and the l2norm/scale/gate fold are the high-value pieces -
-   they are pure elementwise prologues sitting ON the serial chain between conv and
-   recurrence.
-2. **CUDA-graph the decode step.** Even without fusion, capturing the decode forward
-   in a CUDA graph removes the per-node host launch latency for all ~384 nodes/step.
-   (Prior A.2 work flagged ggml-cuda graph capture as the orthogonal lever; the
-   measured GDN structure here is exactly why it should move the wall.) vLLM gets
-   BOTH; llama gets neither today.
-
-### Bottom line for the gap-analysis agent
-The candidate explanation is confirmed at the source level: vLLM's GDN decode region
-is 2 fused Triton kernels under a full CUDA graph vs llama's ~8 separate
-host-launched, serially-dependent ggml nodes. That launch/bubble delta - not GEMM
-compute - is the 62%-vs-40% busy gap. A timeline gap analysis on the existing nsys
-trace should show idle GPU between the GDN sub-ops (conv -> gate -> recurrence ->
-gather) per layer; if it does, Lever 3 (gating-into-recurrence fusion) and/or
-decode CUDA-graph capture are the levers that will move the wall, unlike P2a/Lever 2.
-
---
-
-## roofline-decode (READ-ONLY, no GPU) - decode-step roofline + bubble budget + parity target
-
-Goal: bound the q36-27b-nvfp4 decode step by the bandwidth floor and the compute floor,
-compare to measured llama 384 ms/step vs vLLM 327 ms/step, size the unexplained "bubble
-budget", and state the step-time target for parity. Cross-checks vllm-gdn-compare above.
-
-### Inputs (measured / GGUF metadata, no new GPU work)
- DGX GB10 (sm_121): LPDDR5x **273 GB/s**, dense NVFP4 MMA peak ~**500 TFLOP/s** (sparse ~1 PFLOP/s).
-  Both numbers are shared identically by llama and vLLM (same HW, same weights).
- q36-27b-nvfp4 GGUF (arch qwen35): block_count **64** (full_attention_interval 4 ->
-  **16 full-attn + 48 GDN** layers), d_model 5120, FFN 17408, attn 24 heads / 4 kv-heads,
-  head_dim 256, ssm conv_kernel 4 / state_size 128 / group_count 16 / inner_size 6144.
-  Weight file = **18.804 GB** (NVFP4 + FP8 block-scales + f16 norms/embd), fully GPU-resident.
- Measured llama decode (dense_base.out, -fa -npp128 -ntg128 -npl128, B=128, 128 TG steps):
-  T_TG 49.154 s / 128 = **384.0 ms/step** (S_TG 333.3 tok/s; matches STATE "~381 ms").
- vLLM dense ref **391 tok/s @128** => 128/391 = **327.4 ms/step**.
-
-### 1. Bandwidth floor (bytes that MUST cross LPDDR5x per step / 273 GB/s)
-| term | bytes/step | basis |
-|------|-----------|-------|
-| Weights (batched, read ONCE/step, reused across all 128 seqs) | ~18.4 GB | file 18.804 minus ~0.4 GB sparse input-embd lookup; lm_head fully read |
-| SSM state R+W (48 GDN layers x 128 seqs) | ~19 GB (bracket 10-38) | ~1.5 MB/layer/seq R+W, kernel-grounded: gated_delta_net 1.47 ms/call -> ~400 MB/call @273 GB/s; theoretical d_inner*d_state f32 doubles it |
-| KV cache read (16 attn layers, avg 192 ctx, 128 seqs, f16) | ~1.6 GB | 64 KiB/tok over 16 layers; max-ctx 256 -> 2.1 GB |
-| Activation/quantize/gate intermediates R+W | ~3 GB | quantize_mmq_nvfp4 + k_bin_bcast + silu + rms tensors @ batch 128 |
-| **TOTAL** | **~42 GB/step** | bracket 32-61 GB |
-
-**Bandwidth floor = 42/273 = ~154 ms/step** (central; bracket ~117-224 ms).
-Weight-only HARD sub-floor (unavoidable, both engines pay it): 18.4/273 = **67 ms/step**.
-
-KEY: even at batch 128 the FP4 GEMM is STILL memory-bound, not MMA-bound. AI = 2*128/0.53 B
-= ~483 FLOP/byte << GB10 ridge 500e12/273e9 = 1832 FLOP/byte. The 42% `mul_mat_q<NVFP4,m=128>`
-GPU time is weight-DRAM streaming, not tensor cores -> first-principles reason P2a (-26% MMA
-occupancy) and Lever-2 were FLAT on decode.
-
-### 2. Compute floor (FLOPs / ~500 TFLOP/s dense FP4)
-| term | FLOPs/step | floor |
-|------|-----------|-------|
-| FP4 GEMM (all dense matmuls): 2 * ~26e9 params * 128 seqs | 6.66 TFLOP | / 500e12 = **13.3 ms** (6.7 ms @ sparse 1 PFLOP) |
-| GDN recurrence (state update + read-out, 48 layers x 128 seqs) | ~0.04 TFLOP | < 0.1 ms (state-bound, not FLOP-bound) |
-| **TOTAL** | ~6.7 TFLOP | **~13 ms/step (~4% of the step)** |
-
-### 3. Verdict / bubble budget / parity target
-```
-                    compute floor   bandwidth floor    MEASURED step   x above bw-floor
-GB10 dense-FP4      ~13 ms          ~154 ms (117-224)
-vLLM dense @128                                        327 ms          ~2.1x (1.5-2.8x)
-llama dense @128                                       384 ms          ~2.5x (1.7-3.3x)
-```
- **Binding floor = bandwidth (~130-155 ms), NOT compute (~13 ms).** Compute floor is ~25x
-  below the wall -> FP4-MMA throughput is irrelevant; matches P2a/Lever-2 flatness exactly.
- **Both engines run ~2-2.8x ABOVE the bandwidth floor.** vLLM itself reaches only ~40-47%
-  LPDDR5x efficiency -> even the reference is LATENCY/occupancy bound, not byte-bound.
-  Confirms prior "decode is 2.5x above its bandwidth floor" work.
- **Bubble budget** (wall - bandwidth floor, central 154): vLLM ~**173 ms**, llama ~**230 ms**.
-  = kernel-launch latency + occupancy gaps + serial data-dependency stalls.
- **The llama-vs-vLLM gap = 384 - 327 = 57 ms/step (14.8% of the step) is 100% BUBBLE.**
-  Both engines share IDENTICAL bandwidth AND compute floors (same 18.8 GB NVFP4 weights, same
-  SSM state, same KV, same GB10 273 GB/s + 500 TFLOP). Bytes and FLOPs are byte-for-byte equal,
-  so the entire 57 ms differential lives in critical-path bubble - NOT bandwidth, NOT compute.
-
-**Parity target: 327 ms/step (391 tok/s @128). llama must shave 57 ms/step = 14.8% off 384 ms.**
-Neither floor can move (both already shared with vLLM), so the 57 ms can ONLY come from
-collapsing critical-path bubbles -> structurally-correct case for Lever 3 (fuse the serial GDN
-gating chain) and/or decode CUDA-graph capture, exactly the two wins vllm-gdn-compare found vLLM
-already has. P2a/Lever-2 were flat because they freed OVERLAPPED GPU-busy time BELOW the floor.
-
-### Cross-check / sizing for the gap-analysis (timeline) agent
- GPU-busy sum from nsysab_new (ntg24 window, /24 steps): FP4 GEMM ~243 + gated_delta_net ~76 +
-  GDN glue (k_bin_bcast mul ~49, silu ~34, concat ~19, gdn_gather ~21, ssm_conv ~12, l2_norm ~6,
-  op_add ~10) ~152 + quantize ~62 = **~555 ms GPU-busy vs 384 ms wall** -> sum >> wall by ~1.45x,
-  so heavy overlap is real and GPU-busy% buckets ARE misleading. Do NOT sum kernel times; the
-  wall is the critical path.
- Concrete budget: if the inter-kernel IDLE gaps + non-overlapped launch latency along the serial
-  GDN chain (ssm_conv -> gated_delta_net -> gating elementwise -> gdn_gather, x48 layers x N steps)
-  sum to **>= 57 ms/step**, Lever 3 is justified AND sized. If those critical-path gaps total
-  < 57 ms, parity is NOT reachable via GDN-gate fusion alone and the gap is elsewhere (GDN core
-  kernel slower than vLLM fused_recurrent, or scheduler/H2D).
- Structural corroboration (agrees with vllm-gdn-compare): vLLM runs the GDN region as 2 fused
-  Triton kernels under a full CUDA graph; llama splits it into ssm_conv + gated_delta_net +
-  gdn_gather + ~6 serially data-dependent elementwise gate kernels. ~384 host-launched nodes/step
-  on a chain that cannot overlap is precisely the mechanism that produces llama's extra ~57 ms.
-
-Floors are engine-independent lower bounds; the timeline agent owns proving the 57 ms is
-recoverable on the critical path. Roofline says: target 327 ms, shave 57 ms, and it can ONLY
-come from bubble (not bytes, not FLOPs).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-## lever3-design (READ-ONLY, no GPU) - concrete fusion of the serial GDN gate chain into the recurrence kernel
-
-### What actually feeds/consumes the recurrence kernel today (qwen35 decode, fused_gdn_ar)
-Traced in `src/models/qwen35.cpp::build_layer_attn_linear` ->
-`src/models/delta-net-base.cpp::build_recurrent_attn` (fused !keep branch) ->
-`ggml/src/ggml-cuda/gated_delta_net.cu`. The model is GDA (g->ne[0]==1, scalar
-gate per head; kda=false in the kernel). S_v = ssm_d_state = 128, so the kernel
-runs the `<128>` template: warp_size==S_v==128, num_warps=4, rows_per_lane==1,
-grid (H, n_seqs, S_v/4=32 z-tiles). Each warp owns one output column `col`; the
-128 lanes hold the full head-vector (one element per lane).
-
-Serial pre-GDN gate chain (each a standalone host-launched ggml node, all on the
-critical path between the in-proj GEMMs and the recurrence):
-1. `beta = ggml_sigmoid(ssm_beta @ cur)`            -> kernel reads `beta_val = *beta_t`
-2. `alpha = ssm_alpha @ cur`
-3. `ggml_add(alpha, ssm_dt)`  (k_bin_bcast op_add)
-4. `ggml_softplus(...)`        (unary_op<softplus>, 1248 inst)
-5. `ggml_mul(softplus, ssm_a)` (k_bin_bcast op_mul; ssm_a = -exp(A_log), baked)  -> g; kernel does `expf(g_t)`
-6. `ssm_conv` then `ggml_silu` (conv path; may already hit the upstream SSM_CONV+SILU fuse) -> v_conv, and the q/k slices
-7. `ggml_l2_norm(q_conv)`, `ggml_l2_norm(k_conv)` (l2_norm_f32<32>, 2496 inst = 1248x2) -> kernel reads q_reg/k_reg
-
-Post-GDN gate (consumes kernel output):
-8. `build_norm_gated(output, ssm_norm, z)` = rms_norm(output)*ssm_norm (RMS_NORM+MUL fused) then `silu(z)*.` (unary_gated_op<silu>, the 5.9% bucket)
-
-### The fusion: fold steps 1,3,4,5,7 INTO gated_delta_net_cuda (a "fused-gate" mode)
-These five are exactly the per-(head) scalar gates (sigmoid beta; softplus+dt+ssm_a
-> g) and the per-head-vector L2 norms of q/k - and the kernel ALREADY loads every
-operand it needs:
- It reads `beta_val` (scalar) -> pass RAW beta, do `beta_val = 1.f/(1.f+expf(-raw))` in-kernel. Removes node 1.
- It reads `g_t` (scalar, GDA) and does `expf(g_t)` -> pass RAW alpha + per-head `ssm_dt[h]` + per-head `ssm_a[h]`, compute `g = ssm_a[h]*op_softplus(alpha + ssm_dt[h])` in-kernel, keep the existing `expf(g)`. `op_softplus(x) = (x>20)?x:logf(1+expf(x))` (copy `ggml_compute_softplus_f32` verbatim). Removes nodes 3,4,5.
- It loads the full q/k head-vector into `q_reg[r]`/`k_reg[r]` (one element per lane at S_v==128). L2-normalize in registers: `float qss = warp_reduce_sum<128>(q_reg[0]*q_reg[0]); q_reg[0] *= rsqrtf(qss + eps* ... )` matching the l2_norm formula, same for k. Each warp redundantly recomputes the (identical) norm for its column - cheap, no shared mem, no extra launch. Removes nodes 7 (x2). `eps` (= f_norm_rms_eps) passed as a kernel float param.
-
-That collapses the pre-GDN serial chain to just: in-proj GEMMs -> build_conv_state(concat) -> ssm_conv(+silu) -> [single fused gated_delta_net kernel]. 5 gate kernels removed per SSM layer per decode step.
-
-### Why the OUTPUT gate (step 8) is NOT folded into this kernel
-The output gated-rmsnorm reduces over the full head_v_dim (S_v=128) per (head,seq).
-In this kernel those 128 elements are produced by 128 DIFFERENT (warp x z-tile)
-blocks (4 warps x 32 z-tiles), so an in-kernel head-wide reduction would need a
-grid-global sync - not feasible without a grid redesign. Leave step 8 as the
-existing RMS_NORM+MUL + unary_gated<silu> fusion (already 2 launches, not in scope).
-The conv-silu (step 6) is a convolution, structurally separate; rely on the
-existing upstream SSM_CONV(+ADD)+SILU fuse rather than pulling it into the
-recurrence kernel.
-
-### Implementation scope
- `ggml/include/ggml.h`: new builder `ggml_gated_delta_net_inplace_ids_fused_gate(ctx, q_raw, k_raw, v, alpha_raw, beta_raw, cache4d, state_dst, ids, ssm_a, ssm_dt, rs_head, eps)` (or an op-param flag GDN_FUSE_GATE on the existing builder + 2 extra srcs). src budget: current op uses src[0..7]; add ssm_a -> src[8], ssm_dt -> src[9]. GGML_MAX_SRC==10, so it fits EXACTLY (zero headroom - note for review).
- `ggml/src/ggml.c`: builder + a new op-param i32 flag (e.g. params[2]=fuse_gate) + f32 param for eps; assert shapes (ssm_a/ssm_dt are [num_v_heads]).
- `ggml/src/ggml-cuda/gated_delta_net.cu`: in `gated_delta_net_cuda`, gate the in-kernel sigmoid/softplus-gate/l2norm behind a `bool FUSE_GATE` template param (4th template bool, keeps the non-fused path byte-identical and avoids register bloat when off). Read ssm_a[h_idx], ssm_dt[h_idx]; compute g per head; sigmoid raw beta; warp-reduce q_reg/k_reg sumsq -> rsqrtf scale. Plumb the 2 new src pointers + eps through `launch_gated_delta_net` and `ggml_cuda_op_gated_delta_net` (read src[8],src[9], op_param eps/flag). The `gdn_gather_nonident` path is unaffected (it gathers state, not q/k/g/beta).
- `ggml/src/ggml-cpu/ops.cpp`: mirror in `ggml_compute_forward_gated_delta_net_one_chunk` (host sigmoid/softplus/l2norm before the per-token math) for CPU parity / test-backend-ops.
- `src/models/delta-net-base.cpp::build_recurrent_attn` (the fused !keep + ids branch, and the inplace non-ids branch): call the fused-gate builder, pass raw alpha/beta/q/k + ssm_a + ssm_dt + eps.
- `src/models/qwen35.cpp` / `qwen35moe.cpp` / `qwen3next.cpp` `build_layer_attn_linear`: when the fuse flag is on, DROP `ggml_sigmoid(beta)`, `ggml_add(alpha,dt)`, `ggml_softplus`, `ggml_mul(.,ssm_a)`, and the two `ggml_l2_norm` nodes; hand the raw tensors + `model.layers[il].ssm_a`, `ssm_dt` to build_recurrent_attn. The conv-silu and z/output-gate path are unchanged.
- Guard the whole thing behind `cparams.fused_gdn_gate` / env `LLAMA_FUSE_GDN_GATE` (default OFF) so it A/Bs against the clean Lever-1 build exactly like P2a/Lever-2, and only the recurrent (GDA) qwen35 family path is touched.
-
-### Numeric considerations / bit-exactness
- sigmoid(beta), softplus(alpha+dt), and the `g = ssm_a*softplus` mul/add are pointwise fp32 with the SAME formula/order as the standalone ggml ops -> these can be **bit-exact** (no reduction). softplus must copy `(x>20)?x:logf(1+expf(x))` exactly.
- q/k l2norm is the ONE op with a reduction: the standalone `l2_norm_f32<32>` does its own warp/block reduction; the in-kernel `warp_reduce_sum<128>` tree may differ in the last ULP, and the eps placement (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) must match the ggml l2_norm exactly. Expect **near-bit-exact, not guaranteed byte-identical** greedy output. So unlike Levers 1/2, gate this on a **PPL/KL tolerance** (KL logit delta < ~1e-3, PPL delta within noise) rather than md5 identity. If byte-identity is required, exclude l2norm from the fold (keep nodes 7) and fuse only sigmoid/softplus/gate - but that drops the value to ~0.3% and is probably not worth it.
-
-### Estimated kernels-removed-per-layer and the honest ceiling
- Removed per SSM decode layer-step: sigmoid(beta) + add(dt) + softplus + mul(ssm_a) + l2norm(q) + l2norm(k) = **6 host-launched kernels -> 0**, collapsing 7 nodes (incl. recurrence) to 1. Across 48 SSM layers = **~288 launches/step removed** (matches the instance deltas: l2_norm 2496, softplus 1248, sigmoid 1248, plus the alpha-add/ssm_a-mul share of op_add/op_mul).
- GPU-BUSY ceiling of the removed ops is small: l2_norm 1.0% + softplus ~0% + sigmoid 0.3% + (dt add + ssm_a mul share of op_add 1.7% / op_mul 8.5%). The point of Lever 3 is NOT the freed busy-time (P2a/Lever-2 proved freeing overlapped busy-time is flat) - it is removing ~288 LAUNCH BUBBLES/step that sit on the serial conv->gate->recurrence dependency where the GPU is otherwise idle. The win is wall-clock only if those specific bubbles are on the critical path.
-
-### RISK (must be settled before building)
-1. **Same trap as P2a/Lever-2 if the bubbles overlap.** If the scheduler already
-   overlaps these pre-GDN gate kernels with an adjacent layer's 42% mul_mat_q GEMM,
-   Lever 3 is FLAT. **Precondition: the timeline gap analysis must show idle GPU
-   between ssm_conv -> (sigmoid/softplus/l2norm) -> gated_delta_net per layer** at
-   batch=128 BEFORE building. If the trace shows the gate ops back-to-back with no
-   gap (overlapped), do NOT build op-fusion; go to lever (2) below.
-2. **The bigger bubbles may be elsewhere on the chain.** The large buckets are op_mul
-   8.5% and unary_gated<silu> 5.9% - much of which is the POST-GDN output gate and
-   FFN, which this fusion does NOT touch. If the gap analysis pins the dominant idle
-   to the post-GDN region or to inter-layer launch latency generally, the
-   higher-leverage Lever 3 is **decode CUDA-graph capture** (removes host launch
-   latency for ALL ~384 nodes/step at once, exactly what vLLM does), not per-op
-   fusion. CUDA-graph is the strictly larger hammer here; op-fusion only helps the
-   pre-GDN slice. Recommend measuring the per-sub-op gap first and preferring the
-   CUDA-graph lever if the bubbles are spread across the step rather than concentrated
-   in the pre-GDN gate slice.
-3. **src-slot exhaustion** (src[8],src[9] use the last 2 of GGML_MAX_SRC=10) - any
-   later op needing more srcs on this node has zero headroom; flag for review.
-
-## cudagraph-coverage (READ-ONLY, no GPU) - does the CUDA graph cover the GDN serial chain at B=128?
-
-### Verdict: YES, the graph covers GDN at batch=128 (dense model). No GDN op forces graph-disable or per-step re-instantiation.
-
-Source: `ggml/src/ggml-cuda/ggml-cuda.cu` (graph state machine), `gated_delta_net.cu`
-(fused op), `src/models/delta-net-base.cpp` (graph build), `src/llama-memory-recurrent.cpp`
-(recurrent head), all on dev tree `~/llama-paged-dev` (HEAD df1cc97, Lever-1). Cross-checked
-against the committed A2_CUDAGRAPH_DECODE.md + DECODE_PARITY_EXPLORE.md measurements.
-
-### How graph-disable / re-instantiation are decided (this fork's state machine)
- `ggml_cuda_graph_check_compability` (ggml-cuda.cu:3251) disables the graph for ONLY two
-  reasons: (a) a split-buffer src, (b) `GGML_OP_MUL_MAT_ID` with non-quantized weights OR
-  `node->ne[2] > get_mmvq_mmid_max(...)` [TAG_MUL_MAT_ID_CUDA_GRAPHS]. GATED_DELTA_NET,
-  SSM_CONV, SSM_SCAN, GET_ROWS, CONCAT, the gating elementwise ops are NOT in the disable
-  list. So no GDN op forces graph-disable.
- `ggml_cuda_graph_update_required` (3297) memcmps, per node, the full `ggml_tensor` struct
-  (incl. `op_params` and `data`) + each src's `data` ptr / `ne` / `nb`. ANY delta -> the
-  warmup state machine (ggml_backend_cuda_graph_compute:4464) resets `warmup_complete` and the
-  WHOLE graph (one key = `cgraph->nodes[0]`) runs eager that step until stable again. Buffer
-  CONTENTS are NOT compared - a contents-only change (e.g. ids values) is graph-safe.
-
-### Why the GDN region's properties are STABLE across steady decode steps
-The fused decode path is `ggml_gated_delta_net_inplace_ids` (delta-net-base.cpp:558-560):
-```
-state_dst = ggml_view_2d(ctx, ssm_states_all, n_embd_s, n_seqs, nb1,
-                         kv_head * n_embd_s * elsize);   // offset = kv_head
-ggml_gated_delta_net_inplace_ids(..., cache4d, state_dst, ids, /*rs_head=*/(int)kv_head);
-```
-Both the `state_dst` view byte-offset and the `rs_head` op_param (read back as
-`ggml_get_op_params_i32(dst,1)` in gated_delta_net.cu:330) derive from
-`kv_head = mctx_cur->get_head()`. In `llama_memory_recurrent::find_slot`
-(llama-memory-recurrent.cpp:610-689) the n_seqs used cells are SWAPPED into the contiguous
-range `[min .. min+n_seqs-1]` and `head = min`. The recurrent cache does NOT grow per token
-(one state cell per sequence, unlike the KV cache). For a steady 128-seq continuous batch the
-same sequences own the same tails every step, so `min`/`head` are constant (=0) -> state_dst
-offset constant, rs_head op_param constant. The GDN inputs (q,k,v,g,b, cache4d, ids) are
-fixed-shape (n_seqs=128, n_rs slots), so ne/nb are stable, and ggml-alloc hands out the same
-compute-buffer offsets each same-topology ubatch -> data ptrs stable. The `ids` (s_copy)
-tensor's CONTENTS change per step but its address/ne/nb do not -> graph-safe.
-
-### The fused GDN op is capture-safe (no host-sync, no capture-time cudaMalloc)
-`gated_delta_net.cu`: the op launches `gdn_gather_nonident_kernel` + `gated_delta_net_cuda`
-on `ctx.stream()` with NO `cudaStreamSynchronize` / host `cudaMemcpy` / `cudaMalloc`. The
-gather scratch is `ggml_cuda_pool_alloc` (VMM pool, served from reserved memory after warmup,
-no real cudaMalloc during capture). `gdn_gather_nonident` early-returns for identity sequences
-(`ids[s]==rs_head+s`), which is the steady-decode case, so its 3.7% is a launched-but-mostly-
-noop kernel - still captured into the graph like any other. Capture succeeds (the build runs,
-graphs engage), confirming none of these break stream capture.
-
-### The only re-instantiation is NOT GDN-driven
-A2 already measured the re-warm cadence: the graph re-instantiates every ~256 tokens because
-the FULL-ATTENTION block-table input `idx` has `ne[0]=GGML_PAD(n_kv,256)` (and kq_mask in
-lockstep) - those step at 256-token boundaries (paged-attn.cpp:199-213). ~97% of decode steps
-replay the captured graph. This is a full-attention-layer input, not a GDN op. (The unpadded
-`LLAMA_KV_PAGED_GATHER` fallback grows `ne[0]` every step and runs pure-eager, but that is not
-the default decode path and is not the GDN/SSM path.)
-
-### Reconciliation with the "~40% util / 60% idle bubbles" premise (it is refuted for GDN)
-The committed nsys sweeps (A2_CUDAGRAPH_DECODE.md, DECODE_PARITY_EXPLORE.md) show the steady
-decode is ~99.4-99.5% GPU-BUSY with graphs ON (measured with `--cuda-graph-trace=node`; a
-graphs-ON trace WITHOUT that flag under-counts GPU rows and falsely reports idle - Trap #2).
-Total exposed idle is ~0.65% of the step; the within-step launch fraction graphs remove is
-0.34% (0.37%->0.11%) and is ALREADY collapsed - the GDN sub-op launch gaps are inside the
-captured region. The "40% utilization" in the STATE is BANDWIDTH-roofline util, not idle SMs:
-decode moves ~55.5 GB/step at 2.48x the 273 GB/s floor, SSM state r+w = 66% of step bytes. The
-GDN recurrence is memory-bandwidth-bound at low occupancy (~12-16%), not launch-gap-bound. So
-"60% idle bubbles on the serial GDN chain" is not supported by the traces; the gap to vLLM is
-SSM-state memory traffic, consistent with P2a/Lever-2 being flat (freeing GPU-busy time, not
-wall-clock).
-
-### Graph-safe lever for GDN: none new
- GDN is already graph-covered; there is no "make the GDN ops graph-safe" lever to build - they
-  are already safe and captured.
- The only genuinely graph-NON-covered idle is the BETWEEN-step host gap (~2 ms/step, ~0.4%):
-  ggml rebuilds/reallocs the cgraph each step with a new `cgraph->uid`, so the uid fast-path in
-  ggml_cuda_graph_update_required never fires and the host re-dispatches ~3100 launches on the
-  Grace cores between graph launches (vLLM builds its graph once + persistent device metadata).
-  A persistent/reused cgraph across decode steps would let the uid fast-path fire and shrink the
-  host gap - but at 0.4% of the step it is second-order to the SSM bandwidth floor.
- CAVEAT (MoE, qwen35moe): MUL_MAT_ID at B=128 can trip [TAG_MUL_MAT_ID_CUDA_GRAPHS]
-  (`ne[2] > mmvq_mmid_max`), disabling the WHOLE MoE-decode graph (GDN included) into eager.
-  That is a MUL_MAT_ID disable, not a GDN break, and does not touch the dense 335 tok/s headline;
-  worth a separate confirm for the MoE model.
-
-## decode-timeline-gap (GPU, label gap-analysis) - the decisive fresh node-level measurement
-
-This is the new GPU run the analysis was waiting on. It arbitrates between the
-roofline/vllm-gdn-compare theory ("57 ms = 100% bubble, Lever 3 closes it") and the
-cudagraph-coverage source verdict ("~99.4% busy, bandwidth-bound, bubbles refuted").
-The measurement confirms the latter and refutes the former, with per-kernel numbers.
-
-### Capture (the trap the prior `--trace=cuda` fell into is now avoided)
-`nsys profile --trace=cuda --cuda-graph-trace=node` on build-cuda-base (clean
-Lever-1, HEAD df1cc97, git-clean mmq.cuh), q36-27b-nvfp4 dense, `-fa on -npp 128
-ntg 24 -npl 128 -c 33000`. Artifacts on DGX: `~/llama-paged-dev/nsysgap.{nsys-rep,
-sqlite}`. The decode step is a single CUDA graph (graphId=11, 23 replays = steps
-2-24; graphId=1 x8 = prefill). Plain `--trace=cuda` recorded each step as ONE opaque
-~380 ms block, so the widely-cited `nsysab_new.kern.txt` breakdown (mul_mat_q 42%,
-gated_delta_net 13%) is PREFILL + the single eager capture step, NOT decode. With
-node-level trace the graph expands: 168201 kernels = 91499 graph-internal + 76702
-eager prefill. **All graph kernels on stream 14 (single stream) -> strictly serial,
-no overlap, so any inter-kernel gap is pure GPU idle.**
-
-### One steady decode step (window between decode launches 22413.26 / 22796.74 ms, width 383.48 ms)
-Exactly 48 `gated_delta_net` + 16 `flash_attn` = one clean step (48 GDN + 16 attn).
-2965 kernels.
-
-| classification | ms/step | % of step |
-|---|---|---|
-| (a) inter-kernel LAUNCH gaps + (b) SERIAL-DEPENDENCY stalls (LAG sum, single stream) | **0.225** | **0.06%** |
-| (c) within-kernel time (GPU running) | 380.4 | 99.94% |
-
-Zero gaps > 5 us. Largest single gap 2.40 us. 1260 sub-1us gaps + 1700 back-to-back.
-**The decode step is 99.94% GPU-busy. There are no bubbles.** This independently
-confirms cudagraph-coverage's ~99.4% and **refutes** roofline-decode's "57 ms = 100%
-bubble" and vllm-gdn-compare's "~384 launch bubbles/step on the critical path".
-nvidia-smi's "40% util" = low SM/compute efficiency WITHIN kernels (c) (memory-latency-
-bound, ~12-16% achieved occupancy), not wall-clock idle.
-
-### Real decode kernel mix (% of the 380.4 ms step) - corrects the prefill-contaminated kern_sum
-| kernel | n/step | ms | % | grid CTAs | waves/48SM |
-|---|---|---|---|---|---|
-| gated_delta_net_cuda | 48 | **196.37** | **51.6** | 48x128x32 = 196608 | 4096 |
-| mul_mat_q (FP4 in/out/qkv/o proj) | 496 | 92.90 | 24.4 | 136 | 1.5 |
-| quantize_mmq_nvfp4 | 496 | 17.13 | 4.5 | 483 | 10 |
-| nvjet GEMM (lm_head) | 1 | 11.91 | 3.1 | 1944 | 40 |
-| flash_attn_ext_f16 (16 attn layers) | 16 | 11.67 | 3.1 | 48 | 1.0 |
-| concat_cont (conv-state) | 48 | 8.01 | 2.1 | 20480 | 427 |
-| cpy_scalar | 64 | 7.62 | 2.0 | 49152 | 1024 |
-| k_get_rows_float | 49 | 7.08 | 1.9 | 15098 | 315 |
-| k_bin_bcast (gate mul + add) | 720 | 6.59 | 1.7 | 3169 | 66 |
-| ssm_conv_f32 | 48 | 5.64 | 1.5 | 10240 | 213 |
-| unary_gated (silu/sigmoid) | 128 | 5.36 | 1.4 | 5888 | 123 |
-| mul_mat_q_stream_k_fixup | 304 | 3.94 | 1.0 | 192 | 4 |
-| rms_norm_f32 | 209 | 3.52 | 0.9 | 1764 | 37 |
-| l2_norm_f32 | 96 | 0.64 | 0.2 | | |
-| gdn_gather_nonident | 48 | **0.061** | 0.016 | | |
-
- `gated_delta_net` is **51.6% of the step**, the single dominant term. The
-  previously-cited "1.47 ms/call near-vLLM" was the EAGER average over 1248 calls
-  (range 0.046-4.42 ms = prefill warmups + capture); true steady decode is
-  **4.08-4.11 ms/call** (gridY=128 = the 128 seqs). 2.8x higher than believed.
- It launches 196608 CTAs / 4096 waves = NOT occupancy-starved; the cost is
-  bandwidth-bound state traffic (~384 MB read + ~384 MB write per layer for the
-  48-head x 128-seq x [state 128 x head_v 128] recurrent state, ~190 GB/s effective).
- The Lever-3 narrow target (gating glue) = k_bin_bcast 6.59 + silu/sigmoid 5.36 +
-  l2_norm 0.64 + softplus 0.13 = **12.76 ms = 3.35%** of the step. `gdn_gather` is
-  **0.06 ms** (negligible - it early-returns on identity ids as predicted).
-
-### The three answers (with numbers)
-1. **Bubbles on the serial GDN critical path?** NO. 0.225 ms idle/step = 0.06%,
-   zero gaps > 5 us. CUDA graphs eliminated launch overhead; serial dependencies do
-   not produce idle (each kernel starts < 1 us after the previous). The premise is
-   refuted by direct measurement.
-2. **Would Lever 3 (fuse the gating chain) shrink the step or overlap away?** It
-   shrinks it, but only by its hard ceiling **12.76 ms = 3.35%** (380 -> 367 ms, 336
-   -> ~348 tok/s, 86% -> 89% of vLLM). It does NOT close the 14% / 53-57 ms gap.
-   IMPORTANT mechanism correction: the step is single-stream and 99.94% busy, so
-   there is NO overlap to absorb freed time (the lever3-design RISK #1 "same trap as
-   P2a if overlapped" does NOT apply - nothing overlaps). So removing those kernels'
-   GPU-time DOES cut wall-clock - but the win is removing their HBM byte traffic, NOT
-   launch bubbles (there are none). And the value is the measured ~12.76 ms, not the
-   "~288 launch bubbles" framing (those launches cost ~0 inside the graph). This also
-   explains P2a/Lever-2 flatness correctly: NOT "overlapped busy-time" (no overlap),
-   but P2a tuned the prefill large-M GEMM (decode GEMMs are 136-CTA tail-bound, untouched)
-   and Lever-2 relocated mandatory quantize work into the GEMM prologue (net zero).
-3. **Do CUDA graphs cover the GDN region at B=128?** YES, fully. Whole step = one
-   graph, 23 replays, ~0.2 ms host gap between steps. `gdn_gather_nonident` and the
-   in-place state ops are graph-internal nodes (graphNodeId != 0); no fragmentation.
-   Confirms cudagraph-coverage. Note: lever #2 from vllm-gdn-compare ("CUDA-graph the
-   decode step") is ALREADY IN EFFECT in this build and did not close the gap - so it
-   is spent, not pending.
-
-### Verdict against roofline-decode's own sizing test
-roofline-decode stated: "if critical-path gaps total < 57 ms, parity is NOT reachable
-via GDN-gate fusion alone and the gap is elsewhere (GDN core kernel slower than vLLM
-fused_recurrent)." **Measured gaps = 0.225 ms << 57 ms.** Therefore, by that test, the
-53-57 ms / 14% gap is NOT bubble and NOT closable by gating fusion. It lives in
-**kernel GPU-time**, dominated by the `gated_delta_net` recurrence (51.6%, bandwidth-
-bound) and secondarily the FP4 GEMM + quantize stack (29%). The "57 ms = 100% bubble"
-roofline conclusion was an inference from the prefill-contaminated GPU-busy sum
-(~555 ms vs 384 ms "implies overlap"); the node-level decode-only measurement shows
-per-step GPU-busy = wall (no overlap), so that inference does not hold.
-
-### Recommendation (resized)
- The real lever is the `gated_delta_net` recurrence kernel itself (196 ms, 51.6%):
-  match vLLM's `fused_recurrent_gated_delta_rule_packed_decode` (vllm-gdn-compare
-  kernel #4) which folds l2norm + gate + decay + recurrence + state-writeback into a
-  SINGLE pass over the state, reducing HBM round-trips of the state. The win is byte
-  reduction in a memory-bound single-stream step, not bubble removal.
- The lever3-design fusion is still worth doing as a component of that (it removes
-  ~12.76 ms = 3.35% of real byte traffic, and unlike its own RISK section feared, it
-  will NOT be flat because there is no overlap), but on its own it is a ~3% lever, not
-  the gap-closer. Build it folded into a single-pass recurrence kernel, not as an
-  isolated gate fold.
- Next decisive measurement (future GPU-agent run): profile vLLM's decode step at
-  npl128 with the same node-level method and compare per-region GPU-time (GDN
-  recurrence vs GEMM vs attention) to localize exactly where vLLM spends its 53-57 ms
-  less. Both engines move near-identical bytes only if vLLM's fused recurrence does
-  not re-stream state; the per-kernel A/B will show whether the gap is the recurrence
-  pass or the GEMM/quantize stack.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## SYNTHESIS (final) - the validated decode-parity picture, ranked plan, and verdict
-
-Reconciles all six investigation sections above plus the three adversarial verdicts
-(Verify A/B/C). One sentence: **the "~60% idle" never existed; the decode step is
-99.94% GPU-busy single-stream, so the 14% gap to vLLM is kernel GPU-time, dominated by
-the bandwidth-bound `gated_delta_net` recurrence (51.6%), and the only gap-closing levers
-are byte-reduction inside that kernel - NOT launch-bubble removal.**
-
-### 1. The proven critical-path decomposition of the decode step
-
-Decisive node-level trace (`nsys --cuda-graph-trace=node`, clean Lever-1 build df1cc97,
-q36-27b-nvfp4 dense, npl128, GB10/48SM/sm_121, commit a7238525, nsysgap.sqlite). One
-steady step = single replayed CUDA graph (graphId=11, 23 replays), all 2965 kernels on
-ONE stream (stream 14, strictly serial -> every inter-kernel gap is pure idle). Window
-383.48 ms.
-
-BUBBLE CLASSIFICATION (the "where is the ~60% idle" answer - it is NOT idle):
-
-| bucket | ms/step | % step | note |
-|---|---|---|---|
-| (a) inter-kernel launch bubbles | ~0 | ~0 | graph replay collapses host launch latency |
-| (b) serial-dependency stalls (GDN chain) | included in 0.225 | 0.06 | each kernel starts < 1 us after prev; zero gaps > 5 us, max 2.40 us |
-| (a)+(b) total exposed idle (LAG sum) | **0.225** | **0.06%** | 1700 kernels back-to-back |
-| (d) between-step HOST gap (cgraph rebuild, new uid) | ~0.2 | ~0.05 | the ONLY graph-non-covered idle; ~0.4% in older eager-tail traces |
-| (c) within-kernel GPU-busy | **380.4** | **99.94%** | this is the whole step |
-
-The nvidia-smi "40%" is within-kernel SM/bandwidth efficiency (~12-16% achieved
-occupancy on memory-latency-bound kernels), NOT wall-clock idle.
-
-KERNEL GPU-TIME DECOMPOSITION of the 380.4 ms busy step (this is where the gap lives):
-
-| kernel | ms | % step | regime |
-|---|---|---|---|
-| `gated_delta_net_cuda<128>` (48x, 4.08 ms/call) | **196.37** | **51.6** | bandwidth-bound f32 recurrent-state R+W (~384 MB R + 384 MB W/layer) |
-| `mul_mat_q` FP4 GEMM (496x) | 92.90 | 24.4 | memory-bound weight stream, 136-CTA tail-bound at decode |
-| `quantize_mmq_nvfp4` (496x) | 17.13 | 4.5 | mandatory act-quant (Lever-2 only relocated it) |
-| `nvjet` lm_head GEMM | 11.91 | 3.1 | |
-| `flash_attn_ext_f16` (16 attn layers) | 11.67 | 3.1 | |
-| `concat_cont` (conv-state splice) | 8.01 | 2.1 | Lever-1 target |
-| `cpy_scalar` (conv-state writeback + dup) | 7.62 | 2.0 | Lever-1 target (the conv-state share) |
-| `k_get_rows_float` | 7.08 | 1.9 | |
-| `k_bin_bcast` (gate mul + add) | 6.59 | 1.7 | Lever-3 gate-fold target (partial - rest is residual adds) |
-| `ssm_conv_f32` | 5.64 | 1.5 | folds into Lever-1 |
-| `unary_gated` (silu/sigmoid) | 5.36 | 1.4 | mostly FFN + output-gate (Lever 3 does NOT touch) |
-| `mul_mat_q_stream_k_fixup` | 3.94 | 1.0 | |
-| `rms_norm_f32` | 3.52 | 0.9 | |
-| `l2_norm_f32` | 0.64 | 0.2 | Lever-3 gate-fold target |
-| `gdn_gather_nonident` | 0.061 | 0.016 | negligible (early-returns on identity ids) |
-
-GDN region (recurrence + conv + concat + cpy + gather + l2norm) >= 210 ms = 55%+ of the step.
-The widely-cited "gated_delta_net 13%, 1.47 ms/call near-vLLM" from nsysab_new.kern.txt was
-PREFILL + the single eager capture step contaminating the average over 1248 calls (range
-0.046-4.42 ms); true steady decode is 4.08 ms/call, 2.8x higher, 51.6% of the step.
-
-### 2. Claims A / B / C: which HOLD, which are REFUTED, and the residual uncertainty
-
-**CLAIM A** ("the ~60% decode GPU-idle is inter-op launch bubbles ON the serial GDN
-chain"): **REFUTED.** Measured idle = 0.225 ms = 0.06%, not the ~53-57 ms the claim
-requires (two-plus orders of magnitude short). Zero gaps > 5 us; CUDA-graph replay
-already collapsed launch latency; serial data-dependency does NOT equal idle when the
-graph dispatches nodes back-to-back. The "40%" was a misread of within-kernel SM
-efficiency; the "555 ms busy-sum > 384 ms wall implies overlap" was a prefill-contaminated
-`--trace=cuda` artifact (each step recorded as one opaque ~380 ms block).
-
-**CLAIM B** ("Lever 3 - gate fusion - moves the wall, unlike P2a/Lever-2, by removing
-serial launch bubbles"): **REFUTED on mechanism.** (i) There are no bubbles to remove
-(0.06%). (ii) The contrast is fictional: the step is single-stream with ZERO overlap
-anywhere, so P2a/Lever-2 were NOT flat because they "optimized overlapped work" - P2a
-tuned the prefill large-M GEMM (decode GEMMs are a different 136-CTA tail regime) and
-Lever-2 merely relocated mandatory quantize work into the GEMM prologue (net zero).
-(iii) Where the claim is trivially true (any kernel removal cuts wall in a 99.94%-busy
-single-stream step), the slice Lever 3 actually fuses ceilings at **12.76 ms = 3.35%**
-(k_bin_bcast 6.59 + silu/sigmoid 5.36 + l2_norm 0.64 + softplus 0.13 - and even that
-over-counts, since silu is mostly untouched FFN/output-gate). So the wall DOES move, but
-only ~3% (380 -> ~367 ms, 86% -> ~89% of vLLM), and NOT for the claimed reason. Lever 3
-is a component, not the gap-closer.
-
-**CLAIM C** ("the residual gap is software-closable LATENCY, not a GB10 hardware floor"):
-**REFUTED as worded** (no latency, no idle to close - same data as A). The "not a hardware
-floor" half is **UNSETTLED, not proven.** vLLM hits 327 ms on the same silicon, so it is
-not an absolute hard floor - but whether the dominant 51.6% `gated_delta_net` term is
-software-closable in BIT-EXACT form turns on one unmeasured quantity (below).
-
-RESIDUAL UNCERTAINTY (the single open question that decides everything):
- **The DRAM byte-traffic ratio of llama's recurrence vs vLLM's.** Every section above
-  ESTIMATED the GDN state bytes (~190 GB/s effective, ~70% of 273 GB/s peak); none MEASURED
-  it. If llama's `gated_delta_net_cuda<128>` moves ~2x the minimal (s0-read + s1-write)
-  bytes because the un-fused gate/l2norm/writeback/gather ops re-stream state through HBM,
-  then the 51.6% is software-closable by a single-pass fused recurrence (Claim C spirit
-  HOLDS). If llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same,
-  the recurrence is at the GB10 LPDDR5x floor for this state size -> the gap is a
-  hardware/architecture floor and is NOT closable in bit-exact form (Claim C REFUTED on
-  both halves). This is the one measurement that converts the verdict from "refuted as
-  worded" to a definitive yes/no.
- **The MoE model (qwen35moe) is untested.** At B=128 MUL_MAT_ID can trip
-  [TAG_MUL_MAT_ID_CUDA_GRAPHS] (`ne[2] > mmvq_mmid_max`) and disable the WHOLE MoE-decode
-  graph into eager, where the ~3100 per-step launches re-dispatch serially on the Grace
-  cores and inter-op bubbles WOULD reappear. For MoE only, Claim A could partially hold.
-  The dense 335 tok/s headline is fully settled.
-
-### 3. Ranked implementation plan for the remaining ~14% (57 ms/step, 384 -> 327)
-
-Every win must come from kernel GPU-time (bytes), because bubbles = 0 and both engines
-share identical bandwidth/compute floors. Ranked by expected recovery.
-
-| # | Lever | ms/step recovered | -> % of vLLM | bit-exact | tractability | gate |
-|---|---|---|---|---|---|---|
-| **1** | **Single-pass fused GDN recurrence** (fold l2norm+gate+decay+recurrence+state-writeback+gather into ONE pass over state, mirroring vLLM `fused_recurrent_gated_delta_rule_packed_decode`) - cuts state HBM round-trips | **0 to ~40** (= the byte-delta; UNKNOWN until ncu) | 86% -> up to ~98% | near (l2norm reduction; KL < ~1e-3) | HIGH (kernel rewrite) | **ncu byte-ratio test FIRST** |
-| 2 | **Conv-state concat -> ssm_conv fusion** (Lever 1): pass conv-state + new token as separate srcs, update conv state in place (vLLM `causal_conv1d_update`); removes concat_cont + the conv-state cpy | **~8-12** (concat 8.01 + cpy share of 7.62) | +2-3% | YES | MEDIUM | no-regret, build regardless |
-| 3 | **Gate-chain fold** (Lever 3 as designed): sigmoid-beta + softplus+dt+ssm_a gate + q/k l2norm into the recurrence kernel | **~12.76 ceiling** (3.35%) - but SUBSUMED by #1 | +3% | near (l2norm) | MEDIUM | build as a COMPONENT of #1, not standalone |
-| 4 | **bf16 recurrent + conv state** (Lever 5): halve the 196 ms recurrence + conv traffic; keep f32 in-register accumulation | **~70-90** (if floor-bound) | could reach/exceed parity | NO (parity-tolerance decision; must match vLLM stored dtype) | HIGH (rewrite + parity validation) | the ONLY lever that moves the floor kernel; separate precision track |
-| 5 | gdn_gather skip-launch at steady decode | ~0.06 | ~0 | YES | trivial | not worth it (micro) |
-| 6 | GDN occupancy split | 0 | 0 | - | - | NOT a lever: 196608 CTAs / 4096 waves, already saturated, bandwidth-bound |
-| 7 | quantize_mmq attack (Lever 2) | 0 | 0 | - | - | SPENT - relocated mandatory work, proven flat |
-| 8 | decode CUDA-graph capture | 0 | 0 | - | - | SPENT - ALREADY in effect (graphId=11), did not close gap |
-| 9 | persistent cgraph (uid fast-path) | ~0.2 (0.05-0.4%) | ~0 | YES | MEDIUM | second-order to the SSM floor |
-
-Levers 1, 3, and the gather of #5 are the SAME kernel rewrite: build them together as a
-single-pass recurrence. Levers 6/7/8 are dead (at-floor or already-shipped). Lever 4 is a
-distinct, bit-exactness-breaking precision track.
-
-### 4. The honest verdict and the single highest-value next step
-
-**Is true (bit-exact) decode parity reachable?** UNCERTAIN, and it hinges entirely on the
-unmeasured byte ratio:
- If llama's recurrence re-streams state (~2x bytes from un-fused ops): YES - a single-pass
-  fused recurrence (Lever 1) plus conv fusion (Lever 2) plausibly recover ~20-40 ms, taking
-  llama to ~345-365 ms = ~90-95% of vLLM, near-bit-exact (gate on KL tolerance).
- If llama is already at the GB10 bandwidth floor for f32 state: NO in bit-exact form - the
-  57 ms is a hardware floor, and only bf16 state (Lever 4, non-bit-exact) closes it.
-
-Either way, the gating-fold-alone path tops out at ~89% of vLLM, so the project should NOT
-ship the isolated gate fold as "the parity lever."
-
-**SINGLE highest-value next IMPLEMENTATION step:** build the **single-pass fused GDN
-recurrence kernel** (Lever 1 = fold gate + l2norm + state-writeback + gather into one pass
-over the recurrent state) - BUT gate the build on one cheap measurement first, because it
-is a HIGH-effort kernel rewrite that is worthless if the recurrence is already byte-minimal.
-
-**The measurement that confirms it before over-investing (one short GPU run, gap-analysis
-agent only):** `ncu` on `gated_delta_net_cuda<128>` at B=128 vs vLLM's
-`fused_recurrent_gated_delta_rule_packed_decode_kernel` for identical layer dims, two
-counters:
- `dram__bytes.sum` (actual DRAM bytes/call)
- `dram__throughput.avg.pct_of_peak_sustained_elapsed` (achieved % of 273 GB/s)
-
-Decision rule:
- llama moves ~2x minimal bytes OR vLLM moves materially fewer for the same math -> redundant
-  un-fused state round-trips -> BUILD the single-pass fused recurrence; predicted recovery
-  scales with the byte delta (up to ~40 ms). This is the gap-closer.
- llama already moves ~minimal bytes at > 85% of peak and vLLM moves the same -> the
-  recurrence is at the GB10 hardware floor -> do NOT build the fusion for throughput (only
-  the ~3% gate-fold ceiling remains); the sole remaining lever is bf16 state (Lever 4,
-  accept non-bit-exact), and bit-exact parity is NOT reachable.
-
-**No-regret parallel work** (build regardless of the ncu outcome, bit-exact, medium effort):
-the conv-state concat -> ssm_conv in-place fusion (Lever 2, ~8-12 ms = +2-3% toward parity),
-which removes concat_cont (8.01 ms) and the conv-state writeback cpy off a bandwidth-bound,
-single-stream step where their full GPU-time is wall-clock.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
@@ -1,185 +0,0 @@
-# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
-
-Profiling study (no engine changes). Question: matched apples-to-apples (both
-batched servers, NVFP4-class weights, prefix caching on, both eager), why is
-`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
-1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
-
-Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
-llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
-(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
-`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
-`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
-K=32 concurrent, generate 64. All profiling scripts are dev-tree only
-(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
-reports per-slot `eval time`, which excludes prompt-eval = pure decode).
-
-## TL;DR
-
-1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
-   the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
-   Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
-   sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
-   "GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
-2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
-   opposite of the thin-batch-GEMM story. Attention scaling with context length, not
-   the matmul, is the load-bearing cost.
-3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
-   stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
-   gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
-   step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
-   in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
-   decode latency.
-4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
-   (~120-185 ms/step). The residual gap is the **long-context decode-attention
-   kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
-   gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
-
-## The measured numbers (batch 32, server-reported pure-decode step time)
-
-`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
-(the most-contended, full-batch-32 slots; excludes prompt eval).
-
-| config                                   | decode step ms (max / top8) | client wall ms/step |
-|------------------------------------------|-----------------------------|---------------------|
-| paged, ctx 1024, greedy                  | 1343 / 1279                 | 1468                |
-| paged, ctx 1024, **heavy 5-sampler**     | 1346 / 1280                 | 1470                |
-| **stock** (no paging), ctx 1024, greedy  | **729 / 650**               | 768                 |
-| paged, **ctx 64** (short), greedy        | **215 / 215**               | 253                 |
-| vLLM NVFP4A16, ctx 1024 (K=32)           | **~120-185** (270 tok/s)    | -                   |
-
-The brief's reference ~828 ms/step sits between the stock (650-729) and paged
-(1279-1343) numbers measured here; the decomposition below is what is robust. Our
-fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
-tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
-"max" slot is maximally contended, which is why our paged max runs a little above 828.
-
-### Context sweep - decode step is attention-scaling, not fixed overhead
-
-Pure-decode step vs shared-prefix length (paged, batch 32):
-
-| prefix ctx | decode step ms |
-|-----------|----------------|
-| 64        | 215            |
-| 128       | ~290           |
-| 256       | ~410           |
-| 512       | ~660           |
-| 1024      | ~1280          |
-
-Roughly linear in context length: ~1 ms of added step time per added context token.
-The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
-loop + sampling, attention negligible). Everything above it scales with KV length =
-attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
-
-## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
-
-`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
-decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
-own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
-GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
-
-| bucket                         | % GPU time | abs (of ~1280 ms) | what it is |
-|--------------------------------|-----------:|------------------:|------------|
-| `flash_attn_ext_f16` ATTENTION | **47.7%**  | ~610 ms           | decode attention over the 1056-cell KV |
-| `cpy_scalar` KV copy/cast      | 18.3%      | ~234 ms           | KV write + f32->f16 casts |
-| `get_rows/set_rows` KV gather  | 17.8%      | ~228 ms           | **paged** gather of K/V/mask to contiguous |
-| `mul_mat_q` + `quantize_mmq`   | 15.7%      | ~201 ms           | NVFP4 weight GEMM (+ activation requant) |
-| rmsnorm / silu / rope / add    | ~0.6%      | ~8 ms             | elementwise |
-
-Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
-weight matmul is ~the entire short-context step, and is context-independent, as
-expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
-finding that ~84% of the step scales with context.
-
-Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
-+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
-classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
-
-### Stock vs paged decomposition
-
- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
-  directly by the FA kernel, **no gather**).
- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
-  gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
-  feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
-  self-inflicted cost and is the entire ~1.85x stock->paged regression.
-
-## vLLM decode architecture mapped onto each llama bucket
-
-vLLM at ~120-185 ms/step is faster on **every** bucket:
-
-| llama bucket (paged)        | ms    | vLLM equivalent | does vLLM avoid it? |
-|-----------------------------|-------|-----------------|---------------------|
-| paged KV gather (get_rows)  | ~228  | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
-| KV copy/cast                | ~234  | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
-| decode attention            | ~610  | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
-| weight GEMM + act quant     | ~201  | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
-| CPU sampling / loop         | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
-
-vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
-vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
-stack of kernel-efficiency wins, not one silver bullet.
-
-## Ranked levers - closable vs structural
-
-1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
-   path; net-zero risk - it is a regression]** The worktree's paged engine makes
-   decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
-   every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
-   paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
-   Fully recovering this *and* keeping paging requires reading paged blocks
-   in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
-
-2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
-   ~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
-   llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
-   FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
-   cost that *grows with context* - exactly the regime the brief targets. Tractable in
-   principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
-   paged read that also kills lever 1's gather), but it is deep CUDA work on a new
-   arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
-
-3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
-   The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
-   less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
-   and folding the activation quant into the GEMM is tractable kernel work. Bounded
-   impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
-
-4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
-   heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
-   the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
-   "host-bound serving loop" hypothesis for this decode-bound workload.
-
-5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
-   already fuses all 32 slots into one decode step (one set of kernels per step over
-   batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
-   on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
-   kernel-bound, not scheduler-bound.
-
-## Honest bottom line
-
-The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
-
- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
-  (don't run paged for decode-latency workloads).
- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
-  llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
-  vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
-  paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
-  the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
-
-What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
-work). What is structural / hard: lever 2 (the decode-attention kernel + a real
-in-kernel paged read), which is where the context-scaling gap actually lives and where
-any serious effort to approach vLLM on GB10 must go.
-
-## Reproduction (dev-tree only, `~/bench/decode_study/`)
-
- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
- `cat2.py` - kernel-time categorization from the sqlite export.
- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
-</content>
-</invoke>
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_PARITY_EXPLORE.md
@@ -1,756 +0,0 @@
-# Decode parity exploration (post-SSM-fix) - per-agent findings
-
-Post-SSM-fix decode (patches 0018 in-place state write-back + 0019 fused gather):
-dense q36-27b-nvfp4 decode_agg = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact.
-The remaining +54% to parity is the question each section below probes. All numbers
-DGX GB10 (sm_121), fusion OFF baseline, `decode_agg` = `S_TG t/s`.
-
---
-
-## Section: per-token-latency (critical path / host-loop) - READ-ONLY
-
-**Verdict: the per-step critical path and host loop are NOT the residual lever.
-Post-SSM the GPU is still ~99% busy at npl128; the entire exposed-idle budget is
-~0.65% of the step (~2.4 ms), of which graphs already remove the within-step half
-(0.34%) and the between-step host gap is ~2 ms/step (~0.4% post-SSM). The 64-layer
-sequential chain does NOT under-fill the GPU at batch 128 - every kernel's grid
-saturates the SMs on its own. The +54% to parity is GPU kernel work (FP4 GEMM
-efficiency + LPDDR5x weight bandwidth), not serialization or host overhead.**
-
-### 1. Measured exposed-idle structure (a2_nsys pre-SSM rep, read-only sqlite sweep)
-
-`paged_off_npl128.sqlite`, steady window 40-97% of trace (14.78 s, ~16.5 decode
-steps at the pre-SSM ~896 ms/step). Overlap-correct interval-union sweep:
-
-| activity set            | busy %  | exposed idle |
-|-------------------------|---------|--------------|
-| kernels only            | 80.25%  | 19.74%       |
-| kernels + memcpy (all)  | 99.35%  | **0.65%**    |
-
- The 19.4% kernels-only gap = **841 big gaps (median 3.35 ms, ~51/step)** that are
-  filled by D2D memcpy. These ARE the per-layer gated-DeltaNet recurrent-state copies
-  (the `gated_delta_net -> ggml_cpy(state->cache) -> next layer reads state` chain).
-  They were a real critical-path serialization, and **patches 0018/0019 removed exactly
-  these** (D2D bucket 18.9% -> 0.23%; get_rows gather 18.8% -> 0.7%). Decode rose
-  +37.8% (186 -> 256 t/s), ~matching the work removed -> the kernels reflowed
-  back-to-back, so post-SSM these big gaps are CLOSED, not re-exposed (inferred from
-  the throughput scaling; the post-SSM nsys was not re-profiled by this read-only agent).
- The TRUE exposed idle (kernel+memcpy union) is **0.65%**: 18 host gaps >=0.5 ms,
-  **median 2.06 ms, max 2.85 ms, ~1.1/step**. This is the single between-step host gap
-  (sample-128 + `update_slots` + next-batch build) that does NOT overlap GPU compute.
- Within-step launch gaps: 24,190 micro-gaps, median 2.14 us, summing to 50.6 ms =
-  **0.34%** of the window - the pure launch overhead that CUDA graphs collapse
-  (measured 0.37% -> 0.11% in A2_CUDAGRAPH_DECODE; graphs already engage on the
-  default paged decode with a 256-token reset cadence).
-
-### 2. Post-SSM scaling of the FIXED host gap
-
-The ~2 ms/step between-step host gap is FIXED work (independent of GPU kernel time).
-As decode accelerated it grew only as a fraction of a shrinking step:
-
-| build         | step ms @npl128 | host gap | host gap % of step |
-|---------------|-----------------|----------|--------------------|
-| pre-SSM (146) | ~877            | ~2 ms    | 0.24%              |
-| post-SSM (256)| ~499            | ~2 ms    | **~0.40%**         |
-| vLLM (391)    | ~328            | (n/a)    | (would be ~0.6%)   |
-
-Even fully removing it (perfect overlap) buys ~0.4%. It is a second-order floor, not
-the lever - it only becomes material once the kernels are fast enough to drop GPU-busy
-below the host time, which is not the case at 65% of parity.
-
-### 3. The 64-layer chain does NOT under-fill the GPU at batch 128
-
-The decode is an intrinsically sequential depth-64 chain (autoregressive: layer N
-needs layer N-1; cannot be parallelized across layers). The question is whether each
-individual kernel fills the SMs at batch 128. It does:
-
- **GDN kernel** (`gated_delta_net.cu`): launch grid `dim3(H, n_seqs, ceil(S_v/4))`
-  = `48 x 128 x 32 = 196,608 blocks` (dense, H=48 value heads, S_v=128). Block
-  `(warp_size, 4, 1)`. Massively oversubscribes the GB10 SMs. Each warp loads its
-  state shard into registers once and runs a single `n_tokens==1` iteration - O(1) in
-  context (confirmed flat across 4x ctx in GDN_DECODE_VERIFY).
- **FP4 GEMM** (`mul_mat_q`, mmq_x=128): M=128 token tile, well into the M-batched
-  regime, full SM occupancy (and Track B P2a already showed it goes 2 CTA/SM).
- The 99.35% kernel+memcpy busy reading IS the direct proof there is no under-fill at
-  npl128: if the chain under-filled, busy% would be well below 99%.
-
-Under-fill only appears at LOW batch (npl32/npl4), where it manifests as the
-weight-bandwidth/GEMV regime (npl32 = 170 t/s vs npl128 = 256): fewer tokens amortize
-the same per-step weight read, NOT idle SMs. That is a bandwidth floor, not a
-host/scheduler problem.
-
-### 4. What the host actually does per step (eager rep runtime API)
-
-Steady-window `CUPTI_ACTIVITY_KIND_RUNTIME` totals (host-thread wall, overlaps GPU):
-
-| API                       |   n   | total   | avg     |
-|---------------------------|-------|---------|---------|
-| cudaStreamSynchronize     | 1723  | 7775 ms | 4513 us |
-| cudaLaunchKernelExC        | 30983 | 4045 ms | 131 us  |
-| cudaLaunchKernel          | 20385 | 2694 ms | 132 us  |
-| cudaMemcpyAsync           | 2085  |   96 ms |  46 us  |
-
-~104 stream-syncs/step and ~3100 kernel launches/step in eager mode (collapsed by
-graphs to ~900 launches/step). The 7.8 s of sync is the host BLOCKING on the busy
-GPU (it overlaps GPU compute, it is NOT exposed idle) - the GPU stays 99.4% busy. The
-sampled-token path is `cudaMemcpyAsync` (96 ms total, negligible, non-blocking). The
-only NON-overlapped residue is the ~2 ms/step between-step gap in section 1.
-
-### 5. vLLM host-loop comparison (per VLLM_DECODE_GROUNDING.md)
-
-vLLM's eager decode is host-cheap BY CONSTRUCTION and hides the host fully behind the
-async CUDA stream WITHOUT pipelined scheduling (`async_scheduling` was OFF; it won the
-2.4x with synchronous scheduling): persistent pre-allocated input buffers updated by
-vectorized numpy (no per-token Python), attention metadata `build()` once per step
-reused across all layers, no GPU->CPU sync in the hot path, sampled-token D2H
-non-blocking + event-gated, and a fixed small launch sequence (~2 ops/Linear). The
-next-step host prep overlaps the current-step GPU compute on the async stream. The key
-asymmetry vs llama: vLLM builds its graph ONCE and reuses persistent device
-KV/block metadata; ggml rebuilds/reallocates the cgraph each decode step (new
-`cgraph->uid`) and re-dispatches ~3100 launches from the loop on the weak Grace cores.
-
-But this asymmetry is hidden under GPU compute on BOTH sides at npl128: llama's host
-loop is a 0.4% exposed gap, not a 2x lever. vLLM's host cheapness is why ITS step is
-328 ms host-free, but llama's 499 ms is also ~99% GPU - the 171 ms difference is GPU
-kernel time (FP4 GEMM), not host.
-
-### 6. Is any host/serialization lever CUDA-graph or scheduler addressable?
-
- **Within-step launch idle (0.34%)**: CUDA-graph addressable, ALREADY captured by
-  default (0.37 -> 0.11%). Worth ~0% of decode_agg (measured +0.1-0.8%, noise).
-  Nothing left to win here.
- **Between-step host gap (~2 ms, ~0.4%)**: NOT removed by a graph (the graph replays
-  the forward; the host still samples + runs `update_slots` + rebuilds the batch
-  between replays). It is SCHEDULER addressable - overlap step N+1's host prep with
-  step N's GPU compute, mirroring vLLM's persistent-buffer + build-once-reuse +
-  non-blocking-D2H pattern (and ideally reuse the ggml cgraph across steps instead of
-  rebuilding it every ubatch). But the ceiling is ~0.4% of the step, so it is a
-  cleanup, not a parity lever.
- **The +54% to parity is none of the above.** It is GPU kernel work: post-SSM the FP4
-  GEMM family is ~48% of decode (the dominant residual), GDN recurrence ~22.5%, and the
-  decode is weight-bandwidth/latency-bound on LPDDR5x (Track B P2a: a -24.7% FP4-GEMM
-  kernel left decode_agg FLAT, the freed compute became idle gaps -> decode is not
-  GEMM-compute-bound but bandwidth/latency-bound). The lever lives in cutting DRAM
-  traffic per step (fused act-quant to drop the separate `quantize_mmq` pass, native
-  FP4-MMA, and/or NVFP4-dense weight quant), NOT in the host loop or CUDA graphs.
-
-### Evidence
- Read-only sqlite sweeps on `~/bench/a2_nsys/paged_off_npl128.sqlite` (this agent).
- `gated_delta_net.cu` launch grid (DGX `~/llama-paged-dev`).
- A2_CUDAGRAPH_DECODE.md, SSM_DECODE_FIX_RESULTS.md, GDN_DECODE_VERIFY.md,
-  VLLM_DECODE_GROUNDING.md, THROUGHPUT_B_P2a_POSTSSM_RESULTS.md.
-# Decode-Parity Exploration
-
-## Section: gdn-source-compare (llama gated_delta_net.cu vs vLLM fused_recurrent_gated_delta_rule)
-
-### Model config (Qwen3.5-27B dense, from vLLM config.json)
- linear_key_head_dim K = 128, linear_value_head_dim V = 128
- linear_num_key_heads = 16, linear_num_value_heads = 48 (GVA 3:1), conv_kernel = 4
- 64 layers, full_attention_interval 4 -> 48 linear (GDN) : 16 full-attn
- Recurrent state per (seq, v-head) = V*K = 128*128 = 16384 f32 = 64 KiB.
-  Per layer per seq = 48 * 64 KiB = 3 MiB. Both engines store state in f32.
-
-### Which kernels run at decode
- llama: ggml_gated_delta_net_inplace_ids -> gated_delta_net_cuda<S_v=128, KDA=false, keep_rs_t=false>.
-  Gate is SCALAR per head (graph reshapes gate/beta to ne[0]=1), so the cheaper !KDA branch runs (one expf per token, not per-channel).
- vLLM: enable_packed_recurrent_decode -> fused_recurrent_gated_delta_rule_packed_decode_kernel
-  (the dedicated single-token decode kernel, NOT the generic varlen fwd kernel).
-
-### The state HBM traffic is IDENTICAL - it is NOT the lever
-Per (seq, v-head) per decode token both engines read 64 KiB state + write 64 KiB state, f32, coalesced.
-The dominant memory term is equal. llama is NOT moving more state bytes than vLLM.
-=> The 1.46 ms/call is llama achieving LOWER effective bandwidth on the SAME bytes,
-   plus extra non-state work, NOT a fundamental HBM-traffic deficit. Hence closable.
-
-### Algorithmic / parallelization delta (the real differences)
-
-1) Reduction strategy (biggest structural difference)
-   - llama: WARP-PER-OUTPUT-COLUMN. State stored transposed M[col][i]=S[i][col]. Each warp owns
-     one V-column; the contraction over the 128 K-rows is a cross-lane warp_reduce_sum.
-     TWO warp_reduce_sum per token (one for kv = S^T@k, one for attn = S^T@q) = ~10 shuffle
-     rounds on the critical path, with n_tokens=1 they are NOT amortized.
-   - vLLM: THREAD-PER-OUTPUT-ROW. b_h is a [BV,BK]=[32,128] tile; each thread owns a FULL K-row
-     of state. sum(b_h*b_k, axis=K) and sum(b_h*b_q) are THREAD-LOCAL 128-wide reductions -
-     ZERO cross-thread shuffles. Outer-product update b_h += b_v*b_k is also thread-local.
-   Same FLOPs, but vLLM has no shuffle-reduction latency in the recurrence.
-
-2) Occupancy / launch geometry (likely the dominant bandwidth gap)
-   - llama: block = (32 lanes, 4 warps) = 128 threads; grid = (H=48, n_seqs, ceil(128/4)=32).
-     Per (head,seq) it launches 32 blocks * 128 threads = 4096 threads to touch a 16384-elem state
-     (only 4 state elems/thread). launch_bounds(128, 2) budgets registers for >=2 blocks/SM; with
-     s_shard[4]+k_reg[4]+q_reg[4]+addressing the register pressure caps it near ~2 blocks = 8 warps/SM
-     (~12-16% occupancy on GB10). A memory-bound kernel at ~8 warps/SM cannot generate enough in-flight
-     loads to saturate 273 GB/s -> low achieved bandwidth on the state read/write.
-   - vLLM: 1 warp/program (num_warps=1), grid (NV=4, B*HV), small register footprint, num_stages=3
-     software-pipelines (prefetches) the state load. Far higher memory-level parallelism per SM.
-
-3) Redundant non-state traffic in llama
-   - q,k re-loaded by EVERY column-warp: 128 column-warps/head each reload the same 128-float q and k
-     => ~128x amplified L2 loads of q/k per head/token (vLLM reloads ~4x, once per NV program).
-     Small (L2-resident) but adds load-issue + L2 pressure competing with the state stream.
-   - Output store: llama writes attn_data[col] from lane 0 only (31/32 lanes idle), scattered
-     single-float stores; vLLM stores a contiguous BV=32 vector (coalesced).
-
-4) Fusion delta (per-layer kernel-launch / HBM round-trip count)
-   - vLLM packed_decode FUSES into ONE kernel: q/k l2norm + q*scale + softplus(a+dt_bias) +
-     (-exp(A_log)) gate + sigmoid(beta) + the recurrence + state write-back.
-   - llama computes these as SEPARATE ggml ops/kernels in the graph before the GDN op:
-     ggml_l2_norm(q), ggml_l2_norm(k), ggml_add(+dt), ggml_softplus, ggml_mul(gate),
-     ggml_sigmoid(beta) (+ conv/silu), each a launch + small HBM round-trip. Plus a separate
-     gdn_gather_nonident_kernel launch per layer (a no-op at steady-state decode: every block
-     early-returns on the identity check, but still a grid launch of n_seqs blocks).
-   Across 48 linear layers this is ~6-10 extra small kernels/layer (~300-480 extra launches/token).
-   Whether this dominates depends on CUDA-graph capture (see A2_CUDAGRAPH_DECODE.md); if captured,
-   launch latency is hidden and the cost reverts to the per-op HBM round-trips + dependency gaps.
-
-### What a faster llama GDN decode kernel would need (optimization scope)
- A. Re-parallelize like vLLM: thread/lane owns a full K-row (or K-shard) so the kv and attn
-  contractions become register-local FMAs, eliminating the two warp_reduce_sum per token.
- B. Raise occupancy for the memory-bound regime: drop/raise the launch_bounds minBlocks hint
-  (the `,2)` is too low), shrink the block, cut registers, and add a software-prefetch of the next
-  state shard so more state loads are in flight per SM. This directly lifts achieved bandwidth on
-  the equal state bytes - the single highest-leverage change.
- C. Load q,k ONCE per (head,seq) into shared memory instead of 128x per-column reload; coalesce
-  the output store across the warp.
- D. Fuse the gate/l2norm/scale (softplus, exp(A_log), sigmoid, l2norm) INTO the recurrence kernel,
-  reading raw a/b/A_log/dt_bias from registers, removing ~6 elementwise passes + their HBM round-trips
-  per layer (matches vLLM's packed_decode). Drop the gather no-op kernel at steady-state decode
-  (or fold the identity check into the recurrence prologue, which it already partly does).
- E. (Longer term) bf16 state would HALVE the dominant traffic, but vLLM keeps f32 too, so this is a
-  divergence-from-reference not a parity lever.
-
-### Bottom line
-llama's GDN decode kernel is NOT moving more state HBM bytes than vLLM (the dominant term is equal),
-so the 1.46 ms/call is an EFFICIENCY gap, not a traffic floor: (1) cross-warp shuffle reductions on
-the n_tokens=1 critical path, (2) low occupancy (~8 warps/SM from launch_bounds + register pressure)
-starving memory-level parallelism so the equal state bytes move at lower effective bandwidth, plus
-(3) 128x redundant q/k L2 loads and (4) ~6-10 unfused gate/norm elementwise kernels per layer that
-vLLM folds into one packed-decode kernel. Highest-leverage fixes: raise occupancy + prefetch (B) and
-row-local reductions (A); secondary: gate/norm fusion (D) and q/k shared-mem reuse (C).
-
---
-
-## Section: validate-findings (adversarial re-derivation from raw DGX data) - READ-ONLY
-
-Re-queried `CUPTI_ACTIVITY_KIND_KERNEL` + `CUPTI_ACTIVITY_KIND_MEMCPY` directly (kernel and
-memcpy summed separately so D2D is never lumped into compute), not from summary text.
-
-### CLAIM 1 - decode decomposition
-PRE-FIX (`a2_nsys/paged_off_npl128.sqlite`, last 17s) vs `decode_decomp.txt`, match <=0.1pp:
-gated_delta_net 23.40% (doc 23.43), k_get_rows 21.99% (21.88), MEMCPY-DtoD 18.89% / 382 GB /
-1583 ops (18.90 / 356 GB / 1584), mul_mat_vec_q 15.53% (15.51), mul_mat_q 10.48% (10.37).
-=> CONFIRMED exactly. gated_delta_net = largest single non-GEMM kernel; FP4-GEMM group ~28%;
-full attention 0.37%.
-
-D2D collapse: only on-box post-fix decomp is `ssm_decomp/after.sqlite`; MEMCPY-DtoD there =
-526 ops / 0.9 ms / 0.05 GB = 0.008% of busy (from 382 GB / 18.89%). => CONFIRMED, stronger than
-the doc's "0.23%" (382 GB state copy-back gone; exact "0.23%/2.93GB/734ops" not reproducible -
-my DtoD 0.05 GB, the 2.16 GB is DtoH).
-
-FLAG (refutes part of the Step-2 decomp): `after.sqlite` is a Step-1 build (patch 0018 only),
-NOT Step-2. It still shows k_get_rows_float 28.44% (gated_delta_net 28.96%, FP4-GEMM group ~33%),
-no `gdn_gather_nonident` kernel, profiled S_TG=164 (~Step-1 180, not Step-2 256); mtime 00:31
-predates the 08:48 rebuild that carried patch 0019. The Step-2 split in `SSM_DECODE_FIX_RESULTS`
-("get_rows 18.8%->0.7%, FP4-GEMM ->48%, GDN 22.5%") has NO surviving sqlite, and the script meant
-to produce it (`ssm_decomp.sh`) CRASHED (Python SyntaxError, see `ssm_decomp_after.out`). So
-"FP4-GEMM ~48%" is UNVERIFIED against raw Step-2 data: measured ~33% on Step-1; removing the 28%
-get_rows bucket lifts it to ~46% arithmetically, so ~48% is plausible but not directly measured.
-Section 1 above and SSM_DECODE_FIX_RESULTS both inherit this unverified Step-2 split.
-
-### CLAIM 2 - 146 -> ~257 ("+66%")
-146.23 baseline CONFIRMED (`ssm_decode_baseline.out`); final 256.57 / 252.50 / 254.02 across
-SSM_DECODE_FIX_RESULTS + THROUGHPUT_B_P2a, within ~1.6%. Magnitude CONFIRMED. TRAP: 146->257 is
-+76% (146->254 = +74%), NOT +66%. "66%" is the % of vLLM (257/391 = 65.7%), not the speedup.
-
-### CLAIM 3 - P2a GEMM-remap FLAT on decode
-THROUGHPUT_B_P2a: dense npl128 252.50->254.02 (+0.6% noise), npl32 -0.4%, MoE flat; FP4 GEMM
-kernel itself -24.7%, PREFILL +12.7%. Pre-SSM corroborated by THROUGHPUT_B_P1. => CONFIRMED.
-
-### CLAIM 4 - 65% of vLLM (254 vs 391)
-254/391 = 64.96%, 256.57/391 = 65.6%; vLLM 391 = enforce_eager apples ref. => CONFIRMED.
-
-### Traps checked
-GGML_CUDA_DISABLE_GRAPHS set `=1` explicitly (not the empty-value trap); graphs ON/OFF within
-noise. memcpy-in-compute lumping AVOIDED (separate table sums). Decomp reps are ntg24-under-nsys
-(S_TG 149/164) - valid for SHARES only; throughput correctly from unprofiled ntg128 logs.
-
-### Net verdict
-1 pre-fix decomp CONFIRMED exact; D2D collapse CONFIRMED (stronger); Step-2 0.7%/48% split
-UNVERIFIED (producer script crashed, only post-fix sqlite is Step-1). 2 magnitude CONFIRMED,
-"+66%" label REFUTED (true +76%; 66% = % of vLLM). 3 CONFIRMED. 4 CONFIRMED.
-
---
-
-## Section: weight-bandwidth (whole-step DRAM budget, READ-ONLY math)
-
-Agent label: weight-bandwidth. Method: exact GGUF tensor accounting (q36-27b-nvfp4,
-arch qwen35, 64 layers) + activation-state math + existing nsys/decode_decomp; no GPU started.
-Config = the production decode number: llama-batched-bench -fa on -npp128 -ntg128 -npl 128
-(B = n_parallel = 128 sequences, S_TG = 254 t/s post-0019). GB10 LPDDR5x peak ~273 GB/s.
-
-### Exact per-step DRAM byte budget at B=128 (ctx avg ~192 over the ntg128 window)
-
-NVFP4 type-40 = 0.5625 B/weight (4-bit data + e4m3 per-16 micro-scale; verified: 5120*48*0.5625=138240).
-
-WEIGHTS (read ONCE per step, shared across all 128 seqs):
-  - NVFP4 layer weights (type40, 64 layers): 13,062.7 MB = 12.76 GB
-      (per SSM layer 215.6 MB x48 = 9867.7 MB ; per full-attn layer 199.7 MB x16 = 3195.0 MB)
-  - LM head output.weight: type 30 = **bf16, NOT quantized** = 2425 MB = 2.37 GB (read in full each step)
-  - per-layer norms/conv1d/ssm_a/dt_bias (type0 f32): 10.1 MB
-  - token_embd: EXCLUDED (get_rows gathers only 128 rows, negligible)
-  => WEIGHTS TOTAL = 15.14 GB / step
-
-PER-SEQUENCE STATE (x128 seqs, read + write every step):
-  - SSM recurrent state: inner_size(6144) x state_size(128) x 4B(f32) = 3.0 MB / layer / seq
-      x 48 SSM layers x 128 seq = 18.43 GB read + 18.43 GB write = **36.86 GB / step**
-  - conv state: conv_k(4) x conv_dim(10240) x 4B = 160 KB / layer / seq
-      x 48 x 128 = 0.96 GB read + 0.96 GB write = 1.92 GB / step
-  - KV cache (16 full-attn layers, GQA n_kv_head=4, k+v_len=512, f16):
-      4096 B/tok/layer x 16 x ~192 ctx x 128 seq = ~1.6 GB read / step
-
-  TOTAL ~= 15.14 (W) + 36.86 (SSM state) + 1.92 (conv) + 1.6 (KV) = **~55.5 GB / step**
-
-### Floor vs measured -- decode is NOT at the bandwidth floor
-
-  Bandwidth floor = 55.5 GB / 273 GB/s = **203 ms/step**
-  Measured llama  = 128 tok / 254 t/s   = **504 ms/step**  => **2.48x the floor** (eff BW 110 GB/s = 40% of peak)
-  vLLM 391 t/s    = 128 / 391           = **327 ms/step**  => 1.61x the floor (eff BW 170 GB/s = 62% of peak)
-
-  The SAME 55.5 GB/step floor applies to vLLM: identical NVFP4 weights, and its
-  fused_recurrent_gated_delta_rule reads+writes the identical f32 recurrent state. So both engines
-  face the same DRAM wall; vLLM simply moves those bytes at 62% of peak vs llama's 40%. The 62/40 =
-  1.55x utilization gap is EXACTLY the 254->391 (1.54x) throughput gap. => Decode-parity is a
-  bandwidth-UTILIZATION / launch-serialization problem, NOT a DRAM-traffic-volume problem. Bandwidth
-  is not the binding constraint (we sit 2.5x above the floor); confirms the GDN-kernel section above.
-
-### Traffic composition is STATE-dominated at B=128 (qualifies the "weight-quant" verdict)
-
-  SSM state r+w = 66% of step traffic; weights = 27%; conv = 3.5%; KV = 3%.
-  At B=128 weights are a minority of traffic, and we are 2.5x above the floor anyway -> NVFP4-dense
-  weight quant (the QWEN36_NVFP4 verdict's lever) cannot move batch-128 decode much. Weight-quant
-  helps PREFILL (compute/weight-bound, already +12.7% from the GEMM remap) and LOW-batch decode.
-  Cross-check at B=32: traffic ~25.2 GB/step (weights now 60%), floor 92 ms, measured 189 ms = 2.05x
-  floor. The sublinear scaling 32->128 (4x batch, only 1.5x throughput: 169->254) is fully explained
-  by per-seq state traffic growing with B while weights stay amortized -> at B=128 the step has become
-  state-traffic-heavy but is STILL 2.5x off the floor, i.e. latency/overlap-bound, not byte-bound.
-
-### Redundant traffic llama reads that vLLM avoids (cut list, by impact)
-
-  1. (HISTORICAL, FIXED by 0018) Redundant DtoD recurrent-state copy = +18.4 GB/step EXTRA
-     (pre-fix decode_decomp: MEMCPY-DtoD 18.9%, 80 copies/step ~230 MB each = 18.4 GB; nsys window
-     356 GB/19.8 steps). This doubled state traffic and was the dominant pre-fix waste. Verified gone
-     post-fix: the THROUGHPUT_B_P2a A/B kernel sum (npp128 ntg24 npl128) lists gated_delta_net /
-     mul_mat_q / quantize but NO MEMCPY-DtoD term. (The committed ~/bench/a2_nsys sqlites are all
-     PRE-fix S_TG~149 traces; re-profiling deferred to the designated profiler.) This single removal
-     (18.4 GB/273 ~= 67 ms/step of bytes plus the killed overlap stalls) is the bulk of 146->254.
-  2. conv state as a SEPARATE ssm_conv kernel + separate buffer: 1.92 GB r+w/step AND 48 extra kernel
-     launches/step. vLLM folds the causal conv into its recurrence kernel. Cut ~= 7 ms bytes + 48
-     launches/step of serialization.
-  3. Residual get_rows gather post-0019 (~0.7%, decode_decomp pre-fix k_get_rows was 21.9% / ~96
-     ops/step = 2/SSM-layer): vLLM indexes the per-seq state in-kernel; llama still does a small
-     gather/scatter. ~0.13 GB. 0019 already folds most of it; fold the identity check fully into the
-     recurrence prologue.
-  4. quantize_mmq_nvfp4: 448 ops/step re-quantizing activations to NVFP4 before each FP4 matmul.
-     Activation BYTES are negligible, but it is 448 extra kernel launches/step that vLLM fuses into
-     the GEMM prologue -> pure launch latency, not traffic.
-  5. NOT redundant: weight bytes (identical NVFP4 to vLLM), SSM-state r+w (inherent, vLLM pays it),
-     NVFP4 scale scalars (8 B/tensor). Note the LM head is bf16 not quantized (2.37 GB/step, 16% of
-     weight traffic) -- fp8 LM head would save ~1.2 GB/step but only matters if vLLM also quantizes it.
-
-### Bottom line (weight-bandwidth)
-At B=128, decode moves ~55.5 GB/step and runs at 2.48x the 273 GB/s floor (40% util) vs vLLM's 1.61x
-(62% util). Same bytes, same floor for both engines -> decode is bandwidth-UTILIZATION-bound, not
-traffic-bound. There is NO large redundant-byte stream left to cut post-0018/0019 (the 18.4 GB/step
-DtoD redundancy is already gone); the remaining 254->391 is recovered by raising achieved bandwidth
-(occupancy + prefetch on the GDN state loads, conv fusion to drop 48 launches/step) so the EXISTING
-55.5 GB/step moves at vLLM's 62% instead of 40%. Weight-quant (NVFP4-dense) is a PREFILL / low-batch
-lever, largely orthogonal to the batch-128 decode-parity gap.
-
---
-
-## Section: explore-other-levers (broad sweep for OTHER llama-specific decode inefficiencies) - READ-ONLY, no GPU
-
-Scope handoff: GDN-kernel internals -> `gdn-source-compare`; host loop / graphs / gaps ->
-`per-token-latency`; weight-byte / utilization -> `weight-bandwidth` section above (which already
-covers the BF16 lm_head and the "same bytes, 40% vs 62% util" framing - I concur, no need to repeat).
-This section covers the levers NONE of those own: the FP4 act-quant fusion, the M=128-vs-M=1 ggml
-fusion gate, TMA scoping, and the conv-state residual.
-
-**Terminology fix that matters for the whole doc:** in this repo's benches **"fusion OFF" means
-`LLAMA_FUSE_NVFP4_QUANT=0`** (Track A's NVFP4 act-quant producer), confirmed in
-`a2_nsys.sh`/`a2_4cell.sh`/`trackA_clean.sh`. It does NOT set `GGML_CUDA_DISABLE_FUSION`, so the
-**standard ggml-cuda elementwise/GLU/rope fusion is ON** in every result. The header's "fusion OFF
-baseline" is only about the act-quant producer.
-
-**Framing (consistent with the sections above, sharpened):** the binder is bandwidth-UTILIZATION /
-the kernel-dependency chain, not traffic or per-kernel compute (P2a -24.7% GEMM and graphs both
-flat). The thing that raises utilization AND shortens the chain is the same: **fewer, fused kernels
-per step** - removing whole passes vLLM doesn't run. So rank by "whole pass eliminated", not "us
-shaved".
-
-### L1. Re-test Track A act-quant fusion (`LLAMA_FUSE_NVFP4_QUANT=1`) POST-SSM. [impact ~8-11%, tractability HIGH - code exists, owned by tasks 38-41]
-`quantize_mmq_nvfp4` is a standalone full-activation requantize run once per NVFP4 GEMM at M=128
-(the weight-bandwidth section counts 448 such launches/step). vLLM has **zero** equivalent:
-`rms_quant_fusion.py:98` folds it into RMSNorm, `act_quant_fusion.py:40,128` into SiLU+mul - the
-activation never hits a temp buffer. Track A built exactly this fused producer (tasks 38-40 DONE),
-but `LLAMA_FUSE_NVFP4_QUANT=1` regressed, and EVERY post-SSM bench ran with it OFF. **The regression
-is likely stale:** pre-SSM the GPU was 99% busy on the state-copy chain, so folding act-quant into
-the norm only relocated busy work into a lower-occupancy kernel with no idle to reclaim. Post-SSM the
-chain has real idle and removing 448 launches/step both shortens the dependency chain and lifts
-utilization - exactly the post-0018/0019 bind. Highest-value CLEAN removal; needs only a re-bench
-(re-run `trackA_clean.sh` on the post-0019 build), not new code. Do not treat the prior regression
-as final.
-
-### L2. M=128 norm->matmul prologue fusion - the ggml fusion gate that does NOT fire at decode batch. [impact ~5-15% aggregate, tractability MEDIUM]
-ggml-cuda's built-in `rms_norm+mul+mul_mat_vec_q` fusion (`ggml_cuda_should_fuse_mul_mat_vec_q`,
-ggml-cuda.cu:2502) is gated to `dst->ne[1]==1` - it ONLY fires at **npl1** (M=1). At npl128
-(`mul_mat_q`, M=128) it does NOT fire, so the per-layer RMSNorm stays a separate kernel feeding the
-GEMM and the act path is unfused (L1). vLLM fuses both into the GEMM prologue at all M. This is the
-M=128 generalization of the existing M=1 fusion + L1; largest aggregate surface but real kernel work.
-Implies a regime split worth stating loudly: **npl1 single-stream latency already gets this fusion;
-the npl128 throughput number does not** - tune the two separately.
-
-### L3. TMA weight feed: a PREFILL / npl1-latency lever, NOT an npl128-decode lever.
-Answering the brief's question (GEMM idle = FEED problem TMA fixes, or off-critical-path TMA can't?):
-P2a cut GEMM compute and the freed time became IDLE, so at npl128 the GEMM finishes early and the
-stall is BETWEEN kernels, not inside the GEMM waiting on weight tiles. TMA accelerates a
-*feed-stalled* GEMM; at npl128 the GEMM is not the binder, so TMA won't move npl128 S_TG. It pays on
-(a) **prefill** (compute/feed-bound; the remap already gave +12.7%) and (b) **npl1 decode**, a pure
-weight-feed GEMV (full model / 273 GB/s ~ 19-20 tok/s ceiling). Scope TMA to prefill + low-batch
-latency; do not bank it for batch-128 decode parity. (Consistent with the weight-bandwidth section's
-"NVFP4-dense is a prefill/low-batch lever".)
-
-### L4. In-place / `ids` conv-state - apply the 0018/0019 pattern to `ssm_conv`. [impact ~1-3%, tractability HIGH, proven pattern, bit-exact-able]
-After the SSM fix the residual D2D is the conv-state copy (`build_conv_state`,
-delta-net-base.cpp:449-525: `build_rs` reads 3 prior samples, `ggml_concat` the new token, writes
-the last 3 back), plus `ssm_conv` (~0.8-1.5%) and a per-GDN-layer `concat_cont` (48/step). The exact
-in-place + `ids`-read treatment from 0018/0019 applies to the conv state, and `ssm_conv`+`concat`
-can fold into the GDN kernel prologue (it already has `ids` plumbing). Small ceiling but bit-exact,
-low-risk, and removes ~48 launches/step from the chain - this is the "conv fusion to drop 48
-launches/step" the weight-bandwidth section calls for, made concrete via the proven patch pattern.
-
-### Deferred (covered by other sections, I concur)
- GDN occupancy / row-local reductions / gate-norm fusion -> `gdn-source-compare`. Add only: bf16
-  state halves the dominant traffic but vLLM keeps f32, so it is a divergence-from-reference, not a
-  parity lever - last priority, quality-risk.
- BF16 lm_head / weight-byte / 40%-vs-62% utilization -> `weight-bandwidth` section. lm_head NVFP4 is
-  an absolute ~1-2% trim, not a vLLM-relative gap (vLLM likely keeps it bf16 too).
- Full-attention KV path (16 attn layers, 0.4-1.8%, O(ctx) but tiny) -> CLOSED, not a lever.
-
-### Bottom line (this section's net-new)
-Ranked by "whole pass vLLM eliminated": **L1 (re-test act-quant fusion post-SSM - clean removable
-pass, code already written, just needs a post-0019 re-bench)** > **L2 (M=128 norm/act prologue
-fusion - biggest aggregate surface, real work)** > **L4 (conv-state in-place - cheap, proven 0018/0019
-pattern, -48 launches/step)**. **L3 (TMA) is mis-scoped if aimed at npl128 decode** - it is a prefill
-/ npl1-latency lever, same bucket as NVFP4-dense weight quant. Caveat inherited from
-`validate-findings`: the post-SSM act-quant absolute share (L1) is on an unverified Step-2 decomp
-(only clean post-fix sqlite is Step-1); re-measure on a clean Step-2 nsys when the profiler runs.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## Section: profile-both-engines (GROUND-TRUTH post-SSM nsys of llama AND vLLM at npl128) - THE GPU PROFILER
-
-Agent label: profile-both-engines (the only GPU agent). Fresh post-SSM nsys traces of
-BOTH engines at the same shape (128-seq decode, 128-token prompts), q36-27b-nvfp4 dense.
-llama = `build-cuda-base` (no FP4 flag, byte-identical to stock, HEAD 46d7dd8 = patch 0019
-SSM fix), `llama-batched-bench -npp128 -ntg32 -npl128 -fa on`, eager (DISABLE_GRAPHS=1) for
-a clean per-kernel trace. vLLM = 0.23.0 in-process offline (`VLLM_ENABLE_V1_MULTIPROCESSING=0`
-so cudaProfilerApi controls the worker), enforce_eager, max_num_seqs 256, 128 prompts.
-Decode-only windows (prefill excluded), overlap-correct interval-union busy, GPU-accurate
-per-call kernel durations. This is the post-SSM **Step-2** trace `validate-findings` flagged
-as having no surviving sqlite - it now exists: `~/bench/postssm_decomp/`.
-
-### 0. THROUGHPUT GROUND TRUTH (un-profiled, prefill-subtracted) - resolves the 391 reference
-
-The vLLM 391 reference is real and reproduced. Prefill-subtracted decode step (two-length
-w16/w64 timing, in-process, batch 128):
-
-| engine / mode            | ms/step | decode tok/s | notes                          |
-|--------------------------|---------|--------------|--------------------------------|
-| llama post-SSM (graphs)  | ~510-522| **245-251**  | S_TG @npl128 ntg32 (this run)  |
-| vLLM enforce_eager       | 324.9   | **394.0**    | == the ~391 ref (h2h log 371-384)|
-| vLLM cuda-graphs         | 304.9   | **419.8**    | graphs buy only +6%            |
-
- **CUDA graphs are NOT the parity lever**: vLLM is already 394 t/s EAGER; graphs add +6%
-  (394->420). llama-batched-bench already runs WITH graphs at 245. So the gap is eager-vs-eager
-  kernel work, confirming `per-token-latency` and `A2_CUDAGRAPH_DECODE`.
- TRAP I hit and corrected: the FIRST vLLM nsys window (0.35-0.99) read 468 ms/step / 273 t/s -
-  WRONG, contaminated by prefill chunked-GDN kernels AND eager-nsys host overhead. The tight
-  decode-only window (0.62-0.98) reads **326.5 ms/step**, matching the un-profiled 324.9 ms
-  exactly -> the tight window is faithful; per-kernel numbers below use it.
-
-### 1. POST-SSM per-step decode decomposition, SIDE BY SIDE (GPU-accurate, prefill-free)
-
-Both at batch 128. llama 510 ms/step (98.7% GPU-busy), vLLM 326 ms/step (97.9% GPU-busy).
-ms/step = on-device kernel time per real decode step (nsys host overhead does not inflate GPU
-kernel duration; per-step = GPU-ms / real-step-count from the decode-only GDN call count).
-
-| component (per step)        | llama ms/step | llama % | vLLM ms/step | vLLM % |
-|-----------------------------|---------------|---------|--------------|--------|
-| GDN linear-attn recurrence  | 193 (48x4.03) | 38%     | 174 (48x3.62)| 53%    |
-| FP4 matmul + act-quant      | **236**       | **46%** | **117**      | **36%**|
-|   - mul_mat_vec_q (GEMV)     | 132 (48x2.75) | 26%     | -            | -      |
-|   - mul_mat_q (GEMM)         | 88 (448 calls)| 17%     | cutlass 61   | 19%    |
-|   - quantize_mmq_nvfp4       | 16 (448)      | 3%      | nvjet 53+cvt2| 17%    |
-| full attention (16 layers)  | 6.6 (16)      | 1.3%    | 6.2 (16)     | 1.9%   |
-| SSM conv + glue/elementwise | ~45           | 9%      | ~22          | 7%     |
-| MEMCPY (D2D+H2D)            | 2.5 (131 MB)  | 0.5%    | 0.36 (85 MB) | 0.1%   |
-| **TOTAL**                   | **~510**      | 100%    | **~326**     | 100%   |
-
-### 2. The three load-bearing comparisons (the brief)
-
-**(1) GDN compute: llama vs vLLM = NOT the gap.** Per-call GPU duration:
-llama `gated_delta_net_cuda<128>` = **4.03 ms/call**, vLLM
-`fused_recurrent_gated_delta_rule_packed_decode` = **3.62 ms/call**. llama is only **+11%**
-slower per call (+19 ms/step). GDN is comparable; it is the largest single kernel on BOTH sides
-(38% llama, 53% vLLM) but it explains only ~19 ms of the 185 ms gap (~10%). This REFUTES the
-framing that the GDN kernel is the dominant residual lever - it is a minor overage post-0018/0019.
-(The `gdn-source-compare` occupancy/shuffle deltas are real but worth ~19 ms/step, not 1.5x.)
-
-**(2) DRAM bytes/step: llama does NOT read more.** Explicit memcpy: llama **131 MB/step** vs
-vLLM **85 MB/step** - llama moves a hair more in copies but both are <0.5% of the step. The big
-per-layer state copies are GONE (pre-SSM 18 GB/step DtoD -> post-SSM 131 MB/step) - **the SSM fix
-(0018/0019) is confirmed working in this trace.** Weight DRAM (read inside the GEMM/GEMV kernels,
-not memcpy) is the SAME ~15 GB NVFP4 for both engines; at 273 GB/s that is a ~52 ms floor, and
-BOTH engines sit far above it (326-510 ms), so BOTH are compute/kernel-bound, NOT
-weight-bandwidth-bound, and llama reads no extra bytes. The 254-vs-391 gap is NOT a byte-volume
-deficit - it is effective-bandwidth/compute-efficiency in the FP4 matmul kernels (see 3).
-
-**(3) GPU-busy% / idle structure: identical, both ~98% busy.** llama 98.7% busy (1.3% idle),
-vLLM 97.9% busy (2.1% idle). Neither engine is idle/gap/host-bound at npl128. The entire gap is
-the GPU doing MORE kernel-time per step on llama: llama's non-GDN GPU work = ~310 ms/step vs
-vLLM's ~146 ms/step. That 164 ms delta is concentrated in the FP4 matmul path.
-
-### 3. THE single biggest llama-specific overage: the FP4 matmul path (+119 ms/step = 64% of the gap)
-
-llama spends **236 ms/step** on FP4 matmul+quant; vLLM does ALL its matmul (cutlass FP4 GEMM +
-cublas nvjet + act cvt) in **117 ms/step** - even though vLLM ALSO carries ~18 ms/step of extra
-PyTorch eager elementwise glue that llama's fused ggml kernels avoid. llama is **2.0x slower on
-FP4 matmul**, and that +119 ms is **64% of the entire 185 ms/step gap**.
-
-Inside llama's FP4 path the dominant, untouched cost is **`mul_mat_vec_q` = 132 ms/step (26% of
-decode), 48 calls/step (exactly one per GDN layer), 2.75 ms/call, grid 5120x128**. This is the
-**FP4 GEMV ("vec_q") kernel running at decode batch 128** for the gated-DeltaNet in-projections -
-a non-tensor-core, memory-bound-style kernel doing M=128 work without GEMM-grade weight-read
-amortization. vLLM runs the equivalent projections through cutlass batched FP4 GEMM (tensor-core,
-weight read amortized across the 128-row batch) at a fraction of the cost. **There is no
-GEMV-at-batch-128 on the vLLM side at all.**
-
-Key cross-check with Track B P2a: P2a optimized `mul_mat_q` (the 17%/88 ms tensor-core GEMM, made
-it -24.7%) and decode stayed FLAT - because the BIG FP4 cost is `mul_mat_vec_q` (26%/132 ms),
-which P2a never touched. **Track B optimized the wrong FP4 kernel.** The lever is to route the
-GDN in-projection at M=128 through a tensor-core GEMM (mul_mat_q / MMQ) instead of the vec_q path,
-and to fuse the act-quant (L1) + the norm prologue (L2) so the 448 `quantize_mmq_nvfp4` launches
-fold away - exactly what `explore-other-levers` L1/L2 propose. My measurement RANKS them: the
-mul_mat_vec_q->GEMM routing is the single highest-value target (132 ms), then act-quant fusion
-(16 ms + 448 launches), then the GDN +19 ms.
-
-### 4. Reconciling with the `weight-bandwidth` section (unification, not contradiction)
-
-weight-bandwidth concluded "same 55.5 GB/step, llama 40% util vs vLLM 62% util -> utilization-bound."
-My per-kernel data LOCALIZES that utilization gap: it lives in the **FP4 matmul kernels** (which
-do the bulk of the ~15 GB weight read), NOT in the GDN state traffic. GDN moves its (equal) state
-bytes at comparable rate on both engines (4.03 vs 3.62 ms/call). So the "40% vs 62%" is the
-`mul_mat_vec_q`/`mul_mat_q` weight-read efficiency vs cutlass FP4 GEMM. Raising decode parity =
-raise the FP4-matmul achieved bandwidth (tensor-core GEMM routing + act/norm prologue fusion),
-not the GDN kernel and not byte-cutting.
-
-### Verdict (profiler)
- Reproduced both engines at their true operating points: llama 245 / vLLM 394 eager / 420 graphs.
-  Graphs are not the lever (+6%). Both engines ~98% GPU-busy; gap is GPU kernel-time, not idle/host.
- GDN compute is comparable (llama +11%/call, +19 ms/step) - NOT the dominant residual.
- bytes/step: llama does not read more (131 vs 85 MB memcpy; identical weight bytes); SSM fix's
-  18 GB/step DtoD removal CONFIRMED in-trace.
- **The single biggest llama-specific overage is the FP4 matmul path: 236 vs 117 ms/step (+119 ms
-  = 64% of the 185 ms gap), dominated by `mul_mat_vec_q` (FP4 GEMV at batch 128, 132 ms/step, 26%,
-  one per GDN layer).** Highest-value lever = route the GDN in-projection through a tensor-core FP4
-  GEMM at M=128 + fuse act-quant/norm prologue (L1/L2). Track B optimized the wrong FP4 kernel.
-
-### Evidence (DGX, this agent)
- `~/bench/postssm_decomp/postssm_base.{nsys-rep,sqlite,gpu_trace.csv,run.log}` (llama post-SSM).
- `~/bench/postssm_decomp/vllm_decode.{nsys-rep,gpu_trace.csv}` (vLLM eager decode trace).
- `~/bench/postssm_decomp/vllm_decode_g1.*` (vLLM graphs run), `~/bench/vllm_tps.py` (throughput).
- Scripts: `~/bench/postssm_llama_decomp.sh`, `~/bench/vllm_nsys_run.sh`, `~/bench/decode_decomp2.py`
-  (decode-only windowed, overlap-correct, MB-memcpy, per-step reconstruction).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-## Section: SYNTHESIS (cross-check + ground-truth + ranked levers + verdict) - FINALIZED
-
-Agent label: synthesize. Read-only (no GPU). Cross-checks all sections above against the
-fresh `profile-both-engines` ground-truth, then mechanism-confirms the dominant lever by
-reading the model graph + ggml-cuda dispatch source on the DGX (`~/llama-paged-dev`, HEAD
-46d7dd8 = patch 0019). All throughput vs the vLLM 391 t/s eager apples-to-apples reference.
-
-### 0. Headline
-
-Post-SSM dense decode = 256.6 t/s @npl128 = 65.6% of vLLM 391, bit-exact. The residual is
-NOT a hardware/architecture floor and NOT the GDN recurrence kernel, the host loop, CUDA
-graphs, or DRAM byte-volume. It is ONE concrete, llama-specific kernel-routing defect:
-**the gated-DeltaNet output projection (`ssm_out`) runs as an FP4 GEMV (`mul_mat_vec_q`)
-at decode batch 128 instead of a tensor-core FP4 GEMM (MMQ), costing 132 ms/step = 26% of
-decode = the single biggest overage vs vLLM (which packs the same projection into a cutlass
-M=128 GEMM).** The fix is a ~2-line reshape, bit-exact, and is the highest-value next step.
-
-### 1. Cross-check: which prior findings HELD, were REFUTED, or are SUPERSEDED
-
-HELD (confirmed by both the adversarial re-derivation and the fresh profile):
- Pre-fix decomposition (gated_delta_net 23.4%, k_get_rows 21.9%, MEMCPY-DtoD 18.9% / 382 GB,
-  mul_mat_vec_q 15.5%, mul_mat_q 10.5%): reproduced to <=0.1pp (validate-findings).
- SSM-fix D2D collapse: the 18.4 GB/step redundant recurrent-state copy is GONE. Confirmed
-  three ways: validate (18.9% -> 0.008% on the post-fix sqlite), weight-bandwidth (A/B kernel
-  sum lists no DtoD term), and IN-TRACE by the profiler (18 GB/step DtoD -> 131 MB/step). The
-  SSM fix (0018/0019) is the real breakthrough and is working.
- P2a FP4-GEMM occupancy remap FLAT on decode (+0.6% noise) while the `mul_mat_q` kernel itself
-  shrank -24.7% and prefill rose +12.7%: confirmed. Decode is not GEMM-occupancy-bound.
- 65% of vLLM (254/391 = 64.96%, 256.6/391 = 65.6%): confirmed.
- Decode is NOT at the bandwidth floor: 55.5 GB/step moved at 2.48x the 273 GB/s floor (40% util)
-  vs vLLM 1.61x (62% util) on the SAME bytes. Confirmed + LOCALIZED below.
- Host loop / 64-layer serialization is NOT the lever: both engines ~98% GPU-busy at npl128
-  (llama 98.7%, vLLM 97.9%); the entire exposed-idle budget is ~0.65%. Confirmed by the profiler.
- CUDA graphs are NOT the lever: vLLM is 394 t/s EAGER, graphs add only +6% (420); llama already
-  runs with graphs. Confirmed by the profiler.
-
-REFUTED / CORRECTED:
- "GDN recurrence kernel is the dominant residual lever" (the STATE brief's "gated_delta_net
-  1.46 ms/call, the largest single kernel" and the gdn-source-compare framing): REFUTED. The
-  profiler's fresh side-by-side per-call duration is llama 4.03 ms vs vLLM 3.62 ms = only +11% /
-  +19 ms/step = ~10% of the 184 ms gap. It IS the largest single kernel on both sides (38% llama,
-  53% vLLM) but the largest GAP is elsewhere. (The brief's "1.46 ms/call" is a stale/narrower
-  window; the authoritative post-SSM per-call is 4.03 ms.) gdn-source-compare's occupancy/shuffle/
-  fusion anatomy is correct but addresses a SECONDARY +19 ms target, not parity.
- "+66% SSM-fix gain" label: REFUTED. 146 -> 254-257 is +74 to +76%; "66%" is the percent-of-vLLM,
-  not the speedup (validate-findings).
-
-SUPERSEDED (the gap validate-findings flagged, now filled by real data):
- The "FP4-GEMM ~48% / get_rows 0.7% / GDN 22.5%" Step-2 split had NO surviving sqlite (the
-  producer script crashed; only a Step-1 build was on the box). The profiler's fresh Step-2 trace
-  replaces it with a FINER, load-bearing breakdown: the ~46% "FP4 matmul" bucket is NOT one GEMM
-  family - it splits into `mul_mat_vec_q` 26% (the o_proj GEMV, the real culprit), `mul_mat_q` 17%
-  (the tensor-core GEMM P2a already optimized), and `quantize_mmq_nvfp4` 3%. Lumping them as
-  "48% FP4-GEMM" hid that Track B P2a optimized the 17% MMQ while the 26% MMVQ was the bind. This
-  is why P2a was flat on decode: **it optimized the wrong FP4 kernel.**
-
-### 2. Ground-truth per-step decode decomposition + the single biggest overage
-
-From the profiler's fresh post-SSM eager nsys, both at batch 128, prefill-free, GPU-accurate:
-
-| component (per decode step) | llama ms | llama% | vLLM ms | vLLM% | gap (llama-vLLM) |
-|-----------------------------|----------|--------|---------|-------|------------------|
-| GDN recurrence kernel       | 193      | 38%    | 174     | 53%   | **+19**          |
-| FP4 matmul + act-quant      | 236      | 46%    | 117     | 36%   | **+119**         |
-|   - mul_mat_vec_q (o_proj GEMV) | **132** | **26%** | 0   | -     | **+132**         |
-|   - mul_mat_q (MMQ GEMM)    | 88       | 17%    | 61 (cutlass) | 19% | +27             |
-|   - quantize_mmq_nvfp4      | 16       | 3%     | 55 (nvjet+cvt)| 17% | -39             |
-| full attention (16 layers)  | 6.6      | 1.3%   | 6.2     | 1.9%  | +0.4             |
-| SSM conv + glue/elementwise | 45       | 9%     | 22      | 7%    | +23              |
-| MEMCPY                      | 2.5      | 0.5%   | 0.36    | 0.1%  | +2               |
-| **TOTAL**                   | **~510** | 100%   | **~326**| 100%  | **+184**         |
-
-The +119 ms FP4-matmul gap is ENTIRELY the `mul_mat_vec_q` o_proj GEMV (+132), partly offset
-by llama being -39 on activation-quant (16 vs vLLM's heavier eager 55) and +27 on the MMQ. So
-the one lever that matters is the +132 ms/step o_proj GEMV; everything else nets to ~+52 ms.
-
-**MECHANISM (confirmed by source read, not inferred).** In the dense Qwen3.5-27B GDN block
-(`src/models/qwen3next.cpp` `build_recurrent`), the recurrent core keeps the SSM layout
-`[feat, n_seq_tokens, n_seqs]`. At decode `n_seq_tokens=1, n_seqs=128`. The output projection is:
-
-```cpp
-// current code (qwen3next.cpp, end of the GDN block)
-ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm,
-                                 head_v_dim * num_v_heads, n_seq_tokens, n_seqs); // [6144, 1, 128]
-cur = build_lora_mm(model.layers[il].ssm_out, final_output);                     // <-- the matmul
-cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);                 // collapse AFTER
-```
-
-`final_output` is 3D `[6144, n_seq_tokens=1, n_seqs=128]`, so `src1->ne[1] = 1`. The ggml-cuda
-dispatch (`ggml-cuda.cu:2553`) picks MMVQ when `src1->ne[1] <= MMVQ_MAX_BATCH_SIZE (8)`, with the
-128 sequences carried in `ne[2]`. Result: a per-sequence FP4 GEMV, output rows 5120 x 128 seqs =
-**`mul_mat_vec_q`, grid 5120x128, 48 calls/step (one per GDN layer)** - matching the profiler's
-trace exactly. MMVQ does NOT amortize the `ssm_out` weight read into shared memory across the 128
-sequences (it is built for batch <=8), so each of the 128 sequences re-streams the weight tiles -
-the "40% vs 62% utilization" the weight-bandwidth section measured lives HERE, in this kernel, not
-in the GDN state traffic. vLLM packs all 128 decode tokens into one cutlass M=128 GEMM (its GDN
-kernel is literally `..._PACKED_decode`), so it has NO GEMV-at-batch-128 at all.
-
-This also pins WHY it is decode-specific: at prefill the tokens are in `ne[1]` (n_seq_tokens=prompt
-len), so `ne[1] >> 8` -> MMQ already; only the decode layout (128 seqs x 1 token, batched in ne[2])
-trips the GEMV path. The in-projection (`wqkv`) is unaffected: its input is the 2D residual stream
-`[n_embd, 128]` (reshaped to 3D only AFTER the matmul), so `ne[1]=128` -> MMQ today. The o_proj is
-the unique 3D-input matmul, which is exactly why the profiler counted one MMVQ per GDN layer.
-
-### 3. Ranked remaining decode levers (impact x tractability, cumulative ceiling toward 391)
-
-Anchored on llama 256.6 t/s (499 ms/step) -> vLLM 391 (327 ms/step), gap 172 ms/step. Recover
-figures past Lever 1 are ESTIMATES (the profiler measured the costs, not the post-fix kernels);
-each needs a confirming re-profile. Ceilings are cumulative.
-
-| # | lever | targets (ms/step) | est. recover | cumulative decode_agg | % of vLLM | tractability |
-|---|-------|-------------------|--------------|-----------------------|-----------|--------------|
-| 1 | **o_proj MMVQ -> MMQ** (collapse final_output to 2D before `ssm_out`) | vec_q 132 | ~100-110 | ~320-330 | **~82-85%** | **VERY HIGH** (2-line reshape, bit-exact, MMQ already proven on NVFP4 at M=128 by the in_proj) |
-| 2 | act-quant + norm prologue fusion (explore L1 `LLAMA_FUSE_NVFP4_QUANT=1` re-bench + L2 M=128 gate) | quant 16 + 448 launches/step | ~15-25 | ~345-360 | ~88-92% | MED-HIGH (producer code exists, tasks 38-40; needs post-0019 re-bench, the pre-SSM regression is stale) |
-| 3 | GDN-area fusion + occupancy (gdn A-D: row-local reduction, raise launch_bounds occupancy, fold gate/l2norm/softplus into the recurrence) | GDN +19 + glue +23 | ~25-40 | ~375-388 | ~96-99% | MED-LOW (real kernel rewrite + numeric re-validation) |
-| 4 | conv-state in-place + conv fuse (explore L4, the proven 0018/0019 pattern on `ssm_conv`/concat) | part of glue, 48 launches/step | ~5-10 | ~388-395 | ~99-101% | HIGH (bit-exact, proven pattern) |
-| - | between-step host gap / cgraph reuse | ~2 ms/step | ~2 | +~0.4% | n/a | LOW value (cleanup, not a parity lever) |
-| x | CUDA graphs | - | 0 | already on | n/a | NOT a lever (+6% even for vLLM) |
-| x | TMA weight-feed / NVFP4-dense weight-quant | prefill / npl1 | 0 at npl128 | n/a | n/a | MIS-SCOPED for batch-128 decode (prefill / low-batch levers; prefill already +12.7%) |
-
-Note on Lever 1+2 coupling: routing the o_proj to MMQ ADDS one activation-quant (q8_1/NVFP4) per
-o_proj, so Lever 2 (fusing that quant into the preceding `build_norm_gated`) compounds Lever 1
-rather than overlapping it. Lever 3's "glue +23 ms" and Lever 1's quant are the same elementwise
-passes vLLM folds into its packed kernel, so 2+3 share surface - treat the estimates as a band,
-not a sum.
-
-### 4. Verdict: is true decode parity reachable?
-
-**Yes, parity is reachable in software, and the residual is NOT a hardware/architecture floor.**
-Proof of "not a floor": both engines read identical NVFP4 weights and read+write identical f32
-recurrent state = identical 55.5 GB/step DRAM floor (203 ms) on the identical GB10 LPDDR5x; vLLM
-achieves 62% bandwidth utilization (327 ms/step) where llama achieves 40% (499 ms/step). The 1.54x
-throughput gap equals the 1.55x utilization gap, and that utilization gap is now LOCALIZED to
-specific llama kernels - chiefly the o_proj MMVQ - every one of which is closable in software. The
-GDN recurrence (the supposed floor) is only +11%/call between the two engines.
-
-How far each tier reaches:
- The first ~84% of parity (256 -> ~325) is nearly FREE: Lever 1 is a 2-line reshape that moves
-  the GDN output projection from a per-sequence FP4 GEMV to a tensor-core M=128 FP4 GEMM, bit-exact,
-  no new kernel (MMQ already runs the in-projection at this exact shape and type).
- ~84% -> ~92% (Levers 1+2) is low-effort: the fused act-quant producer already exists (tasks
-  38-40), it just needs a post-0019 re-bench because its pre-SSM regression was measured when the
-  GPU was 99% busy on the now-removed state-copy chain (no idle to reclaim then; real idle now).
- ~92% -> ~100% (Levers 3+4) is the diminishing-returns tail and the only genuinely HARD work:
-  matching vLLM's fully-fused `packed_decode` GDN kernel (row-local reductions, higher occupancy,
-  folding the gate/l2norm/softplus elementwise passes into the recurrence). This last ~8% is "hard
-  but not floored" - it is kernel engineering, not a hardware wall.
-
-**Single highest-value next step (do this first):** apply Lever 1 - collapse `final_output` to 2D
-`[head_v_dim*num_v_heads, n_seq_tokens*n_seqs]` BEFORE the `ssm_out` matmul (drop the now-redundant
-post-matmul `reshape_2d`):
-
-```cpp
-// route the GDN output projection through tensor-core MMQ at decode:
-// M = n_seq_tokens*n_seqs (=128 at decode) instead of ne[1]=1 -> MMVQ GEMV. Free, bit-exact.
-ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm,
-                                 head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-cur = build_lora_mm(model.layers[il].ssm_out, final_output); // now [n_embd, n_tokens], M=128 MMQ
-```
-
-Then the profiler re-measures the realized o_proj-as-MMQ cost on a clean post-0019 nsys (the one
-number this synthesis estimates rather than measures) and confirms the 256 -> ~320-330 lift. The
-same 3D-input-matmul pattern almost certainly affects the MoE checkpoint (q36-35b-a3b) decode and
-any other matmul that consumes a tensor still in the `[feat, 1, n_seqs]` SSM layout - grep those
-and apply the same collapse. Levers 2-4 follow in priority order; none requires a model or accuracy
-compromise, so bit-exactness is preserved throughout.
-
-### Evidence (this section)
- Source read (DGX `~/llama-paged-dev`, read-only): `src/models/qwen3next.cpp` (GDN in/out proj
-  layout, lines ~286-305 and ~518-528), `ggml/src/ggml-cuda/ggml-cuda.cu:2553` (MMVQ dispatch on
-  `ne[1]<=8`), `ggml/src/ggml-cuda/mmvq.cuh:3` (`MMVQ_MAX_BATCH_SIZE 8`), `mmq.cu:267` (NVFP4 is
-  MMQ-supported).
- All five prior sections of this doc + the profiler's `~/bench/postssm_decomp/` traces.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -1,532 +0,0 @@
-# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
-
-Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
-residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
-8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
-consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
-
-**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
-ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
-parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
-parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
-vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
-
-Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
-Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
-committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
-benchmarks were run** (track A is on the box).
-
-## 0. Grounded inputs (measured, committed)
-
-| quantity | value | source |
-|---|---|---|
-| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
-| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
-| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
-| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
-| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
-| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
-| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
-| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
-| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
-| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
-
-`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
-`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
-compute-bound.
-
---
-
-## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
-
-This is the first thing track B must settle, and the evidence settles it decisively.
-
-| option | verdict | why |
-|---|---|---|
-| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
-| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
-| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
-
-**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
-id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
-The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
-hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
-to the bandwidth floor at the thin decode M-tile.
-
-### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
-
-Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
-ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
-`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
-**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
-`mma_block_scaled_fp4` (`mma.cuh:1126`).
-
-**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
-(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
-values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
-`ggml_cuda_float_to_fp4_e2m1`. The MMA is
-`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
-(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
-already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
-sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
-activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
-halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
-
-Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
-`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
-FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
-is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
-
---
-
-## 2. The roofline — answering the load-bearing question
-
-**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
-under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
-
-### 2a. DENSE Qwen3.6-27B, M=128
-
-`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
-
- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
-  @216 GB/s = 83 ms → 1,542 tok/s.
- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
-  at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
-  collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
-  territory.
- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
-  slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
-  compute-bound territory at single-digit efficiency.
- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
-  the GEMM.**
-
-### 2b. MoE Qwen3.6-35B-A3B, M=128
-
-@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
-expert M ≈ **4 tokens**.
-
- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
-  purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
-  feeding ragged M≈4 tiles.
- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
-  loss, not compute).
- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
-  already pushing the MoE BW floor.
-
-**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
-is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
-
---
-
-## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
-
-The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
-keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
-`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
-single most important structural fact for track B:
-
-> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
-> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
-> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
-> regimes pick opposite knobs:
-
-| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
-|---|---|---|
-| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
-| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
-| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
-| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
-| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
-
-This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
-0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
-
-The five inefficiencies, ranked:
-
-1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
-   whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
-   into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
-   `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
-   activation round-trip, not just the launch (see §4.4).
-
-2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
-   `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
-   `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
-   `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
-   M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
-   35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
-   occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
-   hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
-
-3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
-   decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
-   but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
-   **`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
-
-4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
-   tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
-   group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
-   At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
-   the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
-
-5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
-   shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
-
-**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
-`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
-raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
-
---
-
-## 4. The specific build-ready changes
-
-All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
-stock behavior until proven.
-
-### 4.1 Dense M-tile / occupancy (the make-or-break)
-
- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
-  re-reads 18 GB). Lock this as an invariant in P0.
- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
-  128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
-  static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
-  fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
-  hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
-  dense occupancy lever and respects every GB10 rule.
- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
-  `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
-  ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
-  re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
-  broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
-  lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
-
-### 4.2 FP4-MMA fragment usage
-
- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
-  `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
-  all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
-  these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
-  shape.
- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
-
-### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
-
- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
-  (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
-  `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
-  of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
-  (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
-
-### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
-
- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
-  (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
-  operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
-  (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
-  `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
-  MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
-  `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
-  unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
-  with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
-  struct is the frozen interface between them.
-
-### 4.5 GB10-fit rules (binding constraints on every kernel change)
-
- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
-  layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
-  XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
-  padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
-  FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
-  (smaller `mmq_y`, smaller shared), not a deeper pipeline.
- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
-  perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
-
---
-
-## 5. Correctness / parity gate (every phase)
-
- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
-  `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
-  unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
-  is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
-  test).
- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
-  the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
-  small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
-  `mmq_y` changes and block-pad masking can leak.
- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
-  must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
-  rounding).
- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
-  (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
-  the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
-
---
-
-## 6. Phased plan, with expected decode_agg at each phase
-
-Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
-**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
-
-### DENSE (parity target 391)
-
-| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
-|---|---|---:|---:|---:|---:|---|
-| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
-| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
-| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
-| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
-
-Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
-Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
-contingent on P2 clearing the kill-gate and on A landing.**
-
-### MoE (parity target 811)
-
-| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
-|---|---|---:|---:|---:|---:|---|
-| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
-| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
-| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
-| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
-
-MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
-hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
-non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
-cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
-(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
-
-### Explicitly NOT in scope (and why)
-
- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
-  CUTLASS's grouped FP4 is broken on sm_121.
- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
- "Make activations 4-bit" — already W4A4; no work, no win there.
- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
-  **separate tracks**; B owns the GEMM only.
-
---
-
-## 7. The honest ceiling — does B reach TRUE PARITY?
-
- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
-  GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
-  hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
-  The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
-  ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
-  occupancy kill-gate and contingent on track A. **GO (conditional).**
-
- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
-  even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
-  grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
-  non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
-  deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
-
- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
-  *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
-  M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
-  the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
-  (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
-  rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
-  which is exactly what the P2 kill-gate measures.
-
-**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
-vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
-with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
-MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
-PARTIAL for MoE.**
-
---
-
-## 8. One-paragraph summary
-
-The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
-floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
-FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
-**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
-`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
-M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
-win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
-**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
-**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
-demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
-kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
-sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
-(conditional, B+A), PARTIAL for MoE.**
-
---
-
-## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
-
-Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
-BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
-grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
-`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
-the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
-rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
-The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
-reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
-(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
-fixed**.
-
-### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
-
-The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
-dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
-parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
-
- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
-  so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
-  proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
-  build stuck at 1/4 of MMQ.
- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
-  hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
-  floor** as M grows; vLLM **holds it**.
-
-So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
-W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
-to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
-path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
-constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
-tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
-it is a different and more tractable regime; it under-states that the *binding risk is identical*.
-
-### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
-
-The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
-bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
-territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
-(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
-hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
-not the hardware, is the limiter.
-
-**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
-the dense step (numbers from the doc's own inputs):
-
-```
-llama step @npl128            795 ms   (decode_agg 161)
-vLLM step  @npl128            328 ms   (decode_agg 391)
-total gap                     467 ms
-
-llama GEMM                    471 ms
-vLLM GEMM (at the floor)      ~66-88 ms   (66 @273 GB/s spec, 88 @216 GB/s achieved)
-=> GEMM gap                   383-405 ms  = 82-87% of the 467 ms total gap
-=> non-GEMM gap                62-84 ms   = 13-18% of the total gap
-```
-
-So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
-needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
-elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
-necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
-parity story is *entirely* contingent on A, which the P3 framing buries.
-
-### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
-
-vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
-pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
-not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
-occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
-proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
-occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
-
-The doc conflates two different things under "deep pipeline":
- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
-  (correctly banned).
- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
-  register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
-  occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
-
-**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
-not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
-(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
-to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
-Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
-abandoning the path that works because the *manual* version of it failed.
-
-### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
-
-The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
-  and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
-  evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
-  `nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
-  host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
-  where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
-  on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
-
-**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
-real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
-
-### 9.5 Honest expected outcome (the numbers I would defend)
-
-| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
-|---|---:|---:|---:|---|
-| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
-| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
-
-Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
-**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
-floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
-contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
-GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
-tail, not median.
-
-**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
-outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
-**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
-**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
-the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
-close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
-parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
-the parity tail hits.
-
-### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
-
-The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
-reasons:
-1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
-   producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
-   seam is speculative until P3.
-2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
-   bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
-   B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
-   59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
-   <15%-eff kill-gate.
-3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
-   risky kernel tune.
-
-**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
-post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
-make-or-break decision cheaper, better-informed, and bankable-either-way.
-
-### 9.7 Verdict (go / no-go)
-
- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
-  (expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
-  bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
-  is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
-  first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
-  (iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
-
- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
-  (it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
-  ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
-  the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
-  **1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
-  CUDA-graph, GDN bf16 state).
-
- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
-  **top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
-  and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
-  high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
-  ~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
-  occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.
--- a/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
+++ b/backend/cpp/llama-cpp/patches/paged/FUTURE_LEVERS.md
@@ -1,86 +0,0 @@
-# Decode-Parity: Parked Levers (future exploration)
-
-**Context.** The bit-exact decode-parity effort shipped patches **0018-0023**: dense decode
-38% -> **95% of vLLM** @npl128 on GB10 / DGX Spark (LPDDR5x ~273 GB/s), every patch
-**byte-identical to llama's own f32 output** (md5-gated). The gated-DeltaNet recurrence (the
-dominant ~50% kernel) now runs at **84.6% of peak BW = past vLLM's 82.4%**, at the DRAM floor.
-bf16 SSM state was fully built and **shelved** (real +25-31% lever but fails the f32 KL gate).
-
-The remaining non-recurrence kernels (FP4 GEMM, attention, lm_head) are at their bit-exact
-floor: any knob changes a reduction order vs the f32 reference. So further *bit-exact* decode
-gains are marginal; the levers below are the honest pick-up points, ranked by promise.
-
---
-
-## 1. Hybrid-precision SSM state (the most promising)
-
-The bf16 build (`BF16_SSM_STATE_RESULTS.md`) proved the throughput lever is large -
-recurrence **-49%/call** (dense 3.38 -> 1.73 ms), dense decode ~**490 t/s = 125% of vLLM** (clean
-runs), MoE @128 **+24.9%** - but bf16 fails the f32 KL gate (KLD 0.06-0.17 at >=1024 ctx,
-~10% argmax flips). The discrimination showed the error is **intrinsic to bf16 over the
-long-memory heads** (exp(g) ~ 1, where the per-step decay does not contract the rounding);
-short/fast-decaying heads are fine.
-
-**Lever:** a per-head (or per-channel) precision split - keep the long-memory heads (g near 1)
-in f32, store the fast-decaying heads (g well below 1, where rounding contracts) in bf16. Could
-capture most of the speedup while passing the KL gate. Needs a g-magnitude classifier at graph
-build + a mixed-dtype recurrent-state cache. **HIGH promise, moderate effort.** The bf16 kernel
-plumbing already exists (DGX `~/llama-paged-dev/BF16_SSM_STATE.diff`); this adds the per-head
-dtype selection on top.
-
-*Note:* plain bf16 (no split) is also a legitimate **opt-in for precision-tolerant deployments** -
-it is exactly vLLM's own GDN precision (vLLM's recurrent cache is bf16), so "match vLLM speed at
-vLLM precision" is a one-flag away if a user wants it. We declined it as the *default* because our
-f32 is a strictly higher bar.
-
-## 2. Dense CUDA-graph instability
-
-The bf16 dense decode was **bimodal** across runs (287 / 336 / 487 / 498 t/s) - a dense-path
-CUDA-graph capture/replay instability (good runs hit ~490). The f32 dense path measured stable
-(371-376) but the bimodality is a latent fragility worth root-causing; a robust graph capture on
-the dense path could stabilize and possibly lift dense decode. **Moderate promise**, diagnostic.
-
-## 3. Dense rms_norm -> fp4 producer-fold (~1.5-2.5%, parked as flat-risk)
-
-The last bit-exact bucket (`RMSNORM_FP4_FOLD.md`). Folding the standalone `quantize_mmq_nvfp4`
-into the rms_norm+mul producer at the FFN boundary (f32 output dead -> droppable) could recover
-~1.5-2.5% dense. Parked because: the Lever-2 precedent measured **flat**, it has the worst
-gain/plumbing ratio (3-op `{RMS_NORM,MUL,MUL_MAT(NVFP4)}` graph fusion + a pre-quantized-src1
-GEMM path + scratch-pool / CUDA-graph-lifetime plumbing), and the gain risks being swallowed by
-the ~0.3-0.5% bench noise floor. Revisit only with the inter-node graph-CSE plumbing built and
-proven on a same-build flag toggle (decode_agg lift above noise AND md5 == 0023). **LOW promise.**
-
-## 4. Datacenter Blackwell (sm_100)
-
-This effort targeted **consumer** Blackwell sm_12x (sm_120 RTX 50-series, sm_121 GB10). Datacenter
-Blackwell (B100/B200/GB200, sm_100 / cc 10.0) has HBM3e (much higher BW) and different MMA
-characteristics - the LPDDR5x bandwidth floor that dominates GB10 decode does **not** apply, so the
-whole calculus changes (likely compute-bound, not BW-bound; the recurrence would not be the binding
-kernel). A separate investigation if datacenter Blackwell becomes a target.
-
-## 5. Prefill / TTFT scheduler + paged-pool burst degradation (HIGH priority - the weakest benchmark number)
-
-The final benchmark (`QWEN36_NVFP4_BENCH.md`) exposed TTFT as the clear weak spot vs vLLM. Two distinct
-issues:
- **Static decode-first budget tradeoff:** the QoS budget (patches 0013/0016, `LLAMA_MAX_BATCH_TOKENS=512`)
-  maximizes decode tok/s + memory but throttles burst-prefill, so under a synchronized 128-way burst TTFT
-  climbs to **903 s dense / 213 s MoE @npl128** vs vLLM's chunked-prefill 6-18 s. A dynamic/adaptive budget
-  (by concurrency + queue depth), or matching vLLM's chunked-prefill interleave, would rebalance.
- **Paged-pool burst-degradation BUG (concrete, found in the benchmark):** after a high-npl burst, a
-  server's *subsequent lower-npl* prefill collapses (fresh npl8 = 507 t/s / 6 s TTFT; npl8 after an npl64
-  burst = 65 t/s / 64 s). Decode stays robust; only prefill degrades -> root-cause the paged-pool state
-  that persists across the burst.
-
-**HIGH promise** for the serving experience: decode (dense 90-117%, MoE 77-83% of vLLM) and memory (1.5-3x
-lower) are already strong; TTFT is the one number holding back a clean public win.
-
-## 6. MoE-specific recurrence tuning
-
-The occupancy retune (0022) was tuned on the dense path; it lifted MoE +8.3% as a side effect. The
-MoE path (`MUL_MAT_ID` grouped GEMM + the shared GDN recurrence, expert routing changes the GEMM
-shapes) may have MoE-specific occupancy headroom. Worth a MoE-targeted reprofile.
-
---
-
-*All shelved per the host handover - experiments parked. Pick up from the linked result docs in this
-directory.*
--- a/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
@@ -1,208 +0,0 @@
-# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
-
-Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
-llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
-context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
-
-Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
-sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
-(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
-`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
-(MoE, arch `qwen35moe`).
-
-## TL;DR verdict
-
-**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
-reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
-`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
-major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
-the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
-kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
-kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
-
-The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
-**flat** while the full-attention kernel grows ~3.1x:
-
-| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
-|--------|---------:|---------:|------:|---------|
-| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
-| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
-| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
-
-Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
-per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
-
-## 1. Architecture (confirmed from GGUF metadata + tensor names)
-
-Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
-full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
-
-**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
-  rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
-  `ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
-  `[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
-  heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
-
-**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
-  `expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
-  `inner_size = 4096` -> `H_v = 32` value heads.
-
-**Tensor names confirm the op split (27B, per-layer dump):**
- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
-  `ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
-  block). No `attn_k/v/output`, no per-head q/k norm.
- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
-  `attn_q_norm`, `attn_k_norm`. No `ssm_*`.
-
-llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
-cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
-and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
-SSM-style recurrent path, not standard attention.
-
-## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
-
-Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
-`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
-`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
-
-**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
-(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
-`build_delta_net_autoregressive`. Both are O(1):
-
- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
-  the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
-  `sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
-  `o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
-  the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
-  `ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
-
-**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
-the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
-with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
-state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
-samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
-So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
-this is the recurrent analogue, fixed size, independent of context length.
-
-**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
-`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
-device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
-that does not happen, so the **fused single-kernel path is what runs**.
-
-**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
-O(1) in context:**
- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
-  the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
-  context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
-  `[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
-  loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
-  iteration: read the one new token, do the rank-1 update
-  `s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
-  then write the updated state back (lines 161-167). No second loop, no read of any past KV.
- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
-  batch) and **constant in context length**. This is precisely vLLM's
-  `fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
-  fixed-size `[K,V]` state) cited in the grounding doc.
-
-A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
-the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
-(prefill), never at decode.
-
-## 3. nsys decode profiling: GDN is a small share and does not scale with context
-
-Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
-to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
-sqlites under `~/bench/gdn_study/`).
-
-**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
-
-| bucket | % of busy | us/launch |
-|--------|----------:|----------:|
-| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
-| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
-| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
-| elementwise / act_glu / norm / rope | ~13.5 | - |
-| embed_gather (`get_rows`) | 2.9 | - |
-| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
-| copy_cast (`cpy`) | 1.8 | - |
-| **GDN_conv (`ssm_conv`)** | **1.5** | - |
-
-The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
-**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
-where the grounding predicted the GEMM would be the lever.
-
-**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
-GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
-weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
-context.
-
-**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
-4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
-`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
-decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
-a true per-step comparison:
-
-| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
-|-----------------|---------:|---------:|------:|
-| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
-| GDN_recurrent share | 0.6% | 0.4% | flat/down |
-| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
-| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
-| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
-| total ms / decode step | 84.9 | 86.0 | 1.01x |
-
-The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
-full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
-moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
-the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
-
-(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
-prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
-+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
-prefill short.)
-
-## 4. Verdict and fix scope
-
-**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
-single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
-the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
-`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
-with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
-op carries no context-length parameter at all.
-
-**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
-share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
-not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
-1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
-   vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
-2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
-   context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
-
-If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
-complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
-`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
-traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
-llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
-`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
-it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
-(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
-steady-state step as the bounded host-side add-on.
-
-## Reproduce
-
- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
-  build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
-  (fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
-  184-185, dispatch 226-312).
- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
-  nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
-  `gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.
--- a/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_RECURRENCE_BYTE_GATE.md
@@ -1,344 +0,0 @@
-# GDN recurrence byte gate + fused single-pass kernel design
-
-Label: llama-fused-recurrence-design (READ-ONLY, no GPU). Source-and-math design only;
-the byte-ratio measurement itself is produced by the `ncu-byte-gate` agent.
-
-## TL;DR (the correction the workflow was set up to settle)
-
-**The recurrence kernel is ALREADY single-pass on the f32 state.** `gated_delta_net_cuda<128>`
-(after patches 0018 in-place write + 0019 fused gather) loads the whole `s0` column into registers
-ONCE (`s_shard[rows_per_lane]`), runs the entire token loop in registers, and writes the new state
-back ONCE - directly into the persistent cache slot (0018) or scratch. For decode `n_tokens==1`,
-`keep_rs_t==false`: one register load, one register store, no re-read of state from DRAM.
-
-The byte-gate's working hypothesis - "un-fused l2norm/gate/decay/recurrence/state-writeback/gather
-each touching the f32 state, so a fused pass halves DRAM bytes" - is **false for the state**. Only
-the recurrence kernel touches the 3 MB/seq state. The surrounding ops (`l2_norm`, `silu`, `sigmoid`,
-the `gate` exp/softplus, `ssm_conv`, `concat`, `cpy`) all operate on the **small activations**
-(q/k/v/g/beta), which are 100-800x smaller than the state. There is no 2x state re-streaming to
-recover; the recurrence kernel is byte-minimal on state by construction.
-
-Therefore a fused single-pass kernel **cannot move the dominant 196 ms recurrence** - that cost is
-f32-state read+write bandwidth, already a single pass. The two real levers are decoupled:
-
-1. **Fold the surrounding activation ops into the kernel** (MEDIUM effort): recovers the small
-   per-op buckets (`ssm_conv` 1.5% + `silu`/`sigmoid` 1.4% + 2x `l2_norm` + `concat` 2.1% + conv
-   `cpy` 2.0%, ~6-8% of the step) plus per-op launch overhead. Bit-exact. Ceiling ~93-96% of vLLM.
-2. **bf16 state cache** (HIGH effort, NON-bit-exact): halves the dominant byte stream. The only
-   large lever on the 196 ms. Target KL < 1e-3 by keeping f32 register accumulation, storing only
-   the persisted cache in bf16.
-
-Which of (1)/(2) is worth building hinges on the `ncu-byte-gate` byte ratio (below).
-
-## Byte arithmetic (dense q36-27b-nvfp4, decode, npl128, S_v=128, H_v=48, batch=128)
-
-State per (seq, GDN layer) = S_v^2 * H_v = 128*128*48 = 786,432 f32 = **3.0 MiB**.
-
-Per kernel call (one GDN layer, full 128-seq batch), single pass:
- state read  = 786,432 * 128 * 4 = 402.65 MB
- state write = 402.65 MB
- **state R+W = 805.3 MB/call** (768 MiB)
- activations (q,k 1 MB each; v 3 MB; attn-out 3 MB; g,beta tiny) ~= 8 MB/call = **<1%**.
-
-Measured 4.08 ms/call (node-level trace) -> effective **197.4 GB/s**.
-GB10 / DGX Spark LPDDR5X peak ~= **273 GB/s** -> **~72% of peak.**
-
-48 GDN layers/step -> 38.7 GB of state traffic/step -> 196 ms = 51.6% of the 383.48 ms step. v=8MB
-activation traffic is noise; state is 99% of the recurrence bytes.
-
-### What this means for the open question
- The recurrence is single-pass, coalesced (transposed layout: lane reads `state[col*S_v + i]`,
-  consecutive lanes -> consecutive `i`), running at ~72% of peak BW. It is NOT at the 85% hardware
-  floor, but it is NOT re-streaming state either. The 72->85% headroom (~30 ms, bit-exact) is an
-  occupancy/coalescing tune, NOT a fusion win.
- vLLM `fused_recurrent_gated_delta_rule` does the SAME single-pass recurrence. If vLLM's recurrent
-  state cache is bf16 (model dtype) while llama's is f32, vLLM moves HALF the bytes on the dominant
-  stream - that alone is ~98 ms, i.e. essentially the whole residual decode gap. **This is the
-  single most decision-relevant number for the `ncu-byte-gate` agent to confirm: the dtype/bytes of
-  vLLM's GDN state cache vs llama's f32, plus llama's measured achieved-BW % on the recurrence
-  kernel.** If vLLM is bf16-state -> build (2). If vLLM is also f32-state and at ~85% -> llama is
-  at the floor, only (1) + coalescing remain and bit-exact parity tops out ~95%.
-
-## The fused single-pass kernel design
-
-Two deliverables, layered. Build (1) first (bit-exact, de-risks the graph), gate (2) on the byte
-verdict.
-
-### (1) `ggml_gated_delta_net_decode_fused` - fold the activation ops into the kernel
-
-Folds the pre-recurrence activation ops and the post-recurrence gated RMSNorm into the existing
-single-pass recurrence kernel, so q/k/v/g are produced and consumed in registers/shared and never
-make a separate DRAM round-trip, and the per-op launches collapse to one.
-
-Current decode op chain in `build_layer_attn_linear` (qwen35.cpp 386-461), per GDN layer:
-
-```
-wqkv GEMM -> qkv_mixed                                  (keep: GEMM, separate)
-wqkv_gate GEMM -> z                                     (keep: GEMM, separate)
-ssm_beta GEMM -> beta -> sigmoid                        [FOLD beta sigmoid]
-ssm_alpha GEMM -> alpha -> +ssm_dt -> softplus -> *ssm_a (gate) [FOLD softplus/mul -> per-head g]
-build_conv_state: reshape, transpose qkv, CONCAT, cpy   [concat/cpy -> conv-state plumbing, see note]
-ggml_ssm_conv(conv_input, conv_kernel)                  [FOLD depthwise conv, K=4]
-ggml_silu(conv_output)                                  [FOLD silu]
-views q_conv/k_conv/v_conv
-ggml_l2_norm(q_conv); ggml_l2_norm(k_conv)              [FOLD 2x l2norm]
-[repeat_4d skipped on fused path]
-ggml_gated_delta_net_inplace_ids(...)                   <-- THE recurrence kernel (196 ms)
-build_norm_gated(output, ssm_norm, z): RMSNorm + silu(z) + mul  [FOLD post gated-RMSNorm]
-ssm_out GEMM                                            (keep: GEMM, separate)
-```
-
-Fold list (what moves INTO the kernel):
- `beta` sigmoid: scalar per (head,seq); apply in-kernel when reading beta.
- `gate` g = softplus(alpha+dt)*a (GDA, g->ne0==1): scalar per (head,seq); compute/exp in-kernel.
-  The kernel already does `expf(*g_t)` (non-KDA path, line 85) - so feed RAW `alpha`+`dt` and the
-  `a` scale and do softplus+mul+exp in-kernel; removes the `add`/`softplus`/`mul` launches.
- `ssm_conv` (depthwise causal conv1d, kernel width 4) + `silu`: per channel a length-4 dot of the
-  conv state with `ssm_conv1d` then silu. This is the prologue: each warp/thread, before loading
-  state, computes its q/k/v channel by reading 3 cached conv-state taps + the current qkv_mixed
-  token, dotting the 4-wide kernel, applying silu. The conv state (conv_kernel-1=3 taps x conv_dim)
-  is tiny and already cached; fold its read here and its 1-token shift write into the epilogue
-  (replaces the `concat`+`cpy` conv-state update).
- `l2_norm` of q and k: a warp reduction over S_v of the per-head q/k vector. The recurrence kernel
-  already does warp reductions over S_v (the kv/attn dot products) - the l2norm reuses the same
-  warp-reduce primitive on q_reg/k_reg right after they are loaded, before the recurrence math.
- Post: `build_norm_gated` = RMSNorm(output, ssm_norm) * silu(z). The kernel already holds the
-  attn output `attn_col` per (head,seq,col) in registers at the end; fold an S_v warp-reduce RMS,
-  multiply by `ssm_norm` weight and by `silu(z)` (z read once), and write the final gated output -
-  removing the `rms_norm`+`silu`+`mul` launches and one activation round-trip.
-
-State traffic UNCHANGED (still one read + one write). Activation traffic for conv/silu/l2norm/norm
-collapses into the kernel's register/shared path; ~6 separate launches become 0. Expected recovery:
-the ~6-8% surrounding-op buckets + launch overhead. **Bit-exact** if the numeric ordering is held
-(see Numeric notes). Conservative ceiling ~365-375 tok/s dense (~93-96% of vLLM 391).
-
-Data flow (per (h_idx=head, sequence=seq) block, decode n_tokens=1, S_v=128, num_warps=4):
-1. PDL sync.
-2. Prologue (per channel/lane): read 3 conv-state taps + current `qkv_mixed[t]` for this channel,
-   dot with `ssm_conv1d[0..3]`, add conv bias if any, `silu`. Produces this lane's q/k/v element.
-3. l2norm q,k: warp-reduce sum(q^2), sum(k^2) over the S_v dim; scale q_reg,k_reg by rsqrt(.+eps).
-4. Load `s0` column into `s_shard` (UNCHANGED single read).
-5. Recurrence (UNCHANGED math: g-decay, kv = S^T k, delta = (v - g*kv)*beta, S = g*S + k(x)delta,
-   attn = S^T q * scale).
-6. Write `s_shard` back to cache slot ONCE (UNCHANGED single write). Write the 1-token-shifted conv
-   state back to the conv cache (replaces concat+cpy).
-7. Epilogue gated-RMSNorm: warp-reduce sum(attn^2) over S_v -> RMS; multiply by `ssm_norm[col]` and
-   by `silu(z[col])` (z loaded once); write final output element. ssm_out GEMM stays separate.
-
-Inputs added to the op: `ssm_conv1d` weight, `ssm_norm` weight, `z`, conv-state cache view, raw
-`alpha`/`dt`/`a`, eps. This is a wider op signature (src[8..]) - acceptable; gate it behind a new
-`cparams.fused_gdn_decode` resolved exactly like `auto_fgdn` (graph_reserve + device-match probe,
-llama-context.cpp 518-595) so it silently falls back to the current op chain if any device lacks it.
-
-### (2) bf16 recurrent-state cache - the dominant-term lever (NON-bit-exact)
-
-Only build if `ncu-byte-gate` shows vLLM moves fewer state bytes (bf16) OR llama's f32 recurrence is
-already >=85% of peak (then f32 is at the floor and bf16 is the only way down).
-
- Store `ssm_states_all` (the recurrent-state cache) as bf16. Halves the dominant 805 MB/call -> at
-  the same ~197 GB/s -> ~2.04 ms/call -> ~98 ms/step saved (196 -> ~98). Dense projected
-  335 -> ~440+ tok/s (>= vLLM 391) if BW-bound holds; smaller dtype usually achieves a HIGHER % of
-  peak, so likely better.
- Kernel change: read state -> convert bf16->f32 into `s_shard` (registers stay f32); all recurrence
-  arithmetic in f32 (UNCHANGED); on write, convert f32->bf16. Accumulation precision is preserved
-  within a step; only the PERSISTED state is rounded to bf16 each step.
- Numerics: the recurrent state decays geometrically (g<1), so per-step bf16 rounding does not
-  accumulate unboundedly, but it is NOT bit-exact. Validate KL < 1e-3 vs the f32-state build over a
-  256-token greedy run; if KL fails, fall back to f32 state (keep it a cparams toggle). This is the
-  ONLY path to bit-near parity-or-better on the dominant term; bit-EXACT parity on the 196 ms is
-  unreachable because the f32 state bytes are irreducible (single pass already).
-
-## Numeric / bit-exactness notes (for fold (1))
- l2norm/RMS use f32 warp-reduce accumulation (matches `ggml_l2_norm`/`ggml_rms_norm` f32 sum).
-  Order of summation across lanes differs from the standalone op's sequential sum -> floating
-  reassociation. To stay bit-exact, replicate the standalone op's reduction order, OR accept a
-  tiny reassociation delta and gate on KL<1e-3 (the workflow's near-bit-exact target). Recommend:
-  ship fold (1) behind the cparams probe and assert greedy md5 match vs the current chain (0019
-  already established the harness: dense text md5, MoE byte-identical).
- Recurrence math, scale, g-exp order, beta apply: keep EXACTLY as in `gated_delta_net_cuda` /
-  `ops.cpp` reference (lines 84-141 .cu, 10685-10730 ops.cpp). Do not reorder the
-  v - g*kv -> *beta -> S update -> S^T q sequence.
- conv: depthwise dot of width-4 kernel in f32, then silu - identical to `ggml_ssm_conv`+`ggml_silu`
-  if done in the same order.
- gate softplus: `softplus(x)=log1p(exp(x))`; match ggml's `ggml_softplus` (has the >20 fast path)
-  to stay bit-exact.
-
-## Implementation scope
- (1) `.cu`: extend `gated_delta_net_cuda` with a decode-fused template specialization (or a new
-  kernel) that does conv+silu prologue, q/k l2norm, recurrence, conv-state shift write, gated-RMSNorm
-  epilogue. Add `ggml_cuda_op` dispatch. CPU mirror in `ops.cpp` for parity/CI.
- (1) `ggml.h`/`ggml.c`: new builder `ggml_gated_delta_net_decode_fused` (extra src: ssm_conv1d,
-  ssm_norm, z, conv-cache view, alpha/dt/a, eps + op_params for eps).
- (1) graph edits: `delta-net-base.cpp build_recurrent_attn` (add the decode-fused branch alongside
-  the existing fused/ids branch); `qwen35.cpp` + `qwen35moe.cpp` `build_layer_attn_linear` (route
-  the pre/post ops into the op when `cparams.fused_gdn_decode`); leave `qwen3next.cpp`,
-  `kimi-linear.cpp`, the non-fused and rollback (n_rs_seq>0) paths unchanged.
- (1) `llama-context.cpp`: `auto_fgdn`-style device-match probe to enable/disable the decode-fused
-  op (silent fallback). `cparams.h`/`cparams.fused_gdn_decode`.
- (2) bf16 state: cache dtype change in the recurrent-memory allocation + the kernel load/store
-  convert + a `cparams` toggle + KL gate. Touches `gated_delta_net.cu` load/store, the inplace/ids
-  builders' state asserts, and the recurrent cache type.
-
-## Risk register
- (1) is MEDIUM effort, bit-exact-targetable, but bounded upside (~6-8% + launches; ceiling ~95% of
-  vLLM). Worth it only if the workflow wants >90% and accepts no bf16.
- (2) is the only large lever on the dominant 196 ms but is NON-bit-exact (KL-gated). If vLLM is
-  f32-state, (2) takes llama BELOW vLLM's precision, not toward parity - a product call, not a perf
-  call.
- The widened op signature (many srcs) raises maintenance cost and the device-match probe matters
-  (CPU offload of a GDN layer must fall back cleanly).
- Do NOT expect a fused recurrence to cut the 196 ms: it is already one read + one write of f32
-  state. Re-confirm with the `ncu-byte-gate` achieved-BW number before committing HIGH effort.
-
---
-
-# MEASUREMENT + VERDICT (label ncu-byte-gate, THE GPU agent) - GATE SETTLED
-
-The design above predicted the answer; this is the decisive measurement that confirms it.
-
-## VERDICT: NO-BUILD the fused single-pass recurrence. BUILD bf16 SSM state (design's lever (2)).
-
-Deciding number: **llama re-stream factor = ~1.0x** (mathematically capped at <=1.33x; >=1.5x is
-physically impossible). llama's recurrence kernel is ALREADY single-pass, coalesced, and at
-**74% of GB10 peak BW** - MORE bandwidth-efficient than vLLM's fused triton kernel (41% of peak).
-The whole 2x DRAM gap vs vLLM is **f32 (llama) vs bf16 (vLLM) state-cache width**, not re-streaming.
-
-## ncu HW counters were BLOCKED; timing + geometry gave the byte ratio anyway
- `ncu dram__bytes` and `nsys --gpu-metrics-devices` both return `ERR_NVGPUCTRPERM`
-  (`NVreg_RestrictProfilingToAdminUsers` restricted, root-only; no passwordless sudo on dgx.casa).
-  DRAM byte counters are unobtainable on this box.
- Decisive fallback (no perf counters): CUPTI kernel TIMING (allowed) + EXACT byte geometry from
-  the kernel source. bytes_moved <= peak_BW x duration gives a HARD CAP on the re-stream factor;
-  comparing implied effective BW between llama and vLLM (same model, same B, both eager) settles it.
-
-## Measured (clean nsys CUDA timing; build-cuda-base df1cc97 Lever-1; both B=128, both graphs/eager-OFF)
-llama: `llama-batched-bench -npp 8 -ntg 12 -npl 128 -ub 2048`, GGML_CUDA_DISABLE_GRAPHS=1.
-vLLM:  postssm_decomp/vllm_decode.sqlite, NSEQ=128, enforce_eager=True (apples-to-apples).
-
-| kernel | state dtype | bytes R+W/call | duration/call (steady) | eff. BW | % of 273 peak | re-stream |
-|---|---|---|---|---|---|---|
-| llama gated_delta_net_cuda          | f32  | 805.3 MB | **3.98 ms** (min 3.90 max 4.33, grid 48x128x32) | 202 GB/s | **74%** | ~1.0x |
-| vLLM fused_recurrent...packed_decode | bf16 | 402.6 MB | **3.62 ms** (min 3.53 max 3.96, grid 4x6144x1)  | 111 GB/s | **41%** | ~1.0x |
-
- llama recurrence/step = 3.98 x 48 = **191 ms** (50% of 384 ms step; matches STATE 196 ms).
- vLLM recurrence/step  = 3.62 x 48 = **174 ms**. Per-call gap llama-vs-vLLM is only +10%, NOT 2.8x.
-  The old "1.47 ms near-vLLM" was prefill-contaminated; clean decode is 3.98 ms (confirms STATE).
- Both kernels verified SINGLE-PASS in source (llama: s_shard load-once/store-once, 128 consecutive
-  f32/warp = coalesced; vLLM packed_decode: `b_h += load(p_h0).to(f32)` once, `store(p_ht, b_h.to(bf16))`
-  once). vLLM cache dtype = state_dtype = model_dtype = bf16 (`_mamba_state_dtype` default "auto" ->
-  model dtype; config.json dtype=bfloat16). Geometry identical (H=48, k/v head_dim 128, S_v 128).
-
-## Why re-stream ~1.0x (the gate number)
-Most bytes a 3.98 ms call could move at 273 GB/s peak = 1.087 GB = **1.33x the 816 MB minimal**.
-1.5x/2x re-stream would need >peak BW -> impossible. Source proves single-pass+coalesced -> 1.0x end:
-~816 MB at 202 GB/s = 74% peak. A fused single-pass rewrite recovers ~0 state bytes => NO-BUILD.
-
-## The lever: bf16 SSM state (design (2)) - confirmed, large, parity-to-ahead
-2x recurrence bytes vs vLLM = 100% f32-vs-bf16 cache. llama's kernel is the more efficient one
-(74% vs 41% peak), so bf16 state (cache + load/store bf16, f32 register compute, exactly as vLLM):
- 805.3 -> ~413 MB => at 74% peak ~2.0 ms/call => 191 -> ~96 ms/step, save ~95 ms => step ~289 ms
-  (~443 tok/s, AHEAD of vLLM 327). Conservative (50% peak on smaller footprint): ~3.0 ms/call =>
-  save ~45 ms => step ~339 ms = vLLM parity. Range = parity-to-ahead.
- NON-bit-exact vs llama's f32 reference, but EQUAL precision to vLLM (which is bf16). Gate on
-  PPL/KL vs the f32 build, not md5. "Bit-exact parity with vLLM" was never on the table - vLLM is bf16.
-
-## Conv-path (no-regret conv-fusion lever sizing), llama steady decode, per call x48
-concat_cont 169.6 us (8.14 ms/step) + cpy_scalar 120.1 us (5.76) + ssm_conv_f32 115.9 us (5.56)
-= ~19.5 ms/step (~5%). Conv STATE ~12.6 MB (tiny) -> this is LAUNCH/small-kernel overhead, not bytes
-> a FUSION lever (design (1)), secondary to bf16 state. l2_norm 6.8 us, gdn_gather 1.21 us (no-op,
-identity seqs -> confirms gather does NOT re-stream state at steady decode).
-
-## One-line answer
-llama: 805 MB/call, 74% peak, re-stream ~1.0x (<=1.33x). vLLM: 402 MB/call (bf16), 41% peak.
-conv-path: ~12.6 MB (launch-bound ~19.5 ms/step, not byte-bound).
-=> NO-BUILD fused recurrence (already single-pass, more efficient than vLLM); BUILD bf16 state
-(halves the dominant 805 MB, ~45-95 ms/step, parity-to-ahead). Deciding number: re-stream ~1.0x.
-
---
-
-# FINAL DECISION (synthesis of all four agents) - the five points
-
-This closes the workflow. Inputs: `ncu-byte-gate` (measured byte ratio), `vllm-fused-recurrence-study`
-(vLLM's single-pass boundary), `llama-fused-recurrence-design` (the fold/levers), `conv-fusion-design`
-(the no-regret conv in-place lever). They agree on every number; the decision is unambiguous.
-
-## (1) Byte-ratio verdict - the decisive number
-
-**llama is at the hardware bandwidth floor, NOT re-streaming.** Re-stream factor = **~1.0x**, hard
-capped at **<=1.33x** (the most bytes a 3.98 ms call can move at 273 GB/s peak is 1.087 GB = 1.33x
-the 816 MB minimal; >=1.5x is physically impossible). The recurrence kernel runs at **74% of GB10
-peak BW** (805.3 MB R+W / 3.98 ms = 202 GB/s) - MORE bandwidth-efficient than vLLM's fused triton
-`packed_decode` at **41% of peak** (402.6 MB / 3.62 ms = 111 GB/s). Source confirms both are
-single-pass and coalesced (llama `s_shard` load-once/store-once, 128 consecutive f32/warp; vLLM
-`b_h = load(p_h0)` once -> f32 regs -> `store(p_ht, b_h.to(bf16))` once). The entire 2x DRAM gap
-vs vLLM is **100% f32 (llama) vs bf16 (vLLM) state-cache WIDTH**, not extra passes.
-
-## (2) Fused single-pass GDN recurrence: **NO-BUILD**
-
-A fused single-pass rewrite recovers **~0 state bytes** because the kernel is already one read + one
-write of the f32 state, and the un-fused l2norm/sigmoid/softplus/gate ops act on the tiny
-q/k/g/beta projections (8 MB/call, <1%), not the 805 MB state. There is no second pass to fuse away.
-Expected ceiling if built anyway: unchanged 191 ms recurrence -> no movement on the dominant 50% of
-the step. **Do not build it.** This refutes the workflow's founding hypothesis with a measured cap.
-
-## (3) Conv-state in-place fusion (`conv-fusion-design`): **GO - confirmed, bit-exact, no-regret**
-
-This is independent of the recurrence verdict and holds regardless. Build a fused
-`ggml_ssm_conv_update_inplace` (mirrors the 0018/0019 in-place pattern) that, at decode
-(`n_seq_tokens==1 && !keep && fused-AR && n_rs_seq==0`), assembles the width-4 conv window in
-registers from the cached K-1=3 taps + the native `qkv_mixed` token, computes the depthwise conv,
-folds `silu`, and writes the 1-token-shifted ring state back in place.
- Eliminates `concat_cont` (8.14 ms/step), `cpy_scalar` (5.76 ms/step), the transpose
-  materialization, and the separate `ggml_silu`; replaces `ssm_conv` with a ~1.6x-byte fused kernel
-  (5.56 -> ~9 ms). **Net ~12-14 ms/step = +3.1 to +3.7%** -> dense 335 -> ~346-349 tok/s @npl128
-  (88.5-89.3% of vLLM 391).
- **Bit-exact**: identical ascending-j width-4 FMA order as `ssm_conv_f32` at i==0, same `silu`
-  primitive, same f32 state bytes written - only the producing node changes. Greedy output is
-  bit-identical to the 0018/0019 baseline. LOW risk, additive to everything else.
-
-## (4) Recurrence floor-mover: bf16 SSM state - **BUILD (gated product call)**, and the bit-exact question
-
-Since the recurrence is at the f32 byte floor, the **only** lever on the dominant 191 ms (50% of the
-step) is narrowing the state-cache width to bf16, exactly as vLLM does.
- Store `ssm_states_all` in bf16; load bf16->f32 into `s_shard`, run ALL recurrence arithmetic in
-  f32 (UNCHANGED), store f32->bf16. 805.3 -> ~413 MB/call -> ~2.0-3.0 ms/call -> save **~45-95 ms/
-  step** -> step 384 -> **289-339 ms** = parity-to-ahead of vLLM (327 ms / 391 tok/s; projected
-  360-443 tok/s @npl128).
- **Bit-exact parity is UNREACHABLE on this term, by construction.** The f32 state bytes are
-  irreducible (single pass already), so matching vLLM's *speed* on the recurrence requires matching
-  vLLM's *width* (bf16). bf16 state is non-bit-exact vs llama's own f32 reference, but it is **equal
-  precision to vLLM** (vLLM's state cache is itself bf16). "Bit-exact parity with vLLM" was never on
-  the table - vLLM is the less-precise reference here. Gate the build on **KL < 1e-3 / PPL-delta**
-  over a 256-token greedy run, not on md5, with a `cparams` f32 fallback. The geometric state decay
-  (g<1) bounds per-step bf16 rounding, so accumulation is well-behaved.
- Bit-exact gains that ARE reachable (vs llama f32): the conv fusion (3) and the activation-fold
-  lever (1) - together ~9-11% - but they top out near ~93-96% of vLLM and never touch the 50%
-  recurrence term.
-
-## (5) Ranked build order + the single highest-value next step
-
-1. **Conv-state in-place fusion (BUILD NEXT - no-regret).** Bit-exact, LOW risk, +12-14 ms (~+3%),
-   reuses the proven 0018/0019 in-place op pattern. Build this first because it is risk-free, purely
-   additive, and de-risks the in-place conv-cache plumbing the bf16 work also touches.
-   Confirming measurement: nsys decode trace shows `concat_cont` and `cpy_scalar` GONE, step
-   384 -> ~370-372 ms, and greedy md5 IDENTICAL to the 0019 baseline (dense text md5, MoE
-   byte-identical).
-2. **bf16 SSM state cache (HIGHEST-VALUE lever - gated product call).** The ONLY lever on the
-   dominant 50% recurrence term: +45-95 ms/step, step -> 289-339 ms = parity-to-ahead of vLLM.
-   Non-bit-exact vs llama f32, equal precision to vLLM. Confirming measurement: `gated_delta_net_cuda`
-   duration/call drops 3.98 -> 2.0-3.0 ms in nsys; **KL < 1e-3 / PPL-delta vs the f32 build over
-   256-token greedy** passes; step time and tok/s hit the 289-339 ms / 360-443 tok/s band; cparams
-   f32 fallback verified.
-3. **Activation-op fold, design lever (1) (OPTIONAL, bit-exact, diminishing).** After (1) takes the
-   conv/silu buckets, the residual fold (q/k l2norm + gate softplus/sigmoid + gated-RMSNorm epilogue
-   + launch overhead) is ~3-5%; bit-exact but bounded. Build only if the goal is >90% of vLLM with
-   no bf16. Confirming measurement: per-op launch count for the GDN layer collapses to ~1; greedy
-   md5 unchanged.
-
-**Single highest-value next implementation step: bf16 SSM state cache (#2)** - it is the only change
-that moves the dominant 191 ms term and reaches vLLM parity-to-ahead. Its confirming measurement is
-the `gated_delta_net_cuda` per-call time dropping to ~2.0-3.0 ms AND the KL<1e-3 gate passing.
-**Recommended immediate build: the conv fusion (#1) first** (no-regret, bit-exact) so the bf16 work
-lands on an already-cleaned conv path; ship #2 as a `cparams`-gated, KL-validated product option.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/LEVER1_OPROJ_MMQ_RESULTS.md
@@ -1,77 +0,0 @@
-# Lever 1: gated-DeltaNet output-projection MMQ reshape (patch 0020)
-
-The single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models
-(arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). A two-line,
-bit-exact tensor reshape that re-routes the per-layer SSM output projection
-from a batch-1 FP4 GEMV (MMVQ) to a batch-128 tensor-core GEMM (MMQ).
-
-## The mechanism (profiled, both engines)
-
-Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391.
-The largest llama-specific overage was the gated-DeltaNet OUTPUT projection
-(ssm_out). The GDN op left its output in SSM layout and the graph reshaped it
-to 3D `[value_dim, n_seq_tokens=1, n_seqs=128]` before the ssm_out matmul, so
-`src1->ne[1] = 1`. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the
-128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does NOT amortize
-the ssm_out weight read across the 128 sequences. vLLM packs the same projection
-into a single M=128 GEMM. The in-projection was already fine (2D input -> MMQ);
-only the output projection was in 3D SSM layout.
-
-## The fix
-
-In the GDN output path of qwen35.cpp / qwen35moe.cpp / qwen3next.cpp, collapse
-the final GDN output to 2D `[value_dim, n_seq_tokens * n_seqs]` (= [6144, 128] at
-decode) BEFORE the ssm_out `ggml_mul_mat`, so `src1->ne[1] = 128` routes to the
-MMQ M=128 GEMM. The result is then already 2D `[n_embd, n_seq_tokens * n_seqs]`,
-so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a
-2D vs 3D view => bit-identical. MMQ on NVFP4 at this exact M=128 shape was already
-proven by the in-projection.
-
-```
-    ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
-+    ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
-     ...
-     cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
-    cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-```
-
-## Gates (all PASS)
-
- Bit-identical greedy (--temp 0 --seed 1, -n 200, llama-completion) vs the
-  post-SSM baseline build:
-  - dense q36-27b-nvfp4: md5 b90681a7728faadc44492b0bcd6181cc (IDENTICAL)
-  - MoE   q36-35b-a3b-nvfp4: md5 f37c7ca1edd752e3bd82e99b4e8744b6 (IDENTICAL)
- test-backend-ops MUL_MAT: OK ; MUL_MAT_ID: OK
- Coherent dense + MoE output (greedy text inspected).
-
-## decode_agg (llama-batched-bench, -fa on, -npp 128 -ntg 128 -npl 32,128 -c 33000)
-
-S_TG t/s (decode aggregate):
-
-| model            | npl | baseline | Lever 1 | delta   |
-|------------------|-----|----------|---------|---------|
-| dense q36-27b    |  32 |   170.52 |  200.00 | +17.3%  |
-| dense q36-27b    | 128 |   254.92 |  335.80 | +31.7%  |
-| MoE   q36-35b-a3b|  32 |   373.28 |  420.77 | +12.7%  |
-| MoE   q36-35b-a3b| 128 |   560.66 |  691.24 | +23.3%  |
-
-Dense @128: 335.80 t/s = 85.9% of vLLM 391 (target 82-85% HIT/exceeded;
-up from 65% post-SSM).
-
-## nsys (cuda_gpu_kern_sum, -npp 128 -ntg 24 -npl 128)
-
-The o_proj FP4 batch-1 GEMV bucket is eliminated and the work moves to MMQ M=128:
-
-| kernel                              | baseline           | Lever 1          |
-|-------------------------------------|--------------------|------------------|
-| mul_mat_vec_q<NVFP4, m=1> (o_proj)  | 132.8 ms / 48 inst | 0 ms / 0 inst    |
-| mul_mat_q<NVFP4, m=128>             | 5463 ms / 8800 inst| 5827 ms /10000 inst|
-
-The 132.8 ms o_proj GEMV bucket collapses to zero; mul_mat_q M=128 absorbs it
-(+1200 instances, +363 ms over the window), and its per-call average DROPS
-(620.8 us -> 582.7 us) because the added o_proj GEMMs are individually cheaper
-than the average projection GEMM. Realized o_proj-as-MMQ marginal cost
-~363.5 ms / 1200 = ~0.30 ms/call, versus the 2.77 ms/call (132.8 ms / 48) of the
-old GEMV: the amortized weight read is the win.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
@@ -1,143 +0,0 @@
-# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
-
-The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
-opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
-`mmq_x` auto-select in `mul_mat_q_case`. Companion to
-`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
-`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
-
-Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
-**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
-41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
-larger expert FFN, standard attention).
-
-## What it does (vs 0014)
-
-`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
-the per-expert column upper bound = token count) in one column-tile, i.e. stock
-**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
-per-expert density is tiny, that 128-wide tile is mostly padding.
-
-Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
-**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
-per-expert density host-side, from args the ids path already passes:
-
-```
-ne_get_rows = ncols_dst   = ne12 * n_expert_used        (token-expert assignments)
-n_experts   = nchannels_x = ne02
-density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
-```
-
-and caps to the small tile (default 64) **only when `density <= density_max`**, so
-the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
-No new kernel: the selection only lowers the loop's upper bound to an
-already-compiled, granularity- and shared-memory-validated `mmq_x`.
-
-## The threshold matters: `density_max = 8`, not `tile/4 = 16`
-
-The cap must fire for decode but not for a prefill ubatch. Each has per-expert
-density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
-
-```
-                       128 experts   256 experts
-prefill ubatch (512)        32            16
-decode npl128 (128)          8             4
-```
-
-`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
-prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
-`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
-`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
-default change is what makes the patch prefill-safe on the 256-expert model.
-
-## Measurements (default-on vs stock, median of 5 reps)
-
-`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
-sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
-
-```
-  npl   S_TG stock  S_TG 0015   dTG%     S_PP stock  S_PP 0015   dPP%
-    8      183.59     183.18  -0.22%        1489.2     1500.1  +0.73%
-   32      264.02     263.44  -0.22%        2034.5     2033.5  -0.05%
-   64      311.76     310.41  -0.43%        2028.3     2027.6  -0.03%
-  128      336.10     337.32  +0.36%        2025.0     2027.7  +0.13%
-```
-
-Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
-`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
-`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
-2029.4, 2029.0]`.
-
-### Honest read: neutral on this model
-
-On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
-npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
-the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
-materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
-256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
-lever has nothing to bite on.
-
-### npl128 decode tile sweep confirms 64 is the only useful width
-
-`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
-
-```
-  TILE8   TILE16  TILE32  TILE64  TILE96
- -6.31%   -3.18%  -0.17%  +0.70%  -0.76%
-```
-
-Smaller tiles are **worse**, not better: more column-tiles per expert = more
-grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
-the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
-same as 0014.
-
-## Why ship it default-on anyway
-
-1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
-   global, so prefill keeps its 128 tile (S_PP neutral above).
-2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
-   `tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
-   `cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
-   -1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
-   is identical by construction.)
-3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
-   does not help.
-4. **Correctness-gated** by the P0 harness (below).
-
-## Conservative by design (known limitation)
-
-A pure-density gate cannot separate two cases with the **same** per-expert density:
-Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
-16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
-+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
-it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
-as future work, not implemented.
-
-## Knobs
-
- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
-  forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
-
-## P0 correctness gate
-
-`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
-NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
-`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
-tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
-0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
-oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
-suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
-byte-identical to stock).
-
-## Verdict
-
- Correct, prefill-safe, default-on density-aware tile select; the durable design
-  0014's own doc scoped. Supersedes 0014's global cap as the default path; the
-  `LLAMA_MOE_MMQ_X` knob is retained as a manual override.
- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
-  because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
-  lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
-  prefill cost.
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
@@ -1,220 +0,0 @@
-# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
-
-Build-ready plan. **Not implemented in this workflow** (large kernel work). This
-document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
-on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
-finding that re-shapes the whole effort.
-
-Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
-LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
-(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
-HEAD at patch 0013), `build-cuda` sm_121.
-
-## TL;DR (the honest reframe)
-
-**The grouped GEMM the mission scoped to build from scratch already exists in
-upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
-sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
-MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
-building blocks:
-
-1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
-   count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
-2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
-   grid flattened into kbc-continuous space over expert x col-tile x row-tile x
-   k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
-
-The per-expert host-side row-gather loop in `ggml-cuda.cu`
-`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
-analysis describes as "the cliff" - is a **fallback only reached when
-`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
-GB10 mxfp4 path.**
-
-Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
-Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
-`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
-original `253/505/830/620` cliff was a real high-batch regression that has since
-been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
-batched-bench artifact.
-
-**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
-**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
-sizing heuristic.** This document scopes that delta, plus the optional
-block-padded align, plus the parity gate and phased plan. It also records what is
-intentionally NOT built and why (the W4A16 occupancy wall).
-
-## The one structural gap: M-tile sizing
-
-`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
-**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
-count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
-always selects the **largest** `mmq_x` that fits shared memory. vLLM's
-CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
-(typ. 16/32/64), padded **per expert**.
-
-ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
-per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
-~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
-So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
-tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
-`need_check`) every expert runs as a masked partial tail.
-
-The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
-(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
-
-Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
-expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
-re-read only across multiple col-tiles, of which there is one) while it **lowers
-shared-mem footprint and raises occupancy** - strictly aligned with the GB10
-occupancy lessons.
-
-## What already exists (reuse, do NOT rebuild)
-
-Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
-
- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
-  expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
-  warp-scan for the compaction index, scatters into `ids_src1` (column gather
-  permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
-  `expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
-  This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
-  vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
-  are exact (not block-padded).
- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
-  (kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
-  Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
-  Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
-  col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
-  x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
-  past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
-  `stream_k_fixup` pass recombines split tiles.
- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
-  (== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
-  `quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
-  the `ids_src1` gather fused** (one pass, no separate row-copy).
-
-Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
-with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
-(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
-`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
-
-(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
-7 for turing_plus - which is fine for thin batch and out of scope here.)
-
-## What to add (the durable delta, priority order)
-
-### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
-
-In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
-**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
-capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
-
- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
-  host-side SELECTION change - it picks a different already-compiled instantiation.
-  **Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
-  tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
-
-This is the single highest-leverage change and the seed of the durable port.
-
-### [2] Block-padded moe_align (the moe_align_block_size port proper)
-
-Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
-block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
-block boundaries. Then every col-tile is **full**, which:
-
- drops the `need_check` masking + per-expert partial-tail MMA,
- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
-  schedule), removing the `col_diff` skip branch.
-
-Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
-`write_back` masking (pad rows must not write output). This is the proper
-`moe_align_block_size` analogue and the durable second step.
-
-### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
-
-A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
-unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
-completeness; do not start without [1]+[2] measured as insufficient.
-
-## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
-
- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
-  `ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
-  (`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
-  scatter). No change to the host dispatch decision.
- **Gated fallback preserved**: the existing per-expert host loop
-  (`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
-  non-Blackwell GPUs. The new selection only fires on the grouped path.
- **Env gates** (off = exact current behavior):
-  - `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
-    (lever-1 + [1] manual knob).
-  - `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
-  Default both off until parity + throughput proven, then flip [1]'s
-  auto-selection on by default.
-
-## Correctness / parity gate
-
-Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
-reference is **deterministic** - the op test must be **bit-exact**.
-
- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
-  `n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
-  of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
-  tile geometry and where block-pad masking can leak.
- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
-  to current op-test output (op test is exact; the GB10 CUDA greedy-decode
-  non-determinism band applies only to end-to-end, never to the op test).
- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
-  `-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
-  and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
-  CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
-
-Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
-
-## Phased plan, expected payoff, risk per phase
-
-| Phase | Work | Expected payoff | Risk |
-|-------|------|-----------------|------|
-| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
-| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
-| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
-| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
-
-**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
-no broken path to unlock. The durable win is **matching vLLM's saturated
-grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
-M-tile mismatch - a micro-efficiency gain at large effective batch, not a
-step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
-hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
-already uses the same sorted-grouped-GEMM algorithm, so structural parity is
-**already met**; this closes the residual kernel micro-gap.
-
-## The biggest risk: the GB10 W4A16 occupancy wall
-
-The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
-178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
-XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
-
- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
-  or XOR-swizzle - they are exactly what killed W4A16);
- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
-  FP4 = 2x INT8/BF16 rate;
- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
-  MoE wins come from occupancy/tile fit, not bandwidth).
-
-Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
-shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
-re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
-rewards - which is why it leads the plan and [3] is gated behind it.
-
-## Commit / hygiene
-
-Scope doc only (this file). No engine change committed in this workflow. Bench and
-parity scripts are dev-tree-only. Commit with `git -s`, trailer
-`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
-Do not push (human pushes). When [1]/[2] are implemented they mirror to
-`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).
--- a/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_QUANT_DEDUP_RESULTS.md
@@ -1,71 +0,0 @@
-# MOE_QUANT_DEDUP_RESULTS.md - patch 0023 (qwen35moe NVFP4 activation-quantize de-dup)
-
-Bit-exact MoE decode/prefill lever. Built + measured on DGX GB10 (sm_121a) on top of HEAD
-8a3229f (patch 0022). Companion analysis: NONRECURRENCE_BITEXACT.md (section "nonrec-build").
-
-## What
-
-ggml `mul_mat_id` quantizes the EXPERT-GATHERED activation rows: it allocates
-`ne11_flat = ne12 * n_expert_used` rows and quantizes each via `quantize_mmq_nvfp4(..., ids_src1)`.
-For the broadcast up/gate projections the activation is the per-token hidden state, the SAME for
-every expert that token routes to (`ne11 == 1`). So the stock path re-quantizes each token
-`n_expert_used` times (4x for q36-35b-a3b).
-
-`quantize_mmq_nvfp4` computes each `block_fp4_mmq` as a pure per-thread function of its 16
-consecutive inputs (per-thread amax, the +/-2 ue4m3 search, the e2m1 packing - NO cross-thread
-shfl/reduction). So the quantized block for a given token is byte-identical no matter which
-expert slot it lands in.
-
-## Lever
-
-When `ne11 == 1` (broadcast up/gate):
-1. Quantize the `ne12` UNIQUE token activations once into a compact buffer
-   (`quantize_mmq_fp4_cuda(src1_d, nullptr, ..., ne12, 1, 1)`, row stride `s12`).
-2. Gather the `block_fp4_mmq` rows into the expert-gathered layout keyed by `ids_src1`
-   (`gather_mmq_fp4`): `block_fp4_mmq == 9 * uint4 == 144 B`, copied with a coalesced uint4
-   kernel whose output is written fully contiguously (`gathered[t] = unique[ib_u*9 + w]`).
-
-Pure byte copy of identical blocks => the gathered buffer is byte-for-byte identical to
-re-quantizing each gathered row. The MMQ GEMM is UNTOUCHED. `down_proj`
-(`ne11 == n_expert_used`, distinct per expert) keeps the stock re-quantize path.
-
-The first gather draft (one thread copies one 144 B struct, scattered) was uncoalesced and cost
-478 ms - it ate 84% of the quantize saving and decode stayed flat. The shipped coalesced-uint4
-gather costs 32 ms.
-
-## Measurements (q36-35b-a3b-nvfp4 dense=q36-27b-nvfp4, -fa on, -npp 128 -ntg 128)
-
-nsys decode-isolated (`--cuda-graph-trace=node`, npp8 ntg128 npl128), per-run kernel sums:
-| kernel                | dedup off | dedup on |
-|-----------------------|-----------|----------|
-| quantize_mmq_nvfp4    | 868 ms    | 457 ms   |
-| gather_mmq_fp4        | -         | 32 ms    |
-| net quantize path     | 868 ms    | 489 ms   |  (-379 ms decode GPU-time)
-| gated_delta_net (50%) | unchanged | unchanged |
-| mul_mat_q<NVFP4>      | unchanged | unchanged |
-
-Decode S_TG (t/s), back-to-back same-build A/B (default-on vs GGML_CUDA_MOE_QUANT_DEDUP=0):
-| model           | npl32 off->on    | npl128 off->on        |
-|-----------------|------------------|-----------------------|
-| MoE q36-35b-a3b | 440.3 -> 442.8 (+0.6%) | 745.2 -> 758.1 (+1.73%) |
-| dense q36-27b   | 207.4 -> 206.9 (flat)  | 373.28 -> 373.24 (byte-flat) |
-
-Prefill: MoE T_PP 7.69 -> 7.38 s (~ -4% time). Dense unaffected (no `mul_mat_id`).
-
-## Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022)
-
-| model            | md5 (default on)                     | == 0022 |
-|------------------|--------------------------------------|---------|
-| q36-27b-nvfp4    | 5951a5b4d624ce891e22ab5fca9bc439     | yes (dense untouched) |
-| q36-35b-a3b-nvfp4| 07db32c2bcb78d17a43ed18bc22705cd     | yes (on == off == 0022) |
-
-test-backend-ops: MUL_MAT 1115/1115, MUL_MAT_ID 805/805 (default on).
-
-## Knob
-
-On by default. `GGML_CUDA_MOE_QUANT_DEDUP=0` restores the stock per-expert re-quantize path
-(byte-identical output, used as the A/B baseline).
-
-Commits: DGX dev tree f7409c2; worktree patch `0023-qwen35moe-nvfp4-quant-dedup.patch`.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
@@ -1,99 +0,0 @@
-# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
-
-Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
-`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
-Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
-(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
-`build-cuda` sm_121.
-
-## Headline (honest): there is no npl128 cliff to erase on this build
-
-The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
-@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
-
-```
-llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
-  npl        1     8    32    64   128   256
-  stock     85   282   629   935  1295  1779     <- monotonic, no knee
-```
-
-The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
-decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
-`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
-persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
-`MOE_GROUPED_GEMM_SCOPE.md`.
-
-## What the knob does
-
-`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
-(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
-column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
-(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
-~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
-throughput on the padding columns, and the larger y-tile lowers occupancy.
-
-`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
-(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
-still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
-already uses for smaller batches - no new kernel configuration. Default
-(unset/<=0) = disabled => byte-identical to stock.
-
-## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
-
-Decode throughput, S_TG t/s:
-
-```
-  npl     stock   cap16   cap32   cap64
-   1       85      85      85      85
-   8      282     280     282     282
-  32      629     623     629     628
-  64      935     915     949     934
- 128     1295    1204    1344    1357     <- cap64 +4.8% (cap16 -7%)
- 256     1779    1370    1723    1820     <- cap64 +2.3% (cap16 -23%)
-```
-
-Prefill throughput, S_PP t/s (the cost):
-
-```
-  npl     stock   cap16   cap32   cap64
- 128     3083    1817    2559    3038
- 256     3084    1818    2560    3046
-                 -41%    -17%    -1.3%
-```
-
-Reproducibility (interleaved off/cap64, two reps each):
-
-```
-  npl    off rep1/rep2   cap64 rep1/rep2
-  128    1300 / 1290     1357.5 / 1357.0
-  256    1786 / 1782     1826.3 / 1824.5
-```
-
-cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
-
-## Why 64 is the only value that helps net
-
-A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
-16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
-craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
-so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
-the fuller, higher-occupancy tile.
-
-## Verdict
-
- Real but **modest** high-effective-batch DECODE micro-optimization
-  (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
-  continuous batching already scales). Shipped as an opt-in, default-off knob;
-  recommended value 64 for decode-heavy high-concurrency deployments.
- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
-  for single-stream generation and stays coherent; thousands of capped MoE
-  matmuls at npl128/256 ran with no CUDA error / NaN.
-
-## Durable follow-up (scoped, not implemented)
-
-Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
-from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
-small tile while prefill keeps its large tile automatically (removes the ~1.3%
-prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
-`MOE_GROUPED_GEMM_SCOPE.md`.
--- a/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
+++ b/backend/cpp/llama-cpp/patches/paged/NONRECURRENCE_BITEXACT.md
@@ -1,323 +0,0 @@
-# NONRECURRENCE_BITEXACT.md - bit-exact non-recurrence decode levers (label nonrec-design, READ-ONLY, no GPU)
-
-Post-0022 the gated-DeltaNet recurrence is at 84.6% BW = 102.6% of vLLM (3.488 ms/call), past parity.
-The remaining ~5% to vLLM lives in the non-recurrence path. Per the node-level decode trace (nsys
-`--cuda-graph-trace=node`, clean build, q36-27b-nvfp4 dense, npl128) the decode step is ONE replayed
-CUDA graph, ALL kernels on a SINGLE stream (stream 14), strictly serial, 99.94% GPU-busy, 0.06% idle.
-That single-stream-99.94%-busy fact is load-bearing for everything below: there is NO overlap, so any
-kernel GPU-time genuinely removed (or any kernel folded away) cuts wall-clock 1:1; and conversely, if a
-"faster kernel" leaves wall-clock flat, then the kernel did NOT actually get faster at the decode shape.
-
-Post-recurrence-fix kernel mix of the ~367 ms decode step (was 380.4 pre-0022; recurrence now smaller):
- `mul_mat_q` FP4 GEMM (496 calls/step) ~24% (the biggest non-recurrence bucket)
- `quantize_mmq_nvfp4` (496/step) ~4.5%
- `nvjet` lm_head GEMM ~3.1%
- `flash_attn_ext_f16` (16 attn layers) ~3.1%
- elementwise glue: k_bin_bcast (gate mul+add) ~1.7%, unary_gated silu/sigmoid ~1.4%, rms_norm ~0.9%,
-  l2_norm ~0.2%, plus conv-state concat_cont/cpy (Lever-1 territory, not in this scope).
-
-Files read on the DGX 0022 tree (HEAD 8a3229f): `mmq.cuh`, `mmq.cu`, `quantize.cu`, `gated_delta_net.cu`,
-`fattn.cu`, `fattn-common.cuh`.
-
---
-
-## RESOLUTION of the P2a puzzle (load-bearing) - mmmq_y=64 / minblocks: bit-exact but FLAT on decode
-
-The existing P2a machinery is two NVFP4-gated, default-stock flags in `mmq.cuh`:
- `GGML_CUDA_FP4_MMQ_Y` (L143-163): overrides the weight-row N-tile `mmq_y` 128 -> 64/96 for NVFP4 on
-  Blackwell. mmq_y tiles N (output rows); each weight row lives in exactly one row-tile, so total weight
-  traffic is unchanged. **Bit-exact**: the per-output K-reduction is the `for frag` loop in
-  `vec_dot_fp4_fp4_mma` (L1097-1108, `sum[...] += C.x[l]`), whose order is independent of mmq_y. md5-
-  verified in prior runs (1115/805 gate, byte-identical).
- `GGML_CUDA_FP4_MINBLOCKS` (L205-216): raises the `__launch_bounds__` min-blocks operand (L3579-3585)
-  for NVFP4 so >1 CTA co-resides per SM. **Bit-exact**: register allocation / occupancy cannot change
-  results.
-
-The paradox restated: P2a made a standalone `mul_mat_q<NVFP4,m=128>` -24.7% faster (bit-exact), yet
-decode was FLAT (335->336 post-0020). The trace says decode is 99.94% single-stream busy and mul_mat_q
-is ~24% of it, so a -24.7% cut should give ~+6%. RESOLUTION (airtight, from the single-stream fact):
-
-> On a 99.94%-busy single stream, freed kernel GPU-time MUST lower the wall 1:1. Decode is flat =>
-> mmq_y=64 did NOT free per-call GPU-time at the DECODE shapes => the -24.7% was measured at a
-> NON-decode shape (a single large-N or prefill-M GEMM that runs enough waves to reach asymptotic
-> throughput). There is no contradiction; the two measurements are at different GEMM shapes.
-
-Mechanism (grounded in the launch path, `launch_mul_mat_q` L3989-4088): decode runs ONE `mul_mat_q` per
-weight with mmq_x=128 fused tokens => ntx=1, and the grid is `nty = N / mmq_y` CTAs (xy-tiling, or
-stream-k at nsm=48 when `tiles_efficiency_percent < 90`, L4044-4047). The 496 decode GEMMs have small N:
- FFN up/gate N=17408 -> nty=136 CTAs (mmq_y=128) = ceil(136/48)=3 waves, last wave 40/48=83% full
- FFN down / qkv / o-proj N~5120-6144 -> nty=40-48 CTAs = 1 wave (and eff<90 => stream-k at 48 CTAs)
-
-So EVERY decode GEMM is a 1-3 wave, 40-136 CTA kernel: it is **ramp + tail (wave-quantization) bound**,
-dominated by the first-wave weight-load latency before any MMA can start plus the fractional last wave -
-NOT by steady-state occupancy. mmq_y=64 doubles the grid (272 CTAs, 6 waves for the fat FFN) which only
-helps the ASYMPTOTIC achieved-BW the microbench measures; at 1-3 waves there is no steady state for it
-to act over, and each CTA now carries half the arithmetic-per-weight-load so the ramp is relatively MORE
-exposed. minblocks=2 is worse: the FP4 MMA is register-bound at ~255 regs/thread (the `(256,1)` bound),
-so forcing 2 CTAs/SM register-caps to ~128 regs => heavy spill => net-negative. Both are the in-wave
-occupancy lever, and the decode GEMM has no in-wave occupancy problem - it has a too-few-waves problem.
-
-VERDICT: re-test P2a (mmq_y=64, and 96) and minblocks=2 ON TOP of 0022 because it is a FREE one-build
-re-test (flags already exist, default stock). **Design prediction: still ~flat (maybe +1-2% from the
-one fat-FFN N=17408 GEMM that has 3->6 waves of room; ~0% from the 1-wave thin GEMMs).** The decisive
-measurement for the reprofile agent is NOT a standalone microbench - it is the PER-CALL `mul_mat_q`
-GPU-time at the REAL decode shapes (the 496 calls), flag on vs off, summed. If per-call decode time
-drops, it ships (free bit-exact win). If per-call decode time is ~unchanged (predicted), the -24.7%
-was a large-N artifact and the GEMM has no bit-exact occupancy lever - confirming the structural wall.
-
-WHY the decode GEMM has no high-value bit-exact lever: its bottleneck is wave-quantization at a small
-grid. The only knobs that change the grid are (a) mmq_y-down [bit-exact, flat per above], (b) mmq_x-down
-[FORBIDDEN: re-reads the 18 GB weights ntiles_x times, strictly worse, and pins one-read], (c) the
-stream-k-vs-tiling threshold [FORBIDDEN for bit-exactness: stream-k splits each output tile's K-sum
-across CTAs and re-adds via the fixup kernel - a DIFFERENT K-accumulation order than one-CTA-full-K
-tiling, so flipping the L4047 threshold changes which path a GEMM takes and breaks md5 vs the 0022
-baseline]. So at the bandwidth/wave-quant floor for these tiny grids, 3% FP4 efficiency is structural;
-no order-preserving change moves it.
-
---
-
-## RANKED bit-exact non-recurrence levers
-
-Ranked by expected bit-exact decode gain. "Bit-exact-safe" = keeps the exact reduction/FMA order; the
-gate is md5-identity to llama 0022 f32 output on both models (dense + MoE), greedy temp0.
-
-### 1. Quantize producer-fold (Track A) - bit-exact-safe - ceiling 4.5%, realistic ~2-2.5%
-Fold `quantize_mmq_nvfp4` (4.5%, ~17 ms, 496/step) into the PRODUCER epilogue (the rms_norm / silu that
-emits each GEMM's activation), so the f32 activation is quantized to `block_fp4_mmq` directly from the
-producer's registers instead of being written to HBM as f32 and re-read by a standalone quantize kernel.
- **Bit-exactness: SAFE, and unusually clean.** `quantize_mmq_nvfp4` (quantize.cu:78-171) computes
-  `amax_raw` PER-THREAD over the thread's own QK_NVFP4_SUB=16 values (L108-118) with NO cross-thread
-  shfl/reduction (unlike `quantize_mmq_q8_1` which does a warp shfl_xor). Each thread independently runs
-  the +/-2 ue4m3 scale search (L120-150) and `ggml_cuda_float_to_fp4_e2m1` packing (L155-166). So the
-  output block is a pure per-thread function of its 16 inputs. Copy that arithmetic VERBATIM into the
-  producer epilogue and the `block_fp4_mmq` bytes are identical => md5-safe. The only requirement is the
-  producer thread-layout owns contiguous 16-element K-sub-blocks (feasible for an rms_norm/silu epilogue).
- **Expected gain:** the win is removing the standalone kernel's f32 activation READ (the producer already
-  holds the f32); the quant compute + fp4 write still happen (now folded). So ~the read-half of the 17 ms,
-  ~2-2.5% of the step, and it is REAL because the step is single-stream 99.94% busy (no overlap to hide
-  the removed kernel).
- **Trap / caveat:** the SPENT "Lever-2" was a DIFFERENT fusion (quantize -> GEMM *consumer* prologue,
-  measured net-zero because the GEMM still reads the same activation bytes). Track A is the *producer*
-  fold and removes a true f32 round-trip, so it is not subject to that flatness - but it needs real
-  producer-kernel surgery + the frozen `block_fp4_mmq` ABI (mmq.cuh:53), more plumbing than the others.
- Ranked #1: largest cleanly-bit-exact non-GEMM bucket, no reduction trap (per-thread quant).
-
-### 2. Activation / op fold - POINTWISE subset only - bit-exact-safe - realistic ~1.5-2.5%
-Fold the pure pointwise glue off the single-stream chain into the adjacent kernel's epilogue/prologue:
-the GDN residual ADDs and gate MULs (`k_bin_bcast`, ~1.7%), the `silu`/`sigmoid` (`unary_gated`, ~1.4%,
-the part that is the output gate, not FFN), and the post-GDN gate MUL after the output rms_norm.
- **Bit-exactness: SAFE for the pointwise ops only.** Add/mul/silu/sigmoid are elementwise fp32 with the
-  same formula and the same op order whether standalone or folded => byte-identical. This is the bit-exact
-  half of the prior Lever-3 design.
- **THE TRAP (FORBIDDEN half):** the `rms_norm`/`l2_norm` REDUCTIONS must NOT be re-folded with a
-  different reduction tree. The standalone `l2_norm_f32<32>`/`rms_norm_f32` use a specific warp/block
-  reduction; folding the norm into a kernel with a different `warp_reduce_sum` width or eps placement
-  (`x*rsqrt(sumsq+eps)` vs `x/max(sqrt(sumsq),eps)`) changes the last ULP => breaks md5. Fold the MUL that
-  FOLLOWS the norm (pointwise, safe); do NOT fold the norm's reduction. (This is the direct analog of the
-  f32x4 lane-remap trap that blocked the recurrence's vectorized state loads: any change to a reduction's
-  grouping is forbidden.)
- **Expected gain:** ceiling ~3.3% (the Lever-3 slice), realistic ~1.5-2.5% once the norm reductions are
-  excluded. Real (single-stream, no overlap), bounded, lower plumbing than #1 (no new ABI).
- Ranked #2: smaller than #1 and the high-value pieces (norms) are off-limits.
-
-### 3. mul_mat_q occupancy retune (existing P2a: mmq_y=64/96, minblocks=2) - bit-exact-safe - ~FLAT
-See the P2a resolution above. Bit-exact-safe (N-tiling / register-cap preserve the K-reduction order;
-md5-verified). Design prediction FLAT on decode (decode GEMMs are 40-136 CTA, 1-3 wave, ramp/tail-bound;
-the -24.7% was an asymptotic large-N number). **Worth the one-build re-test only because it is free**
-(flags exist, default stock). Possible marginal +1-2% from the single N=17408 fat-FFN GEMM (3->6 waves).
-Measure PER-CALL decode-shape `mul_mat_q` time, not a microbench. Ranked #3: zero plumbing, but low/zero
-expected gain - it is the diagnostic that confirms the GEMM wall is structural, not a shippable lever.
-
-### 4. Attention occupancy (flash_attn_ext_f16) - NO bit-exact lever - NO-GO
-`flash_attn_ext_f16` is ~3.1% (11.67 ms, 16 attn layers), grid 48 CTAs = exactly ONE full wave on 48
-SMs (trace). There is no occupancy headroom (already 1 wave, perfectly filled, no tail) and no in-wave
-under-occupancy to fix. The only knobs that change the attention grid are split-KV / parallel_blocks /
-a different KV-tile (the `ncols1`/`ncols2`/`cols_per_block` selection in `fattn.cu`), and EVERY one of
-them changes the online-softmax running-max/sum RESCALING ORDER across KV blocks => NOT bit-exact
-(forbidden, the softmax-rescale analog of the reduction-tree trap). At 3.1% with one full wave the
-attention is effectively at floor. Ranked last: no bit-exact lever exists; do not pursue.
-
---
-
-## FORBIDDEN levers (require a precision or accumulation-order change - excluded by the gate)
- Stream-k vs plain-tiling threshold flip for the GEMM wave-quant tail: splits + re-adds the K-sum across
-  CTAs => different f32 accumulation order than one-CTA-full-K tiling => breaks md5.
- Vectorized / lane-remapped tile loads in the GEMM (`load_tiles_nvfp4_nvfp4` / `load_ldmatrix`): any
-  remap of which lane holds which K-element changes the MMA fragment->accumulator mapping => can change
-  the per-output sum grouping => forbidden (the f32x4 lane-remap trap, same class that blocked the
-  recurrence's vectorized state loads).
- mmq_x-down at dense decode: re-reads the 18 GB weights `ntiles_x` times. Order-preserving but strictly
-  slower and breaks the one-read invariant; not a lever.
- Folding rms_norm / l2_norm with a different reduction tree or eps placement: last-ULP change => md5 break.
- flash-attn split-KV / KV-retile: changes the online-softmax rescale order => not bit-exact.
- bf16 state / bf16 anything: precision change, SHELVED, forbidden by the gate.
-
---
-
-## One-line summary for the parent
-The remaining non-recurrence decode gap has NO single big bit-exact lever. The largest cleanly bit-exact
-win is the **quantize producer-fold (Track A, ~2-2.5%, the per-16 NVFP4 quant has no cross-thread
-reduction so it copies verbatim into the rms_norm/silu epilogue)**; second is the **pointwise activation
-fold (~1.5-2.5%, fold the residual adds / gate muls / silu but NOT the norm reductions)**; the
-**mul_mat_q occupancy retune (P2a mmq_y/minblocks) is bit-exact but predicted FLAT** (decode GEMMs are
-small-grid wave-quant/ramp-bound, so the -24.7% asymptotic number does not apply per-call - confirmed by
-the airtight single-stream-99.94%-busy logic, re-test only because the flag is free); and **attention has
-NO bit-exact lever** (already one full wave; every grid knob changes the softmax rescale order). The
-P2a puzzle is resolved: not a contradiction - the -24.7% and the flat decode are simply at different GEMM
-shapes (large-N asymptotic vs 1-3-wave decode per-call).
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# EMPIRICAL P2a RE-TEST ON 0022 (label reprofile-puzzle, GPU agent) - measured, build + bench + nsys
-
-The design section above PREDICTED P2a flat from the single-stream logic. This section is the GPU
-measurement that CONFIRMS it byte-for-byte, plus one load-bearing correction: an early "+11% decode"
-A/B was a STALE-BASELINE artifact, not the flag. Box: DGX GB10 (sm_121a), HEAD 8a3229f (patch 0022),
-SM+MEM clock pinned 2190 MHz (verified via `nvidia-smi dmon`, identical base vs flag - NOT a clock story).
-
-## (1) Fresh node-level decode decomposition (nsys --cuda-graph-trace=node, dense q36-27b-nvfp4, npl128)
-Per-instance trace windowed to one steady decode step (103 steady steps, step = 48 GDN-layer boundaries):
-
-  Committed-default build (build-cuda-base, 336 t/s @128) -- step span 383.1 ms, kernel-busy 99.24-99.30%:
-    gated_delta_net (SSM recurrence)   193.97 ms/step   51.0%   <- BINDING KERNEL
-    mul_mat_q<NVFP4,m=128,nc=0>         93.64 ms/step   24.6%   <- the P2a target
-    quantize_mmq_nvfp4                  16.77 ms/step    4.4%
-    nvjet (cublas lm_head GEMM)         12.07 ms/step    3.2%
-    flash_attn_ext_f16                  11.69 ms/step    3.1%
-    concat_cont 8.14 / cpy_scalar 7.49 / k_get_rows 7.29 / ssm_conv 6.55 / silu 5.32 / k_bin_bcast 4.67
-    mul_mat_q_stream_k_fixup 3.95 / rms_norm 3.56 / ... ; SUM 380.1 ms = 99.24% of the 383.1 ms wall.
-
-  conv-inplace + GDN(16,8) build (the 374 t/s state) -- step span 345.3 ms, kernel-busy 99.0%:
-    gated_delta_net 167.99 (49.2%), mul_mat_q<NVFP4,128,0> 93.79 (27.5%), quantize 17.66 (5.2%),
-    nvjet 12.05 (3.5%), flash_attn 11.66 (3.4%), ssm_conv(fused update) 8.44 (2.5%), k_get_rows 7.32 (2.1%).
-
-  BINDING KERNEL = gated_delta_net (~49-51% of the step) in BOTH; mul_mat_q<NVFP4,m=128> is #2 (~25-27.5%).
-  Decode is ~99.0-99.3% GPU-busy single-stream (confirms the 99.94% claim; ~0 idle, strictly serial).
-
-## (2) P2a A/B - the -DGGML_CUDA_FP4_MMQ_Y=64 nwarps-remap, re-applied + built + bit-exact-gated on 0022
-The committed 0022 machinery was PARTIAL (patch 0017 templated get_mmq_y_device<type> but left
-mmq_get_nwarps_device() stock -> mmq_y=64 + nwarps=8 fails static_assert nwarps*tile_C::I==mmq_y at
-mmq.cuh:3280). Re-derived the full threading: templated mmq_get_nwarps_device<type>() -> mmq_y/16 (=4)
-for NVFP4+flag; type-aware mmq_get_nwarps_host(...,type); threaded <type> through the NVFP4 loader (998),
-write_back_mma (3266), process_tile (3500), mul_mat_q launch_bounds (3579/83/85) + body (3602),
-stream_k_fixup launch_bounds (3832) + body (3843), 2 host launch sites (3994/4172). Reverted after.
-
-  cuobjdump proof the flag took effect: mul_mat_q<NVFP4,m=128,nc=0> STACK 112 -> 56 (256-thr/8-warp CTA
-  -> 128-thr/4-warp CTA => 1 -> 2 resident CTAs/SM). REG 255 (HW-capped), no new spill.
-  BIT-EXACT GATE (HELD): test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805; greedy md5 base==flag
-  IDENTICAL = 5951a5b4d624ce891e22ab5fca9bc439 (matches the prior P2a gate hash). Byte-identical output.
-
-  CLEAN A/B (same build dir, ONLY mmq.cuh toggled => non-mmq .o byte-identical; back-to-back, pinned clocks)
-  S_TG t/s, llama-batched-bench -fa on -npp128 -ntg128:
-    DENSE q36-27b:   npl 32  208.02 -> 207.51 (-0.2%)   npl 128  374.30 -> 373.19 (-0.3%)   FLAT
-    MoE  q36-35b-a3b: npl 32  438.83 -> 439.30 (+0.1%)   npl 128  745.71 -> 745.07 (-0.1%)   FLAT
-  Prefill S_PP also flat at 0022 (npp128 1056->1050; npp2048/npl1 1028.85->1024.19).
-
-## (3) RESOLUTION - why FLAT, where the GEMM time goes, and a correction to the prior "-24.7%->+6%" logic
-Decode-isolated per-kernel A/B (node trace, same-source toggle, identical non-mmq code):
-    gated_delta_net          167.99 -> 167.89 ms/step  (IDENTICAL - it is byte-identical code, untouched)
-    mul_mat_q<NVFP4,128,0>    93.79 ->  92.74 ms/step  (-1.1%, FLAT)            <- the P2a target, decode shape
-    mul_mat_q_stream_k_fixup   3.96 ->   5.65 ms/step  (+1.7ms, REGRESSES at nwarps/2=2)
-  => the decode mmq FAMILY is flat-to-slightly-WORSE; the flag delivers ~nothing at the m=128 decode shape.
-
-The "-24.7%" is REAL but it is a PREFILL-shape number. Full-run aggregate (npp128 ntg128, prefill+decode)
-mul_mat_q<NVFP4,128>: 19630 -> 17569 ms = -10.5%; subtracting the flat decode portion (93.8x128 vs
-92.7x128) leaves the prefill-shape portion at 7625 -> 5699 ms = -25.3% (matches the prior -24.7%). So the
-occupancy lever genuinely cuts the COMPUTE/occupancy-bound prefill-shape GEMM ~25%, and ~0 of the
-BANDWIDTH-bound m=128 decode-shape GEMM (it reads the full NVFP4 weight matrix from 273 GB/s LPDDR5x; the
-mmq_y knob is deliberately bandwidth-neutral - every weight row still read once - so it cannot move a
-bandwidth-bound wall). Confirmed at the SOURCE-of-decode level, not inferred.
-
-Reconciling with "99.94% busy single stream => a -24.7% cut should give ~+6%": the PREMISE is false. The
-flag does NOT cut the decode mul_mat_q by 24.7% (it cuts it 1.1%). There is therefore NO freed time on the
-99% busy stream - so the "where does the freed time go (idle gaps?)" question is moot: no time is freed at
-the decode shape. The contradiction dissolves: mul_mat_q IS on the critical path AND single-stream-busy, but
-the lever simply doesn't accelerate the decode-shape invocation. (Net it slightly hurts via stream_k_fixup.)
-
-CORRECTION to an earlier in-session A/B (recorded so the parent does not chase it): a first pass showed
-build-cuda-base 334.6 -> "flag" 372 (+11%). That was a STALE-BASELINE artifact, NOT the flag. build-cuda-base
-(binaries 18:46) was compiled from a pre-0021 source - it has NO ssm_conv_update_f32 (cuobjdump symbol count
-0 vs 4 in the 0022 build) and the un-retuned GDN default (gated_delta_net 194 vs 168 ms/step). Those ~40 ms
-of non-mmq differences (conv fuse ~14 ms + GDN ~26 ms) are the entire 334.6->373 gap. With a correct
-same-source baseline (toggle ONLY mmq.cuh in one build dir) the flag is flat (373.19 vs 374.30). Lesson:
-the only valid P2a A/B holds every non-mmq .o byte-identical; comparing two independently-built trees mixes
-in whatever other flag/patch state each was built from.
-
-## VERDICT
-P2a (mmq_y=64 nwarps-remap) is BIT-EXACT (md5-identical, 1115/805) and a genuine ~25% PREFILL-shape FP4-GEMM
-kernel win, but it is FLAT on decode (dense and MoE, npl 32 and 128) on 0022, AND flat on end-to-end prefill
-S_PP at 0022 (prefill is GDN/other-bound at these sizes, not mmq-bound). It is NOT a decode-parity lever and
-the decode commit-gate (lift decode_agg) is NOT met -> do NOT ship for decode. The binding decode kernel is
-gated_delta_net (~50%); the only decode levers left are the bit-exact folds in the design section above
-(quantize producer-fold ~2-2.5%, pointwise activation fold ~1.5-2.5%) and the GDN-region launch/fusion that
-vLLM already has. The mmq P2a machinery was reverted; the 0022 tree is left git-clean.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
---
-
-# nonrec-build (GPU agent) - built + measured. Lever shipped: MoE NVFP4 quantize de-dup (patch 0023)
-
-Box: DGX GB10 (sm_121a), baseline = clean rebuild of HEAD 8a3229f (patch 0022) in build-cuda
-(verified: mmq.cu.o rebuilt from clean source; the A/B-left binary was stale). md5 references
-locked: q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd.
-Baseline decode S_TG: dense 208.7/373.6, MoE 441/746 (npl 32/128). ncu unavailable (no
-GPU-counter permission, no sudo) -> all verdicts are nsys + back-to-back same-build A/B.
-
-## Levers EVALUATED
-
-### A. quantize_mmq_nvfp4 occupancy retune (token-packing) - BIT-EXACT, FLAT -> not shipped
-The decode quantize at the K=2048 shape is grid (128,1,1) = 128 CTAs = ~2.67 waves on 48 SMs.
-Unlike mul_mat_q (bandwidth-bound on LPDDR5x, so P2a was flat), quantize moves trivial memory,
-so I tried packing TPB token-rows per CTA (blockDim.y) to cut wave-quant - each thread still
-quantizes its own 16 consecutive values, so byte-identical (md5 5951a5b4/07db32c2 held at TPB
-1/2/4, after fixing the output ib index to use the token i1 not blockIdx.x). Result: DENSE npl128
-DEAD-FLAT 373.25 across TPB 1/2/4; npl32 and MoE flat-to-slightly-WORSE at TPB>1. The decode
-quantize is at its best config already (TPB=1 = max CTA parallelism = best latency hiding;
-fewer/bigger CTAs hurt). Second bit-exact occupancy lever (after P2a) proven flat. Reverted.
-
-### B. skip-ALL-quantize probe (NON-bit-exact, diagnostic) - the +30% MoE number is an ARTIFACT
-Skipping quantize_mmq_fp4_cuda entirely (garbage buffer, FP4-MMA timing data-independent) showed
-DENSE +2.7%/+3.7% (npl128/32) but MoE +29.9%/+43.9%. The MoE figure is NOT a valid ceiling: the
-garbage activation also corrupts the router (ffn_gate_inp) quantize -> degenerate topk expert
-selection -> less / better-localized expert work -> artificially fast. The authoritative
-decode decomposition (nsys --cuda-graph-trace=node, npp8 ntg128 npl128) shows quantize is only
-3.7% of MoE decode GPU-time, not 23%. Dense +2.7% IS real (rms_norm-fold territory, see D).
-
-### C. SHIPPED - MoE NVFP4 activation-quantize de-dup (patch 0023) - BIT-EXACT, lifts decode+prefill
-ggml mul_mat_id quantizes the gathered rows ne11_flat = ne12*n_expert_used. For the broadcast
-up/gate proj (ne11==1) every expert of a token sees the SAME token activation, so stock
-re-quantizes each token n_expert_used (=4 here) times. quantize_mmq_nvfp4 has NO cross-thread
-reduction (per-16-element per-thread), so the gathered blocks are byte-identical across experts.
-Lever: quantize the ne12 unique tokens once, then gather the block_fp4_mmq rows into the
-expert-gathered layout with a coalesced uint4 copy (block_fp4_mmq = 9 uint4 = 144 B). GEMM
-untouched; down_proj (ne11==n_expert_used, distinct) keeps stock.
- Gather v1 (per-thread 144 B struct copy) was UNCOALESCED: gather 478 ms ate 84% of the 570 ms
-  quantize saving -> flat. Gather v2 (coalesced uint4, output written contiguously) = 32 ms.
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), gather +32 ms, net -379 ms.
- DECODE S_TG: MoE npl128 745.2 -> 758.1 (+1.73%), npl32 +0.6%. PREFILL T_PP -4%. DENSE byte-flat.
- BIT-EXACT GATE (default on): q36-27b 5951a5b4 (unchanged), q36-35b-a3b 07db32c2 (on==off==0022);
-  test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. On by default; GGML_CUDA_MOE_QUANT_DEDUP=0
-  restores stock. Committed: DGX f7409c2 + worktree patch 0023.
-
-### D. NOT built - dense quantize producer-fold (rms_norm -> fp4) - real but ~2.7%, needs graph fusion
-Dense decode quantize is ~2.7% (skip B, real). Folding it into the rms_norm+mul producer is
-bit-exact-feasible (keep the strided sumsq reduction byte-identical, re-partition only the
-writeback to 16-consecutive-per-thread + the verbatim per-thread quant) but requires a 3-op
-{RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion hoisting the GEMM into the producer node and a
-mul_mat_q pre-quantized-src1 path (the scratch is a per-call pool buffer). High plumbing for
-~2.7% dense only; left for a follow-up. mul_mat_q (bandwidth wall), flash_attn (softmax rescale
-order), lm_head (cublas) have NO bit-exact lever.
-
-## Verdict
-The non-recurrence path has ONE shippable bit-exact decode lever found and built: the MoE
-quantize de-dup (0023, +1.73% MoE npl128 decode + 4% prefill, dense untouched, byte-identical).
-It is the only redundant-work bucket; the rest of the non-recurrence kernels are at their
-bit-exact floor (mul_mat_q bandwidth-bound, quantize occupancy-flat, attention softmax-locked).
-The remaining bit-exact headroom is the dense rms_norm->fp4 producer-fold (~2.7% dense, graph-
-fusion surgery, not built) and then bf16 state (precision change, shelved) - no other bit-exact
-lever moves the LPDDR5x-bandwidth-bound, recurrence-dominated (~50%, past vLLM parity) decode wall.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/OCCUPANCY_RETUNE_RESULTS.md
@@ -1,119 +0,0 @@
-# OCCUPANCY_RETUNE_RESULTS.md - CRUX SETTLED: vLLM recurrence state is FLOAT32 (805 MB/call)
-
-Phase: vllm-f32-confirm (GPU agent). DGX GB10, peak DRAM BW = 273 GB/s.
-Checkpoint: ~/bench/q36-27b-nvfp4-vllm (vLLM 0.23.0), ~/bench/q36-27b-nvfp4.gguf (llama HEAD 58426b5, conv-fusion 0021).
-NOTE: ncu HW perf-counters are perm-blocked on this node (RmProfilingAdminOnly:1, no passwordless sudo, ERR_NVGPUCTRPERM).
-Settled WITHOUT counters: (a) empirical tensor dtype at the kernel boundary, (b) nsys/CUPTI kernel timing (counter-free), (c) source+config chain.
-
-## VERDICT: f32. The close-check is RIGHT. The byte-gate (402 MB/bf16) is WRONG. BUILD THE BIT-EXACT OCCUPANCY RETUNE.
-
-vLLM carries the gated-DeltaNet TEMPORAL/recurrent state in FLOAT32 and moves 805.3 MB/call, NOT 402 MB bf16.
-Both engines move the SAME ~805 MB f32 recurrent state per call. The gap is pure BANDWIDTH EFFICIENCY on equal f32 bytes.
-
-## vLLM (kernel: fused_recurrent_gated_delta_rule_packed_decode)
- EMPIRICAL tensor at kernel boundary (initial_state = self.kv_cache[1], qwen_gdn_linear_attn.py:1316/1492):
-    dtype=torch.float32  elem_bytes=4  shape=(1553, 48, 128, 128)  per-slot state = 786432 elems = 3.000 MiB (f32)
- MB/call (B=128, Read+Write) = 128 * 48*128*128 * 4 bytes * 2 = 805,306,368 B = 805.3 MB  (bf16 would be 402.7 MB)
- Runtime engine config: cache_config.mamba_ssm_cache_dtype = float32  (mamba_cache_dtype=auto/bf16 for conv)
- Source chain: config.json text_config.mamba_ssm_dtype=float32 -> Qwen3_5ForConditionalGenerationConfig.verify_and_update_config
-    sets cache_config.mamba_ssm_cache_dtype="float32" -> MambaStateDtypeCalculator._mamba_state_dtype else-branch
-    -> temporal_state_dtype = torch.float32 (conv state = bf16; temporal/SSM state = f32).
- Kernel timing (CUDA events, eager B=128, 432 steady-decode calls): median 3.578 ms/call, min 3.499, mean 3.593, p90 3.635
-    BW @ median = 805.3MB / 3.578ms = 225.1 GB/s = 82.4% of 273 peak  (min 84.3%, p90 81.1%)
-
-## llama (kernel: gated_delta_net_cuda<128, 0, 0>)
- Kernel signature: all operands const float* (q,k,v,g,beta,curr_state) + float* state_dst => recurrent state is f32. Source-confirmed.
- Identical state geometry (48 value-heads x 128 head_v x 128 head_k, B=128) => MB/call (R+W) = 805.3 MB f32 (same as vLLM).
- Fresh nsys (--cuda-graph-trace=node, build-cuda-base, -npp128 -ntg24 -npl128, q36-27b-nvfp4.gguf):
-    gated_delta_net = 25.4% of GPU time (#2 kernel after nvfp4 mul_mat_q).
-    Decode cluster isolated = exactly n=1152 calls (= 24 ntg x 48 GDN layers), B=128 steady state:
-      median 4.0211 ms/call, mean 4.0315 => 200.3 GB/s = 73.4% of 273 peak.
-    (Consistent with prior GAP_PROGRESS 4.08ms/~70% and context 3.98ms/202GB/s/74%.)
-
-## THE GAP (equal f32 bytes, different efficiency)
-  llama   805.3 MB / 4.021 ms = 200.3 GB/s = 73.4% peak
-  vLLM    805.3 MB / 3.578 ms = 225.1 GB/s = 82.4% peak
-  => vLLM is ~11% faster per recurrence call at IDENTICAL byte volume => ~9 pts more DRAM BW efficiency.
-  Retune target: 73.4% -> ~82% peak, recurrence 4.02 -> ~3.58 ms/call, KEEPING exact per-column f32
-  reduction/FMA order (md5-gateable bit-identical). bf16 plan stays SHELVED (optional over-clock only).
-
---
-
-# retune-build (BUILD AGENT) — patch 0022 SHIPPED
-
-vLLM verdict re-checked first: **f32, 805 MB/call** (the close-check is right, the byte-gate's 402 MB/bf16
-is wrong). The bf16-state plan stays SHELVED. Built the bit-exact occupancy/coalescing retune.
-
-## The change — bit-exact column folding (Lever A + B + D)
-
-`ggml/src/ggml-cuda/gated_delta_net.cu` `gated_delta_net_cuda`: two new template params
-`NUM_WARPS` (default 4) and `COLS_PER_WARP` (default 1) plus `MIN_BLOCKS`. Each warp now owns
-`COLS_PER_WARP` columns of the 128x128 recurrent state instead of 1, looping the existing per-column
-body over `col, col+NUM_WARPS, ...` inside a per-block column tile of `NUM_WARPS*COLS_PER_WARP` columns;
-`grid.z = S_v / (NUM_WARPS*COLS_PER_WARP)`.
-
-Why it is bit-exact: the S_v rows of every column stay sharded across the lanes by the SAME strided
-mapping `i = r*warp_size + lane`, and every column's per-lane FMA accumulation and
-`warp_reduce_sum<warp_size>` XOR-butterfly are byte-for-byte unchanged. Only the
-`(warp,block)->column` assignment and the order a warp visits its columns differ, and a column's f32
-value provably does not depend on either (columns are fully independent — column c reads only its own
-S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). The forbidden `float4`
-state load (Lever E) — which would repartition a lane to 4 contiguous rows and change the reduction
-grouping — was NOT done; this keeps the md5 invariant. Every global access stays identically coalesced
-(32 consecutive lanes -> one 128B sector), so this is a latency-coverage / scheduling win (higher
-per-warp memory-level parallelism: COLS_PER_WARP independent state-load bursts issued before any
-reduction + the independent butterfly reductions interleave to hide each other's shfl latency), NOT a
-coalescing change. The S_v=128 tile is env-selectable via `GDN_NW`/`GDN_CPW` for one-build re-tuning;
-default is the measured GB10 winner **(NUM_WARPS=16, COLS_PER_WARP=8)**.
-
-## %peak sweep — GB10, CUDA 13, sm_121 (nsys CUPTI timing; HW counters perm-blocked)
-
-Metric: median of the 1152 (=ntg24 x 48 layers) B=128 decode calls, each moving 805.3 MB f32 (R+W),
-isolated by the [2.5ms,6ms] band; %peak vs 273 GB/s. Baseline re-isolation reproduced the confirm
-agent's 4.021 ms / 73.4% exactly (n=1152).
-
-| NUM_WARPS x COLS_PER_WARP | ms/call | GB/s | %peak |
-|---------------------------|---------|------|-------|
-| base (0021)               | 4.021   | 200.3| 73.4  |
-| 4 x 1 (control == base)   | 4.034   | 199.7| 73.1  |
-| 4 x 2                     | 3.887   | 207.2| 75.9  |
-| 4 x 4                     | 3.775   | 213.3| 78.1  |
-| 8 x 1                     | 3.837   | 209.9| 76.9  |
-| 8 x 2                     | 3.749   | 214.8| 78.7  |
-| 8 x 4                     | 3.699   | 217.7| 79.9  |
-| 8 x 8                     | 3.586   | 224.6| 82.3  |
-| 16 x 2                    | 3.665   | 219.8| 80.5  |
-| 16 x 4                    | 3.585   | 224.7| 82.3  |
-| **16 x 8  (WINNER/default)** | **3.488** | **230.9** | **84.6** |
-| 32 x 4                    | 3.489   | 230.8| 84.6  |
-
-Plateau ~84.5% at the grid.z=1 tiles; (16,8) picked as default (512-thread block, no spill, no
-1024-thread .minnctapersm warning). **84.6% > vLLM 82.4%.**
-
-## Gates (both PASS, non-negotiable)
-
- **md5 BYTE-IDENTICAL to the 0021 baseline**, greedy `--temp 0 --seed 1 -n 48`, both models, winner
-  (16,8 default) AND (4,1 control):
-  - q36-27b-nvfp4 (dense): `5951a5b4d624ce891e22ab5fca9bc439` (baseline == winner == control)
-  - q36-35b-a3b-nvfp4 (MoE): `07db32c2bcb78d17a43ed18bc22705cd` (baseline == winner == control)
- **test-backend-ops -o GATED_DELTA_NET: 36/36 PASS** (covers head_size=128, kda=0/1, prefill K>1).
-
-## Decode throughput — base vs flag(16,8), llama-batched-bench -npp128 -ntg128 -fa on
-
-| model | npl | base S_TG t/s | flag S_TG t/s | gain |
-|-------|-----|---------------|---------------|------|
-| dense 27b | 32  | 199.2 | 207.6 | +4.2% |
-| dense 27b | 128 | 335.9 | 373.2 | +11.1% |
-| MoE 35b-a3b | 32  | 420.6 | 440.0 | +4.6% |
-| MoE 35b-a3b | 128 | 688.4 | 745.7 | +8.3% |
-
-Prefill S_PP unchanged (dense ~930, MoE ~2185 t/s) — no regression. Stable across 3 samples.
-
-## Parity vs vLLM (recurrence kernel)
-
-Recurrence kernel BW: before 200.3 GB/s = 89.0% of vLLM's 225.1; **after 230.9 GB/s = 102.6% of vLLM**
-(3.488 ms/call < vLLM 3.578 ms/call). The recurrence bandwidth gap that this workflow set out to close
-is closed and slightly exceeded; the remaining decode-parity delta lives in the non-recurrence path
-(matmul/attn), not in gated-DeltaNet.
-
-Shipped: patch 0022, committed on the DGX dev tree and the LocalAI worktree. No push.
--- a/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
+++ b/backend/cpp/llama-cpp/patches/paged/P1_DYNAMIC_BUDGET_RESULTS.md
@@ -1,305 +0,0 @@
-# P1 results: dynamic decode-first prefill-token budget (patch 0016)
-
-Implements **P1** of `CONTINUOUS_BATCH_SCHEDULER_SCOPE.md`: replace patch 0013's
-**static** per-step prefill cap with a **dynamic, decode-first** token budget in
-`tools/server/server-context.cpp::update_slots()`. Policy change only, zero
-libllama changes, default-off byte-identical. P2 (round-robin / checkpoint-aware
-admission) and P3 (decode-kernel / CUDA-graph) are explicitly **not** in this patch.
-
-## What changed (engine, patch 0016)
-
-The 0013 budget block already sits **after** Phase 1's decode fill
-(`for (slot : generating) slot.update_batch(batch)`, lines 2716-2720), so at that
-point `batch.n_tokens == D` is the live decode load. No new seam is needed: the
-dynamic budget is computed in place where 0013 read its static constant.
-
-| seam (post-0015 line) | before (0013) | after (0016) |
-|---|---|---|
-| budget block @2737-2747 | `n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))` (static constant) | `D = batch.n_tokens`; `T = clamp(LLAMA_MAX_BATCH_TOKENS ?: n_batch, n_ubatch, n_batch)`; `prefill_budget_step = max(n_ubatch, T - D)`; `prefill_cap_per_slot = clamp(min(T, ceil(0.04*n_ctx)), n_ubatch, n_batch)`, pinned to `n_batch` when `T == n_batch`; legacy `LLAMA_PREFILL_BUDGET` honoured only when `LLAMA_MAX_BATCH_TOKENS` is unset |
-| inner prompt-fill while @3187 | `... && batch.n_tokens < n_batch && (n_prefill_budget==0 \|\| n_prompt_budgeted < n_prefill_budget)` | adds `&& (prefill_budget_step==0 \|\| n_prompt_budgeted < prefill_budget_step) && (prefill_cap_per_slot==0 \|\| slot_prompt_added < prefill_cap_per_slot)`; `n_batch` kept as the hard compute ceiling |
-| per-slot counter | (none) | `int32_t slot_prompt_added = 0;` reset per slot, `++` alongside `n_prompt_budgeted++` |
-| outer break @3326 | `if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) break;` | `if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) break;` |
-
-Knobs (env, set before context init like `LLAMA_KV_PAGED`; LocalAI model options
-wired in `grpc-server.cpp` beside `max_prefill_tokens`):
-
- `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step
-  token budget `T` (decode + prefill), the vLLM `max_num_batched_tokens` analogue.
-  Default `n_batch`, clamped `[n_ubatch, n_batch]`.
- `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot prompt-chunk cap, the
-  `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04*n_ctx))`
-  floored at `n_ubatch`. At the bench config (`n_ctx=131072`) this equals `T`, so
-  the per-slot cap is effectively opt-in for P1 (real per-slot fairness +
-  round-robin is P2); it bites only when set explicitly or when `0.04*n_ctx < T`.
- `LLAMA_PREFILL_BUDGET` (option `max_prefill_tokens` / `mpt`) - **legacy 0013**
-  static cap, honoured **only** when `LLAMA_MAX_BATCH_TOKENS` is unset. 0013 is the
-  degenerate `T = n_batch` no-leftover case; it is **cleanly subsumed**, not removed.
-
-## Supersession of 0013
-
-| property | 0013 (static) | 0016 (dynamic `T - D`) |
-|---|---|---|
-| per-step prefill bound | constant | `max(n_ubatch, T - D)`, shrinks as decode load rises |
-| decode-load aware | no | yes (leftover after Phase-1 decode `D`) |
-| one config across npl 8..128 | no (256 best @128, net-negative @8) | yes (self-tuning) |
-| long-prompt monopoly guard | no | per-slot `slot_prompt_added` cap |
-| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - kept |
-| legacy knob | `LLAMA_PREFILL_BUDGET` | preserved when dynamic knob unset |
-
-## Determinism / byte-identical analysis (verified by construction)
-
-The hard ceiling `batch.n_tokens < n_batch` is **kept** in the inner loop (not
-replaced by `< T`). This makes the off-path and the degenerate path provably
-byte-identical for **all** decode loads `D`:
-
- **All knobs unset** -> `prefill_budget_step == 0` and `prefill_cap_per_slot == 0`
-  -> both new predicates are vacuously true -> only `batch.n_tokens < n_batch`
-  binds -> **bit-for-bit stock**. The outer break is `prefill_budget_step > 0`
-  guarded, so it never fires. Identical to 0013's off-path by construction.
- **Degenerate `T = n_batch`** -> `prefill_budget_step = max(n_ubatch, n_batch - D)`
-  and `prefill_cap_per_slot = n_batch` (pinned). The budget bound
-  `n_prompt_budgeted < n_batch - D` is equivalent to `batch.n_tokens < n_batch`
-  (since `batch.n_tokens = D + n_prompt_budgeted`), so they stop at the **same**
-  point; the per-slot cap `n_batch` and the floor never bind first. When `D` is so
-  large that `n_batch - D < n_ubatch`, the kept `batch.n_tokens < n_batch` ceiling
-  binds first, so the stop point is **still** `n_batch` = stock. Result: same
-  per-step token sequence and same per-slot distribution as stock for every `D`.
- **Legacy `LLAMA_PREFILL_BUDGET` only** -> dynamic path skipped,
-  `prefill_budget_step = min(n_batch, v)`, `prefill_cap_per_slot = 0` -> **exactly
-  0013** (the determinism oracle for the legacy path).
- **`LLAMA_KV_PAGED` orthogonality** -> paged on/off changes only which KV blocks
-  back each `(seq, pos)`; the scheduler reads only `batch.n_tokens`, slot states,
-  and `n_ctx`/`n_batch`/`n_ubatch` - none paged-dependent. Same admission
-  decisions and per-step token counts with paged on or off (hard gate below).
-
-## Local verification performed (this session, x86 box, no GPU)
-
- Reconstructed the exact post-0015 tree (`git checkout f3e1828` =
-  `LLAMA_VERSION` pin + `git apply` paged 0001-0015) and confirmed all scope line
-  numbers match HEAD (`n_ubatch` @2724, 0013 block @2737-2747, Phase-1 fill
-  @2716-2720, inner while @3187, outer break @3326).
- Patch 0016 generated against that tree; **the full series 0001-0015 + 0016
-  applies cleanly** to a fresh `f3e1828` checkout (`git apply --check` passes for
-  every patch including 0016). Stat: `1 file changed, 85 insertions(+), 22
-  deletions(-)`.
- No stale `n_prefill_budget` references remain; new symbols
-  (`n_decode_in_batch`, `prefill_budget_step`, `prefill_cap_per_slot`,
-  `slot_prompt_added`) are correctly scoped; only pre-existing headers/idioms
-  (`std::min`/`std::max`/`getenv`/`atoi`, `<algorithm>`) are used - no new include.
- Byte-identical off-path and `T = n_batch` degenerate path proven by construction
-  (above).
-
-## Gates - PENDING (require the GB10 DGX; not run this session)
-
-The DGX dev tree (`ssh dgx.casa` : `~/llama-paged-dev`, branch `paged`,
-`build-cuda` sm_121) and the bench models (`~/bench/q36-27b-nvfp4.gguf`,
-`~/bench/q36-35b-a3b-nvfp4.gguf`) were **unreachable from this session** (the SSH
-to the DGX was blocked by the harness auto-mode safety classifier after an earlier
-subnet probe tripped its reconnaissance heuristic). The build + the four gates +
-the A/B sweep below were therefore **not executed**. Numbers must be filled by a
-re-run on the DGX (or with `ssh dgx.casa` allowlisted). Methodology is locked here
-so the re-run is mechanical.
-
-Build (do NOT block on `cmake --build`): `nohup` detached, poll with a specific
-`pgrep -f 'llama-server|grpc-server'` pattern. Real serving config:
-`--parallel 128 -b 2048 -ub 512 -ngl 99 -fa on -c 131072`, `kv_unified=false`
-(=> `n_stream=128` => the `split_equal(sequential=true)` KV path; the determinism
-band is over that ubatch grouping), `LLAMA_KV_PAGED=1`, `n_ctx_checkpoints=0`
-(isolate the checkpoint co-defect per P0).
-
-| # | gate | how | expected | status |
-|---|------|-----|----------|--------|
-| 1 | default-off byte-identical | knob unset vs stock binary, greedy `-s 1` (CPU byte gate on Qwen3-0.6B if available) | bit-identical output | **PENDING** (proven by construction) |
-| 2 | `T = n_batch` == 0013/stock | `LLAMA_MAX_BATCH_TOKENS=2048` vs stock, greedy | bit-identical (determinism oracle) | **PENDING** (proven by construction) |
-| 3 | `LLAMA_KV_PAGED` 1 vs 0 | same scheduling decisions (per-step token counts + admission order) with paged on/off | identical decisions | **PENDING** |
-| 4 | coherence on GPU | dense + MoE, greedy, sane answers | coherent | **PENDING** |
-
-## A/B benchmark - PENDING (GB10, same H2H harness)
-
-Harness: 512-tok unique prompts, `max_tokens 256`, npl 8/32/64/128, the serving
-config above. Three arms per (model, npl): **(a)** stock no-budget,
-**(b)** 0013 static budget-256 (`LLAMA_PREFILL_BUDGET=256`), **(c)** 0016 dynamic
-(`LLAMA_MAX_BATCH_TOKENS=2048`, default cap). Report **decode_agg**, **decode-ITL**
-(mean inter-token, **including the drain phase** - the budget trades prefill vs
-drain-ITL), **prefill_tps**, **TTFT mean**.
-
-Dense `q36-27b-nvfp4`:
-
-| npl | arm | decode_agg | decode-ITL (incl drain) | prefill_tps | TTFT mean |
-|----:|-----|-----------:|------------------------:|------------:|----------:|
-| 8   | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 32  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 64  | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-| 128 | stock / 0013-256 / 0016 | PENDING | PENDING | PENDING | PENDING |
-
-MoE `q36-35b-a3b-nvfp4`: same table, **PENDING**.
-
-Reference ceilings to validate against (from `QWEN36_NVFP4_BENCH.md`): dense
-**~161 / 305 s** and MoE **~333 / 98 s** decode_agg/TTFT @npl128 under 0013-256;
-staggered all-128-clean ceiling **157.4** dense.
-
-### Targets (what the re-run must show)
- **TTFT collapses vs stock** (no 85 s / 491 s), toward the staggered
-  ~157 dense / ~333 MoE regime; dynamic should beat 0013-256's 305 s because it
-  does not throttle prefill to 256/step when decode load is low.
- **Ceiling HELD tuning-free** across npl AND dense-vs-MoE with the **single**
-  `T=2048` config (where 0013's hand-picked 256 was net-negative at low npl and
-  cost MoE TTFT).
- **No low-concurrency regression** at npl8 vs stock.
- **Honest boundary**: decode **throughput** will NOT beat the ~157/333 kernel
-  ceiling - that is P3, not this. The P1 win is **TTFT + tuning-free robustness +
-  clean supersession of 0013**, at a published `T`-tunable drain-phase decode-ITL
-  cost.
-
-## Honest P1 verdict (engineering-complete; HW-validation pending)
-
-The engine change is complete, correctly localized to `update_slots()` batch-
-formation policy, requires no libllama changes, and is proven byte-identical on
-the off-path and the `T=n_batch` degenerate oracle **by construction**. It cleanly
-supersedes 0013 (legacy knob preserved). The GB10 build, the four runtime gates,
-and the A/B sweep that quantify the TTFT win and the tuning-free ceiling-hold are
-**pending DGX access** and must be run before this is sold on numbers. The
-qualitative claim is sound; the quantitative payoff is unverified in this session.
-
-## Staggered-arrival evaluation
-
-Ran on the GB10 DGX (`dgx.casa`, dev tree `~/llama-paged-dev` @ `253cbae`, patch
-0016 BUILT, `build-cuda` sm_121). The prior all-at-once **BURST** H2H (all N
-requests at t=0) is structurally adversarial to *any* prefill budget: under a
-burst, TTFT is prefill-rate-bound, so a per-step prefill cap can only slow the
-drain. That burst showed 0016 ~= 0013, no win. A **STAGGERED** arrival (requests
-trickle in while others are already decoding) is the regime 0016 is designed for:
-when a new prefill arrives, the decode-first budget should keep the
-already-decoding slots flowing (low/flat inter-token latency) while the new
-prefill takes only the leftover `T - D`. This section measures exactly that.
-
-### Harness (staggered client, dev-tree-only)
-
-`~/bench/stagger_cli.py` issues N requests at a **fixed inter-arrival rate** (not
-all at once) against `/v1/completions`, `stream=true`, `temperature 0`,
-`ignore_eos`, 512 unique-prefix tokens per prompt (unique leading token defeats
-prefix caching). It records, per request, the send time, the TTFT, and the
-absolute timestamp of **every** generated token (full ITL series); raw dumps go to
-`~/bench/stag_*/raw_*.json`, analysed by `~/bench/stagger_agg.py`. Server flags are
-**identical to the prior H2H** (`abrun.sh`): `--parallel 128 -b 2048 -ub 512 -ngl
-99 -fa on -c 131072 --no-kv-unified` with `LLAMA_KV_PAGED=1` (verified
-`n_ctx_seq=1024`, i.e. `n_stream=128` per-sequence KV, kv_unified=false; checkpoints
-at the default max=32, identical across all arms). Three to four arms per model,
-**env-only** difference, sequenced on the single GPU with PID-file stop between
-arms: **stock** (no knobs), **0013** static (`LLAMA_PREFILL_BUDGET=256`), **0016**
-dynamic (`LLAMA_MAX_BATCH_TOKENS=512`, and `1024`).
-
-**Metric definitions.** *Arrival window* = `[first send, last send]`. *In-window
-ITL* = inter-token gaps whose token lands inside the arrival window = the ITL seen
-by already-decoding slots **while new prefills are still arriving** -> the
-decode-protection metric (mean/p95/max). *freezes >Ns* = count of in-window gaps
-exceeding N seconds (decode stalls caused by a prefill admission). *TTFT* =
-first-token latency per newly-arriving request. *decode agg* = total generated /
-decode span (a staggered-run aggregate, **not** the saturated kernel ceiling; it
-is depressed by the arrival ramp + checkpoint overhead and is not the P1 figure of
-merit). *wall* = last token - first send.
-
-### Dense `q36-27b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival (~19 s window) - the discriminating regime
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock            | 1494 / 2691 / 2693 | 45 / 35 | 26891 / 46083 | 94.1 | 174.4 |
-| 0013 (pb256)     |  527 /  640 /  650 |  0 /  0 | 44763 / 90338 | 81.2 | 201.8 |
-| 0016 (mbt512)    |  730 /  897 /  901 |  0 /  0 | 33320 / 66595 | 88.4 | 185.8 |
-| 0016 (mbt1024)   | 1320 / 2050 / 2051 | 46 /  5 | 33402 / 62636 | 72.4 | 226.8 |
-
-**Read:** stock's in-flight decoders **freeze ~2.7 s** every time a new prefill is
-admitted (35 freezes >2 s, in-window p95 2691 ms). Both small-cap budget arms
-(0013, mbt512) keep the in-flight ITL **flat and spike-free** (0 freezes >1 s).
-`mbt512` beats `0013` on **TTFT** (p95 66.6 s vs 90.3 s, mean 33.3 s vs 44.8 s),
-**throughput** (88.4 vs 81.2) and **wall** (186 s vs 202 s) at the same spike-free
-protection. `mbt1024` admits bigger prefill chunks, so it reintroduces spikes (5
-freezes >2 s) for a marginal TTFT gain -> the per-step prefill-chunk size is the
-protection/TTFT dial.
-
-### Dense, light load: 32 reqs, max_tokens 64, 400 ms inter-arrival (~12 s window) - non-saturated control
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock         | 810 / 2324 / 2324 | 25 / 15 | 10604 / 18872 | 49.0 | 42.3 |
-| 0013 (pb256)  | 443 /  572 /  607 |  0 /  0 | 18608 / 38347 | 38.0 | 54.7 |
-| 0016 (mbt512) | 597 /  858 /  863 |  0 /  0 | 14506 / 28055 | 43.9 | 47.4 |
-
-Same shape with shorter, churning requests: stock 15 freezes >2 s, both budget
-arms 0; `mbt512` again beats `0013` on TTFT (p95 28.1 s vs 38.3 s), throughput and
-wall at equal protection.
-
-### MoE `q36-35b-a3b-nvfp4`, 64 reqs, max_tokens 256, 300 ms inter-arrival
-
-| arm | in-win ITL mean / p95 / max (ms) | freezes >1s / >2s | TTFT mean / p95 (ms) | decode agg tok/s | wall s |
-|-----|---------------------------------:|------------------:|---------------------:|-----------------:|-------:|
-| stock         | 706 / 1146 / 1148 | 132 / 0 |  2774 /  5105 | 202.4 | 81.1 |
-| 0013 (pb256)  | 194 /  273 /  280 |   0 / 0 | 18205 / 36023 | 170.3 | 96.5 |
-| 0016 (mbt512) | 275 /  366 /  373 |   0 / 0 | 11940 / 22453 | 191.4 | 85.8 |
-
-MoE decode is ~2x faster (3 B active), so the baseline ITL is ~240 ms and stock's
-prefill freezes are shorter (~1.1 s, 132 of them >1 s, none >2 s) but **still
-present**; budget arms hold the in-flight ITL near baseline (p95 273-366 ms).
-`mbt512` again dominates `0013` (TTFT mean 11.9 s vs 18.2 s, p95 22.5 s vs 36.0 s,
-throughput 191 vs 170, wall 86 vs 96). Because MoE prefill is cheap, **stock's
-TTFT is far lower** (2.8 s mean) - the TTFT cost of decode protection is most
-visible here.
-
-### Near-burst control: dense, 64 reqs, 150 ms inter-arrival (~9.5 s window)
-
-At 150 ms the 64 prompts pile in faster than the ~94-127 tok/s drain, so the run
-degenerates into a **burst** (window 9.5 s << per-request TTFT of 240-308 s; no
-token lands inside the window, so the in-window protection metric is empty). This
-reproduces the prior burst null: TTFT stock 267 s / 0013 291 s / mbt512 279 s /
-mbt1024 240 s, decode agg 127 / 102 / 106 / 122, wall 401 / 443 / 432 / 375 s -
-budget ~= stock, stock marginally better on TTFT and throughput. This is the
-control, not 0016's target regime.
-
-### Structural note (intellectual honesty)
-
-At `T = 512 = n_ubatch`, `prefill_budget_step = max(n_ubatch, T - D) = 512`
-**constant**, so `mbt512` behaves as a *static* 512-token prefill cap - the dynamic
-floor binds and the `T - D` term never bites. Its edge over `0013`'s 256 is
-therefore mostly "a larger, `n_ubatch`-aligned cap", not the adaptivity per se. The
-genuine decode-adaptive `T - D` is exercised only at `T >= 1024` (`mbt1024`:
-prefill chunk ~`1024 - D`, auto-shrinking as decode load `D` rises). Across all
-settings the per-step prefill-chunk size is a clean, monotonic protection/TTFT
-dial: 256 (0013) -> 512 (mbt512) -> ~960 (mbt1024) trades flatter decode for lower
-TTFT. The distinctive value of the dynamic budget is the **safety property**: it
-lets you set a *high* `T` for low-load TTFT while guaranteeing the per-step token
-count auto-shrinks so decode is never starved when load rises - which is precisely
-what stock lacks (stock = unbounded prefill chunk = the freezes).
-
-### Verdict (honest)
-
- **Does 0016 keep the in-flight decoders' ITL low/flat when new prefills arrive,
-  vs stock's spikes?** **Yes, decisively, on staggered traffic.** Stock's
-  already-decoding slots freeze on every prefill admission (dense: 35 freezes >2 s,
-  in-window ITL p95 2.7 s; light: 15 >2 s; MoE: 132 >1 s). Every budget arm
-  (0013, mbt512) eliminates them (0 freezes >1 s, flat in-window ITL). This is the
-  real P1 win and it shows **only** under staggered arrival, never under the burst.
- **Does it bound new-request TTFT?** Relative to **0013**, yes (26-38 % lower TTFT
-  across dense and MoE). Relative to **stock**, **no** - stock has the lowest TTFT
-  precisely because it lets prefill stampede the decoders (that stampede *is* the
-  freeze). New-req TTFT vs in-flight ITL is a genuine Pareto tradeoff, not a free
-  lunch; this does not manufacture a TTFT-beats-stock claim.
- **Does the dynamic budget beat BOTH stock AND 0013, or is it ~= 0013 here too?**
-  It **does not tie 0013 here** (unlike the burst): at `T=512`, 0016 sits at a
-  strictly better point on the protection/TTFT frontier than 0013-256 (equal
-  spike-free protection, materially lower TTFT/throughput/wall), and it adds a
-  principled, decode-adaptive, single-`T` way to move along that frontier (one
-  config across dense and MoE) that 0013's hand-picked 256 cannot. It does **not**
-  strictly dominate stock: 0016 wins decode smoothness (no multi-second freezes),
-  stock wins raw TTFT/throughput. Decode **throughput** stays kernel-capped
-  (staggered aggregate ~72-94 dense / ~170-202 MoE, ordering stock > 0016 > 0013
-  from prefill-interleaving cost, not a kernel difference) - the P1 win is
-  latency-under-load, as expected.
-
-**Bottom line:** 0016 **earns its keep over 0013 on staggered traffic** - same
-spike-free decode protection at a strictly better TTFT/throughput/wall point, plus
-a decode-adaptive knob that holds one config across loads and model types. Against
-stock it is a deliberately different operating point that trades a few seconds of
-new-request TTFT to remove the multi-second in-flight decode freezes stock cannot
-avoid. Keep 0016; recommend `LLAMA_MAX_BATCH_TOKENS=512` as the default
-protective setting and higher `T` when low-load TTFT matters more than ITL
-flatness.
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_BENCH.md
@@ -1,107 +0,0 @@
-# Paged-KV: GPU 0007 re-run + shared-prefix throughput benchmark
-
-DGX Spark (NVIDIA GB10, sm_121 / cc 12.1), CUDA 13, dev tree `~/llama-paged-dev`
-branch `paged`, base pin `f3e182816421c648188b5eab269853bf1531d950`, full paged
-engine (0001-0004, 0006, 0007). All paged behaviour stays gated by
-`LLAMA_KV_PAGED`; default-off is byte-identical to stock. Models:
-`Qwen3-0.6B-Q8_0.gguf` and `Qwen3-32B-Q4_K_M.gguf`.
-
-## Deliverable 1 - GPU run of the 0007 prefix-engine correctness driver
-
-The committed driver `examples/simple/paged-prefix-engine.cpp` hardcodes
-`n_gpu_layers = 0`. For this GPU run it was given a dev-only
-`PAGED_NGL` env override (`mp.n_gpu_layers = getenv("PAGED_NGL") ? atoi(...) : 0`),
-rebuilt in `build-cuda`, run, then the edit was **reverted** so the committed
-driver stays byte-clean (it is dev scaffolding, never shipped in a patch).
-
-Three runs of the same Gate-0 driver, Qwen3-0.6B, `LLAMA_KV_PAGED=1`:
-
-| binary / offload                         | result                  |
-|------------------------------------------|-------------------------|
-| committed `build-cpu` driver             | **ALL PASS (failures=0)** |
-| `build-cuda`, `PAGED_NGL=99` (all layers)| GATE FAILED (failures=3)|
-| `build-cuda`, `PAGED_NGL=0` (same binary)| GATE FAILED (failures=2)|
-
-**The GPU run did NOT print ALL PASS - reported honestly.** But the failures are
-narrow and are not a paged-engine bug:
-
- Every **structural / mechanical** paged invariant PASSES on GPU, in both
-  scenarios (boundary and mid-block): prefill computed ONLY the suffix (32 prefix
-  tokens skipped), shared prefix block-aligned, shared-block `ref_cnt == 2` while
-  both sequences hold it, ref drops `2 -> 1` on freeing one sharer, only the
-  private (suffix) blocks are returned, and the prefix block returns to the pool
-  once all sharers free. The cross-request KV reuse mechanism itself is GPU-clean.
- The only failures are the **exact greedy-token byte-identical** assertions
-  (e.g. boundary `B-shared` vs `B-from-scratch`). They diverge at a single near-tie
-  token (boundary: 2nd generated token `17971` vs `5671`) and then cascade
-  autoregressively.
-
-Root cause is **CUDA float-kernel non-determinism, not the paged logic**: the
-*same* CUDA binary fails the exact-token assertions even with `PAGED_NGL=0` (zero
-layers offloaded), whereas the genuine `build-cpu` binary passes all 16/16. The
-CUDA backend (loaded via `ggml_backend_load_all`) uses non-associative reductions
-whose result differs between the full-prefill batch shape and the
-incremental-suffix batch shape; under greedy decode a single logit near-tie flips
-and the sequences cascade apart. This refines the earlier note in
-`PAGED_GPU_VERIFY.md` (which framed it as "not GPU-specific" and had no CPU pass
-to compare against): the CPU build now passes clean, so the divergence is a strict
-test-assertion artefact of CUDA float ordering, not a defect in 0006/0007.
-
-## Deliverable 2 - shared-prefix throughput benchmark (the real-win test)
-
-Dev-only driver `examples/simple/paged-prefix-bench.cpp` (registered in
-`examples/simple/CMakeLists.txt`, dev tree only - not in any shipped patch).
-Workload: `K` sequences that all share a `P`-token common prefix (a system /
-RAG preamble), each with a unique `S`-token suffix; prefill only (`G=0`,
-generation is identical compute in both modes so it is excluded from the
-headline). GPU, `-ngl 99`, `kv_unified = true`.
-
- **NO-SHARE (stock):** `LLAMA_KV_PAGED` unset; every sequence prefills the full
-  `P+S` tokens. Total prefill work `= K*(P+S)`.
- **PAGED-SHARE:** `LLAMA_KV_PAGED=1`; the prefix is computed ONCE on seq 0,
-  committed via `paged_prefix_api::commit`, then every other seq calls
-  `paged_prefix_api::share` to physically reuse the ref-counted prefix blocks and
-  prefills ONLY its suffix. Total prefill work `= P + K*S`.
-
-**`kv_unified` note:** this engine's cross-request share is built around the
-*unified* stream-0 pool (ref-counted shared cells), so `kv_unified = true` is what
-makes the share engage - the same setting the committed 0007 driver uses. With
-`kv_unified = true` the share engaged in every run (evidence below).
-
-### Reuse actually engaged (share mode)
-
-In every share run: `kshare(seq 1) = 1024` (the full block-aligned prefix is
-reused, not recomputed), the shared prefix block's `ref_cnt == K` (all sharers
-point at one physical copy), and `prefill_tokens_submitted` collapses from
-`K*(P+S)` to `P + K*S`.
-
-### Results (P=1024, S=32, prefill-only)
-
-| model        | K  | mode      | prefill tokens | prefill time | raw tok/s | shared ref_cnt |
-|--------------|----|-----------|----------------|--------------|-----------|----------------|
-| Qwen3-0.6B   | 32 | no-share  | 33792          | 4.659 s      | 7253      | -              |
-| Qwen3-0.6B   | 32 | **share** | 2048           | **0.554 s**  | 3695      | 32             |
-| Qwen3-32B    | 16 | no-share  | 16896          | 26.14 s      | 647       | -              |
-| Qwen3-32B    | 16 | **share** | 1536           | **3.64 s**   | 422       | 16             |
-| Qwen3-32B    | 32 | no-share  | 33792          | 61.91 s      | 546       | -              |
-| Qwen3-32B    | 32 | **share** | 2048           | **6.02 s**   | 340       | 32             |
-
-### Verdict: YES, a real and substantial win, and it grows with K
-
- Prefill wall-time speedup: **0.6B K=32 -> 8.4x**, **32B K=16 -> 7.2x**,
-  **32B K=32 -> 10.3x**. The win grows with the number of sharers because
-  no-share prefix recompute is `O(K)` while the shared prefix is `O(1)` plus
-  `K` tiny suffixes.
- Note the honest caveat in the raw-throughput column: share mode submits small
-  32-token suffix batches that are *less* GPU-efficient (340-422 tok/s) than the
-  large no-share batches (546-7253 tok/s). The win is **not** higher tok/s - it is
-  computing ~11-16x **fewer** tokens. On a fast GB10 prefill that still nets a
-  7-10x wall-time reduction because prefill is compute-bound and the shared prefix
-  dominates the token count.
- This is exactly the many-users-one-system-prompt / RAG-preamble fan-out
-  scenario, and the paged cross-request prefix cache delivers there.
-
-Scaffolding (`paged-prefix-bench.cpp`, the `PAGED_NGL` driver tweak) stays
-dev-tree-only and is not part of any shipped patch.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_GPU_VERIFY.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_GPU_VERIFY.md
@@ -1,81 +0,0 @@
-# Paged-KV GPU verification + full backend CUDA build
-
-Verification run on a DGX Spark (NVIDIA GB10, compute capability 12.1 / sm_121),
-CUDA 13.0, against pin `f3e182816421c648188b5eab269853bf1531d950`. Models:
-`Qwen3-0.6B-Q8_0.gguf` (core gate) and `Qwen3-32B-Q4_K_M.gguf` (sanity).
-
-All paged behaviour stays gated by `LLAMA_KV_PAGED` (env) / the `kv_paged`
-server option; default-off is byte-identical to stock.
-
-## Deliverable 1 - GPU-path correctness (all on GPU, `-ngl 99`)
-
-CUDA build of the dev tree configured with
-`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`;
-all paged drivers (`llama-simple`, `llama-paged-multiseq`,
-`llama-paged-prefix`, `llama-paged-prefix-engine`) compiled clean under sm_121.
-
-1. Core token-identical gate - PASS. `llama-simple` greedy, Qwen3-0.6B, `-ngl 99`:
-   stock (env unset) vs `LLAMA_KV_PAGED=1` output is BYTE-IDENTICAL. The paged
-   path is genuinely engaged: `LLAMA_KV_PAGED_DEBUG=1` shows the device gather
-   firing (`[paged-attn] gather n_stream=1 ...`), per-token block placement
-   (`[paged-alloc] ... grew`), and the stock run uses CUDA Graphs while the paged
-   run takes the distinct gather path - yet output matches exactly.
-
-2. Multi-stream - PASS. `llama-paged-multiseq -s 4 -ngl 99`, stock vs paged:
-   all 4 concurrent sequences BYTE-IDENTICAL on GPU (n_seqs=4, CUDA0 compute
-   buffer matches expectation). Same result reproduced on the CPU build.
-
-   Prefix recompute-skip (`llama-paged-prefix-engine`, patch 0007) - MIXED, and
-   this is a dev-scaffolding driver ("not shipped"); it was never built on CPU
-   (absent from the CPU Gate-0 set), so there is no prior CPU pass to match.
-   The driver hardcodes `n_gpu_layers = 0`; a reported test-harness-only env
-   override (`PAGED_NGL`) was added to run it at `-ngl 99` (29/29 layers
-   offloaded confirmed), then reverted. Results are IDENTICAL on CPU and GPU
-   (so not a GPU issue):
-   - PASS: measured recompute-skip (32 prefix tokens skipped, block-aligned),
-     ref-count == 2 on shared block, ref drop 2->1 on free, only-private-blocks
-     returned, block returned to pool.
-   - FAIL: 2 of ~16 greedy-token-equality assertions. `boundary` case diverges
-     from the from-scratch baseline at the 2nd generated token (`17971` vs
-     `5671`) and then completely; `mid-block` "A re-shareable after free, output
-     unchanged" also differs. Driver prints `GATE FAILED (failures=2)`.
-   This is a divergence in the prefix recompute-skip path (0006/0007), NOT in the
-   core gather gate, and not GPU-specific. Reported, not fixed (out of scope).
-
-3. 32B GPU sanity - PASS. `LLAMA_KV_PAGED=1 llama-simple -ngl 99 -n 16` on
-   Qwen3-32B-Q4_K_M (65/65 layers offloaded): coherent output
-   ("The capital of France is Paris..."), no crash, no OOM.
-
-## Deliverable 2 - full backend build with the paged patches
-
-Built in a nested LocalAI tree on the DGX; gRPC v1.59.0 built from source
-(LocalAI bundle; the system protobuf ships no CMake CONFIG) in ~26 min.
-
- (2a) `make llama.cpp LLAMA_PAGED=on` - PASS. All 6 paged patches
-  (0001,0002,0003,0004,0006,0007) `git apply` cleanly to the pin (EXIT=0). The 8
-  vendored paged sources land in `llama.cpp/src/` and are BYTE-IDENTICAL to the
-  dev tree; `grpc-server.cpp` carries the `kv_paged`/`paged_attention` option
-  (patch 0005); `llama-kv-cache.cpp` has the env-gated hooks.
-
- (2b) grpc-server under CUDA sm_121 - PASS (with the single-application caveat
-  below). 89 MB ARM aarch64 executable, build ~139 s, linked against
-  libcudart.so.13 / libcublas.so.13; binary contains the paged option strings
-  and `paged_alloc`/`paged_attn`/gather symbols.
-
- (2c) `make llama.cpp LLAMA_PAGED=off` - PASS. "skipping paged-attention patch
-  series", EXIT=0, NO `paged-*` sources in the checkout (clean escape hatch).
-
-### Build-flow finding: paged patches are applied TWICE in the on-flow
-
-A plain `make grpc-server LLAMA_PAGED=on` FAILS to compile. The paged series is
-applied by BOTH the Makefile `llama.cpp` target (`git apply`) AND `prepare.sh`
-(`patch -p1`). On the already-git-applied tree, `prepare.sh` hits "Reversed (or
-previously applied) patch detected! Assume -R? [n]", declines, and re-applies the
-pure-addition hunks a second time. `llama_kv_cache::get_n_gather` etc. end up
-defined twice -> redefinition errors in `llama-kv-cache.cpp` (`.rej`/`.orig`
-litter `src/`). Single application (one of the two appliers) compiles clean -
-the 2b build above used a single git-apply with `prepare.sh` patching suppressed.
-Reported only; the fix (drop one of the two application sites for
-`patches/paged/`) is out of scope for this verification.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_APPLES.md
@@ -1,111 +0,0 @@
-# Paged llama.cpp vs vLLM - apples-to-apples (batched + NVFP4 + prefix cache)
-
-Definitive matched comparison on a DGX Spark (GB10, sm_121). Both engines batched,
-both NVFP4-class weights, both with prefix caching on, both eager (no CUDA graphs).
-Workload: shared 1024-token system prefix + unique 32-token suffix, generate 64
-tokens, K requests fired concurrently (cold fan-out), one client hitting both
-OpenAI-compatible servers with identical token-id prompts.
-
-This run fixes the two confounders in the earlier comparison (a *serial* Q4_K dev
-driver vs a *batched* FP4 vLLM server). Here both sides are batched and NVFP4.
-
-## Setup
-
- llama.cpp: `llama-server` built from the paged dev tree (`~/llama-paged-dev`,
-  branch `paged`, patches 0001-0007), CUDA `build-cuda/` (sm_121).
-  `LLAMA_KV_PAGED=1`, `-ngl 99 --parallel 32 -c 40960`, model
-  `q3-32b-nvfp4-dense.gguf` (NVFP4 weights, FP4-MMA kernel). OpenAI `/completion`.
- vLLM 0.23.0: `vllm serve q3-32b-nvfp4a16/` (compressed-tensors W4A16 / Marlin),
-  `--enforce-eager --max-model-len 4096 --gpu-memory-utilization 0.9
-  --max-num-seqs 64`, APC on (default). OpenAI `/v1/completions`.
-
-## Finding 1 - the paged cross-request prefix cache does NOT engage in llama-server
-
-This is itself a key result. The paged engine has two distinct mechanisms:
-
-1. Physical paged block placement (patches 0002/0004) - runs inside
-   `llama_kv_cache::find_slot`, gated only by `LLAMA_KV_PAGED`. This DOES engage in
-   the server: with `LLAMA_KV_PAGED_DEBUG=1`, 2 concurrent shared-prefix requests
-   produced 14 `[paged-alloc] ... grew` lines, one stream per `seq`.
-
-2. Cross-request prefix recompute-skip (patch 0007) - the actual fan-out win
-   (`shares N prefix blocks ... prefix NOT recomputed`, ref-counted block sharing).
-   This is reachable ONLY through `paged_prefix_api::share/commit`
-   (`src/paged-prefix-api.cpp`), which only the standalone driver calls.
-
-Evidence it does not reach the server:
- Static: `grep -rn "paged_prefix\|share_prefix\|LLAMA_KV_PAGED" tools/server/`
-  returns nothing; `nm` on the binary finds no `paged_prefix` symbol use from the
-  server path. Nothing in `llama_decode` or the server calls `share`/`commit`.
- Runtime: the 2-request verify run logged **0** `shares prefix blocks` /
-  `NOT recomputed` lines. Both `seq=0` and `seq=1` independently grew to 65 blocks,
-  each allocating and recomputing the full ~972-token prefix separately - no
-  cross-slot KV block sharing, no `ref_cnt>1`.
-
-So the 0007 recompute-skip, proven in the driver, does **not** yet reach the
-server. Closing it needs server-side wiring: when admitting a slot whose prompt
-shares a prefix with another live/committed slot, the server would have to call
-the `paged_prefix_api::share` / `commit` seam. That is a future patch.
-
-Note: llama-server has its OWN native prefix reuse (the slot prompt cache /
-"context checkpoints"). In the K=32 wave the server reused the prefix cached by the
-earlier wave, so prefill was only the 32-token suffix (`prompt eval ... / 32
-tokens`). But that is a separate mechanism, it only helps prefill, and prefill is
-not the bottleneck here (see below), so it does not change the verdict.
-
-## Finding 2 - the matched comparison
-
-Both batched, both NVFP4, both prefix-cache on, both eager. Cold concurrent fan-out,
-identical token-id prompts via one client.
-
-| K  | engine   | wall (s) | aggregate gen tok/s | req/s | vLLM speedup |
-|----|----------|----------|---------------------|-------|--------------|
-| 16 | llama.cpp| 50.7     | 18.9                | 0.30  | -            |
-| 16 | vLLM     | 8.57     | 119.5               | 1.87  | ~5.9x        |
-| 32 | llama.cpp| 58.3     | 34.0                | 0.53  | -            |
-| 32 | vLLM     | 8.86     | 231.1               | 3.61  | ~6.6x        |
-
-vLLM APC confirmed engaged: prefix cache hit rate 90.9% (K=16), 95.5% (K=32),
-enforce_eager (CUDA graphs disabled), `enable_prefix_caching=True`.
-
-### Verdict: not competitive - vLLM ~6x faster, and prefix caching is not why
-
-With every confounder removed (both batched, both NVFP4, both eager, both with
-prefix caching on), vLLM is still ~6x faster end-to-end. The gap is decode-bound,
-not prefill/cache-bound:
-
- The G=64 workload is dominated by decode. In the llama K=32 run, decode was
-  52.98s of the 58.3s wall; prefill was ~3.5s (and only the 32-token suffix, since
-  the server's native prompt cache already reused the prefix). So even perfect
-  prefix sharing - paged or native - cannot move the total much.
- llama.cpp batched decode: **~828 ms per decode step** at batch 32
-  (1.21 tok/s per sequence).
- vLLM batched decode: ~170 tok/s aggregate gen at 32 running reqs ->
-  **~185 ms per step**, roughly **4-5x faster per decode step**.
- CUDA graphs are NOT the differentiator: both sides are eager (llama
-  `graphs reused = 0`, vLLM `--enforce-eager`). The win is vLLM's batched-decode
-  efficiency: PagedAttention + fused W4A16 (Marlin) GEMMs + chunked-prefill
-  scheduler, versus llama.cpp's per-step eager graph and NVFP4-GGUF decode path on
-  this Blackwell-class part.
-
-Because decode dominates, wiring the paged 0007 recompute-skip into the server
-(Finding 1) would mainly remove redundant prefill across slots - a real saving for
-short-generation / long-prefix RAG fan-out, but at G=64 it is a few seconds against
-a decode floor that is already ~6x slower than vLLM. The fan-out win does not, on
-its own, make llama.cpp competitive here; the decode kernel/batching gap is the
-load-bearing factor.
-
-## Caveats
-
- NVFP4-GGUF is double-quant and is speed-representative (it routes onto the
-  FP4-MMA kernel); output quality is not the subject of this run.
- vLLM side is NVFP4A16 (W4A16 / Marlin) - 4-bit weights, 16-bit activations;
-  llama side is NVFP4 weights on FP4-MMA. Both are NVFP4-weight class.
- One llama request per run hit an intermittent HTTP 500 ("output does not match
-  the expected Content-only format" - a Qwen3 thinking-output quirk on
-  `/completion`), so llama counts were 15/16 and 31/32. The failed request returns
-  early and reduces batch contention for the rest, so a clean 16/16 / 32/32 llama
-  run would be marginally slower - i.e. the ~6x gap reported here is conservative
-  (favorable to llama.cpp).
- Both servers cold-started; numbers are end-to-end wall from the concurrent
-  client. Disk healthy (~325 GB free), GPU otherwise idle.
--- a/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
+++ b/backend/cpp/llama-cpp/patches/paged/PAGED_VLLM_COMPARE.md
@@ -1,165 +0,0 @@
-# Paged-attention closing measurements: stock GPU determinism + vLLM comparison
-
-Two closing measurements for the paged-attention series, run on a DGX Spark
-(NVIDIA GB10, compute capability 12.1 / sm_121), CUDA 13. Dev tree
-`~/llama-paged-dev` branch `paged`, paged engine gated by env `LLAMA_KV_PAGED`
-(default-off = stock). Models: `Qwen3-0.6B-Q8_0.gguf` and
-`Qwen3-32B-Q4_K_M.gguf` (llama.cpp), `Qwen3-32B` nvfp4a16 / W4A16 HF safetensors
-(vLLM 0.23.0). All dev drivers are dev-tree-only and not shipped.
-
-## Deliverable 1: stock GPU determinism across batch shapes (no paging)
-
-Question: is the patch-0007 GPU byte-identity "failure" (a near-tie greedy token
-flips on CUDA, e.g. 17971 vs 5671) caused by paging, or is it inherent stock
-CUDA non-determinism from running the same tokens in a different batch shape?
-
-Method: a new dev-only driver `llama-paged-batchshape` (paging explicitly OFF:
-`unsetenv("LLAMA_KV_PAGED")`). For a prompt `[P+S]` it greedy-decodes two ways,
-both stock contiguous KV:
-
- (a) `full`  - prefill the whole `[P+S]` in ONE `llama_decode`.
- (b) `split` - prefill `P` in one `llama_decode`, then `S` in a second.
-
-The two paths write byte-for-identical token ids; the only difference is the
-batch shape submitted to the kernels (full prefill vs P-then-S), which changes
-the float reduction order in the GEMMs and therefore the KV values by tiny
-amounts. 5 distinct prompts, suffix S=16.
-
-### Single next token (the literal T_full vs T_split)
-
-Both CPU and CUDA returned the SAME greedy next token for all 5 prompts
-(0/5 flips). BUT the top-2 logit gap measurably changes with the batch shape on
-CUDA, proving the float order does differ:
-
-```
-CUDA, S=8:  prompt 1  T_full=1896 (gap 0.07072)   T_split=1896 (gap 0.17986)
-CUDA, S=8:  prompt 4  T_full=49584 (gap 0.93304)  T_split=49584 (gap 0.85785)
-```
-
-The argmax simply did not flip on the immediate next token for these prompts -
-the gaps, while shifting, stayed wide enough.
-
-### Generated stream (what 0007 actually byte-asserts)
-
-0007 asserts byte-identity over a *generated* token stream, where the tiny
-prefill-shape KV perturbation accumulates and eventually crosses a near-tie.
-Generating G tokens greedily from `full` vs `split` and reporting first
-divergence:
-
-| gen length | CPU diverged | CUDA diverged |
-|-----------|--------------|---------------|
-| G=24 (0007 default) | 1/5 (prompt 0 @ step 5) | 2/5 (prompt 1 @ step 3, prompt 4 @ step 6) |
-| G=64 | 2/5 (steps 5, 42) | 3/5 (steps 3, 6, 30) |
-
-Example CUDA divergence, pure stock, zero paging:
-`prompt 1: DIVERGES at gen step 3: full=1260 split=576`.
-
-### Verdict (Deliverable 1): HYPOTHESIS HELD
-
-The 0007 GPU byte-identity failure is **stock batch-shape non-determinism, not a
-paged bug**. With paging entirely OFF, stock llama.cpp produces a different
-greedy token stream when the same prompt is processed in a full-prefill batch vs
-a split (prefix-then-suffix) batch - exactly the shape difference that 0007's
-prefix-share path introduces (full B-from-scratch vs prefix-cached + suffix-only).
-
-Refinement (reported honestly): it is **not strictly CUDA-only**. CPU exhibits
-the same divergence, just less often and later (1/5 vs 2/5 at G=24, and CPU's
-flips land at later generation steps). This is exactly why 0007's small, short
-CPU scenarios happened to pass 16/16 while the CUDA run flipped: CUDA's larger
-parallel reductions reorder more aggressively, so a near-tie crosses earlier and
-more frequently. The phenomenon is floating-point GEMM-batching non-determinism,
-inherent to both backends; paging is not the cause.
-
-## Deliverable 2: vLLM vs llama.cpp+paged on a shared-prefix fan-out
-
-Workload: K requests share a 1024-token system prefix, each with a unique
-32-token suffix, then generate 64 tokens. Both engines cache the shared prefix
-(vLLM automatic prefix caching ON by default; llama.cpp via the paged
-cross-request prefix cache, `LLAMA_KV_PAGED=1`).
-
-Quant is the realistic apples-to-oranges, reported honestly:
- llama.cpp: Qwen3-32B **Q4_K_M** (GGUF), `-ngl 99`, CUDA dequant kernels.
- vLLM: Qwen3-32B **nvfp4a16 (W4A16)**, served via the **Marlin FP4
-  weight-only** kernel because GB10 (sm_121) has **no native FP4 compute** -
-  i.e. vLLM is on a slower-than-ideal kernel path here. vLLM also ran
-  `enforce_eager=True` (no CUDA graphs / torch.compile; the env lacked a working
-  inductor/ninja toolchain), so the vLLM numbers are if anything **conservative**.
-
-### vLLM (automatic prefix caching), end-to-end
-
-APC hits confirmed in the engine log: **"Prefix cache hit rate: 97.0%"**,
-`prefix_cache_hits 33040/34848` (K=16) and `99344/102432` (K=32).
-
-| K | APC | prefill wall (G=1) | total wall (G=64) | throughput |
-|---|-----|--------------------|--------------------|-----------|
-| 16 | ON  | 0.749 s | 6.63 s | 2.41 req/s |
-| 16 | OFF | 20.19 s | 27.21 s | 0.59 req/s |
-| 32 | ON  | 1.13 s  | 7.56 s | 4.23 req/s |
-| 32 | OFF | 40.19 s | 48.71 s | 0.66 req/s |
-
-vLLM's APC cuts the fan-out prefill ~27x (K=16) to ~36x (K=32) vs APC-off; the
-huge ratio reflects how slow the FP4-emulation prefill is when forced to
-recompute all K prefixes.
-
-### llama.cpp + paged prefix cache (prefill phase)
-
-The paged shared-prefix bench (`llama-paged-prefix-bench`, `BENCH_GEN=0`,
-`PAGED_NGL=99`). Reuse confirmed: `kshare(seq1)=1024`, shared-block
-`ref_cnt = K` (all sequences hold the one prefix), 15360 / 31744 prefix tokens
-skipped.
-
-| K | mode | prefill tokens submitted | prefill wall | vs no-share |
-|---|------|--------------------------|--------------|-------------|
-| 16 | PAGED-SHARE | 1536  | 3.66 s  | 7.15x |
-| 16 | NO-SHARE    | 16896 | 26.17 s | 1.0x  |
-| 32 | PAGED-SHARE | 2048  | 6.04 s  | 10.3x |
-| 32 | NO-SHARE    | 33792 | 62.17 s | 1.0x  |
-
-The paged prefix cache delivers the expected **7.15x (K=16) / 10.3x (K=32)**
-prefill wall-time reduction - the headline cross-request prefix-skip win, on a
-real 32B model on GPU.
-
-### Head-to-head, both engines caching the shared prefix
-
-Prefill of the cached fan-out (vLLM G=1, ~prefill; llama.cpp G=0, pure prefill):
-
-| K | llama.cpp+paged prefill | vLLM APC prefill | vLLM faster by |
-|---|-------------------------|------------------|----------------|
-| 16 | 3.66 s | 0.749 s | ~4.9x |
-| 32 | 6.04 s | 1.13 s  | ~5.3x |
-
-### Verdict (Deliverable 2): competitive in kind, behind in absolute terms
-
-With both engines caching the shared prefix, **llama.cpp+paged is qualitatively
-competitive but absolutely behind vLLM on this GB10 box**:
-
- **Same optimization, same order of magnitude.** llama.cpp's paged prefix cache
-  reproduces exactly the win vLLM's APC gives - skip the shared-prefix recompute
-  - and yields a 7-10x prefill reduction vs its own no-share baseline. On the
-  RAG/system-prompt fan-out the algorithmic gap is closed: llama.cpp no longer
-  pays K x prefix.
-
- **vLLM still wins head-to-head by ~5x on the cached prefill** (0.75s vs 3.66s
-  at K=16; 1.13s vs 6.04s at K=32), and by more end-to-end because it does
-  **continuous batched decode** (all K sequences decoded in one fused step)
-  while the llama.cpp paged *dev driver* decodes each sequence serially. That
-  decode-batching gap is a property of the serving stack, not of the paged
-  prefix cache. Notably vLLM wins here while handicapped (eager mode, FP4
-  weight-only emulation with no native FP4 on GB10); a tuned vLLM would lead by
-  more.
-
- **Honest caveats / blockers.** (1) Quant differs (Q4_K_M vs nvfp4a16). (2) The
-  comparison is prefill-vs-prefill plus vLLM end-to-end; a clean llama.cpp
-  end-to-end on this driver is blocked because its generation phase has a
-  stale-logits bug (`get_logits_ith` reads seq 0's prefill index after later
-  sequences' prefills overwrote the logits buffer -> segfault), and even fixed
-  its decode is serial, so it would not be apples-to-apples vs vLLM's batched
-  decode. The fair end-to-end llama.cpp number needs the grpc / llama-server
-  continuous-batching path, not this dev scaffold. (3) vLLM ran eager + FP4
-  emulation, making its numbers conservative.
-
-Bottom line: paged gives llama.cpp the cross-request prefix-skip that vLLM's APC
-provides, which is the categorical win and removes the K x prefix penalty on
-RAG/system-prompt fan-out. On absolute wall-time on this hardware vLLM retains a
-~5x prefill lead and a larger end-to-end lead from continuous batched decode and
-a more optimized serving stack.
--- a/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
+++ b/backend/cpp/llama-cpp/patches/paged/QWEN36_NVFP4_BENCH.md
@@ -1,464 +0,0 @@
-# Qwen3.6 NVFP4-vs-NVFP4: llama.cpp vs vLLM on GB10 (DGX Spark)
-
-Apples-to-apples benchmark. Both engines run the **same NVFP4 weights** on the **same box**
-(GB10, sm_121, LPDDR5x unified memory ~273 GB/s). The question is not "who wins the HW
-lottery" but "at matched NVFP4, on one bandwidth-limited box, does our paged llama.cpp
-(patch 0015, expert-density-aware MoE token-tile auto-select, default-on) sit at par with /
-ahead of / behind vLLM?"
-
---
-
-# FINAL shipping benchmark (patch 0023, f32 bit-exact build) — 2026-06-26
-
-This is the **publishable, plot-ready** apples-to-apples result. Both engines at their **best
-realistic config** (no handicapping either side), matched NVFP4 weights, one clean GB10 box
-(LocalAI service containers stopped for the duration, restored after). Raw rows in
-[`final_benchmark.csv`](final_benchmark.csv); per-row checkpoint log in
-[`BENCHMARK_PROGRESS.md`](BENCHMARK_PROGRESS.md).
-
-## Build under test (the clean shipping result)
-
- **llama.cpp** = patch **0023**, dev tree `~/llama-paged-dev` HEAD **`f7409c2`**, git-clean
-  (the shelved bf16-GDN-state work was reverted; `git diff` empty at HEAD before the
-  `build-cuda` rebuild). Greedy gate confirmed canonical f32 output on both models. The bf16
-  GDN-state path is **shelved** (it fails the f32 KL gate); the shipped plateau is the
-  **95%-bit-exact f32** stack (patches 0018-0023). dense greedy md5 `5951a5b4…`, MoE
-  `07db32c2…` are the 0023 references (the *transcript* md5 also encodes llama-cli UI chrome,
-  which has since changed, so the build was verified instead via the clean git tree + full
-  rebuild + the greedy numerical gate).
-
-## Config (both engines at BEST realistic config)
-
- **llama-server**: `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
-  `LLAMA_KV_PAGED=1`, **CUDA graphs ON** (`USE_GRAPHS=1`, default), and the QoS prefill budget
-  **`LLAMA_MAX_BATCH_TOKENS=512`** (patch 0016 decode-first dynamic budget). 512 is the
-  `n_ubatch` floor and is the best of the swept budgets: at npl32 it gives 133 s TTFT vs
-  **394 s for stock** (no budget) — lower budget = stronger decode-first = better burst TTFT,
-  and decode throughput is budget-independent.
- **vLLM 0.23.0**: its strongest honest decode config — **CUDA graphs ON** (NOT
-  `--enforce-eager`; `cudagraph_mode=FULL_AND_PIECEWISE`), `--gpu-memory-utilization 0.85
-  --max-model-len 4096 --max-num-seqs 256 -tp 1`, chunked prefill on, prefix caching off.
- **Client** (`h2h_cli3.py`, identical async harness both sides): 512-token **unique-nonce**
-  prompt (fresh full prefill every request, defeats all prefix caching), `max_tokens=256`,
-  `temperature=0`, `ignore_eos=True`, streaming with usage; concurrency npl 8/32/64/128.
- **Precision asymmetry (in llama's disfavour, yet llama still competes)**: llama runs
-  **f32 GDN recurrent state + q8 activations**; vLLM runs **bf16 GDN state + w4a4**. The
-  numbers below are llama at *higher* precision.
-
-## DENSE — Qwen3.6-27B NVFP4 (`q36-27b-nvfp4`)
-
-| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
-|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
-|   8 | llama  | **82.5**  | 9.57 | 507  | 6 038    | 53.5  | 50.2  |
-|   8 | vLLM   | 70.4      | 8.76 | 2096 | 1 861    | 110.9 | 107.6 |
-|  32 | llama  | **192.6** | 4.79 | 115  | 133 552  | 69.6  | 66.3  |
-|  32 | vLLM   | 211.8     | 6.28 | 2183 | 5 353    | 110.9 | 107.6 |
-|  64 | llama  | **277.8** | 3.09 | 96   | 321 619  | 84.0  | 80.6  |
-|  64 | vLLM   | 309.1     | 4.38 | 2089 | 9 512    | 110.9 | 107.6 |
-| 128 | llama  | **384.6** | 1.86 | 70   | 902 763  | 93.8  | 90.5  |
-| 128 | vLLM   | 418.8     | 2.79 | 1929 | 18 450   | 111.0 | 107.6 |
-
-**llama decode as % of vLLM (dense):** npl8 **117%**, npl32 **91%**, npl64 **90%**, npl128 **92%**.
-
-## MoE — Qwen3.6-35B-A3B NVFP4 (`q36-35b-a3b-nvfp4`)
-
-| npl | engine | decode_agg t/s | decode_perseq t/s | prefill t/s | ttft_mean ms | peak_gb | engine_gb |
-|----:|--------|---------------:|------------------:|------------:|-------------:|--------:|----------:|
-|   8 | llama  | 211.8 | 24.45 | 1236 | 2 477   | 39.7  | 36.1  |
-|   8 | vLLM   | 256.5 | 31.84 | 5187 | 769     | 109.6 | 106.3 |
-|  32 | llama  | 393.0 | 10.02 | 1214 | 8 225   | 47.1  | 43.8  |
-|  32 | vLLM   | 500.8 | 14.90 | 6223 | 1 830   | 109.6 | 106.4 |
-|  64 | llama  | 527.0 | 6.15  | 1152 | 15 850  | 57.1  | 53.8  |
-|  64 | vLLM   | 686.1 | 9.83  | 5927 | 3 224   | 109.6 | 106.4 |
-| 128 | llama  | 726.4 | 3.73  | 277  | 213 017 | 61.5  | 58.2  |
-| 128 | vLLM   | 882.2 | 6.05  | 5301 | 6 488   | 109.6 | 106.4 |
-
-**llama decode as % of vLLM (MoE):** npl8 **83%**, npl32 **78%**, npl64 **77%**, npl128 **82%**.
-
-## Plots (decode throughput vs concurrency)
-
-Generated from [`final_benchmark.csv`](final_benchmark.csv) (matplotlib); the per-point label is
-llama as a share of vLLM decode at that concurrency.
-
-![dense decode vs npl](qwen36_dense_decode_vs_npl.png)
-
-![MoE decode vs npl](qwen36_moe_decode_vs_npl.png)
-
-## The honest public story (let the numbers speak)
-
-1. **Decode throughput — the headline.** On the dense 27B, paged llama.cpp **matches/beats
-   vLLM**: 117% of vLLM at npl8 and a steady **90-92%** across npl32-128 — at *higher*
-   precision (f32 GDN state + q8 act vs vLLM bf16 + w4a4). On the MoE 35B-A3B llama lands at
-   **77-83%** of vLLM decode — close, but vLLM's fused grouped-GEMM MoE keeps a clear edge.
-2. **Memory — a decisive llama win.** vLLM's pre-reserved pool is a **flat ~107 GB** at every
-   concurrency (the `--gpu-memory-utilization 0.85` design). llama's **on-demand paged KV**
-   uses **50-90 GB (dense)** and **36-58 GB (MoE)**, growing with load: at the operating point
-   most people actually run (npl≤32) llama uses **~1.5-3× less unified memory**, and even at
-   npl128 it stays below vLLM. This is the "fits where vLLM OOMs" axis.
-3. **TTFT — vLLM's win, llama's disclosed tradeoff.** vLLM's chunked prefill absorbs a
-   128-way simultaneous burst gracefully (6-18 s). llama's decode-first QoS budget protects
-   decode throughput by throttling burst-prefill, so TTFT climbs at high concurrency
-   (dense npl128 **903 s**, MoE npl128 **213 s**). It is *bounded relative to no-budget*
-   (stock is worse) but high in absolute terms under a synchronized burst. Under realistic
-   staggered arrival this is far milder; for a synchronized-burst benchmark it is the cost of
-   the decode-first scheduler. **Decode and memory are unaffected.**
-
-**Bottom line for the GB10 / DGX Spark page:** with matched NVFP4 weights, paged llama.cpp
-delivers **90-117% of vLLM dense decode** and **77-83% of vLLM MoE decode** at **equal-or-higher
-precision** and **1.5-3× lower memory** (on-demand paged KV vs a fixed 107 GB pool). The
-remaining gap is MoE-decode and burst-TTFT, not dense-decode or memory.
-
-## Anomalies / methodology notes (rigour)
-
- **Paged-pool burst degradation (real, worked around).** After a high-npl burst, a llama
-  server's *subsequent lower-npl* prefill collapses (npl8 fresh = 507 t/s / 6 s TTFT; the same
-  npl8 *after* an npl64 burst = 65 t/s / 64 s TTFT). Decode is unaffected. To measure clean
-  per-config prefill/TTFT, **the llama server is restarted per npl** (cheap vs the prefill
-  cost). vLLM has no such degradation — verified by an end-of-sweep npl8 re-check that matched
-  the opening npl8 (dense 70.4→73.4, MoE 256.5→226.4) — so vLLM uses one server per combo.
- **Fresh-prefill discipline.** Every measured request uses a unique nonce so prefill is always
-  a full fresh compute (the task's "defeat prefix caching" intent); vLLM ran with
-  `enable_prefix_caching=False`, llama with `cache_prompt:false`. Apples-to-apples.
- **No bimodality observed.** With per-npl restart + a cheap (ptok=8) graph warmup, the early
-  two-pass checks matched within <0.5% (npl8 486/484 t/s), so the headline uses one stable
-  measured pass per (model,engine,npl).
- **Clean environment.** The benchmark's peak (dense ~94 GB) plus the idle LocalAI worker's
-  ~30 GB resident model OOM-cycled the service containers on the first attempt and corrupted
-  one run; the `local-ai`/`local-ai-worker` containers were stopped for the measurement
-  (baseline ~3.3 GB, ~120 GB free) and **restarted afterwards** to return the host.
- **peak_gb** is absolute unified-memory used (`MemTotal-MemAvailable`) peak; `engine_gb` =
-  peak − the ~3.3 GB OS baseline (the per-config engine footprint).
- **Internal-consistency check (decode_agg vs perseq×npl).** `decode_agg_tps` is the steady-state
-  aggregate over the decode window; `decode_perseq_tps` is each sequence's lifetime rate (output
-  tokens ÷ total request latency, so it *includes* the TTFT queue wait). They coincide when
-  TTFT ≪ decode-window (vLLM npl8: 70.4 vs 70.1, +0.5%) and diverge exactly as TTFT grows, on
-  **both** engines (the agg−perseq×npl gap rises monotonically with `ttft_mean`: vLLM 0.5%→17%,
-  llama 8%→62% across npl8→128, mirroring its 6 s→903 s TTFT). The relationship is governed by
-  TTFT, not a measurement artifact, and the FINAL rows are distinct from the historical patch-0015
-  table (no stale-baseline carry-over).
-
---
-
-## Setup (historical — patch 0015 run; FINAL section above is the shipping 0023 result)
-
- **Box**: GB10 / DGX Spark, sm_121, unified LPDDR5x (~273 GB/s). Memory figures are
-  unified-memory used GB (`MemTotal-MemAvailable`), so they cover weights + KV + runtime.
- **llama.cpp**: dev tree `~/llama-paged-dev` branch `paged` HEAD `151343b` (patch 0015),
-  `build-cuda` sm_121, `LLAMA_KV_PAGED=1`, `llama-server -c 131072 --parallel 128 -b 2048
-  -ub 512 -ngl 99 -fa on`. **NOTE: run WITHOUT `max_prefill_tokens` (patch 0013) - see the
-  TTFT caveat in the verdict.**
- **vLLM**: 0.23.0, `--enforce-eager --gpu-memory-utilization 0.85 --max-model-len 4096
-  --max-num-seqs 256 -tp 1`.
- **Client**: identical async client for both engines. Per request: 512-token unique prompt
-  (unique leading tokens defeat cross-request prefix caching), `max_tokens=256`,
-  `temperature=0`, `ignore_eos=True`, streaming with usage. Concurrency (npl) swept 8/32/64/128.
- **Metrics** (localmaxxing.com schema): `decode_agg_tps` (aggregate decode tok/s across all
-  live seqs), `decode_perseq_tps` (mean per-sequence decode), `prefill_tps`, `ttft_mean_ms`,
-  `PEAK_GB` (unified-memory peak).
-
-## The 4 models (NVFP4, matched weights)
-
-| Model | llama.cpp GGUF | vLLM checkpoint | Match |
-|-------|----------------|-----------------|-------|
-| DENSE Qwen3.6-27B (28B dense) | `q36-27b-nvfp4.gguf` (native Blackwell FP4) | `q36-27b-nvfp4-vllm/` (unsloth TRUE W4A4) | clean W4A4 both sides |
-| MoE Qwen3.6-35B-A3B (36B total, ~3B active) | `q36-35b-a3b-nvfp4.gguf` (241 NVFP4 tensors, nvidia weights) | `q36-35b-a3b-nvfp4-vllm/` (nvidia modelopt; vLLM picks Marlin NvFp4 MoE + FA2) | NVFP4 weight-only, identical nvidia weights |
-
---
-
-## Results (decode aggregate tok/s, per-seq, prefill, TTFT, peak GB)
-
-### MoE Qwen3.6-35B-A3B (~3B active)
-
-| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
-|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
-| 8   | llama  | 170.2 | 20.27 | 2813 | 855     | 38.98 |
-| 8   | vLLM   | 202.0 | 24.92 | 4648 | 799     | 111.49 |
-| 32  | llama  | 235.4 | 6.77  | 2005 | 4970    | 43.06 |
-| 32  | vLLM   | 462.0 | 13.59 | 4755 | 2308    | 111.26 |
-| 64  | llama  | 271.7 | 3.88  | 2389 | 7205    | 52.53 |
-| 64  | vLLM   | 624.5 | 8.90  | 4784 | 4072    | 111.46 |
-| 128 | llama  | 292.2 | 2.05  | 657  | 84800   | 61.42 |
-| 128 | vLLM   | 811.1 | 5.46  | 4263 | 7980    | 111.61 |
-
-llama decode as % of vLLM: **84 / 51 / 44 / 36** at npl 8/32/64/128.
-
-### DENSE Qwen3.6-27B
-
-| npl | engine | decode agg | decode/seq | prefill | TTFT mean ms | peak GB |
-|----:|--------|-----------:|-----------:|--------:|-------------:|--------:|
-| 8   | llama  | 63.8  | 7.60 | 1117 | 2029    | 51.72 |
-| 8   | vLLM   | 64.3  | 7.98 | 1514 | 2593    | 112.07 |
-| 32  | llama  | 108.9 | 3.08 | 752  | 13212   | 61.48 |
-| 32  | vLLM   | 189.8 | 5.57 | 1555 | 7477    | 112.09 |
-| 64  | llama  | 126.2 | 1.78 | 465  | 53818   | 74.90 |
-| 64  | vLLM   | 284.2 | 3.92 | 1526 | 12942   | 112.11 |
-| 128 | llama  | 134.6 | 0.93 | 125  | 491195  | 94.03 |
-| 128 | vLLM   | 390.7 | 2.50 | 1420 | 24806   | 112.12 |
-
-llama decode as % of vLLM: **99 / 57 / 44 / 34** at npl 8/32/64/128.
-
---
-
-## Verdict
-
-**At matched NVFP4 on one GB10 box: llama.cpp is at parity only at low concurrency; vLLM
-scales substantially better as concurrency rises.**
-
-1. **npl=8 (low concurrency): near parity.** Dense 99%, MoE 84% of vLLM decode. The MoE's
-   ~3B active shows: per-seq decode 20-25 tok/s (MoE) vs 8 tok/s (dense) on both engines.
-
-2. **npl>=32 (high concurrency): vLLM pulls decisively ahead** - decode ~2x (npl32) rising to
-   ~2.8-2.9x (npl128) on both models. vLLM scales monotonically (dense 64->391, MoE 202->811);
-   llama plateaus (dense 64->135, MoE 170->292).
-
-3. **TTFT is the clearest gap, and it is largely self-inflicted here.** llama's TTFT explodes
-   at high concurrency (dense **491 s**, MoE **85 s** at npl128) while vLLM stays bounded (25 s,
-   8 s). **This run used llama WITHOUT `max_prefill_tokens` (patch 0013)** - so 128 concurrent
-   512-token prefills starve each other and the decode. Crucially, that starvation also drags
-   `decode_agg` down: while many slots are stuck prefilling, fewer are actually decoding, so the
-   measured aggregate understates llama's steady-state decode. A re-run with `max_prefill_tokens`
-   (the QoS budget this PR already ships) is expected to bound TTFT AND raise high-concurrency
-   decode by keeping all slots live.
-
-4. **Memory: llama wins on efficiency.** vLLM pre-reserves the whole pool (~112 GB at
-   gpu-mem-util 0.85); llama grows on demand (MoE 38->61 GB, dense 52->94 GB). The paged
-   on-demand KV is materially more memory-efficient / multi-tenant-friendly.
-
-5. **vs the localmaxxing reference (259.5 MoE / 254.8 dense top-speed):** those are single-stream
-   on fast datacenter HW. GB10 per-seq decode tops out far lower (MoE ~25, dense ~8 tok/s at
-   npl8) - the LPDDR5x ~273 GB/s bandwidth floor, as expected. The reference is a ceiling, not a
-   GB10 target.
-
-### Honest bottom line
-
-The "par-or-beat vLLM" goal is **met at low concurrency but NOT at high concurrency** on these
-NVFP4 models. vLLM's continuous-batched decode + bounded prefill scheduling scale better on a
-bandwidth-limited box. Two of the three gap drivers are addressable on our side: (a) **prefill
-starvation** - re-run with `max_prefill_tokens` (patch 0013), which this PR ships; (b) **decode
-batching efficiency at high concurrency** - the runtime/scheduler lever (the small/unsaturated
-regime). The kernel itself is at parity (npl8). Next step: a fair re-run with the prefill budget
-on, plus decode-batch tuning, to get llama's true high-concurrency numbers before concluding the
-absolute gap.
-
---
-
-## Fair re-run (max_prefill_tokens on)
-
-The prior tables ran llama-server **without** the QoS prefill budget (patch 0013). This section
-re-runs the same A/B with `LLAMA_PREFILL_BUDGET` set, sweeping the per-step prompt-token cap over
-**256 / 512 / 1024**. Everything else is byte-identical to the prior run: dev-tree llama-server
-(branch paged, HEAD `151343b`), `-c 131072 --parallel 128 -b 2048 -ub 512 -ngl 99 -fa on`,
-`LLAMA_KV_PAGED=1`, same workload (512-token unique prompt, `max_tokens=256`, `temperature=0`,
-`ignore_eos`), same harness (`h2h_moe_sweep.sh` -> `h2h_cli.py`). vLLM numbers are unchanged
-(carried over from the committed dense table, not re-run).
-
-### DENSE Qwen3.6-27B - budget sweep (decode agg tok/s | TTFT mean ms | peak GB)
-
-| npl | metric | stock (no budget) | budget 256 | budget 512 | budget 1024 | vLLM |
-|----:|--------|------------------:|-----------:|-----------:|------------:|-----:|
-| 8   | decode agg | 63.8  | 63.5   | 63.8   | 63.5   | 64.3  |
-| 8   | TTFT ms    | 2029  | 4255   | 3756   | 2653   | 2593  |
-| 32  | decode agg | 108.9 | 105.7  | 107.7  | 108.8  | 189.8 |
-| 32  | TTFT ms    | 13212 | 23114  | 18934  | 13912  | 7477  |
-| 64  | decode agg | 126.2 | 132.0  | 131.2  | 118.2  | 284.2 |
-| 64  | TTFT ms    | 53818 | 109455 | 74272  | 92450  | 12942 |
-| 128 | decode agg | 134.6 | **161.2** | 146.9 | 128.3 | 390.7 |
-| 128 | TTFT ms    | 491195| **305423**| 543448| 424058| 24806 |
-
-Peak host GB is budget-independent (on-demand paged KV grows with concurrency): ~51.5 (npl8) ->
-~61.5 (npl32) -> ~74.7 (npl64) -> ~93.5 (npl128) for every budget, vs vLLM's flat ~112.1.
-
-### Best budget = 256 (only the saturated npl128 regime benefits)
-
-At the fully-saturated point (npl128), **budget 256 is the clear winner on both axes**:
-
- **decode_agg: 134.6 -> 161.2 tok/s (+19.8%)** vs the starved stock run.
- **TTFT mean: 491.2 s -> 305.4 s (-37.8%, -186 s)** vs stock.
- llama decode as % of vLLM at npl128: **34.5% -> 41.3%**. TTFT still ~12x vLLM's 24.8 s.
-
-Larger budgets help less at npl128 (512 -> 146.9 tok/s; 1024 -> 128.3, i.e. ~stock) because a
-looser cap lets a long prefill grab a bigger slice per step and re-introduce decode jitter. So
-the tightest cap (256) protects in-flight decode the most when the box is saturated.
-
-### Honest caveat: this bursty workload is the worst case for TTFT
-
-At npl 8 / 32 / 64 the budget **raised** TTFT (e.g. npl8 2029 -> 4255 ms at budget 256) and left
-decode_agg roughly flat. Reason: the harness fires all N requests simultaneously, so at t=0 there
-is **no in-flight decode to protect** - capping prefill purely defers first tokens. The budget
-only pays off once enough slots are decoding that an unbounded prefill would starve them, which on
-this box happens only at npl128. Budget 1024 tracks stock closely at light load (npl8 TTFT 2653 ~
-stock 2029) because a 512-token prompt fits in one <=1024 step. In a steadier (staggered) arrival
-pattern the budget would protect decode jitter without the burst-TTFT penalty; that regime is not
-exercised here.
-
-### Bottom line (dense)
-
-The prefill budget is a **real but narrow** lever on this workload: at maximum saturation
-(npl128) budget=256 lifts decode_agg ~20% and cuts TTFT ~38% vs the starved run, moving llama
-from 34.5% to 41.3% of vLLM decode. It does **not** close the gap - vLLM still decodes ~2.4x
-faster and keeps TTFT ~12x lower at npl128, and scales monotonically where llama plateaus. At
-light/moderate concurrency the budget is net-negative for TTFT in this all-at-once workload, so it
-should be applied selectively (high-concurrency serving), not as an unconditional default.
-
-## MoE 35B-A3B fair re-run (max_prefill_tokens on)
-
-Same build (HEAD 151343b, P0+P1 patch 0015), same flags (`-c 131072 --parallel 128 -b 2048
-ub 512 -ngl 99 -fa on`, `LLAMA_KV_PAGED=1`), same all-at-once harness (512-tok unique prompt,
-gen 256, temp 0, ignore_eos). Swept the dense winner budget 256 plus neighbor 512.
-
-### Primary table - budget 256 (decode_agg tok/s | TTFT mean ms | peak host GB)
-
-| npl | stock (no budget) | budget 256 (best) | budget 512 | vLLM |
-|----:|------------------:|------------------:|-----------:|-----:|
-| 8   | 170.2 / 855   / -    | 169.3 / 1655  / 38.95 | 172.1 / 1488  / 38.82 | 202.0 / 799  |
-| 32  | 235.4 / 4970  / -    | 239.0 / 9034  / 42.93 | 234.7 / 7260  / 42.72 | 462.0 / 2308 |
-| 64  | 271.7 / 7205  / -    | 277.0 / 16249 / 51.96 | 274.5 / 13660 / 52.53 | 624.5 / 4072 |
-| 128 | 292.2 / 84800 / -    | **333.5 / 98106 / 61.42** | 300.8 / 92470 / 61.45 | 811.1 / 7980 |
-
-Peak host GB (paged KV, budget-independent): ~38.9 (npl8) -> ~42.8 (npl32) -> ~52 (npl64) ->
-~61.4 (npl128). Far below the dense run (94 GB @npl128) - only ~3B params are active, so the KV
-plus activations footprint stays light even fully saturated.
-
-### MoE inverts the dense story: the budget buys decode, NOT TTFT
-
-Unlike the dense 27B (where the stock run was prefill-starved to 491 s TTFT @npl128 and the budget
-cut it 38%), the MoE stock run was **never prefill-starved**: 3B active params make prefill cheap,
-so stock TTFT @npl128 was already only 84.8 s. Capping prefill therefore cannot rescue TTFT - it
-can only **defer first tokens to free decode steps**. Result at npl128 with budget 256:
-
- **decode_agg: 292.2 -> 333.5 tok/s (+14.1%)** vs the starved stock run.
- **TTFT mean: 84.8 s -> 98.1 s (+15.7%, WORSE)** - the budget costs latency here.
- llama decode as % of vLLM @npl128: **36.0% -> 41.1%**. TTFT now ~12.3x vLLM's 7.98 s.
-
-Budget 512 is the milder trade (decode +3.0% to 300.8, TTFT +9.0% to 92.5 s @npl128). Budget 256
-maximizes decode throughput; 512 if you want to bleed less TTFT. At npl 8/32/64 both budgets are
-net-negative or flat on decode and clearly raise TTFT (e.g. npl64 7.2 s -> 16.2 s @b256), the same
-all-at-once burst artifact seen in the dense run.
-
-### Does the ~3B-active decode scale better now? Yes - the plateau is gone
-
-The headline win is the **decode scaling curve**, not any single point:
-
-| npl step | stock decode_agg | budget-256 decode_agg |
-|---------:|-----------------:|----------------------:|
-| 8 -> 32  | 170 -> 235 (+38%) | 169 -> 239 (+41%) |
-| 32 -> 64 | 235 -> 272 (+16%) | 239 -> 277 (+16%) |
-| 64 -> 128| 272 -> 292 (**+7.4%**, plateauing) | 277 -> 333.5 (**+20.4%**, still climbing) |
-
-Stock MoE decode **plateaus** at saturation (+7.4% over the last doubling) because unbounded
-prefills keep stealing steps from the many ready decode slots. Budget 256 removes that ceiling -
-decode keeps climbing +20% into npl128, so more of the 128 slots actually decode concurrently.
-This is the cleanest evidence that patch 0013 protects in-flight decode once enough slots are live.
-
-### Bottom line (MoE)
-
-For the A3B MoE the prefill budget is a **decode-throughput lever, paid for in TTFT** - the mirror
-image of the dense case. Budget 256 lifts decode_agg +14% @npl128 and, more importantly, restores
-monotonic decode scaling (kills the stock plateau), moving llama from 36.0% to 41.1% of vLLM
-decode - the same ~41% ceiling the dense run hit. It does **not** close the gap: vLLM still decodes
-~2.4x faster (811 vs 333.5) and holds TTFT ~12x lower (8.0 s vs 98.1 s) @npl128, and scales
-monotonically and steeply where llama only partially recovers. Net: apply the budget to saturated
-MoE serving when decode throughput is the objective and some extra TTFT is acceptable; for
-latency-sensitive MoE serving leave it off (stock TTFT was already not the bottleneck here).
-
---
-
-## Fair re-run verdict
-
-This is the synthesis after patch 0013 (`max_prefill_tokens` / `LLAMA_PREFILL_BUDGET`) was turned
-on for both models. It answers three questions: how much of the apparent gap was prefill
-starvation, what genuine gap to vLLM remains after that artifact is removed, and where that leaves
-the "par-or-beat vLLM" goal.
-
-### 1. How much did patch 0013 close the gap?
-
-The original (stock) tables blamed two things on llama: an exploding TTFT and a flat decode curve
-at high concurrency. The budget re-run shows these were **two different problems with two
-different root causes**, and only one was prefill starvation.
-
-**Dense 27B - was genuinely prefill-starved.** Dense prefill is expensive (full 28B weights per
-token), so 128 simultaneous 512-token prefills truly starved both first-tokens and decode. Budget
-256 @npl128:
-
-| metric @npl128 | stock | budget 256 | vLLM | what closed |
-|----------------|------:|-----------:|-----:|-------------|
-| TTFT mean | 491.2 s | **305.4 s** (-37.8%) | 24.8 s | starvation real; -186 s recovered |
-| decode_agg | 134.6 | **161.2** (+19.8%) | 390.7 | freed slots now decode |
-| llama as % of vLLM decode | 34.5% | **41.3%** | 100% | +6.8 pts |
-
-Dense llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **99 / 56 / 46 / 41** (was 99/57/44/34).
-The fix moved only the saturated tail; npl 8/32 were never starved and are unchanged.
-
-**MoE 35B-A3B - was NOT prefill-starved (the inversion).** Only ~3B active params, so prefill was
-already cheap and stock TTFT @npl128 was 84.8 s, not dense's 491 s. There was no starvation to
-rescue, so the budget could not cut TTFT - it instead converted deferred prefill into decode
-steps. Budget 256 @npl128:
-
-| metric @npl128 | stock | budget 256 | vLLM | direction |
-|----------------|------:|-----------:|-----:|-----------|
-| TTFT mean | 84.8 s | 98.1 s (+15.7%, WORSE) | 7.98 s | budget costs latency here |
-| decode_agg | 292.2 | **333.5** (+14.1%) | 811.1 | plateau removed |
-| llama as % of vLLM decode | 36.0% | **41.1%** | 100% | +5.1 pts |
-
-MoE llama-as-%-of-vLLM after the fix, npl 8/32/64/128: **84 / 52 / 44 / 41** (was 84/51/44/36).
-The decisive MoE finding is the scaling curve, not the point: stock decode plateaued over the last
-doubling (64->128 = +7.4%); budget 256 restored monotonic scaling (+20.4%), proving the stock flat
-curve was unbounded prefill stealing steps from ready decode slots, not a kernel ceiling.
-
-**Combined takeaway.** Both models converge to the **same ~41% of vLLM decode at npl128** after the
-fix. That convergence is the signal: once prefill starvation is removed, dense and a 12x-cheaper-
-prefill MoE land on the identical ceiling, which means the remaining gap is **not** about prefill
-at all - it is the decode scheduler.
-
-### 2. The honest remaining gap to vLLM
-
-After patch 0013, the residual gap is the **continuous-batched-decode efficiency** lever, and it is
-real, not an artifact:
-
- vLLM still decodes **~2.4x faster** at npl128 on both models (390.7 vs 161.2 dense; 811.1 vs
-  333.5 MoE).
- vLLM holds TTFT **~12x lower** at npl128 (24.8 vs 30.5 s dense; 8.0 vs 98.1 s MoE) - and does so
-  while decoding faster, i.e. no latency/throughput trade.
- **vLLM scales monotonically and steeply** (dense 64->391, MoE 202->811 across npl 8->128); llama,
-  even with the budget, only **partially** recovers its scaling (dense 64->161, MoE 170->334).
-
-The mechanism: vLLM's scheduler interleaves prefill and decode at token granularity (chunked
-prefill + paged continuous batching) every step, keeping the GPU saturated with a near-optimal mix.
-Patch 0013 is a coarser tool - a static per-step prefill **cap** - which protects in-flight decode
-but does not actively schedule the prefill/decode mix, and on the bursty all-at-once harness it
-defers first tokens (the TTFT penalty at npl 8/32/64, and the MoE TTFT regression @npl128). The gap
-that remains is the **quality of the step-by-step batching decision**, not raw kernel speed: at
-npl8 the kernels are at parity (dense 99%, MoE 84%), so the per-token math is competitive - what
-vLLM does better is keeping more sequences productively in-flight every step as concurrency rises.
-
-### 3. Where this leaves "par-or-beat vLLM", and the last lever
-
-**Where llama is competitive today (NVFP4, GB10):**
-
- **Low concurrency (npl<=8): at parity.** Dense 99%, MoE 84% of vLLM decode, comparable TTFT.
-  For single-user / few-stream local serving - LocalAI's dominant mode - llama.cpp is already
-  there on matched NVFP4.
- **Memory efficiency: llama wins outright at every concurrency.** On-demand paged KV (dense
-  52->94 GB, MoE 39->61 GB) vs vLLM's flat ~112 GB pre-reservation. On a 128 GB unified box this is
-  the difference between multi-tenant headroom and OOM - a genuine product advantage, not a
-  consolation.
-
-**Where llama is not competitive:** high-concurrency decode throughput (npl>=32), where vLLM is
-~2-2.4x ahead and the budget only narrows it to ~41%.
-
-**The last lever** is therefore *not* another prefill knob (0013 has extracted what a static cap
-can give) and *not* the kernel (at parity @npl8). It is **token-granular continuous-batch
-scheduling**: actively interleaving chunked prefill with decode every step rather than capping
-prefill, so all live slots decode while new prefills trickle in - exactly what closes vLLM's
-monotonic-scaling advantage. A staggered (non-burst) arrival pattern would also let 0013 protect
-decode jitter without the burst-TTFT penalty seen here, narrowing the practical gap for real
-serving traffic that does not arrive all-at-once.
-
-### Bottom line
-
-Patch 0013 is validated and worth shipping as a **selective, high-concurrency QoS lever**: it
-recovers dense TTFT 38% and lifts saturated decode +14-20%, converging both models to ~41% of
-vLLM. But it is honestly **not a gap-closer**. The "par-or-beat vLLM" goal is **met at low
-concurrency and on memory efficiency, and not met at high-concurrency decode throughput.** The
-remaining ~2.4x is a continuous-batched-decode scheduling gap, not a prefill-starvation or kernel
-gap - and that is the next (harder) lever, distinct from anything 0013 can touch.
--- a/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
+++ b/backend/cpp/llama-cpp/patches/paged/RMSNORM_FP4_FOLD.md
@@ -1,400 +0,0 @@
-# RMSNORM_FP4_FOLD.md - ceiling-critic verdict (label ceiling-critic, READ-ONLY, no GPU)
-
-Completeness audit of the post-0022/0023 bit-exact decode surface: is the rms_norm -> fp4
-producer-fold the BEST remaining bit-exact decode lever, or is something better being missed?
-Source: all paged/*.md verdicts + the 0019/0021/0023 patch diffs (local, read-only). No GPU touched.
-
-## Starting line (post-0023)
- Dense q36-27b-nvfp4: 373.2 t/s @ npl128 = 95.4% of vLLM 391. Dense is UNTOUCHED by 0023.
- MoE q36-35b-a3b: 758 t/s @ npl128 (0023 +1.73%).
- Decode = ONE replayed CUDA graph, single stream, 99.94% GPU-busy, 0.06% idle. Removed/folded
-  kernel GPU-time cuts wall 1:1, and DISJOINT folds STACK 1:1 (each removes a distinct kernel).
- gated_delta_net recurrence = ~50% of the step, at 84.6% peak BW (past vLLM's 82.4%), PLATEAUED.
-
-## TIER 0 - confirmed NO bit-exact lever (dead, do not pursue)
-
-(a) GDN recurrence past 84.6% - NO. The 0022 sweep is MONOTONIC toward grid.z=1: 8x4 (grid.z=4,
-    32 cols/block) = 79.9%, 16x4/8x8 (grid.z=2) = 82.3%, 16x8/32x4 (grid.z=1, all 128 cols in one
-    block = max in-flight independent state-loads per warp) = 84.6%. grid.z>1 is the WRONG direction
-    (fewer cols/block = less memory-level parallelism = lower BW), already measured worse. The only
-    thing past 84.6% is the float4/vectorized load or a different row-partition, BOTH of which
-    repartition which rows a lane sums into the warp-butterfly = a different reduction grouping =
-    breaks md5 (the exact f32x4 trap that was explicitly avoided). 84.6% (230.9 of 273 GB/s) is at
-    the practical LPDDR5x DRAM ceiling AND past vLLM. No bit-exact decomposition exists. FLOOR.
-(b) flash_attn_ext_f16 (3.1%) - NO. 48 CTAs = exactly one full wave, no occupancy headroom, no tail.
-    Every grid knob (split-KV / parallel_blocks / ncols / cols_per_block / KV-retile) changes the
-    online-softmax running-max/sum RESCALE ORDER across KV blocks = forbidden. FLOOR.
-(c) lm_head (nvjet/cublas, 3.1%) - NO. cublas-internal; any algo/kernel swap changes the K-accum
-    order vs the current f32 reference = breaks md5. Already tuned. No knob. NO lever.
-(d) mul_mat_q FP4 GEMM (~24-27%, the biggest bucket) - NO decode lever. P2a (mmq_y=64 / minblocks=2)
-    is bit-exact (1115/805, md5-identical) but MEASURED FLAT on decode (decode mmq -1.1%, stream_k
-    fixup +1.7ms = net worse). The -24.7% is a PREFILL large-N asymptotic number; the m=128 decode
-    GEMM is LPDDR5x-bandwidth-bound and mmq_y is deliberately bandwidth-neutral. FLOOR.
-
-=> Of the four largest buckets (recurrence 50% + GEMM 25% + lm_head 3% + attn 3% = ~81% of the
-   step), NONE has any bit-exact lever left. All remaining headroom lives in the ~12% of small,
-   foldable glue/quantize/gather buckets below.
-
-## TIER 1 - the bit-exact-feasible folds, RANKED by ROI (gain / plumbing+risk)
-
-Confirmed bit-exact-foldable buckets from the post-0021/0022 node trace:
- quantize_mmq_nvfp4 ........ 4.5% (dense-foldable ~2.7% ceiling; fold captures ~2-2.5%)
- k_get_rows_float .......... 1.9-2.1% (STILL LIVE post-0021; pure gather)
- pointwise glue ............ ~3.1% (k_bin_bcast 1.7% + silu/sigmoid output-gate 1.4%; ~1.5-2.5% net)
-
-Rank 1 - POINTWISE ACTIVATION FOLD (~1.5-2.5%, MEDIUM plumbing, NO new ABI). Best ROI/risk of the
-  three. Fold k_bin_bcast residual-adds + gate-muls and the silu/sigmoid output gate into adjacent
-  kernel epilogues/prologues. Pure elementwise f32, same formula+order standalone or folded =
-  byte-identical. STRICT EXCLUSION: do NOT re-fold the rms_norm/l2_norm REDUCTIONS (reduction-tree /
-  eps-placement trap). No frozen ABI, no GEMM surgery. Well-scoped already (NONRECURRENCE Lever #2).
-
-Rank 2 - rms_norm -> fp4 PRODUCER-FOLD (the proposed lever) (~2-2.5% realistic dense, HIGHEST
-  plumbing). LARGEST single clean dense bucket and HIGHEST-confidence ROI (skip-B measured dense
-  +2.7% for the whole quantize; the fold removes the f32 round-trip, keeps the quant compute, so
-  ~2-2.5%). BIT-EXACT VERDICT: SOUND, and NOT the f32x4-trap class. The trap changed a REDUCTION
-  grouping; this fold touches only (i) the sumsq block-reduce, kept BYTE-IDENTICAL, and (ii) the
-  writeback, where the post-norm normalize-MUL is pointwise (order-independent, identical out_i for
-  any thread partition) and the NVFP4 quant is per-16-consecutive PER-THREAD with NO cross-thread
-  shfl (verified in quantize.cu; 0023 already shipped on exactly this property and held the byte
-  gate). Re-partitioning the writeback to 16-consecutive-per-thread therefore changes only WHO
-  writes/quantizes each element, not the VALUES or the reduction. md5-safe. BUT it carries the worst
-  plumbing-to-ROI ratio: 3-op {RMS_NORM,MUL,MUL_MAT(NVFP4)} graph fusion + a mul_mat_q
-  prequantized-src1 path + the frozen block_fp4_mmq ABI + a per-call scratch pool. This is the
-  LAST-MILE lever, not the first.
-
-Rank 3 - GET_ROWS / STATE-GATHER FOLD (~up to 2%, LOW-MEDIUM plumbing, ZERO reduction risk -
-  but UNDER-SCOPED). k_get_rows_float is STILL 7.29-7.32 ms = ~2.1% of the step post-0021/0022; the
-  0021 author KEPT the build_rs conv-tap + recurrent-state gathers, explicitly deferring them
-  ("tiny; not one of the eliminated buckets"), NOT proving them unfoldable. A gather is a pure copy
-  with NO reduction = the SAFEST possible bit-exact fold (the exact property the 0023 dedup
-  exploited). Folding the residual build_rs gathers into their consuming kernel (read from cache via
-  ids/block-table instead of a pre-gathered f32 scratch, mirroring 0019's gather-free recurrence) is
-  bit-exact by construction. Ranked 3 only because the FOLDABLE FRACTION needs a one-pass source
-  scoping (some of the 2% may be the "tiny" conv-tap part already); the ROI is lower-confidence than
-  Rank 1/2, but the RISK is the lowest of all. THIS IS THE "SOMETHING BEING MISSED": it is a live
-  ~2% bit-exact bucket that the current plan does not address.
-
-## IS THE fp4 FOLD THE RIGHT NEXT BUILD?
-
-DEFENSIBLE, but NOT unambiguously the best by ROI. It is the largest single well-understood
-bit-exact dense bucket and the verdict is sound (no trap). HOWEVER its plumbing is the highest of
-the three folds, and the POINTWISE fold matches its realistic gain (~1.5-2.5%) at MEDIUM plumbing
-with no new ABI, while the GET_ROWS fold offers ~2% at the lowest risk (pure copy). The fp4 fold has
-the worst gain/plumbing ratio of the candidates.
-
-Recommended build order (all bit-exact, all stack 1:1 on the serial single stream):
-  1. POINTWISE activation fold first (cheapest, no ABI, ~1.5-2.5%).
-  2. GET_ROWS gather fold second, after a short source-scoping pass (~up to 2%, lowest risk).
-  3. rms_norm -> fp4 producer-fold LAST (the high-plumbing last mile, ~2-2.5% dense), built only if
-     the remaining gap to the chosen target still justifies the ABI/graph-fusion surgery.
-If the workflow insists on a SINGLE decisive lever and accepts the plumbing, the fp4 fold is the
-biggest one and a legitimate choice - but it should be sequenced after the cheap folds, not before.
-
-## HONEST BIT-EXACT CEILING
-
-The three folds remove DISJOINT kernels on a 99.94%-busy serial stream, so they STACK:
-  ~2-2.5% (fp4) + ~1.5-2.5% (pointwise) + ~2% (get_rows) = ~5.5-7% gross on dense.
-  373 t/s + ~6% = ~393-399 t/s = ~100-102% of vLLM 391.
-=> The bit-exact dense ceiling is vLLM PARITY-to-slightly-ahead (~100%), NOT 95%. Declaring the
-   ceiling at ~95% would leave ~4-5% of identified, bit-exact-FEASIBLE fold headroom unbuilt.
-   Realistic SHIPPABLE ceiling (fold inefficiency + the realistic-vs-ceiling haircut + some buckets
-   resisting clean folding): ~98-100% of vLLM dense. The recurrence (50%) is already past vLLM and
-   at the DRAM floor; attention/lm_head/mul_mat_q have no bit-exact lever; everything left is the
-   ~6% of small folds above. There is no fourth large bit-exact lever hiding anywhere.
-
-Caveat that frames the whole result: vLLM 391 is a LOWER-precision reference (w4a4/w4a16 acts vs
-llama's q8_1; the recurrence is algebraically reassociated). Bit-exact-vs-vLLM is IMPOSSIBLE; the
-only meaningful cross-engine bar is throughput + top-1/KL, and llama at 373 (95%) bit-exact f32 is
-already doing strictly MORE precise arithmetic at near-equal throughput. Closing the last ~5% with
-the folds reaches throughput parity at higher precision - a strong result, but each fold is a
-diminishing 1.5-2.5% at rising plumbing. The bf16-state over-clock (shelved) is the only thing that
-goes materially AHEAD, and it is non-bit-exact (KL-gated), out of scope for this gate.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# RMS_NORM -> NVFP4 PRODUCER-FOLD - PRECISE IMPLEMENTATION DESIGN (label fold-design, READ-ONLY, no GPU)
-
-Design-only, no GPU. Reads: DGX `~/llama-paged-dev` HEAD f7409c2 (patch 0023) + `git stash@{0}`
-(trackA1-prequant-nvfp4-fused-rmsnorm) + norm.cu/quantize.cu/mmq.cu/mmq.cuh/ggml-cuda.cu/qwen35.cpp.
-
-## 0. One-line verdict
-The fold is bit-exact-FEASIBLE, BUT the Lever-2 stash that exists as the starting point is
-(a) almost certainly bit-INEXACT and (b) was measured FLAT. The single mandatory fix is the
-reduction block_size dispatch; the single thing that makes it not-flat is de-dup-across-siblings
-+ skipping the dead f32 write at the FFN boundary. Build the FFN boundary first, gate on a measured
-per-call producer-vs-removed-quantize win before extending. Honest expectation: ~1.5-2.5% dense
-best case, real risk of flat (Lever-2 precedent). Lower-risk alternative in Section 7.
-
-## 1. Which graph nodes fuse
-Both boundaries already collapse rms_norm+gain into ONE `rms_norm_f32<bs, do_multiply=true>` kernel
-(existing fuse, ggml-cuda.cu:3675). That kernel's f32 output is the byte-exact target.
-
- FFN (STRONGEST), qwen35.cpp:188-192 + build_layer_ffn:478-487:
-  `attn_post_norm = build_norm(cur, RMS)` feeds EXACTLY `ffn_up` + `ffn_gate` (both NVFP4 MMQ at
-  m=128). NO non-NVFP4 consumer (residual = pre-norm `cur`; ffn_down eats silu(gate)*up). => the
-  f32 normed tensor is DEAD once both GEMMs read fp4 -> producer can skip the f32 write. An existing
-  `{MUL_MAT, MUL_MAT, GLU}` fuse (ggml-cuda.cu:3631) already groups up+gate+GLU -> the natural seam.
- GDN/attn (weaker), qwen35.cpp:161 + build_qkvz:228-243:
-  `attn_norm = build_norm(inpL, RMS)` feeds `wqkv` + `wqkv_gate` (NVFP4 MMQ, share src1) AND
-  `ssm_beta` + `ssm_alpha` (small N=n_v_heads -> MMVQ, READ THE f32). => f32 still live, producer
-  MUST write f32 -> smaller win.
- MoE FFN (qwen35moe.cpp) goes via mul_mat_id, already 0023-deduped -> out of scope. Fold = dense only.
-
-## 2. Byte-exact target (norm.cu rms_norm_f32<bs,true>)
-Dispatch (norm.cu:304-380): `bs = (ncols < 1024) ? 256 : 1024`, shmem 32*float.
-```
-for col=tid; col<ncols; col+=bs: tmp += x[col]*x[col];           // (R1) strided sumsq grouping
-tmp = block_reduce<SUM, bs>(tmp, s_sum);                          // (R2) tree width depends on bs
-mean = tmp/ncols; scale = rsqrtf(mean+eps);                       // (R3) exact eps/div
-for col=tid; col<ncols; col+=bs: dst[col] = scale*x[col]*mul[col];// (W) per-channel gain, mul_col==col
-```
-(W) is per-column independent (scale block-uniform) -> writeback may be re-partitioned. (R1/R2/R3)
-are the ONLY order-sensitive parts and must stay byte-identical.
-
-## 3. Fused producer kernel (quantize.cu) - deltas vs the stash
-Start from stash `rms_norm_mul_quantize_nvfp4_kernel` + the factored `quantize_nvfp4_write_subblock`
-(verbatim per-thread NVFP4 quant). Required changes:
-1. TEMPLATE on block_size + launch `bs=(ncols<1024)?256:1024` (NOT the stash's hardcoded 256). MANDATORY.
-2. Reduction pass VERBATIM (R1/R2/R3): scalar strided sumsq, `block_reduce<SUM,bs>`, `mean=tmp/ncols`,
-   `scale=rsqrtf(mean+eps)`. Byte-identical once bs matches.
-3. Writeback re-partitioned to 16-consecutive-per-thread: `for s=tid; s<n_sub; s+=bs`, col0=s*16,
-   `v=scale*xr[col]*mul[col]` (col<ncols else 0), amax=max|v|, `quantize_nvfp4_write_subblock(vals,
-   amax, sub, y+ib)`, `ib=k_block*ne11+row`, n_sub=ncols_padded/16. x is re-read (canonical does too).
-4. `template<bool write_f32>`: FALSE at FFN (skip `dr[col]=v` -> drop the producer's f32 store),
-   TRUE at GDN (beta/alpha read it). THIS is what turns re-bucketing into a real traffic cut.
-Buffer ABI frozen: block_fp4_mmq = {uint32_t d4[4]; int8_t qs[128]} = 144B = 9 uint4 = 4*block_q8_1
-(mmq.cuh:53). Same layout quantize_mmq_fp4_cuda emits; GEMM stride
-s12=ne11*ne10_padded*sizeof(block_fp4_mmq)/(QK_K*sizeof(int)).
-
-## 4. mul_mat_q prequantized-src1 plumbing (mmq.cu/mmq.cuh)
-Re-add the stash hook on top of 0023: `ggml_cuda_mul_mat_q(..., const char* src1_prequantized=nullptr)`.
-In the NON-ids branch: if non-null, skip quantize_mmq_fp4_cuda + the local pool alloc, point mmq_args
-src1_q8_1 at it. GEMM byte-UNTOUCHED (the bit-exactness firewall). 0023 ids-branch untouched (orthogonal).
-Sharing across non-adjacent siblings:
- FFN (preferred): extend `{MUL_MAT,MUL_MAT,GLU}` to `{RMS_NORM,MUL,MUL_MAT,MUL_MAT,GLU}` super-fuse;
-  one producer (write_f32=false) + one pool buf spanning both GEMMs + GLU, all in one handler. Clean.
- GDN/general: a scratch cache keyed by the normed tensor ptr (graph-eval lifetime); defer until FFN wins.
-The stash folds only ONE consumer with a stack-scoped qbuf -> the sibling still standalone-quantizes
-(a key reason it was flat; nsys showed quantize 12896->10816, not ->0).
-
-## 5. Bit-exactness argument
-(1) NVFP4 quant of each 16-elem sub-block = PURE per-thread function, NO cross-thread shfl/reduction
-    (quantize.cu; the exact property 0023 shipped on). => writeback re-partition cannot change a byte.
-(2) v=scale*x[col]*mul[col] byte-identical iff scale identical (R1/R2/R3 preserved via bs dispatch)
-    AND expression verbatim (left-assoc, scalar). Per-column independent -> partition-invariant.
-=> produced block_fp4_mmq bytes == standalone == 0022/0023 baseline; GEMM untouched -> md5 held.
-Gate: BATCHED (ne[1]>8) md5 == 5951a5b4 dense + 1115/1115 - NOT just batch=1 (the gate Lever-2 skipped).
-
-## 6. THE TRAP
- block_size trap (the stash's latent bug): canonical = `ncols<1024?256:1024`; qwen35 n_embd is
-  1024/2560/4096 (qwen35.cpp:30-31) -> canonical is rms_norm_f32<1024> (LEVER2 nsys confirms). Stash
-  hardcodes 256 -> different strided grouping {tid,tid+256,...} vs {tid,tid+1024,...} AND 8-warp vs
-  32-warp reduce -> different f32 order -> md5 break. FIX = template+dispatch matching bs.
- f32x4 vectorize trap (recurrence class): do NOT vectorize the sumsq load or align the reduction
-  partition to the 16-consecutive writeback. Keep sumsq scalar + strided-by-bs.
- eps/assoc: `rsqrtf(mean+eps)`, `mean=tmp/ncols`, `(scale*x)*mul`. Never reassociate.
- GEMM K-reduction / stream-k / tile loads: forbidden (NONRECURRENCE FORBIDDEN list). Fold only
-  changes WHO writes src1.
-
-## 7. Contrast with Lever-2 + lower-risk alternative
-Lever-2 (stash) was FLAT (+0.3% dense) and NET-ADDED GPU-time (+2.3% fused vs -1.1% quantize -0.9%
-rms_norm) because it (a) folded only 1 of 2 siblings, (b) always wrote f32, (c) bs=256 (wrong AND
-non-canonical). It md5'd only batch=1 (fuse off) -> bit-inexactness never caught. The new fold beats
-it ONLY with de-dup-both-siblings + skip-dead-f32-at-FFN; without BOTH, expect flat again.
-LOWER-RISK alt (recommend evaluating first): dense quantize DE-DUP, no fold - keep the efficient
-standalone quantize, quantize the shared normed activation ONCE, reuse for wqkv+wqkv_gate /
-ffn_up+ffn_gate (CSE keyed by src1 ptr, the dense analog of 0023). ZERO reduction risk (rms_norm
-untouched), much less plumbing; ceiling ~<=1% (redundant half only), which the fold's de-dup half
-captures anyway. The fold's only incremental value is the f32 round-trip read, which Lever-2 showed
-is easily eaten by the fused kernel's added work.
-
-## 8. Scope + build order (the gate)
-Scope dense qwen35: quantize.cu/.cuh (templated kernel + bs dispatch), mmq.cu/.cuh (src1_prequantized
-on non-ids path), ggml-cuda.cu (FFN super-fuse, gate: NVFP4 src0 + Blackwell + ne[1]>MMVQ_MAX_BATCH_SIZE
-+ ne2==ne3==1 + per-channel gain; flag LLAMA_FUSE_NVFP4_QUANT).
-Build order: (1) FFN super-fuse only (write_f32=false + de-dup); measure per-call producer GPU-time
-vs the two removed quantizes (nsys node trace, same-build flag toggle); SHIP only if decode_agg
-actually lifts AND batched md5==0022/1115. (2) Only if (1) lifts, add the GDN boundary (write_f32=true,
-keyed scratch). Realistic: ~1.5-2.5% dense FFN best case; ceiling +2.7% (skip-ALL) is unreachable
-(fold keeps quant compute+write). If step 1 is flat, dense quantize is at its bit-exact floor -> stop.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# RE-PROFILE TARGET MEASUREMENT (label reprofile-target, THE GPU agent) - post-0023, HEAD f7409c2
-
-Fresh node-level nsys re-profile of the DENSE decode to confirm the fold target size, foldable
-fraction, critical-path status, and the realistic recoverable ceiling, BEFORE BuildFold commits.
-
-## Build-dir correction (acted on)
-The orchestrator framed `build-cuda-base` as the clean 0023 build. It is NOT: empirically
-`build-cuda-base` = stale pre-0021 (336.71 t/s), the real post-0023 build is `build-cuda` (371.81 t/s,
-git-clean tree, no mmq.cuh P2a remap). All numbers below are from `build-cuda`. (Dense profiling is
-unaffected by the 0023 MoE de-dup knob - dense has no MoE.)
-
-## Confirmed baseline
- llama-batched-bench dense q36-27b-nvfp4 npl128 ntg128: 371.81 t/s, 344 ms/decode-step. CONFIRMS the
-  ~343 ms / ~373 t/s target. (build-cuda-base stale = 336.71 t/s.)
- nsys --cuda-graph-trace=node, 103 steady windowed steps: step span 345.0 ms, mean kernel-busy 99.0%,
-  sum-of-kernels/span = 98.9% (< 100% => no NET overlap; serial single stream, ~1.1% idle).
-
-## Dense decode decomposition (ms/step)
-gated_delta_net 168.06 (49.2%) BINDING | mul_mat_q<NVFP4,128> 93.57 (27.4%) |
-**quantize_mmq_nvfp4 17.55 (5.1%)** | nvjet 12.02 (3.5%) | flash_attn_ext 11.64 (3.4%) |
-ssm_conv 8.56 (2.5%) | k_get_rows_float 7.32 (2.1%) | silu 5.36 | k_bin_bcast(mul) 4.64 |
-stream_k_fixup 3.95 | rms_norm 3.53 (1.0%). TOTAL kernel 341.25.
-
-## quantize_mmq_nvfp4 at the dense decode shape (the answer)
- TOTAL: 17.55 ms/step = 5.1% of kernel time = 5.08% of the 345 ms wall. 496 quant calls/step (1 per
-  NVFP4 GEMM src1). CONFIRMS the verdict's 17.66 ms / ~4.5-5% (the stray "3.7%" reading was wrong).
- Decomposes EXACTLY by input dim K (graph-verified in qwen35.cpp; 64 layers = 48 GDN + 16 attn):
-  - K=5120 (368/step) FOLDABLE: GDN {wqkv, wqkv_gate, beta, alpha} + attn {wq,wk,wv} + both {ffn_up,
-    ffn_gate}. All fed by a plain rms_norm+mul (attn_norm or attn_post_norm). beta/alpha CONFIRMED
-    foldable: they read the same `cur` as wqkv (qwen35.cpp:359/366).
-  - K=6144 (64/step) UNAVOIDABLE: ssm_out (gated-norm = rms_norm + mul(ssm_norm) + mul(silu(gate)),
-    two muls break the chain) + wo (attn-gated producer).
-  - K=17408 (64/step) UNAVOIDABLE: ffn_down (silu(gate)*up producer).
-
-## Foldable portion (measured) - LARGER than the byte-model 2.7%
-The quant kernel is NOT byte-proportional: ffn_down (K=17408) measures 3.62 ms but a byte-model
-predicts 5.75 ms. Small-K quants are launch/overhead-bound (flat ~21.7 us floor, K=5120 vs 6144
-indistinguishable), so the byte model UNDER-counts the numerous small-K (foldable) calls.
- byte-model FOLDABLE  = 9.73 ms = 2.82% of step
- flat-split FOLDABLE  = 11.90 ms = 3.45% of step  (368 small-K quants, the physically correct one)
- => true FOLDABLE raw GPU-time = 9.7 - 11.9 ms = 2.8% - 3.4% of step. UNAVOIDABLE = ssm_out+wo
-  ~2.1 ms + ffn_down 3.62 ms = ~5.7 ms (1.6%).
- Sub-split for the build order: the FFN boundary alone (ffn_up+ffn_gate, f32 DEAD -> cleanest fold)
-  = 128 quants/step ~4.1 ms; the input-norm boundary (wqkv/wqkv_gate/wq/wk/wv, +beta/alpha keep f32)
-  = ~7.8 ms raw but lower net efficiency.
-
-## Critical path: YES (1:1)
-98.9% kern/span, 99.0% busy, single serial stream, no net overlap. The quant kernels are inline on the
-serial decode chain; removing their GPU-time cuts the wall ~1:1. Not a gap-fill (there are no gaps).
-
-## Realistic recoverable - and the honest haircut
-RAW foldable removed = 9.7-11.9 ms. NET recoverable is LESS, for reasons the fold-design + ceiling-critic
-already flagged and this profile does not overturn:
- the fused producer KEEPS the quant compute + the fp4 write (only the f32 round-trip read is saved,
-  and the f32 write is droppable ONLY at the FFN boundary where it is dead);
- Lever-2 precedent: the existing stash fold measured FLAT (+0.3% dense) because it folded 1 of 2
-  siblings, always wrote f32, and used a non-canonical bs=256 reduction;
- TENSION TO FLAG: the critic cites a skip-B probe of only ~+2.7% for the WHOLE quantize, yet the whole
-  quantize is 5.1% on a 98.9%-serial stream (which predicts ~5.1% if cleanly 1:1). Either these small
-  kernels are not perfectly 1:1, or the skip probe is unreliable (same class as the NONREC
-  garbage-routing skip artifact). This caps the realistic NET nearer the conservative end.
-=> Realistic NET recoverable: ~1.5 - 2.5% dense (consistent with fold-design Section 8), real risk of
-   FLAT. Optimistic ceiling if the f32 round-trips fully convert: up to ~3% (371.8 -> ~383 t/s); do not
-   bank above ~2.5%.
-
-## VERDICT (GPU-measurement view)
- The target is REAL: foldable raw GPU-time 9.7-11.9 ms (2.8-3.4%, slightly LARGER than the prior 2.7%
-  byte-model floor), squarely on the single-stream critical path (1:1), bit-exact-FEASIBLE (no precision
-  change), and the largest single clean dense bucket left after the plateaued recurrence.
- BUT the NET recoverable is the contested ~1.5-2.5% with a documented FLAT risk, and this fold has the
-  HIGHEST plumbing of the three identified folds. Worst gain/plumbing ratio of the candidates.
- RECOMMENDATION: build is DEFENSIBLE but should be SEQUENCED AFTER the cheaper pointwise + get_rows
-  folds (per ceiling-critic). If built as the single decisive lever, do the FFN boundary FIRST (cleanest
-  ~4.1 ms, f32 dead), gate per-call producer-GPU-time vs the two removed quantizes, and SHIP only if
-  decode_agg actually lifts AND batched md5 == 5951a5b4 (1115/1115). Kill-switch: if the only bit-exact
-  construction forces re-partitioning the sumsq reduction (changing accumulation order), abort - not
-  bit-exact.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
-
-====================================================================================================
-
-# BUILD VERDICT (label fold-build, THE GPU agent) - post-0023, HEAD f7409c2 = patch 0023
-
-DECISION: NO BUILD. The bit-exact decode ceiling is effectively reached for any lever that justifies
-its plumbing. The proposed rms_norm -> fp4 producer-fold is NOT built (it was already built once and
-measured FLAT), and the recommended lower-risk alternative (dense quantize de-dup) does NOT have a
-clean, contained construction for the portion that matters. Tree left clean at 0023; nothing committed
-to the code; this verdict appended only.
-
-I extended the read-only agents' analysis with the two things they could not verify from the .md
-verdicts alone: (1) the prior EMPIRICAL fold attempt, and (2) the actual graph/dispatch structure in
-the source. Both kill the build.
-
-## 1. The fp4 producer-fold was ALREADY BUILT and measured FLAT (decisive)
-LEVER2_PROGRESS.md + stash@{0} (trackA1-prequant-nvfp4-fused-rmsnorm) is exactly this fold. Measured:
-  - dense q36-27b-nvfp4 npl128: 333.32 -> 334.44 t/s (+0.3%), npl32 -0.5%
-  - MoE   q36-35b-a3b   npl128: 690.23 -> 690.89 (+0.1%), npl32 -0.3%
-nsys A/B (fusion fires): quantize_mmq_nvfp4 -2080 inst (-1.1%), rms_norm_f32<1024> -2080 (-0.9%),
-NEW rms_norm_mul_quantize_nvfp4 +2080 (+2.3%). NET GPU-time = +0.3%. The fused producer ADDS BACK
-the GPU-time it removes - it RELOCATES work, it does not remove it. The +0.3% wall is exactly
-consistent with strict 1:1 wall scaling on the single serial stream (reprofile's own model). So the
-fold is not a victim of a bad implementation that a rewrite fixes - it is structurally flat: the
-producer must still read x, compute sumsq, normalize, quantize and WRITE the fp4 blocks; the only
-recoverable traffic is the f32 round-trip, which the fused kernel's extra work eats (Lever-2 proved
-this empirically; fold-design Section 7 and reprofile both predicted it). The design's two "fixes"
-(de-dup both siblings + skip dead f32 at FFN) do not change this: the skip-f32 saves one f32 write at
-the FFN boundary only (~0.5% of step), and the de-dup-both-siblings is item 2 below.
-
-## 2. The dense quantize de-dup is NOT a clean analog of 0023 (the meaningful part is infeasible)
-This is the critical finding the read-only agents missed. 0023's MoE de-dup lifted +1.73% because the
-redundancy is INTRA-CALL: inside ONE mul_mat_id, the broadcast (ne11==1) up/gate quantize repeats the
-SAME token n_expert_used times, all within a single ggml_cuda_mul_mat_q call, so de-dup is a contained
-quantize-once + gather with a stack-scoped buffer. NO precedent issue, NO cross-node lifetime.
-The DENSE redundancy is INTER-NODE and that is a different, much harder problem:
-  - The shared-src1 GEMMs are SEPARATE graph nodes. build_qkvz (qwen35.cpp:228-243) emits wqkv MM,
-    reshape, wqkv_gate MM; then ssm_beta MM, reshape, sigmoid; ssm_alpha MM, reshape, add, softplus,
-    mul. The four src1-sharing MMs are INTERSPERSED with reshape/sigmoid/softplus/add/mul - they are
-    NOT consecutive graph nodes, so ggml's consecutive-op fusion framework cannot match them. A
-    contained, single-handler de-dup (the only kind with safe buffer lifetime, like 0023) is impossible
-    for the qkvz bucket.
-  - De-duping them therefore requires graph-level CSE: recognize 2-4 non-adjacent MUL_MAT nodes share
-    src1, quantize once, and keep that pool buffer alive across the intervening nodes until the last
-    sibling GEMM consumes it - under CUDA-graph CAPTURE (buffer addresses baked at capture, the pool
-    must not recycle the buffer between siblings). This is the SAME high-plumbing scratch-pool +
-    src1_prequantized path the fold needs, with real implementation risk (graph-capture
-    non-determinism / crashes), and NO precedent in the tree. fold-design's "much less plumbing"
-    framing for this alternative is optimistic - the hard part (inter-node buffer sharing under graphs)
-    is common to both.
-  - The qkvz bucket (the big one, ~192 redundant quants/step ~= 1.4%) is exactly the inter-node case.
-  - The ONLY contained, tractable dense de-dup is the FFN {MUL_MAT,MUL_MAT,GLU} (consecutive; build_ffn
-    LLM_FFN_PAR). But that existing fusion executes ONLY via ggml_cuda_mul_mat_vec (gated on batch<=8;
-    ggml_cuda_should_fuse_mul_mat_vec_q). At npl128 (m=128) it falls through to two separate MMQ nodes.
-    Adding an MMQ-path branch to quantize src1 once captures only the FFN redundancy = ~64 quants/step
-    ~= 0.5% of the step - below the +-0.3-0.5% bench noise the runs already show, not worth a new
-    fusion code path + the risk to the byte gate.
-
-## 3. The pointwise + get_rows folds are not clean wins either
- Pointwise: the cheap ops are ALREADY fused in the tree - {RMS_NORM,MUL(,ADD)} -> rms_norm_fused
-  (ggml-cuda.cu:4194/4199), {SSM_CONV,(ADD),SILU} -> ssm_conv (4204/4209), {UNARY(silu/sigmoid/
-  softplus),MUL} -> unary_mul (4216). The residual silu 5.36 + k_bin_bcast 4.64 ms is the un-fusable
-  remainder inside the GDN gating chain feeding the 50% binding gated_delta_net kernel; GAP_PROGRESS
-  measured the whole gating-glue ceiling at only 3.35% and folding further means surgery on the binding
-  kernel. Lower-confidence, needs a GPU node-scoping pass - not a clean lever.
- get_rows: 0019 already folded the main recurrent-state gathers; the residual ~2% is an unquantified
-  mix of the conv-tap (already deferred as "tiny") and leftovers - under-scoped, not a confirmed win.
-
-## 4. Tree state / gates
- Dev tree clean at HEAD f7409c2 (git diff empty; mmq.cuh/mmq.cu/quantize.cu no uncommitted diff -
-  no P2a remap to revert). build-cuda = the clean 0023 build (371.81 t/s dense @npl128, per reprofile).
- No code change made -> no md5 gate needed (baseline 27b = 5951a5b4, 35b = 07db32c2 unchanged).
- No GPU build/bench launched (no buildable candidate clears the ROI bar; re-confirming the baseline
-  the reprofile already measured would waste the GPU window).
-
-## 5. FINAL BIT-EXACT CEILING
-Dense q36-27b-nvfp4: 371.81 t/s @npl128 = 95.0% of vLLM 391. MoE q36-35b-a3b: 758.1 @npl128 (0023).
-This is the bit-exact f32 decode plateau and there is no single decisive bit-exact lever left:
-  - gated_delta_net recurrence (~50%) is at 84.6% peak LPDDR5x BW, PAST vLLM (82.4%) - DRAM floor.
-  - mul_mat_q NVFP4 GEMM (~27%), flash_attn (~3.4%), lm_head nvjet (~3.5%) have NO bit-exact lever
-    (any knob changes a K-/softmax-reduction order vs the f32 reference).
-  - The remaining ~5% of small foldable buckets is real GPU-time on the critical path, but the largest
-    piece (the fp4 fold, ~1.5-2.5%) is EMPIRICALLY FLAT, the next (dense qkvz quant de-dup, ~1.4%) has
-    no clean inter-node construction and shares the fold's flat-risk, and the contained remainder is
-    each <=0.5% (FFN de-dup) or entangled in the binding kernel (pointwise) - none clears the
-    plumbing/risk bar for a 1:1 single-stream gain that the bench noise floor (~0.3-0.5%) can swallow.
-FRAME: vLLM 391 is a LOWER-precision (w4a4) reference; bit-exact-vs-vLLM is impossible. llama at 371.81
-bit-exact f32 is doing strictly MORE precise arithmetic at ~95% of vLLM's throughput. The only thing
-that goes materially further is bf16 state (precision change, KL-gated, out of scope, shelved).
-RECOMMENDATION: ship the 0023 plateau as the bit-exact decode result. Do not build the fp4 fold (flat).
-If a future agent insists on the dense qkvz de-dup, it must first build the inter-node graph-CSE
-scratch-pool/CUDA-graph-lifetime plumbing and prove on a same-build flag toggle that decode_agg lifts
-above the +-0.5% noise AND batched md5 == 5951a5b4 - and should expect the Lever-2 flat outcome.
-
-Assisted-by: Claude:opus-4.8 [Claude Code]
--- a/backend/cpp/llama-cpp/patches/paged/SERVER_SWEEP.md
+++ b/backend/cpp/llama-cpp/patches/paged/SERVER_SWEEP.md
@@ -1,138 +0,0 @@
-# GB10 same-day head-to-head server sweep: llama-server (paged) vs vLLM
-
-Date: 2026-06-23. Hardware: GB10 / DGX Spark (sm_121, 128 GB LPDDR5x unified, ~273 GB/s
-weight-read floor). GPU otherwise idle (sibling vLLM had exited; LocalAI docker workers
-stopped for the run).
-
-This sweep **replaces** the stale carried "~75-80% of vLLM" figure (commit 07985ba4,
-pre-co-batching, single-point). It measures *real serving* steady-state aggregate decode
-throughput across the full concurrency curve, for three model classes, with one identical
-client driving both engines.
-
-## Method
-
- **llama**: `llama-server` from the paged dev tree (`~/llama-paged-dev/build-cuda`, HEAD =
-  patch 0013 / commit 17d97cb), `LLAMA_KV_PAGED=1`, `-fa on -ngl 999 --parallel 128 -c 65536`.
- **vLLM**: 0.23.0, `vllm serve --enforce-eager --enable-prefix-caching --max-num-seqs >=128
-  --max-model-len 4096` (APC on, eager per the GB10 no-CUDA-graphs edge).
- **Client** (`sweep_client2.py`): N concurrent **non-streaming** `/v1/completions`, short
-  shared prompt, `max_tokens=min_tokens=256`, `ignore_eos=true`. Aggregate decode tok/s =
-  total generated tokens / wall. Non-streaming keeps the Python client off the critical path
-  (one JSON parse per request, not per token), so the **server** is the bottleneck. Validated:
-  vLLM pushed 4227 tok/s through the exact same client where llama topped out at 2087, so the
-  client is not the cap. Both engines use the identical client + prompt -> apples-to-apples.
- npl (concurrency) sweep: 8 / 32 / 64 / 128.
-
-Quant parity:
- Dense: llama **NVFP4-dense GGUF** (weight-only FP4, 16-bit compute) vs vLLM **NVFP4A16**
-  (weight FP4, 16-bit activation) -> matched precision class.
- Small: llama **Q8_0** vs vLLM **bf16** (closest loadable form).
- MoE: llama **mxfp4** GGUF. **vLLM could not serve this MoE on GB10 at all** (see below), so
-  there is no vLLM MoE column.
-
-## Results: aggregate decode tok/s (higher is better)
-
-### Dense 32B  (llama NVFP4-dense  vs  vLLM NVFP4A16)
-
-| npl | llama (NVFP4) | vLLM (NVFP4A16) | llama % of vLLM |
-|----:|--------------:|----------------:|----------------:|
-|   8 |          83.2 |            85.9 |          **96.9%** |
-|  32 |         228.9 |           301.3 |          76.0%  |
-|  64 |         367.1 |           507.8 |          72.3%  |
-| 128 |         520.6 |           604.0 |          86.2%  |
-
-Plateau: neither has plateaued at 128 (both still climbing, weight-read bound). llama is at
-**parity at batch-8** (97%), dips to ~72% mid-curve (npl 32-64), recovers to 86% at 128.
-
-### Small  Qwen3-0.6B  (llama Q8_0  vs  vLLM bf16)
-
-| npl | llama (Q8_0) | vLLM (bf16) | llama % of vLLM |
-|----:|-------------:|------------:|----------------:|
-|   8 |        911.3 |       923.0 |        **98.7%** |
-|  32 |       1701.6 |      2531.4 |        67.2%  |
-|  64 |       1911.7 |      3497.1 |        54.7%  |
-| 128 |       2087.6 |      4227.6 |        49.4%  |
-
-Plateau: **llama plateaus hard** at ~2.0-2.1k by npl 64-128 (+9% from 64->128). vLLM keeps
-scaling (3497 -> 4227). For a tiny runtime-bound model, vLLM's scheduler/batching amortizes
-better; llama-server's per-token host cost (sampling, detok, slot mgmt) caps it. This is the
-worst llama-vs-vLLM ratio in the sweep (down to 49%).
-
-### MoE  Qwen3-Coder-30B-A3B  (llama mxfp4; vLLM = NOT SERVABLE on GB10)
-
-| npl | llama (mxfp4) | vLLM |
-|----:|--------------:|-----:|
-|   8 |         290.0 |  n/a |
-|  32 |         582.5 |  n/a |
-|  64 |         931.8 |  n/a |
-| 128 |        1041.3 |  n/a |
-
-llama-server scales cleanly to **1041 tok/s** at npl 128 with **no npl-128 expert-activation
-cliff** (unlike the prior `llama-batched-bench` MoE numbers 253/505/830/620 that peaked at 64;
-short-prompt continuous batching in the server avoids it).
-
-**vLLM could not serve this MoE on GB10 (two independent failures):**
-1. **bf16** (`Qwen/Qwen3-Coder-30B-A3B-Instruct`, the only HF form on the box): loads the
-   56.9 GB of weights, then **hangs at the MoE warmup** (`Using MoEPrepareAndFinalize
-   NoDPEPModular` -> `Model loading took ...`), GPU 0% util, and **takes the whole box down
-   (hard reboot)**. Reproduced twice. With tight `--gpu-memory-utilization` it still hangs at
-   the same step before the API server ever comes up.
-2. **mxfp4 GGUF** (same weights llama uses): vLLM 0.23.0's GGUF loader **cannot map the fused
-   qwen3moe expert tensors** (`RuntimeError: Failed to map GGUF parameters (48):
-   ['model.layers.N.mlp.experts.gate_up_proj', ...]`). Engine init fails outright.
-
-So on GB10, llama.cpp is the *only* engine of the two that serves this 30B-A3B MoE at all -
-an availability win, independent of throughput.
-
-## Batch-8 anomaly triage (dense NVFP4) -- RESOLVED
-
-The prior mixed-load run reported llama batch-8 steady decode at **471 ms/step (~19% of vLLM
-aggregate, ~17 tok/s)**. This sweep does **not** reproduce it. Clean isolated batch-8 decode:
-
- `llama-server` batch-8 dense paged = **83.2 tok/s** aggregate = ~96 ms/step = **96.9% of
-  vLLM's 85.9** (parity, both at the LPDDR5x weight-read floor).
-
-`llama-batched-bench` cross-check, dense NVFP4, `-npp 16 -ntg 128 -npl 1,8`, the three
-hypotheses isolated (S_TG = decode tok/s aggregate at batch 8):
-
-| config                | batch-8 S_TG t/s | ms/decode-step |
-|-----------------------|-----------------:|---------------:|
-| paged,  ctx 65536     |            90.32 |          88.6  |
-| stock,  ctx 65536     |            88.39 |          90.5  |
-| paged,  ctx 163840    |            89.33 |          89.6  |
-| stock,  ctx 163840    |            87.72 |          91.2  |
-
-Conclusion: clean batch-8 dense decode is **~88-90 tok/s (~89 ms/step) regardless of all three
-suspects**:
- **Paged overhead?** No -- paged is within 2% of stock, and at ctx 65k paged is *faster*
-  (90.3 vs 88.4). The decode path is not paying a paged penalty at batch-8.
- **The 163840-token ctx allocation?** No -- ctx 163840 == ctx 65536 within 1% (89.3 vs 90.3).
-  The large allocation does not slow steady-state decode.
- **NVFP4 decode cost?** This *is* the cost -- ~89 ms/step is the GB10 weight-read floor for a
-  32B at batch-8 (it matches vLLM's 86 tok/s server and exceeds it at the kernel level: 90 vs
-  86). It is the hardware ceiling, not a bug.
-
-The 471 ms/step is ~5.3x slower than this clean floor and is explained by none of the three.
-It was a **mixed-load artifact**: the 8 decoders were time-sharing the GPU with a concurrent
-prefill (a large prompt / chunked prefill landing on the same steps). That decode-vs-prefill
-contention is exactly the stall **patch 0013 (`LLAMA_PREFILL_BUDGET`)** bounds. In steady-state
-isolated decode, batch-8 dense is at **parity with vLLM (97%)**, not 19%.
-
-## Aggregate map (replaces the carried 75-80%)
-
-llama-server (paged) as a fraction of vLLM, by regime:
-
- **Low concurrency (batch-8): parity, 97-99%** on both measurable classes. Both engines sit on
-  the LPDDR5x weight-read floor; there is nothing to win.
- **Dense 32B, mid-to-high concurrency: 72-86%.** Dips to ~72% at npl 32-64, recovers to 86% at
-  128. Both still climbing (weight-bound), neither plateaus by 128.
- **Small 0.6B, mid-to-high concurrency: 49-67%.** llama plateaus ~2.0k; vLLM scales to 4.2k.
-  Runtime/scheduler-bound regime -- vLLM's batching wins; this is llama's weakest ratio.
- **MoE 30B-A3B: llama-only.** vLLM cannot serve it on GB10 (bf16 reboots the box at MoE
-  warmup; GGUF expert tensors unmappable). llama serves it at 290 -> 1041 tok/s, scaling
-  cleanly with no npl-128 cliff.
-
-Net: the single "75-80%" number is replaced by a curve. It is *roughly* right only for the
-dense mid-band; it is too optimistic for the small model at high concurrency (49%) and moot for
-MoE (where llama is the only option). The headline is parity at low concurrency and a hardware
-(not engine) ceiling on dense decode.
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Ettore Di Giacinto	6746a6fc7e	feat(backends): make PreferDevelopmentBackends install the development image as primary When LOCALAI_PREFER_DEV_BACKENDS is set, install the -development image as the primary backend URI (keeping the released image reachable as the first fallback), instead of only reaching development as a download fallback when the released image is missing. This lets an operator force backends built from the development branch — e.g. to pick up a fix already on master before a release. Threads PreferDevelopmentBackends through SystemState so InstallBackend can see it, and reuses the same development-URI convention as the existing failure-path fallback (released tag -> branch tag + dev suffix). The unexported developmentURI helper is covered by a Ginkgo spec. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-25 23:37:19 +00:00
LocalAI [bot]	d388f874de	feat(backends): darwin/Metal build for the privacy-filter backend (#10513 ) * feat(backends): darwin/Metal build for the privacy-filter backend (timeboxed try) The privacy-filter.cpp engine is already Metal-capable on Apple Silicon: it pulls ggml and never forces GGML_METAL=OFF, and ggml defaults Metal ON on Apple, so a plain Darwin build is Metal-enabled. grpc++/protobuf resolve from Homebrew via find_package(... CONFIG). It just had no darwin build path - the existing package.sh and run.sh are Linux-only and there was no make target / workflow step. Adds the bespoke darwin path, modeled on the ds4 one: - scripts/build/privacy-filter-darwin.sh: native make grpc-server, otool -L dylib bundling, create-oci-image (no Linux package.sh). - Makefile: backends/privacy-filter-darwin target (+ .NOTPARALLEL). - .github/workflows/backend_build_darwin.yml: gated build step for privacy-filter. - scripts/changed-backends.js: inferBackendPathDarwin special-case -> backend/cpp. - .github/backend-matrix.yml: includeDarwin entry (lang go, like ds4/llama-cpp). - backend/index.yaml: metal: capability + metal-privacy-filter(-development) entries. - backend/cpp/privacy-filter/run.sh: DYLD_LIBRARY_PATH branch on Darwin. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] * fix(privacy-filter): macOS proto include + bundle ggml dylibs Validated natively on an M4 (the build/package/load chain now works with Metal): - CMakeLists.txt: hw_grpc_proto compiles the generated proto/grpc sources but only linked the binary dir, so on macOS it could not find protobuf's headers (runtime_version.h) - Homebrew puts them under /opt/homebrew, not /usr/include. Link protobuf::libprotobuf + gRPC::grpc++ so their include dirs propagate. No-op on Linux (apt headers are already on the default search path). - privacy-filter-darwin.sh: bundle the ggml shared libs the binary @rpath-links (libggml{,-base,-cpu,-blas,-metal}); the otool -L walk only catches on-disk absolute deps and missed them. Resolved at runtime by run.sh's DYLD_LIBRARY_PATH. M4 check: arm64 grpc-server links @rpath/libggml-metal.0.dylib; with the 15 ggml dylibs + grpc/protobuf bundled, it loads clean (no dyld errors) and prints usage. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 01:18:41 +02:00
LocalAI [bot]	86677495a2	chore: ⬆️ Update ggml-org/llama.cpp to `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1` (#10514 ) ⬆️ Update ggml-org/llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-26 01:15:40 +02:00
LocalAI [bot]	253aedff06	chore: ⬆️ Update CrispStrobe/CrispASR to `8f1218141b792b8868861c1af17ba1e361b05dc0` (#10502 ) ⬆️ Update CrispStrobe/CrispASR Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-26 01:08:09 +02:00
LocalAI [bot]	74f07ecc35	fix(backends): quote $CURDIR in run.sh (fixes backends in paths with spaces) (#10519 ) fix(backends): quote $CURDIR in run.sh so backends work in paths with spaces The backend launcher scripts derive their own directory with CURDIR=$(dirname "$(realpath $0)") and then referenced it unquoted as $CURDIR (e.g. [ -f $CURDIR/lib/ld.so ], export LD_LIBRARY_PATH=$CURDIR/lib:..., exec $CURDIR/<binary> "$@"). When a backend is installed under a path that contains a space - notably macOS's ~/Library/Application Support/... - bash word-splits the unquoted $CURDIR, so the test builtin fails with "binary operator expected" and exec tries to run ".../Library/Application", yielding "No such file or directory". The backend never starts, surfacing as a gRPC "service not ready" error and an HTTP 500. Quote $CURDIR (and the realpath "$0") in every affected run.sh; no logic changes. Co-authored-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-26 01:02:48 +02:00
LocalAI [bot]	ae0da454a7	chore: pin localrecall to tagged v0.6.3 (#10518 ) #10517 pinned the pseudo-version of the postgres connection-timeout fix; mudler/LocalRecall@v0.6.3 now tags that exact commit. Use the clean release tag instead of the pseudo-version. No code change. Assisted-by: Claude:claude-opus-4-8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 01:02:15 +02:00
LocalAI [bot]	179210b970	chore: bump localrecall for postgres per-connection timeouts (#10517 ) * chore: bump localrecall for postgres per-connection timeouts Pulls mudler/LocalRecall#49: sets lock_timeout / idle_in_transaction (default on) + opt-in statement_timeout on every pooled connection, so a corrupt/wedged index (e.g. a BM25 insert spinning on a buffer-content lock) can no longer hold its relation lock forever and head-of-line block the whole vector store. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] * docs(agents): document PostgreSQL connection safety timeouts Note the POSTGRES_LOCK_TIMEOUT / POSTGRES_IDLE_IN_TRANSACTION_TIMEOUT / POSTGRES_STATEMENT_TIMEOUT env vars read by the embedded vector store, and that safe defaults are on automatically. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code] --------- Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Co-authored-by: Ettore Di Giacinto <mudler@localai.io>	2026-06-26 00:53:03 +02:00
LocalAI [bot]	6c03e46390	chore: ⬆️ Update ikawrakow/ik_llama.cpp to `b84902d2ad27c34f989f23947200c4b91b1568fd` (#10515 ) ⬆️ Update ikawrakow/ik_llama.cpp Signed-off-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: mudler <2420543+mudler@users.noreply.github.com>	2026-06-25 23:42:21 +02:00