feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)

Mirror of the llama-paged-dev patch 0019 engine change plus the measured results. Step 2 of the SSM decode work: after Step 1 (in-place state write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-state get_rows gather (18.8 percent of decode GPU time). This removes that materialization, mirroring ggml_ssm_scan's ids source: ggml_gated_delta_net_inplace_ids reads each sequence's prior state directly from cache[ids[seq]] (src[5] = full cache, src[7] = ids), so combined with Step 1's in-place write the op reads AND writes the cache directly with no state materialization at all. Race-free by construction: identity sequences (ids[seq] == rs_head + seq, the whole AR decode path) read s0 in place from the destination slot; non-identity sequences (reorder / rs_zero, e.g. multi-new-seq prefill) read from a disjoint scratch a small gather kernel populates first. ids stays a device pointer. Bit-identical to the get_rows path. Gated to qwen35 + qwen35moe; qwen3next, kimi-linear, the non-fused and rollback paths are unchanged. Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off): q36-27b-nvfp4 dense: npl32 137.64 -> 170.68 (+24.0 percent), npl128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391). q36-35b-a3b-nvfp4 MoE: npl32 299.68 -> 366.69 (+22.4 percent), npl128 409.30 -> 553.63 (+35.3 percent). Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build (dense + MoE). nsys k_get_rows_float bucket 18.8 -> 0.7 percent. The residual decode gap to vLLM is now the FP4 GEMM (~48 percent of decode). See SSM_DECODE_FIX_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (patch 0018)
2026-06-25 00:59:28 -04:00 · 2026-06-24 23:47:51 +00:00 · 2026-06-24 22:45:49 +00:00 · 2026-06-24 21:45:42 +00:00 · 2026-06-24 21:44:22 +00:00 · 2026-06-24 21:26:16 +00:00
1027 changed files with 105533 additions and 5231 deletions
--- a/.agents/building-and-testing.md
+++ b/.agents/building-and-testing.md
@@ -15,3 +15,35 @@ Let's say the user wants to build a particular backend for a given platform. For
 - Unless the user specifies that they want you to run the command, then just print it because not all agent frontends handle long running jobs well and the output may overflow your context
 - The user may say they want to build AMD or ROCM instead of hipblas, or Intel instead of SYCL or NVIDIA insted of l4t or cublas. Ask for confirmation if there is ambiguity.
 - Sometimes the user may need extra parameters to be added to `docker build` (e.g. `--platform` for cross-platform builds or `--progress` to view the full logs), in which case you can generate the `docker build` command directly.
+
+## Test coverage gate
+
+The core Go suites (`./pkg`, `./core`, plus the in-process integration suite `./tests/e2e`) are covered by a **strict, monotonic coverage ratchet**:
+
+- `make test-coverage` — runs the suites with `covermode=atomic` instrumentation and writes a merged profile to `coverage/coverage.out`. Uses the same prerequisites as `make test`.
+  - **`--coverpkg` (`COVERAGE_COVERPKG = core/...,pkg/...`):** coverage is attributed to the core+pkg packages, not just the package under test. This is what lets the in-process `tests/e2e` suite (which drives the real HTTP server over loopback via `application.New`) credit the `core/http/endpoints/...` handlers it exercises — folding it in roughly doubled endpoint coverage (e.g. `endpoints/openai` 13.6% → 52%). The denominator is therefore *all* of `core`+`pkg` (minus generated proto, dropped via `COVERAGE_EXCLUDE_RE`), so the number isn't comparable to a plain per-package figure.
+  - **Integration suites (`COVERAGE_E2E_ROOTS = ./tests/e2e`)** run non-recursively (excludes `tests/e2e/distributed`, which needs containers) with `--label-filter=!real-models` (those need a downloaded model) against the mock backend built by `prepare-test`. `tests/integration` is deliberately excluded — it needs `make backends/local-store`, which the coverage CI job doesn't build.
+  - **Flake note:** folding integration tests into a *strict* gate means a hard e2e failure (or a spec that silently stops running) can fail the coverage gate, not just the test. `--flake-attempts` absorbs transient retryable failures; covermode=atomic keeps line coverage deterministic otherwise.
+  - **Why one ginkgo run per root (`scripts/run-coverage.sh`):** passing several recursive roots to a *single* ginkgo invocation (e.g. `ginkgo -r ./pkg ./core`) only merges **one** root's coverprofile into `--output-dir`/`--coverprofile` — the others are silently dropped. Verified with ginkgo 2.29.0: `-r ./pkg ./core` yields only `./pkg` coverage, while `-r ./core` alone yields all 34 core packages. So the script runs each root separately and concatenates the (disjoint) profiles. Don't "simplify" it back to a single multi-root invocation — that's how `core/` (including all of `core/http`, ~7.4k statements) silently vanished from the number before.
+  - **Build tags (`COVERAGE_TAGS`, passed via `GINKGO_TAGS`):** defaults to `debug auth`. The `auth` tag is required to compile the real (sqlite-backed) auth implementation and its ~150 `//go:build auth` tests — without it those files aren't built, the tests don't run, and the gate scores auth against a stub (~3.7% instead of ~38%). If you add new tag-gated tests, extend `COVERAGE_TAGS` or they won't count (and likely won't run in CI at all).
+- `make test-coverage-check` — runs `test-coverage`, then `scripts/coverage-check.sh` fails the build if total coverage is **below** the committed baseline in `coverage-baseline.txt`. The Linux job in `.github/workflows/test.yml` runs this instead of `make test`.
+- `make test-coverage-baseline` — regenerates and overwrites `coverage-baseline.txt` from the current run.
+- `make install-hooks` — sets `core.hooksPath` to the versioned `.githooks/`, whose `pre-commit` runs checks scoped to what's staged: Go changes → `make lint` + `make test-coverage-check`; `core/http/react-ui/` changes → `make test-ui-coverage-check` (Playwright e2e + UI coverage gate). A commit touching neither is skipped; bypass with `git commit --no-verify`. The hook resolves golangci-lint's new-from base to `upstream/master` → `origin/master` → `master`, so it works from a fork clone where `origin/master` is stale (passed to `make lint` via `LINT_NEW_FROM`).
+
+### React UI coverage
+
+The React UI (`core/http/react-ui/`) has **no component/unit tests** — its only tests are the Playwright e2e specs in `e2e/`, which run against the real app served by `tests/e2e-ui/ui-test-server` (the dist is `//go:embed`ed, so the server is rebuilt per coverage run). Those specs do genuinely exercise the UI (clicks, `fill`, `setInputFiles`, `getByRole`/`getByText`, visibility/value assertions).
+
+- `make test-ui-coverage` — builds an istanbul-instrumented bundle (`COVERAGE=true`, via `vite-plugin-istanbul` with `forceBuildInstrument: true` — the plugin skips production builds otherwise), re-embeds it into `ui-test-server` (the dist is `//go:embed`ed), runs the Playwright specs, and writes an `nyc` report to `core/http/react-ui/coverage/`. The specs import `{ test, expect }` from `e2e/coverage-fixtures.js` (re-exports Playwright's, plus harvests `window.__coverage__` into `.nyc_output/` after each test). Instrumentation is off unless `COVERAGE=true`, so dev/prod builds and plain `make test-ui-e2e` are unaffected (the fixture no-ops when `window.__coverage__` is absent).
+- **Browser:** the flake dev shell ships `chromium` and exports `PLAYWRIGHT_CHROMIUM_PATH`; `playwright.config.js` uses it via `launchOptions.executablePath`, and the Makefile skips `playwright install` when it's set. This avoids Playwright's downloaded browser, which can't resolve system libs (`libglib-2.0`, …) on NixOS. In CI (no `PLAYWRIGHT_CHROMIUM_PATH`) the Makefile falls back to `playwright install --with-deps chromium`.
+- The app is a React SPA, so coverage accumulates across in-app navigation within a test; a full `page.goto`/reload resets it.
+- `.nycrc.json` uses `all: true`, so **every `src/**` file is in the report**, including 0%-coverage ones — that's how you spot features with no test at all (sort the HTML report or `coverage-summary.json` by line% ascending). 
+- **UI coverage gate:** `make test-ui-coverage-check` runs the suite then `scripts/ui-coverage-check.sh`, failing if total line coverage drops more than `UI_COVERAGE_TOLERANCE` below `core/http/react-ui/coverage-baseline.txt`. `make test-ui-coverage-baseline` regenerates the baseline. Runs in CI (`tests-ui-e2e.yml`) and pre-commit on `core/http/react-ui/` changes.
+- **Why it has a tolerance (unlike the strict Go gate):** UI e2e coverage is *non-deterministic*. Specs that assert on state and end while async/lazy render work is still in flight collect those lines only when the render beats the coverage teardown — so the total drifts with machine speed/load (a fast local box reads higher than a slow CI runner), diffusely across many specs. The tolerance absorbs that drift, so set the baseline *below* the slow-CI floor, never to a fast-local `make test-ui-coverage-baseline` number, or CI flaps.
+- **Raising coverage is cheap:** a *render-smoke* spec (navigate to a route, assert its header renders) mounts a lazy page and runs its full render + initial effects, capturing most of its lines in a few lines of test — see `e2e/page-render-smoke.spec.js`. Auth is disabled in the test server (`isAdmin=true`), so `RequireAdmin`/`RequireFeature` routes render without a mock. The most *deterministic* win is removing a race: make a spec `await` a rendered element before ending (see `e2e/agents.spec.js` → AgentCreate) so its lines count every run.
+
+Rules (both gates):
+- **Install the hooks:** `make install-hooks` once per clone so lint + coverage run pre-commit. Don't lean on CI for what the hook catches.
+- **Don't work around the gate:** never `git commit --no-verify`, and never hand-lower a baseline or widen a tolerance to turn a red gate green. The ratchet only moves up.
+- If a change drops coverage, **add tests** (sort `coverage-summary.json` by line% ascending to find untested code) rather than editing the baseline. When coverage legitimately rises, commit the regenerated baseline (`make test-coverage-baseline` / `test-ui-coverage-baseline`).
+- The Go gate is **strict — no tolerance**; `covermode=atomic` keeps it deterministic. The UI gate keeps a small tolerance only because its e2e coverage isn't.
--- a/.agents/coding-style.md
+++ b/.agents/coding-style.md
@@ -50,6 +50,17 @@ Do not mix styles within a package. If you are extending tests in a package that

 This is enforced by `golangci-lint` via the `forbidigo` linter (see `.golangci.yml`); calls like `t.Errorf` / `t.Fatalf` / `t.Run` / `t.Skip` / `t.Logf` are flagged. Run `make lint` locally before submitting; the same check runs in CI (`.github/workflows/lint.yml`).

+## Outbound HTTP
+
+All outbound HTTP must go through `github.com/mudler/LocalAI/pkg/httpclient` rather than the standard library's default client. Use `httpclient.New(...)` (no body deadline — safe for streaming/SSE) or `httpclient.NewWithTimeout(d, ...)` (simple request/response). Both **refuse redirects by default** and set a TLS 1.2 floor.
+
+The reason is GHSA-3mj3-57v2-4636: the std default client follows redirects, and on a *cross-host* redirect Go forwards custom credential headers (e.g. Anthropic's `x-api-key`) to the redirect target, leaking the secret. `httpclient` fails closed instead.
+
+- Need to follow redirects (download CDNs, registry blobs, GitHub asset URLs)? Pass `httpclient.WithFollowRedirects()` — it still strips credential headers on any cross-host hop.
+- Have a custom transport (IP-pinned dialer, HTTP/2 tuning, a credential-injecting `RoundTripper`)? Pass `httpclient.WithTransport(rt)`, basing the transport on `httpclient.HardenedTransport()` to keep the TLS floor. Handed a `*http.Client` by a library? `httpclient.Harden(c)` applies the policy in place.
+
+This is enforced by `forbidigo` (see `.golangci.yml`): `http.DefaultClient` and `http.Get`/`Post`/`PostForm`/`Head` are flagged. The `&http.Client{}` composite literal can't be matched precisely by forbidigo without also flagging legitimate `*http.Client` type references, so that form is caught by review — don't construct raw clients.
+
 ## Documentation

 The project documentation is located in `docs/content`. When adding new features or changing existing functionality, it is crucial to update the documentation to reflect these changes. This helps users understand how to use the new capabilities and ensures the documentation stays relevant.
--- a/.agents/ds4-backend.md
+++ b/.agents/ds4-backend.md
@@ -44,6 +44,39 @@ maps to `DS4_THINK_HIGH`. We pass the chosen mode to `ds4_chat_append_assistant_
 via `ModelOptions.Options[] = "kv_cache_dir:/some/path"`. Format is **our own** -
 NOT bit-compatible with ds4-server's KVC files (interop is a follow-up plan).

+## Engine options (LoadModel)
+
+`LoadModel` maps `ModelOptions.Options[]` (`"key:value"`, from model-YAML
+`options:`) onto `ds4_engine_options` through a **declarative table**
+(`kEngineOptSpecs` + `apply_engine_option` in `grpc-server.cpp`). The struct is
+plain C with no reflection, so the field set is enumerated once in the table;
+adding a future engine knob is a one-line table row, not a new branch. Unknown
+keys are ignored (back-compat). A bare flag (`ssd_streaming` with no value)
+means `true`. Path-type values (`mtp_path`, `expert_profile_path`,
+`directional_steering_file`) resolve **relative to the model directory**, so a
+gallery entry can reference a companion file it downloaded by bare filename;
+absolute values pass through. `ds4_role` / `ds4_layers` / `ds4_listen` /
+`ds4_route_timeout` / `kv_cache_dir` keep their dedicated handling (validation
+ coordinator wiring) and are not in the table.
+
+Wired keys: `mtp_path`, `mtp_draft`, `mtp_margin`, `prefill_chunk`,
+`power_percent`, `warm_weights`, `quality`, `ssd_streaming`,
+`ssd_streaming_cold`, `ssd_streaming_preload_experts`,
+`ssd_streaming_cache_experts` (count or `NGB`, sets both experts+bytes via
+`ds4_parse_streaming_cache_experts_arg`), `simulate_used_memory` (`NGB` via
+`ds4_parse_gib_arg`), `expert_profile_path`, `directional_steering_file`,
+`directional_steering_attn`, `directional_steering_ffn`.
+
+## SSD streaming (running models larger than RAM)
+
+ds4's **SSD streaming** keeps non-routed weights resident and streams routed MoE
+experts from the GGUF on cache misses, turning "does it fit in RAM" into a speed
+spectrum. **Metal (Darwin) only** - it is a no-op on CUDA/CPU. Enable with
+`options: ["ssd_streaming"]`; size the routed-expert cache with
+`ssd_streaming_cache_experts:NGB` (omit for ds4's automatic 80%-of-working-set
+budget). Gallery entries built on this: `deepseek-v4-flash-q4-ssd` (153 GB Flash
+on a 128 GB Mac) and `deepseek-v4-pro-q2-ssd` (433 GB Pro, experimental).
+
 ## Build matrix

 | Build | Where | Notes |
@@ -68,6 +101,34 @@ go test -count=1 -timeout=30m -v ./tests/e2e-backends/...

 CI does not load the model; the suite is opt-in via env vars.

+## Distributed mode
+
+ds4 supports **layer-split** distributed inference (a model too big for one host,
+split by transformer layer; the GGUF must be present on every machine, each loads
+only its slice). Topology is **inverted** vs llama.cpp: the coordinator listens,
+workers dial in.
+
+- **`ds4-worker` binary**: built and packaged next to `grpc-server` (`package.sh`
+  copies it into `package/`). Links the same engine objects plus `ds4_distributed.o`;
+  **no gRPC/protobuf dependency** (speaks ds4's own TCP transport), so it builds
+  even where `grpc-server` can't. Runs the worker serving loop (`ds4_dist_run`).
+- **Coordinator wiring**: the ds4 `grpc-server` acts as coordinator when `LoadModel`
+  `ModelOptions.Options` (from model-YAML `options:`) carry:
+  - `ds4_role:coordinator` (enables distributed mode; absent → single-node, back-compat)
+  - `ds4_layers:0:19` (coordinator's own slice, inclusive; `N:output` includes the head)
+  - `ds4_listen:0.0.0.0:1234` (address workers dial into)
+  - `ds4_route_timeout:60` (optional; seconds Predict/PredictStream wait for the route
+    to form before returning gRPC `UNAVAILABLE`; default 60)
+- **Worker CLI**: `local-ai worker ds4-distributed -- <ds4-worker args>` resolves the
+  ds4 backend and execs the packaged `ds4-worker` (raw passthrough), e.g.
+  `--role worker --model /models/ds4flash.gguf --layers 20:output --coordinator <host> 1234`.
+
+Opt-in e2e in `tests/e2e-backends/backend_test.go`, gated by
+`BACKEND_TEST_DS4_DISTRIBUTED=1` (plus `BACKEND_TEST_DS4_WORKER_BINARY`,
+`BACKEND_TEST_DS4_WORKER_LAYERS`, `BACKEND_TEST_DS4_COORDINATOR_LAYERS`,
+`BACKEND_TEST_DS4_LISTEN`). Design spec:
+`docs/superpowers/specs/2026-05-30-ds4-distributed-inference-design.md`.
+
 ## Importer

 `core/gallery/importers/ds4.go` (`DS4Importer`) auto-detects ds4 weights by
--- a/.dockerignore
+++ b/.dockerignore
@@ -4,6 +4,7 @@
 .devcontainer
 models
 backends
+volumes
 examples/chatbot-ui/models
 backend/go/image/stablediffusion-ggml/build/
 backend/go/*/build
@@ -21,3 +22,36 @@ __pycache__
 # backend virtual environments
 **/venv
 backend/python/**/source
+
+# In-place llama.cpp clone + per-variant build copies. The Makefile
+# clones llama.cpp itself at the pinned LLAMA_VERSION; if a stale
+# local checkout is COPY'd into the image, the `llama.cpp:` target
+# sees the directory and skips re-cloning, so grpc-server.cpp ends
+# up compiled against whatever (likely older) commit the host had.
+backend/cpp/llama-cpp/llama.cpp
+backend/cpp/llama-cpp-*-build
+
+# privacy-filter: same in-place pattern. The Makefile fetches privacy-filter.cpp
+# at the pinned commit (or symlinks a PRIVACY_FILTER_SRC checkout for local dev).
+# A stale dir/symlink COPY'd into the image makes the clone step fail (dangling
+# symlink) or compile against the wrong commit, so keep host build state out.
+backend/cpp/privacy-filter/privacy-filter.cpp
+backend/cpp/privacy-filter/build
+backend/cpp/privacy-filter/grpc-server
+backend/cpp/privacy-filter/package
+
+# Rust backend build output (sources are tracked; target/ is generated)
+backend/rust/*/target
+
+# Local-only artifacts that bloat the build context but the image never needs.
+# Saved image tarballs, locally-installed backends, the host-built binary, and
+# assorted tool/scratch dirs. None of these are git-tracked.
+backend-images
+local-backends
+local-ai
+.crush
+protoc
+tests
+
+# Installed via npm inside the build stage; no need to ship the host copy.
+**/node_modules
--- a/.githooks/pre-commit
+++ b/.githooks/pre-commit
@@ -0,0 +1,60 @@
+#!/usr/bin/env sh
+#
+# LocalAI pre-commit hook. Install it (once per clone) with:
+#
+#     make install-hooks
+#
+# Runs only the checks relevant to what's staged:
+#   - Go files          -> make lint + make test-coverage-check
+#   - core/http/react-ui -> make test-ui-coverage-check (Playwright e2e + gate)
+# A commit touching neither is skipped entirely (docs/YAML/etc. can't change
+# lint findings, Go coverage, or the UI).
+#
+# To bypass for a single commit (e.g. a WIP checkpoint): git commit --no-verify
+set -eu
+
+repo_root="$(git rev-parse --show-toplevel)"
+cd "$repo_root"
+
+staged="$(git diff --cached --name-only --diff-filter=ACMRD)"
+
+go_changed=0
+ui_changed=0
+if echo "$staged" | grep -qE '\.go$'; then go_changed=1; fi
+if echo "$staged" | grep -qE '^core/http/react-ui/'; then ui_changed=1; fi
+
+if [ "$go_changed" -eq 0 ] && [ "$ui_changed" -eq 0 ]; then
+	echo "pre-commit: no Go or React UI changes staged — skipping."
+	exit 0
+fi
+
+if [ "$go_changed" -eq 1 ]; then
+	# Resolve the ref golangci-lint's new-from-merge-base should compare
+	# against. .golangci.yml pins origin/master, which is correct in CI
+	# (origin == the canonical repo) but wrong from a fork clone, where
+	# origin/master lags behind and lint would report the whole upstream
+	# backlog. Prefer upstream/master, then origin/master, then master.
+	lint_base=""
+	for ref in upstream/master origin/master master; do
+		if git rev-parse --verify --quiet "${ref}^{commit}" >/dev/null 2>&1; then
+			lint_base="$ref"
+			break
+		fi
+	done
+
+	echo "pre-commit ▶ golangci-lint (make lint${lint_base:+, new-from $lint_base})"
+	make lint LINT_NEW_FROM="$lint_base"
+
+	echo "pre-commit ▶ coverage gate (make test-coverage-check) — builds and runs the"
+	echo "             pkg/core suites plus tests/e2e; can take a few minutes."
+	make test-coverage-check
+fi
+
+if [ "$ui_changed" -eq 1 ]; then
+	echo "pre-commit ▶ React UI e2e + coverage gate (make test-ui-coverage-check) —"
+	echo "             rebuilds the UI + ui-test-server, runs the Playwright specs, and"
+	echo "             fails if line coverage regressed; can take a couple of minutes."
+	make test-ui-coverage-check
+fi
+
+echo "pre-commit ✓ all relevant checks passed"
--- a/.github/backend-matrix.yml
+++ b/.github/backend-matrix.yml
--- a/.github/gallery-agent/main.go
+++ b/.github/gallery-agent/main.go
@@ -3,6 +3,7 @@ package main
 import (
 	"context"
 	"encoding/json"
+	"errors"
 	"fmt"
 	"os"
 	"strconv"
@@ -113,6 +114,17 @@ func main() {
 	fmt.Println("Searching for trending models on HuggingFace...")
 	rawModels, err := client.GetTrending(searchTerm, limit)
 	if err != nil {
+		if errors.Is(err, hfapi.ErrRateLimited) {
+			fmt.Printf("HuggingFace API is rate limited after retries, skipping this run: %v\n", err)
+			writeSummary(AddedModelSummary{
+				SearchTerm:     searchTerm,
+				TotalFound:     0,
+				ModelsAdded:    0,
+				Quantization:   quantization,
+				ProcessingTime: time.Since(startTime).String(),
+			})
+			return
+		}
 		fmt.Fprintf(os.Stderr, "Error fetching models: %v\n", err)
 		os.Exit(1)
 	}
@@ -277,4 +289,3 @@ func truncateString(s string, maxLen int) string {
 	}
 	return s[:maxLen] + "..."
 }
-
--- a/.github/workflows/backend_build_darwin.yml
+++ b/.github/workflows/backend_build_darwin.yml
@@ -98,6 +98,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
+            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      - name: Dependencies
@@ -109,7 +110,10 @@ jobs:
          # Without explicitly installing them, a brew cache-hit run restores
          # ccache's Cellar dir but skips installing those transitive deps,
          # and ccache fails at runtime with `dyld: Library not loaded`.
-          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd
+          # nlohmann-json is header-only and required by the ds4 backend
+          # (dsml_renderer.cpp includes <nlohmann/json.hpp>); on Linux it comes
+          # from the apt-installed nlohmann-json3-dev in the build image.
+          brew install protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json
          # Force-reinstall ccache so brew re-validates its full runtime-dep
          # closure on every run. This is the durable fix: when the upstream
          # ccache formula gains a new transitive dep (as it has multiple times
@@ -128,7 +132,7 @@ jobs:
          # and decides "already installed" without re-linking, so on a cache-
          # hit run the formulas aren't on PATH. Force-link them; --overwrite
          # tolerates pre-existing symlinks from earlier installs.
-          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd 2>/dev/null || true
+          brew link --overwrite protobuf grpc make protoc-gen-go protoc-gen-go-grpc libomp llvm ccache blake3 fmt hiredis xxhash zstd nlohmann-json 2>/dev/null || true

      - name: Save Homebrew cache
        if: github.event_name != 'pull_request' && steps.brew-cache.outputs.cache-hit != 'true'
@@ -148,6 +152,7 @@ jobs:
            /opt/homebrew/Cellar/hiredis
            /opt/homebrew/Cellar/xxhash
            /opt/homebrew/Cellar/zstd
+            /opt/homebrew/Cellar/nlohmann-json
          key: brew-${{ runner.os }}-${{ runner.arch }}-v1-${{ hashFiles('.github/workflows/backend_build_darwin.yml') }}

      # ---- ccache for llama.cpp CMake builds ----
--- a/.github/workflows/bump_deps.yaml
+++ b/.github/workflows/bump_deps.yaml
@@ -26,10 +26,26 @@ jobs:
            variable: "DS4_VERSION"
            branch: "main"
            file: "backend/cpp/ds4/Makefile"
+          - repository: "localai-org/privacy-filter.cpp"
+            variable: "PRIVACY_FILTER_VERSION"
+            branch: "master"
+            file: "backend/cpp/privacy-filter/Makefile"
          - repository: "ggml-org/whisper.cpp"
            variable: "WHISPER_CPP_VERSION"
            branch: "master"
            file: "backend/go/whisper/Makefile"
+          - repository: "CrispStrobe/CrispASR"
+            variable: "CRISPASR_VERSION"
+            branch: "main"
+            file: "backend/go/crispasr/Makefile"
+          - repository: "mudler/parakeet.cpp"
+            variable: "PARAKEET_VERSION"
+            branch: "master"
+            file: "backend/go/parakeet-cpp/Makefile"
+          - repository: "mudler/depth-anything.cpp"
+            variable: "DEPTHANYTHING_VERSION"
+            branch: "master"
+            file: "backend/go/depth-anything-cpp/Makefile"
          - repository: "leejet/stable-diffusion.cpp"
            variable: "STABLEDIFFUSION_GGML_VERSION"
            branch: "master"
@@ -50,10 +66,22 @@ jobs:
            variable: "SAM3_VERSION"
            branch: "main"
            file: "backend/go/sam3-cpp/Makefile"
-          - repository: "predict-woo/qwen3-tts.cpp"
-            variable: "QWEN3TTS_CPP_VERSION"
+          - repository: "mudler/rf-detr.cpp"
+            variable: "RFDETR_VERSION"
            branch: "main"
+            file: "backend/go/rfdetr-cpp/Makefile"
+          - repository: "mudler/locate-anything.cpp"
+            variable: "LOCATEANYTHING_VERSION"
+            branch: "master"
+            file: "backend/go/locate-anything-cpp/Makefile"
+          - repository: "ServeurpersoCom/qwentts.cpp"
+            variable: "QWEN3TTS_CPP_VERSION"
+            branch: "master"
            file: "backend/go/qwen3-tts-cpp/Makefile"
+          - repository: "ServeurpersoCom/omnivoice.cpp"
+            variable: "OMNIVOICE_VERSION"
+            branch: "master"
+            file: "backend/go/omnivoice-cpp/Makefile"
          - repository: "localai-org/vibevoice.cpp"
            variable: "VIBEVOICE_CPP_VERSION"
            branch: "master"
--- a/.github/workflows/secscan.yaml
+++ b/.github/workflows/secscan.yaml
@@ -18,10 +18,13 @@ jobs:
        if: ${{ github.actor != 'dependabot[bot]' }}
      - name: Run Gosec Security Scanner
        if: ${{ github.actor != 'dependabot[bot]' }}
-        uses: securego/gosec@v2.22.9
+        uses: securego/gosec@v2.27.1
        with:
          # we let the report trigger content trigger a failure using the GitHub Security features.
-          args: '-no-fail -fmt sarif -out results.sarif ./...'
+          # backend/go/supertonic is excluded: it vendors upstream supertone-inc/supertonic
+          # (helper.go), whose findings (G304 model-file loads, G404 math/rand for flow-matching
+          # noise, G104 unhandled errors) are inherent to that upstream code, not ours to rewrite.
+          args: '-no-fail -exclude-dir=backend/go/supertonic -fmt sarif -out results.sarif ./...'
      - name: Upload SARIF file
        if: ${{ github.actor != 'dependabot[bot]' }}
        uses: github/codeql-action/upload-sarif@v4
--- a/.github/workflows/stalebot.yml
+++ b/.github/workflows/stalebot.yml
@@ -11,7 +11,7 @@ jobs:
    if: github.repository == 'mudler/LocalAI'
    runs-on: ubuntu-latest
    steps:
-      - uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v9
+      - uses: actions/stale@eb5cf3af3ac0a1aa4c9c45633dd1ae542a27a899 # v9
        with:
          stale-issue-message: 'This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.'
          stale-pr-message: 'This PR is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 10 days.'
--- a/.github/workflows/test-extra.yml
+++ b/.github/workflows/test-extra.yml
@@ -37,6 +37,8 @@ jobs:
      sglang: ${{ steps.detect.outputs.sglang }}
      acestep-cpp: ${{ steps.detect.outputs.acestep-cpp }}
      qwen3-tts-cpp: ${{ steps.detect.outputs.qwen3-tts-cpp }}
+      rfdetr-cpp: ${{ steps.detect.outputs.rfdetr-cpp }}
+      locate-anything-cpp: ${{ steps.detect.outputs.locate-anything-cpp }}
      vibevoice-cpp: ${{ steps.detect.outputs.vibevoice-cpp }}
      localvqe: ${{ steps.detect.outputs.localvqe }}
      voxtral: ${{ steps.detect.outputs.voxtral }}
@@ -45,6 +47,7 @@ jobs:
      speaker-recognition: ${{ steps.detect.outputs.speaker-recognition }}
      sherpa-onnx: ${{ steps.detect.outputs.sherpa-onnx }}
      whisper: ${{ steps.detect.outputs.whisper }}
+      parakeet-cpp: ${{ steps.detect.outputs.parakeet-cpp }}
    steps:
      - name: Checkout repository
        uses: actions/checkout@v6
@@ -561,7 +564,7 @@ jobs:
      - name: Run e2e-backends smoke
        env:
          BACKEND_IMAGE: quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp
-          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias
+          BACKEND_TEST_CAPS: health,load,predict,stream,logprobs,logit_bias,tokenize
        run: |
          make test-extra-backend
  # Realtime e2e with sherpa-onnx driving VAD + STT + TTS against a mocked LLM.
@@ -632,6 +635,26 @@ jobs:
      - name: Build whisper backend image and run transcription gRPC e2e tests
        run: |
          make test-extra-backend-whisper-transcription
+  # Parakeet ASR via the parakeet-cpp backend (C++/ggml port of NeMo
+  # Parakeet). Drives AudioTranscription (offline, with word timestamps) on
+  # tdt_ctc-110m + the JFK 11s clip.
+  tests-parakeet-cpp-grpc-transcription:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.parakeet-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    timeout-minutes: 90
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Setup Go
+        uses: actions/setup-go@v5
+        with:
+          go-version: '1.25.4'
+      - name: Build parakeet-cpp backend image and run transcription gRPC e2e tests
+        run: |
+          make test-extra-backend-parakeet-cpp-transcription
  # VITS TTS via the sherpa-onnx backend. Drives both TTS (file write) and
  # TTSStream (PCM chunks) on the e2e-backends harness.
  tests-sherpa-onnx-grpc-tts:
@@ -843,6 +866,81 @@ jobs:
      - name: Test qwen3-tts-cpp
        run: |
          make --jobs=5 --output-sync=target -C backend/go/qwen3-tts-cpp test
+  # Per-backend smoke for rfdetr-cpp: builds the .so + Go binary and runs
+  # `make -C backend/go/rfdetr-cpp test`. test.sh fetches the small (~20 MB)
+  # rfdetr-nano-q8_0 GGUF from the published mudler/rfdetr-cpp-nano HF repo
+  # via curl and synthesises a tiny PNG to exercise the wire protocol.
+  tests-rfdetr-cpp:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.rfdetr-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y build-essential cmake curl libopenblas-dev
+      - name: Setup Go
+        uses: actions/setup-go@v5
+      - name: Display Go version
+        run: go version
+      - name: Proto Dependencies
+        run: |
+          # Install protoc
+          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
+          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+          rm protoc.zip
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
+          PATH="$PATH:$HOME/go/bin" make protogen-go
+      - name: Build rfdetr-cpp
+        run: |
+          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp
+      - name: Test rfdetr-cpp
+        run: |
+          make --jobs=5 --output-sync=target -C backend/go/rfdetr-cpp test
+  # Per-backend e2e for locate-anything-cpp: builds the .so + Go binary and
+  # runs `make -C backend/go/locate-anything-cpp test`. test.sh fetches the
+  # locate-anything-q8_0 GGUF (~6.3 GB, NVIDIA LocateAnything-3B) from the
+  # published mudler/locate-anything.cpp-gguf HF repo + a COCO image, then the
+  # Go wire test loads the model and runs an open-vocabulary Detect, asserting
+  # at least one labeled box. Heavier than the other Go backends (it is a 3B),
+  # so it is gated to changes under backend/go/locate-anything-cpp/.
+  tests-locate-anything-cpp:
+    needs: detect-changes
+    if: needs.detect-changes.outputs.locate-anything-cpp == 'true' || needs.detect-changes.outputs.run-all == 'true'
+    runs-on: ubuntu-latest
+    steps:
+      - name: Clone
+        uses: actions/checkout@v6
+        with:
+          submodules: true
+      - name: Dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y build-essential cmake curl libopenblas-dev
+      - name: Setup Go
+        uses: actions/setup-go@v5
+      - name: Display Go version
+        run: go version
+      - name: Proto Dependencies
+        run: |
+          # Install protoc
+          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
+          unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
+          rm protoc.zip
+          go install google.golang.org/protobuf/cmd/protoc-gen-go@v1.34.2
+          go install google.golang.org/grpc/cmd/protoc-gen-go-grpc@1958fcbe2ca8bd93af633f11e97d44e567e945af
+          PATH="$PATH:$HOME/go/bin" make protogen-go
+      - name: Build locate-anything-cpp
+        run: |
+          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp
+      - name: Test locate-anything-cpp
+        run: |
+          make --jobs=5 --output-sync=target -C backend/go/locate-anything-cpp test
  # Per-backend smoke for vibevoice-cpp: builds the .so + Go binary and
  # runs `make -C backend/go/vibevoice-cpp test`. test.sh auto-downloads
  # the published mudler/vibevoice.cpp-models bundle (TTS Q8_0 + ASR Q4_K
--- a/.github/workflows/test.yml
+++ b/.github/workflows/test.yml
@@ -53,9 +53,22 @@ jobs:
          node-version: '22'
      - name: Build React UI
        run: make react-ui
-      - name: Test
+      # Runs the core suite with coverage and fails if total coverage dropped
+      # below the committed baseline (coverage-baseline.txt). The gate is
+      # strict — any decrease fails. Raise the baseline with
+      # `make test-coverage-baseline` and commit it when coverage rises.
+      - name: Test (with coverage gate)
        run: |
-          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test
+          PATH="$PATH:/root/go/bin" make --jobs 5 --output-sync=target test-coverage-check
+      - name: Upload coverage report
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v4
+        with:
+          name: coverage-linux
+          path: |
+            coverage/coverage.out
+            coverage/coverage.html
+          if-no-files-found: ignore
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.github/workflows/tests-ui-e2e.yml
+++ b/.github/workflows/tests-ui-e2e.yml
@@ -37,6 +37,10 @@ jobs:
        uses: actions/setup-node@v6
        with:
          node-version: '22'
+      - name: Setup Bun
+        uses: oven-sh/setup-bun@v2
+        with:
+          bun-version: '1.3.11'
      - name: Proto Dependencies
        run: |
          curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
@@ -48,16 +52,12 @@ jobs:
        run: |
          sudo apt-get update
          sudo apt-get install -y build-essential libopus-dev
-      - name: Build UI test server
-        run: PATH="$PATH:$HOME/go/bin" make build-ui-test-server
-      - name: Install Playwright
-        working-directory: core/http/react-ui
-        run: |
-          npm install
-          npx playwright install --with-deps chromium
-      - name: Run Playwright tests
-        working-directory: core/http/react-ui
-        run: npx playwright test
+      # Builds an instrumented UI bundle, runs the Playwright specs, and fails
+      # if line coverage regressed beyond the jitter tolerance (the gate is
+      # in `make test-ui-coverage-check`). PLAYWRIGHT_CHROMIUM_PATH is unset
+      # here, so scripts/ensure-playwright-browser.sh installs Chromium via apt.
+      - name: Run UI e2e + coverage gate
+        run: PATH="$PATH:$HOME/go/bin" make test-ui-coverage-check
      - name: Upload Playwright report
        if: ${{ failure() }}
        uses: actions/upload-artifact@v7
@@ -65,6 +65,14 @@ jobs:
          name: playwright-report
          path: core/http/react-ui/playwright-report/
          retention-days: 7
+      - name: Upload UI coverage report
+        if: ${{ always() }}
+        uses: actions/upload-artifact@v7
+        with:
+          name: ui-coverage
+          path: core/http/react-ui/coverage/
+          if-no-files-found: ignore
+          retention-days: 7
      - name: Setup tmate session if tests fail
        if: ${{ failure() }}
        uses: mxschmitt/action-tmate@v3.23
--- a/.gitignore
+++ b/.gitignore
@@ -26,6 +26,10 @@ go-bert
 LocalAI
 /local-ai
 /local-ai-launcher
+# Root-level build artifacts when running `go build ./...` against
+# Go backend packages whose main lives under backend/go/.
+/cloud-proxy
+/local-store
 # prevent above rules from omitting the helm chart
 !charts/*
 # prevent above rules from omitting the api/localai folder
@@ -66,10 +70,17 @@ docs/static/gallery.html
 # per-developer customization files for the development container
 .devcontainer/customization/*

+# Coverage profiles (the committed baseline is coverage-baseline.txt)
+/coverage/
+
 # React UI build artifacts (keep placeholder dist/index.html)
 core/http/react-ui/node_modules/
 core/http/react-ui/dist

+# React UI coverage (vite-plugin-istanbul + nyc, via `make test-ui-coverage`)
+core/http/react-ui/.nyc_output/
+core/http/react-ui/coverage/
+
 # Extracted backend binaries for container-based testing
 local-backends/

--- a/.golangci.yml
+++ b/.golangci.yml
@@ -56,10 +56,26 @@ linters:
        # are exempt — see linters.exclusions.rules below.
        - pattern: '^os\.(Getenv|LookupEnv|Environ)$'
          msg: 'Plumb config through ApplicationConfig (or the relevant CLI struct) instead of reading env directly. CLI entry points (core/cli/) bind env vars via kong''s `env:` tag — that is the only sanctioned env→struct boundary. See .agents/coding-style.md.'
+        # Outbound HTTP must go through pkg/httpclient, which refuses redirects
+        # by default and sets a TLS floor. The std-library default client and
+        # the http.Get/Post/... convenience helpers follow redirects (up to 10)
+        # and, on a cross-host redirect, forward custom credential headers such
+        # as Anthropic's x-api-key to the redirect target — leaking the secret
+        # (GHSA-3mj3-57v2-4636). forbidigo can't precisely match the
+        # `&http.Client{}` composite literal without also flagging legitimate
+        # `*http.Client` type references, so that form is enforced by
+        # convention + review; these two patterns catch the implicit-default
+        # client, which is the common footgun.
+        - pattern: '^http\.DefaultClient$'
+          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.DefaultClient — the std client follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
+        - pattern: '^http\.(Get|Post|PostForm|Head)$'
+          msg: 'Use pkg/httpclient (httpclient.New / NewWithTimeout) instead of http.Get/Post/PostForm/Head — these use http.DefaultClient, which follows redirects and leaks credential headers cross-host (GHSA-3mj3-57v2-4636). See .agents/coding-style.md.'
  exclusions:
    paths:
      # Upstream whisper.cpp source tree fetched by the whisper backend Makefile.
      - 'backend/go/whisper/sources'
+      # Vendored upstream supertonic pipeline (supertone-inc/supertonic go/helper.go).
+      - 'backend/go/supertonic/helper.go'
      - 'docs/'
    rules:
      # CLI entry points: kong's `env:"..."` tag is the legitimate env→struct
@@ -95,3 +111,18 @@ linters:
      - path: _test\.go$
        text: 'os\.(Getenv|LookupEnv|Environ)'
        linters: [forbidigo]
+      # pkg/httpclient is the sanctioned home for outbound HTTP clients; it
+      # necessarily references net/http directly.
+      - path: ^pkg/httpclient/
+        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
+        linters: [forbidigo]
+      # Tests drive local httptest servers where redirect/TLS hardening is
+      # irrelevant; the std client is fine there.
+      - path: _test\.go$
+        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
+        linters: [forbidigo]
+      # Vendored upstream whisper.cpp Go bindings are a separate module and
+      # cannot import pkg/httpclient.
+      - path: ^backend/go/whisper/sources/
+        text: 'http\.(DefaultClient|Get|Post|PostForm|Head)'
+        linters: [forbidigo]
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -35,6 +35,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants]

 ## Quick Reference

+- **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md).
 - **Logging**: Use `github.com/mudler/xlog` (same API as slog)
 - **Go style**: Prefer `any` over `interface{}`
 - **Comments**: Explain *why*, not *what*
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -198,6 +198,7 @@ For AI-assisted development, see [`AGENTS.md`](AGENTS.md) (or the equivalent [`C

 - Prefer modern Go idioms — for example, use `any` instead of `interface{}`.
 - Use [`golangci-lint`](https://golangci-lint.run) to catch common issues before submitting a PR.
+- Run `make install-hooks` once per clone to enable the pre-commit hook: Go changes run `make lint` + the coverage gate (`make test-coverage-check`); `core/http/react-ui/` changes run the Playwright e2e suite (`make test-ui`). Bypass a single commit with `git commit --no-verify`.
 - Use [`github.com/mudler/xlog`](https://github.com/mudler/xlog) for logging (same API as `slog`). Do not use `fmt.Println` or the standard `log` package for operational logging.
 - Use tab indentation for Go files (as defined in `.editorconfig`).

@@ -265,6 +266,12 @@ The e2e tests run LocalAI in a Docker container and exercise the API:
 make test-e2e
 ```

+### React UI tests and coverage
+
+The React UI (`core/http/react-ui/`) is covered by Playwright e2e specs, gated by a **monotonic line-coverage ratchet** (`make test-ui-coverage-check`, run in CI and pre-commit). The metric is non-deterministic — a fast local box reads higher than a slow CI runner for the same code — so a small tolerance is unavoidable.
+
+**If your change lowers UI coverage, raise it back by adding specs — do not widen the tolerance or hand-lower the baseline.** A *render-smoke* spec (navigate to a page, assert its header is visible) cheaply covers an entire lazy page. See `core/http/react-ui/e2e/page-render-smoke.spec.js` and the full policy in [.agents/building-and-testing.md](.agents/building-and-testing.md#react-ui-coverage).
+
 ### Running E2E container tests

 These tests build a standard LocalAI Docker image and run it with pre-configured model configs to verify that most endpoints work correctly:
--- a/1
+++ b/1
@@ -108,6 +108,7 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/189
+++ b/189
@@ -1,5 +1,5 @@
 # Disable parallel execution for backend builds
-.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio
+.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter

 GOCMD=go
 GOTEST=$(GOCMD) test
@@ -69,10 +69,41 @@ else
 	GORELEASER=$(shell which goreleaser)
 endif

-TEST_PATHS?=./api/... ./pkg/... ./core/...
+TEST_PATHS?=./api/... ./pkg/... ./core/... ./backend/go/cloud-proxy/... ./backend/go/local-store/...
+
+## Coverage output and the committed baseline that CI compares against.
+## The gate is strict: total coverage must never decrease (no tolerance).
+## covermode=atomic makes line coverage deterministic regardless of test
+## ordering or flake retries, so there is no run-to-run jitter to absorb.
+COVERAGE_DIR?=$(abspath ./coverage)
+COVERAGE_PROFILE?=$(COVERAGE_DIR)/coverage.out
+COVERAGE_BASELINE?=coverage-baseline.txt
+## Coverage is collected one recursive root at a time and merged (see
+## scripts/run-coverage.sh): passing several recursive roots to a single
+## ginkgo invocation only keeps one root's coverprofile. Mirrors TEST_PATHS
+## minus ./api (which doesn't exist).
+COVERAGE_ROOTS?=./pkg ./core
+## Build tags for the coverage build. `auth` is required to compile the real
+## auth implementation and its ~150 `//go:build auth` tests (otherwise they're
+## invisible and the gate scores auth against a stub). `debug` matches `test`.
+COVERAGE_TAGS?=debug auth
+## Coverage is attributed to these packages via --coverpkg, so the in-process
+## integration suites (COVERAGE_E2E_ROOTS) credit the core/http handlers they
+## drive over HTTP — not just their own test package.
+COVERAGE_COVERPKG?=github.com/mudler/LocalAI/core/...,github.com/mudler/LocalAI/pkg/...
+## In-process integration suites folded into coverage. Run non-recursively
+## (excludes tests/e2e/distributed, which needs containers) with the mock
+## backend built by prepare-test. real-models specs need a downloaded model,
+## so they're filtered out. NOTE: tests/integration is intentionally NOT here —
+## it needs the local-store backend built (`make backends/local-store`), which
+## the coverage CI job doesn't do.
+COVERAGE_E2E_ROOTS?=./tests/e2e
+COVERAGE_E2E_LABELS?=!real-models
+## Drop generated protobuf from the denominator (it has no tests by design).
+COVERAGE_EXCLUDE_RE?=grpc/proto/.*[.]pb[.]go


-.PHONY: all test build vendor lint lint-all
+.PHONY: all test test-coverage test-coverage-baseline test-coverage-check test-ui test-ui-coverage-baseline test-ui-coverage-check install-hooks build vendor lint lint-all

 all: help

@@ -149,7 +180,7 @@ osx-signed: build

 ## Run
 run: ## run local-ai
-	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./
+	CGO_LDFLAGS="$(CGO_LDFLAGS)" $(GOCMD) run ./cmd/local-ai

 prepare-test: protogen-go build-mock-backend

@@ -170,6 +201,36 @@ test: prepare-test
 	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) --fail-fast -v -r $(TEST_PATHS)

+## Runs the core suite ($(TEST_PATHS)) with statement-coverage instrumentation
+## and writes a merged profile to $(COVERAGE_PROFILE). Deliberately omits
+## --fail-fast so a single failure doesn't truncate the coverage number, and
+## uses covermode=atomic so the result is deterministic. Prints the total.
+test-coverage: prepare-test
+	@echo 'Running tests with coverage'
+	GINKGO_TAGS="$(COVERAGE_TAGS)" \
+	COVERAGE_COVERPKG="$(COVERAGE_COVERPKG)" \
+	COVERAGE_E2E_ROOTS="$(COVERAGE_E2E_ROOTS)" \
+	COVERAGE_E2E_LABELS="$(COVERAGE_E2E_LABELS)" \
+	COVERAGE_EXCLUDE_RE='$(COVERAGE_EXCLUDE_RE)' \
+	OPUS_SHIM_LIBRARY=$(abspath ./pkg/opus/shim/libopusshim.so) \
+	scripts/run-coverage.sh $(COVERAGE_DIR) $(COVERAGE_PROFILE) $(TEST_FLAKES) $(COVERAGE_ROOTS)
+	@$(GOCMD) tool cover -html=$(COVERAGE_PROFILE) -o $(COVERAGE_DIR)/coverage.html
+	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | tail -n1
+
+## Writes the current total coverage to $(COVERAGE_BASELINE). Run this (and
+## commit the result) whenever a change legitimately raises coverage so the
+## ratchet moves up. Never lower it by hand.
+test-coverage-baseline: test-coverage
+	@$(GOCMD) tool cover -func=$(COVERAGE_PROFILE) | awk '/^total:/{gsub(/%/,"",$$NF); print $$NF}' > $(COVERAGE_BASELINE)
+	@echo "Saved coverage baseline: $$(cat $(COVERAGE_BASELINE))%"
+
+## CI gate: fails if total coverage dropped more than COVERAGE_TOLERANCE
+## (default 0.5pp) below the committed baseline. A small tolerance absorbs the
+## run-to-run jitter from the in-process tests/e2e suite folded in via
+## --coverpkg (timing-dependent which handler lines execute).
+test-coverage-check: test-coverage
+	@scripts/coverage-check.sh $(COVERAGE_PROFILE) $(COVERAGE_BASELINE)
+
 ########################################################
 ## Lint
 ########################################################
@@ -185,12 +246,17 @@ test: prepare-test
 ## everything else automatically, so new packages are scanned by default.
 LINT_EXCLUDE_DIRS_RE=/(backend/go/(piper|silero-vad|llm)|cmd/launcher)(/|$$)

+## Set LINT_NEW_FROM to a git ref to override .golangci.yml's
+## new-from-merge-base (origin/master). Useful from a fork clone where
+## origin/master is stale relative to the canonical repo — the pre-commit
+## hook passes the resolved upstream ref here so local lint matches CI.
+LINT_NEW_FROM?=
 lint:
 	@command -v golangci-lint >/dev/null 2>&1 || { \
 		echo 'golangci-lint not installed. Install: go install github.com/golangci/golangci-lint/v2/cmd/golangci-lint@latest'; \
 		exit 1; \
 	}
-	golangci-lint run $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')
+	golangci-lint run $(if $(LINT_NEW_FROM),--new-from-merge-base=$(LINT_NEW_FROM),) $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

 ## Like `lint` but reports every issue, including the pre-existing baseline
 ## that `lint` ignores via .golangci.yml's new-from-merge-base. Use this to
@@ -202,6 +268,17 @@ lint-all:
 	}
 	golangci-lint run --new=false --new-from-merge-base= --new-from-rev= $$(go list -e -f '{{.Dir}}' ./... | grep -vE '$(LINT_EXCLUDE_DIRS_RE)')

+########################################################
+## Git hooks
+########################################################
+## Points git at the versioned .githooks/ directory so the pre-commit hook
+## (lint + coverage gate) runs locally. Run once per clone. Undo with:
+## `git config --unset core.hooksPath`. Skip a single commit with
+## `git commit --no-verify`.
+install-hooks:
+	git config core.hooksPath .githooks
+	@echo 'Installed git hooks: core.hooksPath -> .githooks (pre-commit runs lint + test-coverage-check on Go changes)'
+
 ########################################################
 ## E2E AIO tests (uses standard image with pre-configured models)
 ########################################################
@@ -232,13 +309,20 @@ run-e2e-aio: protogen-go
 	@echo 'Running e2e AIO tests'
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e-aio

+# Distributed architecture e2e (PostgreSQL + NATS via testcontainers).
+# Includes NatsJWT specs (JWT-enabled NATS). Requires Docker.
+# VLLMMultinode is excluded here; use test-e2e-vllm-multinode for that.
+test-e2e-distributed: protogen-go
+	@echo 'Running distributed e2e tests (label Distributed, incl. NatsJWT)'
+	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --label-filter='Distributed && !VLLMMultinode' --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e/distributed
+
 # vLLM multi-node DP smoke (CPU). Builds local-ai:tests and the
 # cpu-vllm backend from the current working tree, then drives a
 # head + headless follower via testcontainers-go and asserts a chat
 # completion. BuildKit caches both images, so re-runs only rebuild
 # what changed. The test lives under tests/e2e/distributed and is
 # selected by the VLLMMultinode label so it doesn't run alongside
-# the other distributed-suite tests by default.
+# test-e2e-distributed.
 test-e2e-vllm-multinode: docker-build-e2e extract-backend-vllm protogen-go
 	@echo 'Running e2e vLLM multi-node DP test'
 	LOCALAI_IMAGE=local-ai \
@@ -268,12 +352,13 @@ prepare-e2e:
 run-e2e-image:
 	docker run -p 5390:8080 -e MODELS_PATH=/models -e THREADS=1 -e DEBUG=true -d --rm -v $(TEST_DIR):/models --name e2e-tests-$(RANDOM) localai-tests

-test-e2e: build-mock-backend prepare-e2e run-e2e-image
+test-e2e: build-mock-backend build-cloud-proxy-backend prepare-e2e run-e2e-image
 	@echo 'Running e2e tests'
 	BUILD_TYPE=$(BUILD_TYPE) \
 	LOCALAI_API=http://$(E2E_BRIDGE_IP):5390 \
 	$(GOCMD) run github.com/onsi/ginkgo/v2/ginkgo --flake-attempts $(TEST_FLAKES) -v -r ./tests/e2e
 	$(MAKE) clean-mock-backend
+	$(MAKE) clean-cloud-proxy-backend
 	$(MAKE) teardown-e2e
 	docker rmi localai-tests

@@ -480,6 +565,8 @@ prepare-test-extra: protogen-python
 	$(MAKE) -C backend/python/insightface
 	$(MAKE) -C backend/python/speaker-recognition
 	$(MAKE) -C backend/rust/kokoros kokoros-grpc
+	$(MAKE) -C backend/go/rfdetr-cpp
+	$(MAKE) -C backend/go/locate-anything-cpp

 test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/transformers test
@@ -506,6 +593,10 @@ test-extra: prepare-test-extra
 	$(MAKE) -C backend/python/insightface test
 	$(MAKE) -C backend/python/speaker-recognition test
 	$(MAKE) -C backend/rust/kokoros test
+	$(MAKE) -C backend/go/rfdetr-cpp test
+	$(MAKE) -C backend/go/locate-anything-cpp test
+	$(MAKE) -C backend/go/depth-anything-cpp test
+	$(MAKE) -C backend/go/supertonic test

 ##
 ## End-to-end gRPC tests that exercise a built backend container image.
@@ -911,6 +1002,19 @@ test-extra-backend-whisper-transcription: docker-build-whisper
 	BACKEND_TEST_CAPS=health,load,transcription \
 	$(MAKE) test-extra-backend

+## Audio transcription wrapper for the parakeet-cpp (parakeet.cpp ggml port)
+## backend. Mirrors test-extra-backend-whisper-transcription: drives the
+## AudioTranscription / AudioTranscriptionStream RPCs against a published
+## Parakeet GGUF using the JFK 11s clip from whisper.cpp's CI samples. Not
+## part of the default test suite - run explicitly once the pinned model URL
+## is reachable.
+test-extra-backend-parakeet-cpp-transcription: docker-build-parakeet-cpp
+	BACKEND_IMAGE=local-ai-backend:parakeet-cpp \
+	BACKEND_TEST_MODEL_URL=https://huggingface.co/mudler/parakeet-cpp-gguf/resolve/main/tdt_ctc-110m-f16.gguf \
+	BACKEND_TEST_AUDIO_URL=https://github.com/ggml-org/whisper.cpp/raw/master/samples/jfk.wav \
+	BACKEND_TEST_CAPS=health,load,transcription \
+	$(MAKE) test-extra-backend
+
 ## LocalVQE audio transform (joint AEC + noise suppression + dereverb).
 ## Exercises the audio_transform capability end-to-end: batch transform
 ## of a real WAV fixture and bidi streaming of synthetic silent frames.
@@ -1060,21 +1164,31 @@ BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false
 # Single-model; hardware-only validation lives at tests/e2e-backends/
 # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md.
 BACKEND_DS4 = ds4|ds4|.|false|false
+# privacy-filter wraps the standalone privacy-filter.cpp GGML engine (the
+# openai-privacy-filter PII/NER token classifier) — the TokenClassify RPC for
+# the PII redactor tier, on stock ggml with no llama.cpp carry-patches.
+BACKEND_PRIVACY_FILTER = privacy-filter|privacy-filter|.|false|false

 # Golang backends
 BACKEND_PIPER = piper|golang|.|false|true
 BACKEND_LOCAL_STORE = local-store|golang|.|false|true
+BACKEND_CLOUD_PROXY = cloud-proxy|golang|.|false|true
 BACKEND_HUGGINGFACE = huggingface|golang|.|false|true
 BACKEND_SILERO_VAD = silero-vad|golang|.|false|true
 BACKEND_STABLEDIFFUSION_GGML = stablediffusion-ggml|golang|.|--progress=plain|true
 BACKEND_WHISPER = whisper|golang|.|false|true
+BACKEND_CRISPASR = crispasr|golang|.|false|true
+BACKEND_PARAKEET_CPP = parakeet-cpp|golang|.|false|true
+BACKEND_DEPTH_ANYTHING_CPP = depth-anything-cpp|golang|.|false|true
 BACKEND_VOXTRAL = voxtral|golang|.|false|true
 BACKEND_ACESTEP_CPP = acestep-cpp|golang|.|false|true
 BACKEND_QWEN3_TTS_CPP = qwen3-tts-cpp|golang|.|false|true
+BACKEND_OMNIVOICE_CPP = omnivoice-cpp|golang|.|false|true
 BACKEND_VIBEVOICE_CPP = vibevoice-cpp|golang|.|false|true
 BACKEND_LOCALVQE = localvqe|golang|.|false|true
 BACKEND_OPUS = opus|golang|.|false|true
 BACKEND_SHERPA_ONNX = sherpa-onnx|golang|.|false|true
+BACKEND_SUPERTONIC = supertonic|golang|.|false|true

 # Python backends with root context
 BACKEND_RERANKERS = rerankers|python|.|false|true
@@ -1117,6 +1231,7 @@ BACKEND_KOKOROS = kokoros|rust|.|false|true

 # C++ backends (Go wrapper with purego)
 BACKEND_SAM3_CPP = sam3-cpp|golang|.|false|true
+BACKEND_RFDETR_CPP = rfdetr-cpp|golang|.|false|true

 # Helper function to build docker image for a backend
 # Usage: $(call docker-build-backend,BACKEND_NAME,DOCKERFILE_TYPE,BUILD_CONTEXT,PROGRESS_FLAG,NEEDS_BACKEND_ARG)
@@ -1147,12 +1262,17 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT)))
 $(eval $(call generate-docker-build-target,$(BACKEND_DS4)))
+$(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_PIPER)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCAL_STORE)))
+$(eval $(call generate-docker-build-target,$(BACKEND_CLOUD_PROXY)))
 $(eval $(call generate-docker-build-target,$(BACKEND_HUGGINGFACE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SILERO_VAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_STABLEDIFFUSION_GGML)))
 $(eval $(call generate-docker-build-target,$(BACKEND_WHISPER)))
+$(eval $(call generate-docker-build-target,$(BACKEND_CRISPASR)))
+$(eval $(call generate-docker-build-target,$(BACKEND_PARAKEET_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_DEPTH_ANYTHING_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VOXTRAL)))
 $(eval $(call generate-docker-build-target,$(BACKEND_OPUS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_RERANKERS)))
@@ -1185,6 +1305,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_WHISPERX)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACE_STEP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_ACESTEP_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_QWEN3_TTS_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_OMNIVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_VIBEVOICE_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_LOCALVQE)))
 $(eval $(call generate-docker-build-target,$(BACKEND_MLX)))
@@ -1195,13 +1316,15 @@ $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_QUANTIZATION)))
 $(eval $(call generate-docker-build-target,$(BACKEND_TINYGRAD)))
 $(eval $(call generate-docker-build-target,$(BACKEND_KOKOROS)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SAM3_CPP)))
+$(eval $(call generate-docker-build-target,$(BACKEND_RFDETR_CPP)))
 $(eval $(call generate-docker-build-target,$(BACKEND_SHERPA_ONNX)))
+$(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC)))

 # Pattern rule for docker-save targets
 docker-save-%: backend-images
 	docker save local-ai-backend:$* -o backend-images/$*.tar

-docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-qwen3-tts-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx
+docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter

 ########################################################
 ### Mock Backend for E2E Tests
@@ -1213,6 +1336,12 @@ build-mock-backend: protogen-go
 clean-mock-backend:
 	rm -f tests/e2e/mock-backend/mock-backend

+build-cloud-proxy-backend: protogen-go
+	$(GOCMD) build -o tests/e2e/mock-backend/cloud-proxy ./backend/go/cloud-proxy
+
+clean-cloud-proxy-backend:
+	rm -f tests/e2e/mock-backend/cloud-proxy
+
 ########################################################
 ### UI E2E Test Server
 ########################################################
@@ -1223,6 +1352,50 @@ build-ui-test-server: build-mock-backend react-ui protogen-go
 test-ui-e2e: build-ui-test-server
 	cd core/http/react-ui && npm install && npx playwright install --with-deps chromium && npx playwright test

+## Optional Playwright worker count for the UI e2e targets below. Pass
+## UI_TEST_WORKERS=N (e.g. `make test-ui-coverage UI_TEST_WORKERS=20`) to
+## override Playwright's default (cores/2). Empty by default so Playwright
+## picks its own worker count.
+UI_TEST_WORKERS ?=
+PLAYWRIGHT_WORKERS_FLAG = $(if $(UI_TEST_WORKERS),--workers=$(UI_TEST_WORKERS),)
+
+## Fast Playwright e2e run used by the pre-commit hook on React UI changes.
+## Force-rebuilds the (non-instrumented) dist so the suite tests the working
+## tree — not a stale dist the `react-ui` skip-guard would leave — re-embeds
+## it into ui-test-server, and runs the specs. Uses the nix-provided browser
+## when PLAYWRIGHT_CHROMIUM_PATH is set (flake dev shell), else falls back to
+## downloading it as `test-ui-e2e` does.
+test-ui: build-mock-backend protogen-go
+	cd core/http/react-ui && bun install && bun run build
+	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui
+	cd core/http/react-ui && sh $(CURDIR)/scripts/ensure-playwright-browser.sh && bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG)
+
+## React UI code coverage from the Playwright e2e suite. Builds a
+## NON-instrumented bundle with source maps (COVERAGE_V8=true), re-embeds it
+## into the ui-test-server (the dist is //go:embed'ed at compile time), runs the
+## Playwright specs which collect native Chromium V8 coverage (PW_V8_COVERAGE=1)
+## — far cheaper than istanbul's build-time counters (~40% faster end-to-end) —
+## convert it to istanbul via v8-to-istanbul in the coverage fixture, and write
+## an nyc report to core/http/react-ui/coverage/. Removes the dist afterwards so
+## normal builds aren't served source-mapped assets. (The legacy istanbul path
+## still exists: `bun run build:coverage` + unset PW_V8_COVERAGE.)
+test-ui-coverage: build-mock-backend protogen-go
+	trap 'rm -rf "$(CURDIR)/core/http/react-ui/dist"' EXIT; \
+	( cd core/http/react-ui && bun install && bun run build:coverage-v8 ) && \
+	$(GOCMD) build -o tests/e2e-ui/ui-test-server ./tests/e2e-ui && \
+	( cd core/http/react-ui && rm -rf .nyc_output coverage && \
+	    sh $(CURDIR)/scripts/ensure-playwright-browser.sh && \
+	    PW_V8_COVERAGE=1 bunx playwright test $(PLAYWRIGHT_WORKERS_FLAG) && bun run coverage:report )
+
+## UI coverage baseline (committed) and the strict gate that compares against
+## it — the React mirror of test-coverage-baseline / test-coverage-check.
+test-ui-coverage-baseline: test-ui-coverage
+	@node -e 'const fs=require("fs");process.stdout.write(String(JSON.parse(fs.readFileSync("core/http/react-ui/coverage/coverage-summary.json")).total.lines.pct))' > core/http/react-ui/coverage-baseline.txt
+	@echo "Saved UI coverage baseline: $$(cat core/http/react-ui/coverage-baseline.txt)% lines"
+
+test-ui-coverage-check: test-ui-coverage
+	sh $(CURDIR)/scripts/ui-coverage-check.sh core/http/react-ui/coverage/coverage-summary.json core/http/react-ui/coverage-baseline.txt
+
 test-ui-e2e-docker:
 	docker build -t localai-ui-e2e -f tests/e2e-ui/Dockerfile .
 	docker run --rm localai-ui-e2e
--- a/README.md
+++ b/README.md
@@ -29,14 +29,32 @@
 <a href="https://trendshift.io/repositories/5539" target="_blank"><img src="https://trendshift.io/api/badge/repositories/5539" alt="mudler%2FLocalAI | Trendshift" style="width: 250px; height: 55px;" width="250" height="55"/></a>
 </p>

+<!-- Keep these links, translations synced daily. -->
+<p align="center">
+<a href="https://zdoc.app/de/mudler/LocalAI">Deutsch</a> |
+<a href="https://zdoc.app/es/mudler/LocalAI">Español</a> |
+<a href="https://zdoc.app/fr/mudler/LocalAI">français</a> |
+<a href="https://zdoc.app/ja/mudler/LocalAI">日本語</a> |
+<a href="https://zdoc.app/ko/mudler/LocalAI">한국어</a> |
+<a href="https://zdoc.app/pt/mudler/LocalAI">Português</a> |
+<a href="https://zdoc.app/ru/mudler/LocalAI">Русский</a> |
+<a href="https://zdoc.app/zh/mudler/LocalAI">中文</a>
+</p>
+
 **LocalAI** is the open-source AI engine. Run any model - LLMs, vision, voice, image, video - on any hardware. No GPU required.

- **Drop-in API compatibility** — OpenAI, Anthropic, ElevenLabs APIs
- **36+ backends** — llama.cpp, vLLM, transformers, whisper, diffusers, MLX...
- **Any hardware** — NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
- **Multi-user ready** — API key auth, user quotas, role-based access
- **Built-in AI agents** — autonomous agents with tool use, RAG, MCP, and skills
- **Privacy-first** — your data never leaves your infrastructure
+**A small core, not a bundle.** Each backend wraps a best-in-class engine (llama.cpp, vLLM, whisper.cpp, stable-diffusion, MLX...) in its own image, pulled only when a model needs it. You install nothing you don't use.
+
+- **Composable by design**: backends are separate and pulled on demand, so you install only what your model needs
+- **Open and extensible**: load any model, or build your own backend in any language against an open interface
+- **Drop-in API compatibility**: OpenAI, Anthropic, and ElevenLabs APIs across every backend
+- **Any model, any modality**: LLMs, vision, voice, image, and video behind one API
+- **Any hardware**: NVIDIA, AMD, Intel, Apple Silicon, Vulkan, or CPU-only
+- **Multi-user ready**: API key auth, user quotas, role-based access
+- **Built-in AI agents**: autonomous agents with tool use, RAG, MCP, and skills
+- **Privacy-first**: your data never leaves your infrastructure
+
+![A small LocalAI core with backends (llama.cpp, vLLM, MLX, whisper.cpp, stable-diffusion, kokoro, parakeet.cpp...) plugged in as separate on-demand images](docs/static/images/diagrams/composable-core.png)

 Created by [Ettore Di Giacinto](https://github.com/mudler) and maintained by the [LocalAI team](#team).

@@ -143,14 +161,30 @@ local-ai run https://gist.githubusercontent.com/.../phi-2.yaml
 local-ai run oci://localai/phi-2:latest
 ```

+To test a running LocalAI server from the terminal, open an interactive chat session from another shell. Inside the prompt, `/models` lists installed models and `/model <name>` switches between them.
+
+```bash
+# Terminal 1
+local-ai run llama-3.2-1b-instruct:q4_k_m
+
+# Terminal 2
+local-ai chat --model llama-3.2-1b-instruct:q4_k_m
+```
+
 > **Automatic Backend Detection**: LocalAI automatically detects your GPU capabilities and downloads the appropriate backend. For advanced options, see [GPU Acceleration](https://localai.io/features/gpu-acceleration/).

 For more details, see the [Getting Started guide](https://localai.io/basics/getting_started/).

 ## Latest News

- **April 2026**: [Voice recognition](https://github.com/mudler/LocalAI/pull/9500), [Face recognition, identification & liveness detection](https://github.com/mudler/LocalAI/pull/9480), [Ollama API compatibility](https://github.com/mudler/LocalAI/pull/9284), [Video generation in stable-diffusion.ggml](https://github.com/mudler/LocalAI/pull/9420), [Backend versioning with auto-upgrade](https://github.com/mudler/LocalAI/pull/9315), [Pin models & load-on-demand toggle](https://github.com/mudler/LocalAI/pull/9309), [Universal model importer](https://github.com/mudler/LocalAI/pull/9466), new backends: [sglang](https://github.com/mudler/LocalAI/pull/9359), [ik-llama-cpp](https://github.com/mudler/LocalAI/pull/9326), [TurboQuant](https://github.com/mudler/LocalAI/pull/9355), [sam.cpp](https://github.com/mudler/LocalAI/pull/9288), [Kokoros](https://github.com/mudler/LocalAI/pull/9212), [qwen3tts.cpp](https://github.com/mudler/LocalAI/pull/9316), [tinygrad multimodal](https://github.com/mudler/LocalAI/pull/9364)
- **March 2026**: [Agent management](https://github.com/mudler/LocalAI/pull/8820), [New React UI](https://github.com/mudler/LocalAI/pull/8772), [WebRTC](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed via P2P and RDMA](https://github.com/mudler/LocalAI/pull/8801), [MCP Apps, MCP Client-side](https://github.com/mudler/LocalAI/pull/8947)
+- **June 2026**: New [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (a tiny Go client for the Realtime API with a full talk-back voice loop and tool calling), plus [streaming of the realtime LLM / TTS / transcription pipeline stages](https://github.com/mudler/LocalAI/pull/10176) and [configurable WebRTC ICE candidates](https://github.com/mudler/LocalAI/pull/10231).
+- **June 2026**: Big speech push: the [parakeet.cpp](https://github.com/mudler/parakeet.cpp) ASR engine gains [NeMo-faithful segment timestamps](https://github.com/mudler/LocalAI/pull/10207), a [multilingual streaming Nemotron-3.5 model](https://github.com/mudler/LocalAI/pull/10199), [dynamic batching for concurrent transcription](https://github.com/mudler/LocalAI/pull/10112) and [CUDA graphs](https://github.com/mudler/LocalAI/pull/10273); the new [CrispASR backend](https://github.com/mudler/LocalAI/pull/10099) adds multi-architecture ASR + TTS, and [60 Piper TTS voices across 42 languages](https://github.com/mudler/LocalAI/pull/10296) land in the gallery (plus [per-request TTS instructions and params](https://github.com/mudler/LocalAI/pull/10172)).
+- **June 2026**: New backends and models: [locate-anything.cpp](https://github.com/mudler/LocalAI/pull/10264) for open-vocabulary object detection via ggml, [Ideogram4 image generation](https://github.com/mudler/LocalAI/pull/10201) in stablediffusion-ggml, [llama.cpp video input](https://github.com/mudler/LocalAI/pull/10216), and the [Gemma 4 QAT family with MTP speculative-decoding pairs](https://github.com/mudler/LocalAI/pull/10215). Plus an [interactive CLI chat mode](https://github.com/mudler/LocalAI/pull/10226) and [RAG source citations in agent responses](https://github.com/mudler/LocalAI/pull/10228).
+- **June 2026**: Distributed mode hardening: [prefix-cache-aware routing](https://github.com/mudler/LocalAI/pull/10071), a [production-ready request router with auto-sized embedding/rerank batches](https://github.com/mudler/LocalAI/pull/10104), [ds4 layer-split distributed inference](https://github.com/mudler/LocalAI/pull/10098), [NATS JWT auth + TLS/mTLS](https://github.com/mudler/LocalAI/pull/10159), and [resumable file uploads](https://github.com/mudler/LocalAI/pull/10109).
+- **May 2026**: **LocalAI 4.3.0** - `llama.cpp` [prompt cache on by default](https://github.com/mudler/LocalAI/pull/9925) (repeated system prompts collapse from minutes to seconds), [keyless cosign signing of backend OCI images](https://github.com/mudler/LocalAI/pull/9823), [per-API-key + per-user usage attribution](https://github.com/mudler/LocalAI/pull/9920), Distributed v3 with [per-request replica routing](https://github.com/mudler/LocalAI/pull/9968). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.3.0)
+- **May 2026**: **LocalAI 4.2.0** - LocalAI sees and hears: [voice recognition](https://github.com/mudler/LocalAI/pull/9500), [face recognition + antispoofing liveness](https://github.com/mudler/LocalAI/pull/9480), speaker diarization. Plus [drop-in Ollama API](https://github.com/mudler/LocalAI/pull/9284), [video generation](https://github.com/mudler/LocalAI/pull/9420), redesigned UI with i18n + admin-configurable branding, vLLM at feature parity with llama.cpp, and 11 new backends. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.2.0)
+- **April 2026**: **LocalAI 4.1.0** - LocalAI becomes a control tower: distributed cluster mode with VRAM-aware smart routing + autoscaling, multi-user platform with OIDC and API keys, per-user quotas with predictive analytics, in-UI fine-tuning with TRL (auto-export to GGUF), on-the-fly quantization backend, visual pipeline editor. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.1.0)
+- **March 2026**: **LocalAI 4.0.0** - native agentic orchestration with the new [Agenthub](https://agenthub.localai.io) community hub, full React UI rewrite with Canvas mode, [MCP Apps + client-side](https://github.com/mudler/LocalAI/pull/8947) with tool streaming, [WebRTC realtime audio](https://github.com/mudler/LocalAI/pull/8790), [MLX-distributed](https://github.com/mudler/LocalAI/pull/8801). [Release notes](https://github.com/mudler/LocalAI/releases/tag/v4.0.0)
 - **February 2026**: [Realtime API for audio-to-audio with tool calling](https://github.com/mudler/LocalAI/pull/6245), [ACE-Step 1.5 support](https://github.com/mudler/LocalAI/pull/8396)
 - **January 2026**: **LocalAI 3.10.0** — Anthropic API support, Open Responses API, video & image generation (LTX-2), unified GPU backends, tool streaming, Moonshine, Pocket-TTS. [Release notes](https://github.com/mudler/LocalAI/releases/tag/v3.10.0)
 - **December 2025**: [Dynamic Memory Resource reclaimer](https://github.com/mudler/LocalAI/pull/7583), [Automatic multi-GPU model fitting (llama.cpp)](https://github.com/mudler/LocalAI/pull/7584), [Vibevoice backend](https://github.com/mudler/LocalAI/pull/7494)
@@ -186,10 +220,26 @@ For older news and full release notes, see [GitHub Releases](https://github.com/

 ## Supported Backends & Acceleration

-LocalAI supports **36+ backends** including llama.cpp, vLLM, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).
+LocalAI supports **60+ backends** including llama.cpp, vLLM, SGLang, transformers, whisper.cpp, diffusers, MLX, MLX-VLM, and many more. Hardware acceleration is available for **NVIDIA** (CUDA 12/13), **AMD** (ROCm), **Intel** (oneAPI/SYCL), **Apple Silicon** (Metal), **Vulkan**, and **NVIDIA Jetson** (L4T). All backends can be installed on-the-fly from the [Backend Gallery](https://localai.io/backends/).

 See the full [Backend & Model Compatibility Table](https://localai.io/model-compatibility/) and [GPU Acceleration guide](https://localai.io/features/gpu-acceleration/).

+### Backends built by us
+
+Most backends wrap a best-in-class upstream engine. A handful of them are native C/C++/GGML engines (no Python at inference) developed and maintained by the LocalAI project itself:
+
+| Backend | What it does |
+|---------|-------------|
+| [parakeet.cpp](https://github.com/mudler/parakeet.cpp) | C++/GGML port of NVIDIA NeMo Parakeet ASR (tdt/ctc/rnnt/hybrid), with cache-aware streaming transcription |
+| [voxtral.c](https://github.com/mudler/voxtral.c) | Voxtral Realtime 4B speech-to-text in pure C |
+| [vibevoice.cpp](https://github.com/mudler/vibevoice.cpp) | Native port of Microsoft VibeVoice for TTS (voice cloning) and long-form ASR with speaker diarization |
+| [rf-detr.cpp](https://github.com/mudler/rf-detr.cpp) | Native RF-DETR object detection and instance segmentation |
+| [locate-anything.cpp](https://github.com/mudler/locate-anything.cpp) | Open-vocabulary object detection and visual grounding (LocateAnything-3B) |
+| [depth-anything.cpp](https://github.com/mudler/depth-anything.cpp) | Depth Anything 3 monocular metric depth + camera pose estimation |
+| [privacy-filter.cpp](https://github.com/localai-org/privacy-filter.cpp) | Standalone GGML PII/NER token-classification engine powering LocalAI's PII redaction tier |
+| [LocalVQE](https://github.com/localai-org/LocalVQE) | Joint acoustic echo cancellation, noise suppression, and dereverberation |
+| [local-store](https://github.com/mudler/LocalAI) | Local-first vector database for embeddings (shipped in-tree) |
+
 ## Resources

 - [Documentation](https://localai.io/)
@@ -199,7 +249,7 @@ See the full [Backend & Model Compatibility Table](https://localai.io/model-comp
 - [Integrations & community projects](https://localai.io/docs/integrations/)
 - [Installation video walkthrough](https://www.youtube.com/watch?v=cMVNnlqwfw4)
 - [Media & blog posts](https://localai.io/basics/news/#media-blogs-social)
- [Examples](https://github.com/mudler/LocalAI-examples)
+- [Examples](https://github.com/mudler/LocalAI-examples) — including the [realtime voice assistant demo](https://github.com/localai-org/localai-realtime-demo) (Go client for the Realtime API with tool calling)

 ## Team

@@ -236,11 +286,22 @@ A huge thank you to our generous sponsors who support this project covering CI e
  <a href="https://www.spectrocloud.com/" target="blank">
    <img height="200" src="https://github.com/user-attachments/assets/72eab1dd-8b93-4fc0-9ade-84db49f24962">
  </a>
+</p>
+
+<details>
+
+<summary>
+Past sponsors
+</summary>
+
+<p align="center">
  <a href="https://www.premai.io/" target="blank">
    <img height="200" src="https://github.com/mudler/LocalAI/assets/2420543/42e4ca83-661e-4f79-8e46-ae43689683d6"> <br>
  </a>
 </p>

+</details>
+
 ### Individual sponsors

 A special thanks to individual sponsors, a full list is on [GitHub](https://github.com/sponsors/mudler) and [buymeacoffee](https://buymeacoffee.com/mudler). Special shout out to [drikster80](https://github.com/drikster80) for being generous. Thank you everyone!
--- a/backend/Dockerfile.golang
+++ b/backend/Dockerfile.golang
@@ -206,6 +206,16 @@ RUN if [ "${BACKEND}" = "opus" ]; then \
    apt-get clean && rm -rf /var/lib/apt/lists/*; \
 fi

+# CrispASR's piper TTS backend dlopens libespeak-ng at runtime to phonemize
+# non-English text (the MIT-clean path; English uses a built-in G2P). Install
+# the espeak-ng runtime + its libpcaudio/libsonic deps + voice data so
+# package.sh can bundle them into the FROM scratch image.
+RUN if [ "${BACKEND}" = "crispasr" ]; then \
+    apt-get update && apt-get install -y --no-install-recommends \
+        espeak-ng-data libespeak-ng1 libpcaudio0 libsonic0 && \
+    apt-get clean && rm -rf /var/lib/apt/lists/*; \
+fi
+
 COPY . /LocalAI

 RUN git config --global --add safe.directory /LocalAI
--- a/backend/Dockerfile.privacy-filter
+++ b/backend/Dockerfile.privacy-filter
@@ -0,0 +1,109 @@
+ARG BASE_IMAGE=ubuntu:24.04
+# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses when no
+# prebuilt base is supplied; the builder-prebuilt stage is only entered when
+# BUILDER_TARGET=builder-prebuilt, so the fallback content is harmless
+# (BuildKit prunes the unreferenced builder).
+ARG BUILDER_BASE_IMAGE=${BASE_IMAGE}
+# BUILDER_TARGET selects which builder stage the scratch image copies from.
+# Declared before any FROM so it is usable in `FROM ${BUILDER_TARGET}`. The
+# backend_build workflow sets it to builder-prebuilt when the matrix entry
+# provides builder-base-image, else builder-fromsource (the local default).
+ARG BUILDER_TARGET=builder-fromsource
+ARG APT_MIRROR=""
+ARG APT_PORTS_MIRROR=""
+
+# privacy-filter: standalone GGML engine for the openai-privacy-filter PII/NER
+# token classifier, wrapped as a LocalAI gRPC backend.
+#
+# Mirrors backend/Dockerfile.llama-cpp: the build toolchain (gRPC + cmake +
+# protoc + conditional CUDA/Vulkan) comes from the shared
+# .docker/install-base-deps.sh (from-source path) or a prebuilt
+# quay.io/go-skynet/ci-cache:base-grpc-* image (CI path) — nothing GPU-specific
+# is hand-rolled here. BUILD_TYPE selects the engine backend in the Makefile:
+# "" = cpu, "cublas" -> -DPF_CUDA=ON, "vulkan" -> -DPF_VULKAN=ON.
+
+# ============================================================================
+# Stage: builder-fromsource — self-contained build. Runs the same install
+# script backend/Dockerfile.base-grpc-builder runs, so this path is
+# bit-equivalent to the prebuilt base. Used when BUILDER_TARGET=builder-fromsource
+# (the default; local `make backends/privacy-filter`).
+# ============================================================================
+FROM ${BASE_IMAGE} AS builder-fromsource
+ARG BUILD_TYPE
+ARG CUDA_MAJOR_VERSION
+ARG CUDA_MINOR_VERSION
+ARG CMAKE_FROM_SOURCE=false
+# CUDA Toolkit 13.x needs CMake 3.31.9+ for correct toolchain/arch detection.
+ARG CMAKE_VERSION=3.31.10
+ARG GRPC_VERSION=v1.65.0
+ARG GRPC_MAKEFLAGS="-j4 -Otarget"
+ARG SKIP_DRIVERS=false
+ARG TARGETARCH
+ARG UBUNTU_VERSION=2404
+ARG APT_MIRROR
+ARG APT_PORTS_MIRROR
+
+ENV BUILD_TYPE=${BUILD_TYPE} \
+    CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \
+    CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \
+    CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \
+    CMAKE_VERSION=${CMAKE_VERSION} \
+    GRPC_VERSION=${GRPC_VERSION} \
+    GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \
+    SKIP_DRIVERS=${SKIP_DRIVERS} \
+    TARGETARCH=${TARGETARCH} \
+    UBUNTU_VERSION=${UBUNTU_VERSION} \
+    APT_MIRROR=${APT_MIRROR} \
+    APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \
+    DEBIAN_FRONTEND=noninteractive
+# CUDA on PATH (a no-op when CUDA is not installed, e.g. cpu/vulkan builds).
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+WORKDIR /build
+
+# apt deps + cmake + protoc + gRPC + conditional CUDA/Vulkan, all from the
+# shared script (the source of truth that base-grpc-builder also runs).
+RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \
+    --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \
+    bash /usr/local/sbin/install-base-deps
+
+# install-base-deps installs gRPC under /opt/grpc; copy it to /usr/local so the
+# backend's find_package(gRPC CONFIG) resolves it at the canonical prefix.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
+
+# ============================================================================
+# Stage: builder-prebuilt — FROM a prebuilt
+# quay.io/go-skynet/ci-cache:base-grpc-* image (gRPC at /opt/grpc + apt deps +
+# CUDA/Vulkan already installed). Used in CI when the matrix entry sets
+# builder-base-image.
+# ============================================================================
+FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt
+ARG BUILD_TYPE
+ARG TARGETARCH
+ENV BUILD_TYPE=${BUILD_TYPE}
+# CUDA on PATH (a no-op for the cpu/vulkan base images).
+ENV PATH=/usr/local/cuda/bin:${PATH}
+
+# Mirror builder-fromsource: the base-grpc image installs gRPC to /opt/grpc but
+# does not copy it to /usr/local.
+RUN cp -a /opt/grpc/. /usr/local/
+
+COPY . /LocalAI
+
+RUN --mount=type=cache,target=/root/.ccache,id=privacy-filter-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \
+    make -C /LocalAI/backend/cpp/privacy-filter BUILD_TYPE=${BUILD_TYPE} NATIVE=false grpc-server package
+
+# ============================================================================
+# Final stage — copy the package output from the selected builder. BuildKit
+# does not expand variables in `COPY --from=`, so alias the chosen builder to a
+# fixed stage name first.
+# ============================================================================
+FROM ${BUILDER_TARGET} AS builder
+
+FROM scratch
+COPY --from=builder /LocalAI/backend/cpp/privacy-filter/package/. ./
--- a/backend/Dockerfile.python
+++ b/backend/Dockerfile.python
@@ -126,6 +126,7 @@ RUN <<EOT bash
        apt-get update && \
        apt-get install -y --no-install-recommends \
            cuda-nvcc-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
+            cuda-nvrtc-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcufft-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcurand-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
            libcublas-dev-${CUDA_MAJOR_VERSION}-${CUDA_MINOR_VERSION} \
--- a/backend/backend.proto
+++ b/backend/backend.proto
@@ -24,6 +24,7 @@ service Backend {
  rpc TokenizeString(PredictOptions) returns (TokenizationResponse) {}
  rpc Status(HealthMessage) returns (StatusResponse) {}
  rpc Detect(DetectOptions) returns (DetectResponse) {}
+  rpc Depth(DepthRequest) returns (DepthResponse) {}
  rpc FaceVerify(FaceVerifyRequest) returns (FaceVerifyResponse) {}
  rpc FaceAnalyze(FaceAnalyzeRequest) returns (FaceAnalyzeResponse) {}
  rpc VoiceVerify(VoiceVerifyRequest) returns (VoiceVerifyResponse) {}
@@ -37,6 +38,22 @@ service Backend {

  rpc Rerank(RerankRequest) returns (RerankResult) {}

+  // TokenClassify runs a token-classification (NER) model on the
+  // supplied text and returns each detected entity span. Used by the
+  // PII redactor's optional NER tier — the regex tier still handles
+  // formatted hits cheaply, while this catches names, locations, and
+  // other unformatted PII that regex misses.
+  rpc TokenClassify(TokenClassifyRequest) returns (TokenClassifyResponse) {}
+
+  // Score evaluates the model's joint log-probability of each
+  // supplied candidate continuation given a shared prompt. The
+  // prompt's KV cache is computed once and reused across candidates.
+  // Used for routing-policy multi-label classification, reranking,
+  // calibrated confidence, and reward-model scoring — any task where
+  // the consumer wants the model's confidence in a pre-specified
+  // continuation rather than a generated one.
+  rpc Score(ScoreRequest) returns (ScoreResponse) {}
+
  rpc GetMetrics(MetricsRequest) returns (MetricsResponse);

  rpc VAD(VADRequest) returns (VADResponse) {}
@@ -68,6 +85,23 @@ service Backend {
  rpc QuantizationProgress(QuantizationProgressRequest) returns (stream QuantizationProgressUpdate) {}
  rpc StopQuantization(QuantizationStopRequest) returns (Result) {}

+  // Forward proxies a raw HTTP request to an upstream provider. The
+  // cloud-proxy backend implements this for passthrough-mode model
+  // configs: the client wire format is preserved end-to-end (no
+  // translation through internal proto), which means new provider
+  // fields work the day they ship. Translation-mode proxies use the
+  // standard Predict/PredictStream RPCs instead. Backends that don't
+  // support this return UNIMPLEMENTED.
+  //
+  // The request is bidirectionally streamed so large bodies can flow
+  // without buffering. In practice the first ForwardRequest carries
+  // path, method, headers, and the initial body chunk; subsequent
+  // messages append body chunks. The first ForwardReply carries the
+  // upstream status and response headers; subsequent messages stream
+  // body chunks (SSE frames or chunked transfer). Cancellation of the
+  // gRPC context closes the upstream connection.
+  rpc Forward(stream ForwardRequest) returns (stream ForwardReply) {}
+
 }

 // Define the empty request
@@ -81,6 +115,76 @@ message MetricsResponse {
  int32 prompt_tokens_processed = 5;
 }

+// TokenClassifyRequest carries the text to classify plus an optional
+// score threshold. The transformers backend interprets threshold as
+// the minimum confidence to include in the response; 0 = include all.
+message TokenClassifyRequest {
+  string text = 1;
+  float threshold = 2;
+}
+
+// TokenClassifyEntity is one detected entity span. Byte offsets are
+// into the original UTF-8 text — start..end is a half-open range that
+// addresses the substring corresponding to entity_group.
+//
+// entity_group follows HuggingFace's aggregated-tag convention (e.g.
+// "PER", "LOC", "ORG", or a PII-specific label like "EMAIL" /
+// "SSN" depending on the model). The redactor's per-pattern action
+// map keys off this string.
+message TokenClassifyEntity {
+  string entity_group = 1;
+  int32 start = 2;
+  int32 end = 3;
+  float score = 4;
+  string text = 5;
+}
+
+message TokenClassifyResponse {
+  repeated TokenClassifyEntity entities = 1;
+}
+
+// ScoreRequest carries one shared prompt and one or more continuations
+// to score against it. The backend tokenises the prompt once and reuses
+// the resulting KV cache across all candidates in this request.
+message ScoreRequest {
+  string prompt = 1;
+  repeated string candidates = 2;
+  // Return per-token logprobs for each candidate when true. Default
+  // false to keep the wire response small; the joint log_prob field
+  // covers the common ranking case.
+  bool include_token_logprobs = 3;
+  // When true, the response also populates length_normalized_log_prob
+  // (joint log-prob divided by candidate token count). Useful when
+  // candidates differ in length and the consumer wants a per-token
+  // measure comparable across them (PMI-style scoring).
+  bool length_normalize = 4;
+}
+
+// CandidateScore is one row in the ScoreResponse, matching by index
+// the candidate in ScoreRequest.candidates.
+message CandidateScore {
+  // Sum of log P(token_i | prompt, candidate_token_<i) across the
+  // candidate's tokens. The primary ranking signal.
+  double log_prob = 1;
+  // log_prob / num_tokens — populated when length_normalize=true on
+  // the request.
+  double length_normalized_log_prob = 2;
+  // Per-token detail — populated when include_token_logprobs=true.
+  repeated TokenLogProb tokens = 3;
+  // Number of tokens the backend tokenised this candidate into, after
+  // any backend-specific normalisation (e.g. leading-space handling).
+  int32 num_tokens = 4;
+}
+
+message TokenLogProb {
+  string token = 1;
+  double log_prob = 2;
+}
+
+message ScoreResponse {
+  repeated CandidateScore candidates = 1;
+}
+
 message RerankRequest {
  string query = 1;
  repeated string documents = 2;
@@ -325,6 +429,25 @@ message ModelOptions {
  // applied verbatim to the backend's engine constructor (e.g. vLLM AsyncEngineArgs).
  // Unknown keys produce an error at LoadModel time.
  string EngineArgs = 73;
+
+  // Proxy carries the cloud-proxy backend's per-model configuration.
+  // Empty for non-proxy backends.
+  ProxyOptions Proxy = 74;
+}
+
+// ProxyOptions configures the cloud-proxy backend. UpstreamURL and
+// Mode are always meaningful; Provider only matters in translate mode.
+// The two api_key_* fields are mutually exclusive and resolved by the
+// backend at LoadModel — core forwards the references rather than the
+// plaintext key.
+message ProxyOptions {
+  string upstream_url = 1;
+  string mode = 2;
+  string provider = 3;
+  string api_key_env = 4;
+  string api_key_file = 5;
+  string upstream_model = 6;
+  int32 request_timeout_seconds = 7;
 }

 message Result {
@@ -415,6 +538,15 @@ message TTSRequest {
  string dst = 3;
  string voice = 4;
  optional string language = 5;
+  // instructions is a free-form, per-request style/voice description (maps to
+  // the OpenAI `instructions` field). Backends that support expressive synthesis
+  // (e.g. Qwen3-TTS CustomVoice/VoiceDesign) prefer this over the static YAML
+  // option when set; backends that don't simply ignore it.
+  optional string instructions = 6;
+  // params carries optional, backend-specific per-request generation parameters
+  // (e.g. Chatterbox exaggeration/cfg_weight/temperature). Values are strings and
+  // coerced by the backend; unset leaves the backend's configured defaults.
+  map<string, string> params = 7;
 }

 message VADRequest {
@@ -539,6 +671,35 @@ message DetectResponse {
  repeated Detection Detections = 1;
 }

+// --- Depth estimation messages (Depth Anything 3) ---
+
+message DepthRequest {
+  string src = 1;                  // input image (filesystem path or base64-encoded payload)
+  string dst = 2;                  // optional output directory for exports (glb/colmap)
+  bool include_depth = 3;          // return the per-pixel metric depth map
+  bool include_confidence = 4;     // return the per-pixel confidence map (DualDPT)
+  bool include_pose = 5;           // return camera extrinsics/intrinsics (DualDPT)
+  bool include_sky = 6;            // return the per-pixel sky map (mono models)
+  bool include_points = 7;         // back-project to a 3D point cloud (DualDPT)
+  float points_conf_thresh = 8;    // keep points with confidence >= this threshold
+  repeated string exports = 9;     // requested exports: "glb", "colmap"
+}
+
+message DepthResponse {
+  int32 width = 1;                 // processed depth-map width
+  int32 height = 2;                // processed depth-map height
+  repeated float depth = 3;        // width*height row-major metric depth
+  repeated float confidence = 4;   // width*height row-major confidence (DualDPT)
+  repeated float sky = 5;          // width*height row-major sky map (mono)
+  repeated float extrinsics = 6;   // 12 floats, 3x4 row-major (world-to-camera)
+  repeated float intrinsics = 7;   // 9 floats, 3x3 row-major
+  int32 num_points = 8;            // number of 3D points
+  repeated float points = 9;       // num_points*3 xyz, world space
+  bytes point_colors = 10;         // num_points*3 uint8 rgb
+  repeated string export_paths = 11; // paths written for the requested exports
+  bool is_metric = 12;             // depth is in metric units
+}
+
 // --- Face recognition messages ---

 message FacialArea {
@@ -1002,3 +1163,32 @@ message QuantizationStopRequest {
  string job_id = 1;
 }

+// ForwardHeader is one HTTP header on the request or response. Headers
+// like Authorization are typically injected by the backend (from the
+// resolved API key) rather than passed through from the client.
+message ForwardHeader {
+  string name = 1;
+  string value = 2;
+}
+
+// ForwardRequest is a streamed HTTP request to the upstream. First
+// message carries path/method/headers; subsequent messages carry
+// body_chunk only. All fields except body_chunk are honoured on the
+// first message and ignored thereafter.
+message ForwardRequest {
+  string path = 1;                          // e.g. "/v1/chat/completions" — appended to the model's upstream_url
+  string method = 2;                        // usually "POST"
+  repeated ForwardHeader headers = 3;
+  bytes body_chunk = 4;
+}
+
+// ForwardReply is a streamed HTTP response from the upstream. First
+// message carries status/headers; subsequent messages carry body_chunk
+// only. SSE responses arrive as a sequence of body_chunk frames; the
+// caller is responsible for any parsing.
+message ForwardReply {
+  int32 status = 1;
+  repeated ForwardHeader headers = 2;
+  bytes body_chunk = 3;
+}
+
--- a/backend/cpp/ds4/.gitignore
+++ b/backend/cpp/ds4/.gitignore
@@ -2,6 +2,7 @@ ds4/
 build/
 package/
 grpc-server
+ds4-worker
 *.o
 backend.pb.cc
 backend.pb.h
--- a/backend/cpp/ds4/CMakeLists.txt
+++ b/backend/cpp/ds4/CMakeLists.txt
@@ -9,6 +9,22 @@ option(DS4_NATIVE "Compile with -march=native / -mcpu=native" ON)
 set(DS4_GPU "cpu" CACHE STRING "GPU backend: cpu, cuda, or metal")
 set(DS4_DIR "${CMAKE_CURRENT_SOURCE_DIR}/ds4" CACHE PATH "Path to cloned ds4 source")

+if(${CMAKE_SYSTEM_NAME} MATCHES "Darwin")
+    # Homebrew installs protobuf/grpc under a non-default prefix. The generated
+    # backend.pb.cc / backend.grpc.pb.cc pull in google/protobuf and grpcpp
+    # headers, but the hw_grpc_proto library links neither target, so on macOS
+    # the headers (e.g. google/protobuf/runtime_version.h) are never on the
+    # compiler's include path. Add the Homebrew prefix globally, matching the
+    # llama-cpp backend which builds on Darwin CI.
+    if(CMAKE_HOST_SYSTEM_PROCESSOR MATCHES "arm64")
+        set(HOMEBREW_DEFAULT_PREFIX "/opt/homebrew")
+    else()
+        set(HOMEBREW_DEFAULT_PREFIX "/usr/local")
+    endif()
+    link_directories("${HOMEBREW_DEFAULT_PREFIX}/lib")
+    include_directories("${HOMEBREW_DEFAULT_PREFIX}/include")
+endif()
+
 find_package(Threads REQUIRED)
 find_package(Protobuf CONFIG QUIET)
 if(NOT Protobuf_FOUND)
@@ -60,6 +76,13 @@ elseif(DS4_GPU STREQUAL "cpu")
    set(DS4_OBJS "${DS4_DIR}/ds4_cpu.o")
 endif()

+# ds4.c now references ds4_distributed.c (distributed inference) and ds4_ssd.c
+# (SSD expert-cache), each split into its own translation unit upstream. Both
+# are GPU-agnostic objects shared by every GPU mode, so link them in regardless
+# of DS4_GPU.
+list(APPEND DS4_OBJS "${DS4_DIR}/ds4_distributed.o")
+list(APPEND DS4_OBJS "${DS4_DIR}/ds4_ssd.o")
+
 add_executable(${TARGET}
    grpc-server.cpp
    dsml_parser.cpp
@@ -99,3 +122,36 @@ if(DS4_NATIVE)
        target_compile_options(${TARGET} PRIVATE -march=native)
    endif()
 endif()
+
+# ds4-worker: standalone distributed worker. Links the same ds4 engine objects
+# (including ds4_distributed.o) but has NO gRPC/protobuf dependency - it speaks
+# ds4's own TCP transport via ds4_dist_run(). Buildable wherever the engine
+# objects build, even on hosts without protobuf/grpc dev headers.
+add_executable(ds4-worker worker_main.c)
+target_include_directories(ds4-worker PRIVATE ${DS4_DIR})
+foreach(obj ${DS4_OBJS})
+    target_sources(ds4-worker PRIVATE ${obj})
+    set_source_files_properties(${obj} PROPERTIES EXTERNAL_OBJECT TRUE GENERATED TRUE)
+endforeach()
+# worker_main.c is C, but the engine objects built by nvcc (ds4_cuda.o) and the
+# Metal path (ds4_metal.o, Obj-C++) reference the C++ runtime (libstdc++). Force
+# the C++ linker driver so those symbols resolve; the C driver would not link
+# libstdc++ and the CUDA/Metal builds fail with undefined std:: references.
+set_target_properties(ds4-worker PROPERTIES LINKER_LANGUAGE CXX)
+target_link_libraries(ds4-worker PRIVATE Threads::Threads m)
+
+if(DS4_GPU STREQUAL "cuda")
+    target_link_libraries(ds4-worker PRIVATE CUDA::cudart CUDA::cublas)
+elseif(DS4_GPU STREQUAL "metal")
+    target_link_libraries(ds4-worker PRIVATE ${FOUNDATION_LIB} ${METAL_LIB})
+elseif(DS4_GPU STREQUAL "cpu")
+    target_compile_definitions(ds4-worker PRIVATE DS4_NO_GPU)
+endif()
+
+if(DS4_NATIVE)
+    if(APPLE)
+        target_compile_options(ds4-worker PRIVATE -mcpu=native)
+    else()
+        target_compile_options(ds4-worker PRIVATE -march=native)
+    endif()
+endif()
--- a/backend/cpp/ds4/Makefile
+++ b/backend/cpp/ds4/Makefile
@@ -1,10 +1,10 @@
 # ds4 backend Makefile.
 #
-# Upstream pin lives below as DS4_VERSION?=444afce822057d87f14c4dec307dce24fd49b3ee
+# Upstream pin lives below as DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 # (.github/bump_deps.sh) can find and update it - matches the
 # llama-cpp / ik-llama-cpp / turboquant convention.

-DS4_VERSION?=444afce822057d87f14c4dec307dce24fd49b3ee
+DS4_VERSION?=80ebbc396aee40eedc1d829222f3362d10fa4c6c
 DS4_REPO?=https://github.com/antirez/ds4

 CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST))))
@@ -18,16 +18,20 @@ UNAME_S := $(shell uname -s)

 CMAKE_ARGS ?= -DCMAKE_BUILD_TYPE=Release

+# ds4_distributed.o and ds4_ssd.o are GPU-agnostic translation units that
+# ds4.c/ds4_cpu.o now reference (upstream split distributed inference and the
+# SSD expert-cache into their own .c files). Both objects are shared by every
+# GPU mode, so they are appended unconditionally below.
 ifeq ($(BUILD_TYPE),cublas)
    CMAKE_ARGS += -DDS4_GPU=cuda
-    DS4_OBJ_TARGET := ds4.o ds4_cuda.o
+    DS4_OBJ_TARGET := ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
 else ifeq ($(UNAME_S),Darwin)
    CMAKE_ARGS += -DDS4_GPU=metal
-    DS4_OBJ_TARGET := ds4.o ds4_metal.o
+    DS4_OBJ_TARGET := ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
 else
    # CPU reference path (Linux only - macOS CPU path is broken by VM bug per ds4 README).
    CMAKE_ARGS += -DDS4_GPU=cpu
-    DS4_OBJ_TARGET := ds4_cpu.o
+    DS4_OBJ_TARGET := ds4_cpu.o ds4_distributed.o ds4_ssd.o
 endif

 ifneq ($(NATIVE),true)
@@ -52,17 +56,18 @@ ds4:
 # the right per-platform compile flags (Objective-C/Metal on Darwin, nvcc on Linux+CUDA).
 ds4/ds4.o: ds4
 ifeq ($(BUILD_TYPE),cublas)
-	+$(MAKE) -C ds4 ds4.o ds4_cuda.o
+	+$(MAKE) -C ds4 ds4.o ds4_cuda.o ds4_distributed.o ds4_ssd.o
 else ifeq ($(UNAME_S),Darwin)
-	+$(MAKE) -C ds4 ds4.o ds4_metal.o
+	+$(MAKE) -C ds4 ds4.o ds4_metal.o ds4_distributed.o ds4_ssd.o
 else
-	+$(MAKE) -C ds4 ds4_cpu.o
+	+$(MAKE) -C ds4 ds4_cpu.o ds4_distributed.o ds4_ssd.o
 endif

 grpc-server: ds4/ds4.o
 	mkdir -p $(BUILD_DIR)
 	cd $(BUILD_DIR) && cmake $(CMAKE_ARGS) $(CURRENT_MAKEFILE_DIR) && cmake --build . --config Release -j $(JOBS)
 	cp $(BUILD_DIR)/grpc-server grpc-server
+	cp $(BUILD_DIR)/ds4-worker ds4-worker

 package: grpc-server
 	bash package.sh
@@ -71,7 +76,7 @@ test:
 	@echo "ds4 backend: e2e coverage at tests/e2e-backends/ (BACKEND_BINARY mode)"

 clean:
-	rm -rf $(BUILD_DIR) grpc-server package
+	rm -rf $(BUILD_DIR) grpc-server ds4-worker package
 	if [ -d ds4 ]; then $(MAKE) -C ds4 clean; fi

 purge: clean
--- a/backend/cpp/ds4/grpc-server.cpp
+++ b/backend/cpp/ds4/grpc-server.cpp
@@ -23,8 +23,13 @@ extern "C" {

 #include <atomic>
 #include <chrono>
+#include <climits>
 #include <csignal>
+#include <cstddef>
+#include <cstdint>
+#include <cstdlib>
 #include <cstring>
+#include <ctime>
 #include <iostream>
 #include <memory>
 #include <mutex>
@@ -51,6 +56,12 @@ ds4_session *g_session = nullptr;
 int g_ctx_size = 32768;
 std::string g_kv_cache_dir; // empty disables disk cache

+// Distributed coordinator state. g_distributed is set true when LoadModel is
+// given 'ds4_role:coordinator'; generation then waits for the worker route to
+// form before running. Single-node behavior is unchanged when unset.
+bool g_distributed = false;
+int g_route_timeout_sec = 60;
+
 std::atomic<Server *> g_server{nullptr};

 // Parse a "key:value" option string. Returns empty when no colon.
@@ -60,6 +71,201 @@ static std::pair<std::string, std::string> split_option(const std::string &opt)
    return {opt.substr(0, colon), opt.substr(colon + 1)};
 }

+// Parse a positive base-10 integer. Returns false (without throwing) on empty,
+// trailing garbage, non-positive, or overflow - unlike std::stoi.
+static bool parse_positive_int(const std::string &s, int *out) {
+    if (s.empty()) return false;
+    char *end = nullptr;
+    long v = std::strtol(s.c_str(), &end, 10);
+    if (!end || *end != '\0' || v <= 0 || v > INT_MAX) return false;
+    *out = static_cast<int>(v);
+    return true;
+}
+
+// Parse a ds4 layer spec "START:END" or "START:output" into the engine's
+// distributed layer fields. Returns false on malformed input.
+static bool parse_layers_spec(const std::string &spec, ds4_distributed_layers *out) {
+    auto colon = spec.find(':');
+    if (colon == std::string::npos) return false;
+    std::string lhs = spec.substr(0, colon);
+    std::string rhs = spec.substr(colon + 1);
+    if (lhs.empty() || rhs.empty()) return false;
+    char *end = nullptr;
+    long start = std::strtol(lhs.c_str(), &end, 10);
+    if (!end || *end != '\0' || start < 0) return false;
+    out->start = static_cast<uint32_t>(start);
+    out->has_output = false;
+    if (rhs == "output") {
+        out->has_output = true;
+        out->end = out->start; // engine treats has_output as "through final layer"
+    } else {
+        long e = std::strtol(rhs.c_str(), &end, 10);
+        if (!end || *end != '\0' || e < start) return false;
+        out->end = static_cast<uint32_t>(e);
+    }
+    out->set = true;
+    return true;
+}
+
+// Parse a boolean LoadModel option. An empty value (a bare flag-style option
+// like "ssd_streaming" with no colon) means true so model YAMLs can write
+// options: ["ssd_streaming"] to enable a switch.
+static bool parse_bool_option(const std::string &s, bool *out) {
+    if (s.empty() || s == "true" || s == "1" || s == "yes" || s == "on") { *out = true; return true; }
+    if (s == "false" || s == "0" || s == "no" || s == "off") { *out = false; return true; }
+    return false;
+}
+
+// Table-driven mapping from LoadModel option keys to ds4_engine_options fields.
+// ds4_engine_options is a fixed C struct with no reflection, so the field set
+// is enumerated once here; adding a future engine knob is a one-line table
+// entry rather than a new branch in LoadModel. Two fields need ds4's own typed
+// parsers (Gib, CacheExperts) so a plain string passthrough can't cover them.
+enum class DsOptType { Bool, Int, Uint, Float, Str, Gib, CacheExperts };
+
+struct DsOptSpec {
+    const char *key;
+    DsOptType   type;
+    size_t      off;      // byte offset into ds4_engine_options
+    size_t      off2;     // second offset (CacheExperts writes experts + bytes)
+    bool        is_path;  // Str values: resolve a relative value against the model dir
+};
+
+static const DsOptSpec kEngineOptSpecs[] = {
+    {"mtp_path",                      DsOptType::Str,          offsetof(ds4_engine_options, mtp_path),                      0, true},
+    {"mtp_draft",                     DsOptType::Int,          offsetof(ds4_engine_options, mtp_draft_tokens),              0},
+    {"mtp_margin",                    DsOptType::Float,        offsetof(ds4_engine_options, mtp_margin),                    0},
+    {"prefill_chunk",                 DsOptType::Uint,         offsetof(ds4_engine_options, prefill_chunk),                 0},
+    {"power_percent",                 DsOptType::Int,          offsetof(ds4_engine_options, power_percent),                 0},
+    {"warm_weights",                  DsOptType::Bool,         offsetof(ds4_engine_options, warm_weights),                  0},
+    {"quality",                       DsOptType::Bool,         offsetof(ds4_engine_options, quality),                       0},
+    {"ssd_streaming",                 DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming),                 0},
+    {"ssd_streaming_cold",            DsOptType::Bool,         offsetof(ds4_engine_options, ssd_streaming_cold),            0},
+    {"ssd_streaming_preload_experts", DsOptType::Uint,         offsetof(ds4_engine_options, ssd_streaming_preload_experts), 0},
+    {"ssd_streaming_cache_experts",   DsOptType::CacheExperts, offsetof(ds4_engine_options, ssd_streaming_cache_experts),
+                                                               offsetof(ds4_engine_options, ssd_streaming_cache_bytes)},
+    {"simulate_used_memory",          DsOptType::Gib,          offsetof(ds4_engine_options, simulate_used_memory_bytes),    0},
+    {"expert_profile_path",           DsOptType::Str,          offsetof(ds4_engine_options, expert_profile_path),           0, true},
+    {"directional_steering_file",     DsOptType::Str,          offsetof(ds4_engine_options, directional_steering_file),     0, true},
+    {"directional_steering_attn",     DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_attn),     0},
+    {"directional_steering_ffn",      DsOptType::Float,        offsetof(ds4_engine_options, directional_steering_ffn),      0},
+};
+
+// Apply a single key:value LoadModel option to the engine options struct.
+// Unknown keys are ignored (back-compat: callers pass mixed option sets).
+// String values are copied into `storage`, whose elements the engine reads by
+// pointer during ds4_engine_open; `storage` MUST have reserved capacity so
+// push_back never reallocates and dangles an earlier c_str(). Returns false
+// with `err` set when a recognized key has an invalid value.
+static bool apply_engine_option(ds4_engine_options *opt, const std::string &key,
+                                const std::string &val, const std::string &model_dir,
+                                std::vector<std::string> &storage, std::string &err) {
+    const DsOptSpec *spec = nullptr;
+    for (const auto &s : kEngineOptSpecs) {
+        if (key == s.key) { spec = &s; break; }
+    }
+    if (!spec) return true; // unknown key: ignore
+
+    char *base = reinterpret_cast<char *>(opt);
+    switch (spec->type) {
+    case DsOptType::Bool: {
+        bool b = false;
+        if (!parse_bool_option(val, &b)) { err = key + " must be true/false"; return false; }
+        *reinterpret_cast<bool *>(base + spec->off) = b;
+        return true;
+    }
+    case DsOptType::Int: {
+        char *end = nullptr;
+        long v = std::strtol(val.c_str(), &end, 10);
+        if (val.empty() || !end || *end != '\0') { err = key + " must be an integer"; return false; }
+        *reinterpret_cast<int *>(base + spec->off) = static_cast<int>(v);
+        return true;
+    }
+    case DsOptType::Uint: {
+        char *end = nullptr;
+        long v = std::strtol(val.c_str(), &end, 10);
+        if (val.empty() || !end || *end != '\0' || v < 0 || v > static_cast<long>(UINT32_MAX)) {
+            err = key + " must be a non-negative integer"; return false;
+        }
+        *reinterpret_cast<uint32_t *>(base + spec->off) = static_cast<uint32_t>(v);
+        return true;
+    }
+    case DsOptType::Float: {
+        char *end = nullptr;
+        float f = std::strtof(val.c_str(), &end);
+        if (val.empty() || !end || *end != '\0') { err = key + " must be a number"; return false; }
+        *reinterpret_cast<float *>(base + spec->off) = f;
+        return true;
+    }
+    case DsOptType::Str: {
+        // Resolve a relative path option (e.g. mtp_path: a sibling GGUF the
+        // gallery downloaded next to the model) against the model directory, so
+        // YAMLs reference companion files by name. Absolute values pass through.
+        if (spec->is_path && !model_dir.empty() && !val.empty() && val.front() != '/') {
+            storage.push_back(model_dir + "/" + val);
+        } else {
+            storage.push_back(val);
+        }
+        *reinterpret_cast<const char **>(base + spec->off) = storage.back().c_str();
+        return true;
+    }
+    case DsOptType::Gib: {
+        uint64_t bytes = 0;
+        if (!ds4_parse_gib_arg(val.c_str(), &bytes)) {
+            err = key + " must be a GiB value, e.g. 64GB"; return false;
+        }
+        *reinterpret_cast<uint64_t *>(base + spec->off) = bytes;
+        return true;
+    }
+    case DsOptType::CacheExperts: {
+        uint32_t experts = 0;
+        uint64_t bytes = 0;
+        if (!ds4_parse_streaming_cache_experts_arg(val.c_str(), &experts, &bytes)) {
+            err = key + " must be a positive expert count or a <number>GB budget"; return false;
+        }
+        *reinterpret_cast<uint32_t *>(base + spec->off)  = experts;
+        *reinterpret_cast<uint64_t *>(base + spec->off2) = bytes;
+        return true;
+    }
+    }
+    return true;
+}
+
+// When acting as a distributed coordinator, block until the worker route
+// covers all layers (ds4_session_distributed_route_ready == 1) or the timeout
+// elapses. Returns an empty string on success, or an error message to return
+// to the client. No-op when not distributed.
+//
+// Takes the g_engine_mu lock by reference and RELEASES it during each poll
+// sleep. The wait can span up to g_route_timeout_sec seconds while workers
+// connect; holding g_engine_mu the whole time would block the Status/Health
+// readiness probes (they also lock g_engine_mu), making LocalAI's loader treat
+// a still-starting worker as hung.
+static std::string wait_route_ready(std::unique_lock<std::mutex> &lock) {
+    if (!g_distributed) return "";
+    char err[256] = {0};
+    const int deadline_polls = g_route_timeout_sec * 10; // 100ms per poll
+    for (int i = 0; i <= deadline_polls; ++i) {
+        int ready = ds4_session_distributed_route_ready(g_session, err, sizeof(err));
+        if (ready == 1) return "";
+        if (ready < 0) {
+            return std::string("ds4 distributed route error: ") +
+                   (err[0] ? err : "unknown");
+        }
+        // Release the lock while sleeping so Status/Health and other RPCs can
+        // interleave during worker startup.
+        lock.unlock();
+        struct timespec ts = {0, 100L * 1000L * 1000L}; // 100ms
+        nanosleep(&ts, nullptr);
+        lock.lock();
+        // A concurrent Free() may have torn down the engine while we slept.
+        if (!g_engine || !g_session) {
+            return "ds4: model unloaded while waiting for distributed route";
+        }
+    }
+    return "ds4 distributed route incomplete: workers not connected (layers uncovered)";
+}
+
 static void append_token_text(ds4_engine *engine, int token, std::string &out) {
    size_t len = 0;
    const char *text = ds4_token_text(engine, token, &len);
@@ -377,6 +583,11 @@ public:
                     backend::Result *result) override {
        std::lock_guard<std::mutex> lock(g_engine_mu);

+        // Reset distributed state so a model swap (a second LoadModel without
+        // ds4_role) doesn't inherit a stale coordinator configuration.
+        g_distributed = false;
+        g_route_timeout_sec = 60;
+
        if (g_engine) {
            if (g_session) { ds4_session_free(g_session); g_session = nullptr; }
            ds4_engine_close(g_engine);
@@ -391,28 +602,10 @@ public:
            return GStatus::OK;
        }

-        std::string mtp_path;
-        int mtp_draft = 0;
-        float mtp_margin = 3.0f;
-        for (const auto &opt : request->options()) {
-            auto [k, v] = split_option(opt);
-            if (k == "mtp_path") mtp_path = v;
-            else if (k == "mtp_draft") mtp_draft = std::stoi(v);
-            else if (k == "mtp_margin") mtp_margin = std::stof(v);
-            else if (k == "kv_cache_dir") g_kv_cache_dir = v;
-        }
-
-        g_kv_cache.SetDir(g_kv_cache_dir);
-
        ds4_engine_options opt = {};
        opt.model_path = model_path.c_str();
-        opt.mtp_path = mtp_path.empty() ? nullptr : mtp_path.c_str();
        opt.n_threads = request->threads() > 0 ? request->threads() : 0;
-        opt.mtp_draft_tokens = mtp_draft;
-        opt.mtp_margin = mtp_margin;
-        opt.directional_steering_file = nullptr;
-        opt.warm_weights = false;
-        opt.quality = false;
+        opt.mtp_margin = 3.0f; // ds4 default; overridable via the mtp_margin option

 #if defined(DS4_NO_GPU)
        opt.backend = DS4_BACKEND_CPU;
@@ -422,6 +615,89 @@ public:
        opt.backend = DS4_BACKEND_CUDA;
 #endif

+        // Stable storage for string-valued engine options. The engine reads
+        // these by pointer during ds4_engine_open, so the std::string backing
+        // store must outlive the call and not reallocate; reserve up front so
+        // push_back keeps every prior c_str() valid. Static + clear() reuses
+        // the buffer across LoadModel calls (the old engine is closed above).
+        static std::vector<std::string> s_opt_strings;
+        s_opt_strings.clear();
+        s_opt_strings.reserve(sizeof(kEngineOptSpecs) / sizeof(kEngineOptSpecs[0]));
+
+        // Directory of the main model, used to resolve relative path options.
+        std::string model_dir;
+        if (auto slash = model_path.find_last_of('/'); slash != std::string::npos) {
+            model_dir = model_path.substr(0, slash);
+        }
+
+        std::string ds4_role, ds4_layers, ds4_listen;
+        for (const auto &o : request->options()) {
+            auto [k, v] = split_option(o);
+            if (k == "kv_cache_dir") { g_kv_cache_dir = v; continue; }
+            else if (k == "ds4_role") { ds4_role = v; continue; }
+            else if (k == "ds4_layers") { ds4_layers = v; continue; }
+            else if (k == "ds4_listen") { ds4_listen = v; continue; }
+            else if (k == "ds4_route_timeout") {
+                if (!parse_positive_int(v, &g_route_timeout_sec)) {
+                    result->set_success(false);
+                    result->set_message("ds4: ds4_route_timeout must be a positive integer");
+                    return GStatus::OK;
+                }
+                continue;
+            }
+            std::string err;
+            if (!apply_engine_option(&opt, k, v, model_dir, s_opt_strings, err)) {
+                result->set_success(false);
+                result->set_message("ds4: " + err);
+                return GStatus::OK;
+            }
+        }
+
+        g_kv_cache.SetDir(g_kv_cache_dir);
+
+        // Coordinator wiring. 'ds4_role:coordinator' enables layer-split
+        // distributed inference: this process listens on ds4_listen and owns
+        // the ds4_layers slice; workers dial in (see `local-ai worker
+        // ds4-distributed`). Absent ds4_role => unchanged single-node path.
+        // Must be static: opt.distributed.listen_host is a const char* the
+        // engine retains past this call, so it cannot point at a local that
+        // goes out of scope (otherwise a future "simplify to local" refactor
+        // reintroduces a dangling pointer).
+        static std::string s_listen_host;
+        if (ds4_role == "coordinator") {
+            if (ds4_layers.empty() || ds4_listen.empty()) {
+                result->set_success(false);
+                result->set_message("ds4: ds4_role:coordinator requires ds4_layers and ds4_listen");
+                return GStatus::OK;
+            }
+            // host:port for IPv4/hostname; IPv6 literals are unsupported (the
+            // first colon would split inside the address).
+            auto host_port = split_option(ds4_listen); // "host:port" -> {host, port}
+            if (host_port.second.empty()) {
+                result->set_success(false);
+                result->set_message("ds4: ds4_listen must be host:port");
+                return GStatus::OK;
+            }
+            int listen_port = 0;
+            if (!parse_positive_int(host_port.second, &listen_port)) {
+                result->set_success(false);
+                result->set_message("ds4: ds4_listen port must be a positive integer");
+                return GStatus::OK;
+            }
+            ds4_distributed_layers layers = {};
+            if (!parse_layers_spec(ds4_layers, &layers)) {
+                result->set_success(false);
+                result->set_message("ds4: invalid ds4_layers (want START:END or START:output)");
+                return GStatus::OK;
+            }
+            s_listen_host = host_port.first;
+            opt.distributed.role = DS4_DISTRIBUTED_COORDINATOR;
+            opt.distributed.layers = layers;
+            opt.distributed.listen_host = s_listen_host.c_str();
+            opt.distributed.listen_port = listen_port;
+            g_distributed = true;
+        }
+
        int rc = ds4_engine_open(&g_engine, &opt);
        if (rc != 0 || !g_engine) {
            result->set_success(false);
@@ -458,10 +734,13 @@ public:

    GStatus Predict(ServerContext *, const backend::PredictOptions *request,
                   backend::Reply *reply) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
+        std::unique_lock<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
+        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
+            return GStatus(StatusCode::UNAVAILABLE, route_err);
+        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
@@ -554,10 +833,13 @@ public:

    GStatus PredictStream(ServerContext *, const backend::PredictOptions *request,
                         ServerWriter<backend::Reply> *writer) override {
-        std::lock_guard<std::mutex> lock(g_engine_mu);
+        std::unique_lock<std::mutex> lock(g_engine_mu);
        if (!g_engine || !g_session) {
            return GStatus(StatusCode::FAILED_PRECONDITION, "ds4: model not loaded");
        }
+        if (std::string route_err = wait_route_ready(lock); !route_err.empty()) {
+            return GStatus(StatusCode::UNAVAILABLE, route_err);
+        }
        ds4_tokens prompt = {};
        build_prompt(g_engine, request, &prompt);
        int n_predict = request->tokens() > 0 ? request->tokens() : 256;
--- a/backend/cpp/ds4/package.sh
+++ b/backend/cpp/ds4/package.sh
@@ -5,7 +5,8 @@ REPO_ROOT="${CURDIR}/../../.."

 mkdir -p "$CURDIR/package/lib"
 cp -avf "$CURDIR/grpc-server" "$CURDIR/package/"
-cp -rfv "$CURDIR/run.sh"     "$CURDIR/package/"
+cp -avf "$CURDIR/ds4-worker"  "$CURDIR/package/"
+cp -rfv "$CURDIR/run.sh"      "$CURDIR/package/"

 UNAME_S=$(uname -s)
 if [ "$UNAME_S" = "Darwin" ]; then
--- a/backend/cpp/ds4/worker_main.c
+++ b/backend/cpp/ds4/worker_main.c
@@ -0,0 +1,126 @@
+// ds4-worker: standalone distributed worker for the LocalAI ds4 backend.
+//
+// A ds4 distributed worker owns a slice of the model's transformer layers,
+// dials the coordinator, and serves activations for its slice. It does NOT
+// speak backend.proto - it speaks ds4's own TCP transport via ds4_dist_run().
+// This binary is intentionally minimal (no HTTP/web/kvstore/linenoise): it
+// only needs the engine objects + ds4_distributed.o, which the backend already
+// builds. It is launched by `local-ai worker ds4-distributed`.
+//
+// Usage:
+//   ds4-worker --role worker --model <gguf> --layers 20:output \
+//              --coordinator <host> <port> [--cpu|--cuda|--metal] [-c CTX] [-t N]
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <limits.h>
+
+#include "ds4.h"
+#include "ds4_distributed.h"
+
+static const char *need_arg(int *i, int argc, char **argv, const char *flag) {
+    if (*i + 1 >= argc) {
+        fprintf(stderr, "ds4-worker: missing value for %s\n", flag);
+        exit(2);
+    }
+    return argv[++(*i)];
+}
+
+static int parse_int_arg(const char *s, const char *flag) {
+    char *end = NULL;
+    long v = strtol(s, &end, 10);
+    if (!s[0] || *end || v <= 0 || v > INT_MAX) {
+        fprintf(stderr, "ds4-worker: invalid value for %s: %s\n", flag, s);
+        exit(2);
+    }
+    return (int)v;
+}
+
+static ds4_backend default_backend(void) {
+#if defined(DS4_NO_GPU)
+    return DS4_BACKEND_CPU;
+#elif defined(__APPLE__)
+    return DS4_BACKEND_METAL;
+#else
+    return DS4_BACKEND_CUDA;
+#endif
+}
+
+int main(int argc, char **argv) {
+    signal(SIGPIPE, SIG_IGN);
+
+    ds4_engine_options opt = {0};
+    opt.backend = default_backend();
+    int ctx_size = 32768;
+
+    for (int i = 1; i < argc; i++) {
+        const char *arg = argv[i];
+        if (!strcmp(arg, "-h") || !strcmp(arg, "--help")) {
+            fprintf(stdout, "ds4-worker: standalone ds4 distributed worker\n");
+            ds4_dist_usage(stdout);
+            fprintf(stdout, "  -m, --model PATH   model GGUF (the worker loads only its --layers slice)\n");
+            fprintf(stdout, "  -c, --ctx N        context size (default 32768)\n");
+            fprintf(stdout, "  -t, --threads N    CPU threads\n");
+            fprintf(stdout, "  --cpu|--cuda|--metal  backend override\n");
+            return 0;
+        }
+
+        char dist_err[256] = {0};
+        ds4_dist_cli_parse_result dist_parse =
+            ds4_dist_parse_cli_arg(arg, &i, argc, argv, &opt.distributed,
+                                   dist_err, sizeof(dist_err));
+        if (dist_parse == DS4_DIST_CLI_ERROR) {
+            fprintf(stderr, "ds4-worker: %s\n",
+                    dist_err[0] ? dist_err : "invalid distributed option");
+            return 2;
+        }
+        if (dist_parse == DS4_DIST_CLI_MATCHED) continue;
+
+        if (!strcmp(arg, "-m") || !strcmp(arg, "--model")) {
+            opt.model_path = need_arg(&i, argc, argv, arg);
+        } else if (!strcmp(arg, "-c") || !strcmp(arg, "--ctx")) {
+            ctx_size = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
+        } else if (!strcmp(arg, "-t") || !strcmp(arg, "--threads")) {
+            opt.n_threads = parse_int_arg(need_arg(&i, argc, argv, arg), arg);
+        } else if (!strcmp(arg, "--cpu")) {
+            opt.backend = DS4_BACKEND_CPU;
+        } else if (!strcmp(arg, "--cuda")) {
+            opt.backend = DS4_BACKEND_CUDA;
+        } else if (!strcmp(arg, "--metal")) {
+            opt.backend = DS4_BACKEND_METAL;
+        } else {
+            fprintf(stderr, "ds4-worker: unknown option: %s\n", arg);
+            return 2;
+        }
+    }
+
+    if (opt.distributed.role != DS4_DISTRIBUTED_WORKER) {
+        fprintf(stderr, "ds4-worker: --role worker is required\n");
+        return 2;
+    }
+    if (!opt.model_path) {
+        fprintf(stderr, "ds4-worker: --model is required\n");
+        return 2;
+    }
+
+    char prep_err[256] = {0};
+    if (ds4_dist_prepare_engine_options(&opt.distributed, &opt,
+                                        prep_err, sizeof(prep_err)) != 0) {
+        fprintf(stderr, "ds4-worker: %s\n", prep_err);
+        return 2;
+    }
+
+    ds4_engine *engine = NULL;
+    if (ds4_engine_open(&engine, &opt) != 0 || !engine) {
+        fprintf(stderr, "ds4-worker: failed to open engine\n");
+        return 1;
+    }
+
+    ds4_dist_generation_options gen = {0};
+    gen.ctx_size = ctx_size;
+    int rc = ds4_dist_run(engine, &opt.distributed, &gen);
+    ds4_engine_close(engine);
+    return rc;
+}
--- a/backend/cpp/ik-llama-cpp/Makefile
+++ b/backend/cpp/ik-llama-cpp/Makefile
@@ -1,5 +1,5 @@

-IK_LLAMA_VERSION?=642c038ccdf3dd08e6d9ac6fdc3b1c311ebd8a02
+IK_LLAMA_VERSION?=b3dfb7858cfcb9166e92f366e5af87f19ebc94be
 LLAMA_REPO?=https://github.com/ikawrakow/ik_llama.cpp

 CMAKE_ARGS?=
--- a/backend/cpp/llama-cpp/Makefile
+++ b/backend/cpp/llama-cpp/Makefile
@@ -1,6 +1,14 @@

-LLAMA_VERSION?=c0c7e147e7efa6c5858754b47259ba4880f8a906
+LLAMA_VERSION?=f3e182816421c648188b5eab269853bf1531d950
 LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
+# LLAMA_PAGED controls whether the vendored paged-attention patch series
+# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
+# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
+# dep-bump if an upstream change breaks a paged hook - the paged carry is then
+# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
+# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
+# env is set.
+LLAMA_PAGED?=on

 CMAKE_ARGS?=
 BUILD_TYPE?=
@@ -137,14 +145,28 @@ llama.cpp:
 	git remote add origin $(LLAMA_REPO)  && \
 	git fetch --all --tags && \
 	git checkout -b build $(LLAMA_VERSION) && \
-	git submodule update --init --recursive --depth 1 --single-branch
+	git submodule update --init --recursive --depth 1 --single-branch && \
+	for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \
+		[ -e "$$p" ] || continue; \
+		echo "applying llama.cpp patch: $$p"; \
+		git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
+	done && \
+	if [ "$(LLAMA_PAGED)" = "off" ]; then \
+		echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
+	else \
+		for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
+			[ -e "$$p" ] || continue; \
+			echo "applying llama.cpp PAGED patch: $$p"; \
+			git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
+		done; \
+	fi

 llama.cpp/tools/grpc-server: llama.cpp
 	mkdir -p llama.cpp/tools/grpc-server
-	bash prepare.sh
+	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh

 rebuild:
-	bash prepare.sh
+	LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
 	rm -rf grpc-server
 	$(MAKE) grpc-server

--- a/backend/cpp/llama-cpp/grpc-server.cpp
+++ b/backend/cpp/llama-cpp/grpc-server.cpp
@@ -34,6 +34,7 @@
 #include <regex>
 #include <algorithm>
 #include <atomic>
+#include <cmath>
 #include <cstdlib>
 #include <fstream>
 #include <iterator>
@@ -121,6 +122,40 @@ static std::string base64_encode_bytes(const unsigned char* data, size_t len) {

 bool loaded_model; // TODO: add a mutex for this, but happens only once loading the model

+// Score bypasses the slot loop (see the comment on Score below) so it
+// must not run concurrently with any slot-loop RPC. These counters
+// are a defence-in-depth tripwire — ModelConfig.Validate already
+// rejects llama-cpp configs that mix score with chat/completion/
+// embeddings, so a healthy deployment never trips them. seq_cst is
+// load-bearing for the increment-then-check pattern below.
+static std::atomic<int> slot_loop_inflight{0};
+static std::atomic<int> score_inflight{0};
+
+// Increment-then-check, not check-then-increment: two simultaneous
+// racers both observe the other's increment and both abort cleanly.
+// Reversed, both could see zero and proceed.
+struct conflict_guard {
+    std::atomic<int>& self;
+    conflict_guard(const char* rpc, std::atomic<int>& self_, std::atomic<int>& other, const char* other_name)
+        : self(self_) {
+        self.fetch_add(1, std::memory_order_seq_cst);
+        int o = other.load(std::memory_order_seq_cst);
+        if (o > 0) {
+            fprintf(stderr,
+                "FATAL: %s called with %s=%d. The llama-cpp backend cannot "
+                "service Score and slot-loop RPCs concurrently — Score "
+                "bypasses the slot loop and races the llama_context. Bind "
+                "Score-using features to a model dedicated to scoring "
+                "(known_usecases: [score] with no chat/completion/embeddings).\n",
+                rpc, other_name, o);
+            std::abort();
+        }
+    }
+    ~conflict_guard() {
+        self.fetch_sub(1, std::memory_order_seq_cst);
+    }
+};
+
 static std::function<void(int)> shutdown_handler;
 static std::atomic_flag is_terminating = ATOMIC_FLAG_INIT;

@@ -346,6 +381,15 @@ json parse_options(bool streaming, const backend::PredictOptions* predict, const
            });
    }

+    // for each video in the request, add the video data
+    for (int i = 0; i < predict->videos_size(); i++) {
+        data["video_data"].push_back(json
+            {
+                {"id", i},
+                {"data",    predict->videos(i)},
+            });
+    }
+
    data["stop"] = predict->stopprompts();
    // data["n_probs"] = predict->nprobs();
    //TODO: images,
@@ -447,23 +491,13 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    if (!request->draftmodel().empty()) {
        params.speculative.draft.mparams.path = request->draftmodel();
        // Default to draft type if a draft model is set but no explicit type.
-        // Upstream (post ggml-org/llama.cpp#22838) made the speculative type a
-        // vector; the turboquant fork still uses the legacy scalar. The
-        // LOCALAI_LEGACY_LLAMA_CPP_SPEC macro is injected by
-        // backend/cpp/turboquant/patch-grpc-server.sh for fork builds only.
-        // Upstream renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE
-        // in ggml-org/llama.cpp#22964; the fork still uses the old name.
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-        if (params.speculative.type == COMMON_SPECULATIVE_TYPE_NONE) {
-            params.speculative.type = COMMON_SPECULATIVE_TYPE_DRAFT;
-        }
-#else
+        // Upstream made the speculative type a vector (ggml-org/llama.cpp#22838)
+        // and renamed COMMON_SPECULATIVE_TYPE_DRAFT -> ..._DRAFT_SIMPLE (#22964).
        const bool no_spec_type = params.speculative.types.empty() ||
            (params.speculative.types.size() == 1 && params.speculative.types[0] == COMMON_SPECULATIVE_TYPE_NONE);
        if (no_spec_type) {
            params.speculative.types = { COMMON_SPECULATIVE_TYPE_DRAFT_SIMPLE };
        }
-#endif
    }

    //  params.model_alias ??
@@ -535,9 +569,16 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
    // kv_unified=false or cache_ram_mib=0, so flipping kv_unified above is
    // what actually unlocks it.
    params.cache_idle_slots = true;
-    // checkpoint_every_nt: create a context checkpoint every N tokens during
-    // prefill (-1 disables). Match upstream's default (8192).
-    params.checkpoint_every_nt = 8192;
+    // checkpoint_min_step: minimum spacing between context checkpoints in
+    // tokens (0 disables the minimum). Match upstream's default (256). This
+    // field was renamed from `checkpoint_every_nt` in llama.cpp; the semantics
+    // also shifted from a fixed cadence to a minimum spacing. The turboquant
+    // fork still lacks common_params::checkpoint_min_step, so skip it there
+    // (LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP is injected by
+    // backend/cpp/turboquant/patch-grpc-server.sh).
+#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
+    params.checkpoint_min_step = 256;
+#endif

     // decode options. Options are in form optname:optvale, or if booleans only optname.
    for (int i = 0; i < request->options_size(); i++) {
@@ -691,6 +732,97 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") {
                params.kv_unified = false;
            }
+        // --- paged KV cache (experimental, off by default) ---
+        // Enables the on-demand paged KV-cache engine (vendored PagedKVManager
+        // + paged placement/gather/alloc seams). The engine is gated inside
+        // llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use;
+        // here we expose it as a per-server model option instead of forcing the
+        // operator to export a process-wide env. When enabled we set the env
+        // BEFORE the model/context is created (later in this handler), so the
+        // engine latches on. When the option is absent we touch nothing, so an
+        // externally exported LLAMA_KV_PAGED still works as an escape hatch.
+        // Note: the engine's env check is process-wide and latches on first
+        // use, so enabling it for one model enables it for the worker process;
+        // LocalAI runs one model per llama.cpp worker, so this maps cleanly to
+        // per-server configuration. `kv_paged_debug` turns on the per-slot
+        // [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG).
+        //
+        // The continuous-batching serving loop (update_slots) drives paged KV
+        // transparently through the existing kv-cache seams: each slot's
+        // sequence allocates paged blocks on arrival (find_slot placement) and
+        // returns them on slot release (the seq_rm free seam). This is
+        // token-identical to stock under both the unified and per-sequence
+        // caches. The per-slot allocate/free capacity benefit, however, only
+        // materialises with a per-sequence cache, since paged block ownership
+        // is keyed by stream and the unified cache collapses every slot onto a
+        // single stream. Operators who want that benefit should pair this with
+        // `kv_unified:false`; we do NOT flip kv_unified here, to keep the
+        // default serving behaviour (and the idle-slot prompt cache) unchanged.
+        } else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                setenv("LLAMA_KV_PAGED", "1", 1);
+            }
+        } else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) {
+            if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") {
+                setenv("LLAMA_KV_PAGED_DEBUG", "1", 1);
+            }
+        // --- chunked-prefill QoS budget (experimental, off by default) ---
+        // Caps the number of prompt tokens any single slot may prefill per
+        // update_slots iteration, so a large prompt cannot monopolise the batch
+        // and freeze the in-flight decoders. The serving loop reads this budget
+        // from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like
+        // kv_paged above) and splits oversized prompts across iterations,
+        // interleaving decode steps for the other slots. A 6k-token prefill that
+        // stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall
+        // cut) with zero TTFT cost and no steady-state regression. Unset or a
+        // non-positive value leaves the env untouched, so the stock unbounded
+        // prefill behaviour is preserved (an externally exported
+        // LLAMA_PREFILL_BUDGET still works as an escape hatch).
+        } else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) {
+            if (optval != NULL) {
+                try {
+                    int budget = std::stoi(optval_str);
+                    if (budget > 0) {
+                        setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) ---
+        // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic
+        // T - D budget read by update_slots(): a single total per-step token budget T
+        // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which
+        // decode claims its live load D first and prefill gets the leftover, plus an
+        // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_
+        // threshold analogue). Both are set BEFORE context init, like kv_paged /
+        // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays
+        // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS /
+        // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set
+        // it takes precedence over max_prefill_tokens: the engine honours the legacy
+        // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset.
+        } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) {
+            if (optval != NULL) {
+                try {
+                    int mbt = std::stoi(optval_str);
+                    if (mbt > 0) {
+                        setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the budget unset (stock behaviour)
+                }
+            }
+        } else if (!strcmp(optname, "prefill_cap")) {
+            if (optval != NULL) {
+                try {
+                    int cap = std::stoi(optval_str);
+                    if (cap > 0) {
+                        setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1);
+                    }
+                } catch (const std::exception& e) {
+                    // If conversion fails, leave the per-slot cap unset (engine default)
+                }
+            }
        } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) {
            if (optval != NULL) {
                try {
@@ -711,16 +843,28 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
                params.cache_idle_slots = false;
            }

-        // --- prefill checkpoint cadence (upstream -cpent / --checkpoint-every-n-tokens) ---
-        // -1 disables checkpointing during prefill.
-        } else if (!strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
+#ifndef LOCALAI_TURBOQUANT_NO_CHECKPOINT_MIN_STEP
+        // --- minimum context-checkpoint spacing (upstream -cms / --checkpoint-min-step) ---
+        // 0 disables the minimum-spacing gate. Old option names (`checkpoint_every_nt`,
+        // `checkpoint_every_n_tokens`) are kept as aliases for backward compatibility
+        // with existing user configs: upstream renamed the field and shifted its
+        // semantics from a fixed cadence to a minimum spacing.
+        //
+        // Gated out for the turboquant fork, which lacks common_params::
+        // checkpoint_min_step. The leading `}` closing the cache_idle_slots
+        // branch is removed with this block; the next `} else if` (n_ubatch)
+        // then closes cache_idle_slots, so braces stay balanced under both
+        // preprocessor branches.
+        } else if (!strcmp(optname, "checkpoint_min_step") || !strcmp(optname, "checkpoint_min_spacing") ||
+                   !strcmp(optname, "checkpoint_every_nt") || !strcmp(optname, "checkpoint_every_n_tokens")) {
            if (optval != NULL) {
                try {
-                    params.checkpoint_every_nt = std::stoi(optval_str);
+                    params.checkpoint_min_step = std::stoi(optval_str);
                } catch (const std::exception& e) {
-                    // If conversion fails, keep default value (8192)
+                    // If conversion fails, keep default value (256)
                }
            }
+#endif

        // --- physical batch size (upstream -ub / --ubatch-size) ---
        // Note: line ~482 already aliases n_ubatch to n_batch as a default; this
@@ -853,17 +997,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt

        // Speculative decoding options
        } else if (!strcmp(optname, "spec_type") || !strcmp(optname, "speculative_type")) {
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-            // Fork only knows a single scalar `type`. Take the first comma-
-            // separated value and assign it via the singular helper.
-            std::string first = optval_str;
-            const auto comma = first.find(',');
-            if (comma != std::string::npos) first = first.substr(0, comma);
-            auto type = common_speculative_type_from_name(first);
-            if (type != COMMON_SPECULATIVE_TYPE_COUNT) {
-                params.speculative.type = type;
-            }
-#else
            // Upstream switched to a vector of types (comma-separated for multi-type
            // chaining via common_speculative_types_from_names). We keep accepting a
            // single value here, but also tolerate comma-separated lists.
@@ -892,7 +1025,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            if (!parsed.empty()) {
                params.speculative.types = parsed;
            }
-#endif
        } else if (!strcmp(optname, "spec_n_max") || !strcmp(optname, "draft_max")) {
            if (optval != NULL) {
                try { params.speculative.draft.n_max = std::stoi(optval_str); } catch (...) {}
@@ -930,21 +1062,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            // shares the target context size. Accept the option for backward
            // compatibility but silently ignore it.

-// Everything below relies on struct shape introduced in ggml-org/llama.cpp#22838
-// (parallel drafting): `ngram_mod`, `ngram_map_k`, `ngram_map_k4v`,
-// `ngram_cache`, and the `draft.{cache_type_*, cpuparams*, tensor_buft_overrides}`
-// fields. The turboquant fork branched before that, so its build defines
-// LOCALAI_LEGACY_LLAMA_CPP_SPEC via patch-grpc-server.sh and these option
-// keys become unrecognized (silently dropped, like any unknown opt) for it.
-//
-// The `#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC` / `#else` split below sits at the
-// closing-brace position of the `draft_ctx_size` branch on purpose: in the
-// legacy build the chain ends here (the brace closes draft_ctx_size), and in
-// the modern build the chain continues with `} else if (...)` instead, so the
-// brace count stays balanced under both branches of the preprocessor.
-#ifdef LOCALAI_LEGACY_LLAMA_CPP_SPEC
-        }
-#else
        // --- ngram_mod family (upstream --spec-ngram-mod-*) ---
        } else if (!strcmp(optname, "spec_ngram_mod_n_min")) {
            if (optval != NULL) {
@@ -1074,7 +1191,6 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            }
            if (!cur.empty()) flush(cur);
        }
-#endif // LOCALAI_LEGACY_LLAMA_CPP_SPEC — closes the `else`/`#ifdef` opened at draft_ctx_size
    }

    // Set params.n_parallel from environment variable if not set via options (fallback)
@@ -1124,6 +1240,8 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt
            params.tensor_buft_overrides.push_back({nullptr, nullptr});
        }
    }
+    // Terminate the draft tensor_buft_overrides list with a sentinel, mirroring
+    // the main-model handling above.
    if (!params.speculative.draft.tensor_buft_overrides.empty()) {
        params.speculative.draft.tensor_buft_overrides.push_back({nullptr, nullptr});
    }
@@ -1446,6 +1564,7 @@ public:
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
+        conflict_guard guard("PredictStream", slot_loop_inflight, score_inflight, "score_inflight");
        json data = parse_options(true, request, params_base, ctx_server.get_llama_context());


@@ -1484,7 +1603,7 @@ public:
                    msg_json["role"] = msg.role();

                    bool is_last_user_msg = (i == last_user_msg_idx);
-                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
+                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);

                    // Handle content - can be string, null, or array
                    // For multimodal content, we'll embed images/audio from separate fields
@@ -1535,6 +1654,16 @@ public:
                                    content_array.push_back(audio_chunk);
                                }
                            }
+                            if (request->videos_size() > 0) {
+                                for (int j = 0; j < request->videos_size(); j++) {
+                                    json video_chunk;
+                                    video_chunk["type"] = "input_video";
+                                    json input_video;
+                                    input_video["data"] = request->videos(j);
+                                    video_chunk["input_video"] = input_video;
+                                    content_array.push_back(video_chunk);
+                                }
+                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
@@ -1569,6 +1698,16 @@ public:
                                content_array.push_back(audio_chunk);
                            }
                        }
+                        if (request->videos_size() > 0) {
+                            for (int j = 0; j < request->videos_size(); j++) {
+                                json video_chunk;
+                                video_chunk["type"] = "input_video";
+                                json input_video;
+                                input_video["data"] = request->videos(j);
+                                video_chunk["input_video"] = input_video;
+                                content_array.push_back(video_chunk);
+                            }
+                        }
                        msg_json["content"] = content_array;
                    } else if (msg.role() == "tool") {
                        // Tool role messages must have content field set, even if empty
@@ -1874,14 +2013,27 @@ public:
                    body_json["min_p"] = data["min_p"];
                }

-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // chat_template_kwargs + per-request metadata: enable_thinking,
+                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
+                // the previous per-key handling - new template levers need no C++ change.
+                // oaicompat_chat_params_parse reads these from body_json.
                const auto& metadata = request->metadata();
-                auto et_it = metadata.find("enable_thinking");
-                if (et_it != metadata.end()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
+                auto ctk_it = metadata.find("chat_template_kwargs");
+                if (ctk_it != metadata.end() && !ctk_it->second.empty()) {
+                    try {
+                        json ctk = json::parse(ctk_it->second);
+                        if (ctk.is_object()) {
+                            if (!body_json.contains("chat_template_kwargs")) {
+                                body_json["chat_template_kwargs"] = json::object();
+                            }
+                            for (auto& el : ctk.items()) {
+                                body_json["chat_template_kwargs"][el.key()] = el.value();
+                            }
+                        }
+                    } catch (const std::exception & e) {
+                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
-                    body_json["chat_template_kwargs"]["enable_thinking"] = (et_it->second == "true");
                }

                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2009,6 +2161,16 @@ public:
                        files.push_back(decoded_data);
                    }
                }
+
+                const auto &video_data = data.find("video_data");
+                if (video_data != data.end() && video_data->is_array())
+                {
+                    for (const auto &video : *video_data)
+                    {
+                        auto decoded_data = base64_decode(video["data"].get<std::string>());
+                        files.push_back(decoded_data);
+                    }
+                }
            }

            const bool has_mtmd = ctx_server.impl->mctx != nullptr;
@@ -2144,7 +2306,15 @@ public:
        // content element — attaching to both would duplicate the first
        // token since oaicompat_msg_diffs is the same for both.
        json first_res_json = first_result->to_json();
-        if (first_res_json.is_array()) {
+        // Upstream llama.cpp (ggml-org/llama.cpp#23884) now emits an initial
+        // "begin" partial whose to_json() returns null, used only to signal the
+        // HTTP layer to flush 200 status headers before any token. gRPC has no
+        // such concept, so there is nothing to emit — the real tokens arrive in
+        // the loop below. Feeding this null into build_reply_from_json would
+        // throw (uncaught) and surface as a generic RPC error.
+        if (first_res_json.is_null()) {
+            // skip the begin-of-stream marker
+        } else if (first_res_json.is_array()) {
            for (const auto & res : first_res_json) {
                auto reply = build_reply_from_json(res, first_result.get());
                // Skip chat deltas for role-init elements (have "role" in
@@ -2174,7 +2344,10 @@ public:
            }

            json res_json = result->to_json();
-            if (res_json.is_array()) {
+            if (res_json.is_null()) {
+                // begin-of-stream marker (see note above) — nothing to emit
+                continue;
+            } else if (res_json.is_array()) {
                for (const auto & res : res_json) {
                    auto reply = build_reply_from_json(res, result.get());
                    bool is_role_init = res.contains("choices") && !res["choices"].empty() &&
@@ -2205,6 +2378,7 @@ public:
         if (params_base.model.path.empty()) {
             return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
         }
+         conflict_guard guard("Predict", slot_loop_inflight, score_inflight, "score_inflight");
         json data = parse_options(true, request, params_base, ctx_server.get_llama_context());

        data["stream"] = false;
@@ -2249,7 +2423,7 @@ public:
                    }

                    bool is_last_user_msg = (i == last_user_msg_idx);
-                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0);
+                    bool has_images_or_audio = (request->images_size() > 0 || request->audios_size() > 0 || request->videos_size() > 0);

                    // Handle content - can be string, null, or array
                    // For multimodal content, we'll embed images/audio from separate fields
@@ -2302,6 +2476,16 @@ public:
                                    content_array.push_back(audio_chunk);
                                }
                            }
+                            if (request->videos_size() > 0) {
+                                for (int j = 0; j < request->videos_size(); j++) {
+                                    json video_chunk;
+                                    video_chunk["type"] = "input_video";
+                                    json input_video;
+                                    input_video["data"] = request->videos(j);
+                                    video_chunk["input_video"] = input_video;
+                                    content_array.push_back(video_chunk);
+                                }
+                            }
                            msg_json["content"] = content_array;
                        } else {
                            // Use content as-is (already array or not last user message)
@@ -2341,6 +2525,16 @@ public:
                                content_array.push_back(audio_chunk);
                            }
                        }
+                        if (request->videos_size() > 0) {
+                            for (int j = 0; j < request->videos_size(); j++) {
+                                json video_chunk;
+                                video_chunk["type"] = "input_video";
+                                json input_video;
+                                input_video["data"] = request->videos(j);
+                                video_chunk["input_video"] = input_video;
+                                content_array.push_back(video_chunk);
+                            }
+                        }
                        msg_json["content"] = content_array;
                        SRV_INF("[CONTENT DEBUG] Predict: Message %d created content array with media\n", i);
                    } else if (!msg.tool_calls().empty()) {
@@ -2655,14 +2849,26 @@ public:
                    body_json["min_p"] = data["min_p"];
                }

-                // Pass enable_thinking via chat_template_kwargs (where oaicompat_chat_params_parse reads it)
+                // Forward the chat_template_kwargs the Go layer resolved (model config
+                // chat_template_kwargs + per-request metadata: enable_thinking,
+                // reasoning_effort, preserve_thinking, ...). One generic merge replaces
+                // the previous per-key handling - new template levers need no C++ change.
                const auto& predict_metadata = request->metadata();
-                auto predict_et_it = predict_metadata.find("enable_thinking");
-                if (predict_et_it != predict_metadata.end()) {
-                    if (!body_json.contains("chat_template_kwargs")) {
-                        body_json["chat_template_kwargs"] = json::object();
+                auto predict_ctk_it = predict_metadata.find("chat_template_kwargs");
+                if (predict_ctk_it != predict_metadata.end() && !predict_ctk_it->second.empty()) {
+                    try {
+                        json ctk = json::parse(predict_ctk_it->second);
+                        if (ctk.is_object()) {
+                            if (!body_json.contains("chat_template_kwargs")) {
+                                body_json["chat_template_kwargs"] = json::object();
+                            }
+                            for (auto& el : ctk.items()) {
+                                body_json["chat_template_kwargs"][el.key()] = el.value();
+                            }
+                        }
+                    } catch (const std::exception & e) {
+                        SRV_WRN("failed to parse chat_template_kwargs metadata: %s\n", e.what());
                    }
-                    body_json["chat_template_kwargs"]["enable_thinking"] = (predict_et_it->second == "true");
                }

                // Debug: Print full body_json before template processing (includes messages, tools, tool_choice, etc.)
@@ -2792,6 +2998,16 @@ public:
                        files.push_back(decoded_data);
                    }
                }
+
+                const auto &video_data = data.find("video_data");
+                if (video_data != data.end() && video_data->is_array())
+                {
+                    for (const auto &video : *video_data)
+                    {
+                        auto decoded_data = base64_decode(video["data"].get<std::string>());
+                        files.push_back(decoded_data);
+                    }
+                }
            }

            // process files
@@ -2963,6 +3179,7 @@ public:
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
+        conflict_guard guard("Embedding", slot_loop_inflight, score_inflight, "score_inflight");
        json body = parse_options(false, request, params_base, ctx_server.get_llama_context());

        body["stream"] = false;
@@ -3070,6 +3287,8 @@ public:
            return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "\"documents\" must be a non-empty string array");
        }

+        conflict_guard guard("Rerank", slot_loop_inflight, score_inflight, "score_inflight");
+
        // Create and queue the task
        auto rd = ctx_server.get_response_reader();
        {
@@ -3142,12 +3361,218 @@ public:
        return grpc::Status::OK;
    }

+    // Score returns the model's joint log-probability of each candidate
+    // continuation given a shared prompt.
+    //
+    // WHY bypass the slot/task queue: upstream server_context exposes
+    // get_llama_context as "main thread only" and the slot loop's
+    // update_slots() owns the context whenever a task is in flight.
+    // No public synchronization primitive is available — so Score is
+    // unsafe to call concurrently with active generation through this
+    // backend. In practice routing-classifier calls happen before the
+    // request is routed to a generation backend, so the model used
+    // for Score is typically idle. Concurrent Score calls are
+    // serialised by a local mutex; KV-cache state is isolated behind
+    // a dedicated sequence ID cleared between candidates.
+    //
+    // A patch to server-context.cpp that adds SERVER_TASK_TYPE_SCORE
+    // and routes scoring through the slot loop would be the correct
+    // long-term fix; tracked as a follow-up.
+    //
+    // Perf TODO (measured: ~450 ms warm for 3 candidates on Arch-
+    // Router-1.5B Q4_K_M + Intel SYCL): the current loop re-decodes
+    // `prompt + candidate` from scratch for every candidate, throwing
+    // away the prompt's KV cache between iterations. A smarter
+    // version would:
+    //   1. Decode just the prompt once into score_seq_id.
+    //   2. Snapshot/cp that sequence (llama_memory_seq_cp) into a
+    //      per-candidate sequence id.
+    //   3. For each candidate, decode only its tokens onto the copy
+    //      (continuing from the saved prompt state), read logits.
+    //   4. llama_memory_seq_rm the copy.
+    // Estimated speedup: 3-candidate calls 450 ms -> ~150-200 ms,
+    // 6-candidate calls 630 ms -> ~220 ms. Single source-file change,
+    // no proto / Go-side changes needed. Worth doing once routing is
+    // wired into the middleware and Score is on the hot path of every
+    // chat request.
+    grpc::Status Score(ServerContext* context, const backend::ScoreRequest* request, backend::ScoreResponse* response) override {
+        auto auth = checkAuth(context);
+        if (!auth.ok()) return auth;
+        if (params_base.model.path.empty()) {
+            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
+        }
+        if (request->candidates_size() == 0) {
+            return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "candidates must be non-empty");
+        }
+
+        // Tripwire against the slot loop. Acquired before score_mutex
+        // so it fires even when this Score is queued behind another.
+        conflict_guard guard("Score", score_inflight, slot_loop_inflight, "slot_loop_inflight");
+
+        // Serialise concurrent Score calls. The slot loop is still
+        // free to race with us — see the class comment above.
+        static std::mutex score_mutex;
+        std::lock_guard<std::mutex> score_lock(score_mutex);
+
+        llama_context * lctx = ctx_server.get_llama_context();
+        if (lctx == nullptr) {
+            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "llama context unavailable (sleeping?)");
+        }
+        const llama_vocab * vocab = ctx_server.impl->vocab;
+        const int32_t n_vocab = llama_vocab_n_tokens(vocab);
+        const int32_t n_ctx = llama_n_ctx(lctx);
+        llama_memory_t mem = llama_get_memory(lctx);
+
+        // The KV-cache is sized to seq_to_stream.size() at load
+        // (typically equal to n_slots, often 1). Sequence IDs must
+        // be in [0, n_seq_max), so we can't pick a high-value
+        // "private" ID — we have to share with the slot. We clear
+        // the cache before AND after each candidate to keep
+        // scoring isolated from whatever state the slot held, and
+        // the static mutex above guarantees no other Score call is
+        // racing in the meantime. The slot loop is still free to
+        // race (see comment on this method) — Score must not run
+        // concurrently with generation through this backend.
+        const llama_seq_id score_seq_id = 0;
+        llama_memory_seq_rm(mem, score_seq_id, -1, -1);
+
+        // Tokenize the shared prompt once with add_special=true so
+        // BOS is prepended when the model requires it. parse_special
+        // keeps chat-template markers in the prompt intact.
+        const std::string prompt = request->prompt();
+        std::vector<llama_token> prompt_tokens = common_tokenize(vocab, prompt, /*add_special=*/true, /*parse_special=*/true);
+        const int32_t prompt_len = (int32_t) prompt_tokens.size();
+
+        for (int ci = 0; ci < request->candidates_size(); ci++) {
+            const std::string & candidate_text = request->candidates(ci);
+
+            // Re-tokenize prompt + candidate as a single string. BPE
+            // merges across the boundary can shift the tokenization
+            // versus tokenize(prompt) ++ tokenize(candidate), so we
+            // find the divergence point against prompt_tokens.
+            std::vector<llama_token> full_tokens = common_tokenize(vocab, prompt + candidate_text, /*add_special=*/true, /*parse_special=*/true);
+            int32_t divergence = prompt_len;
+            const int32_t min_len = std::min<int32_t>(prompt_len, (int32_t) full_tokens.size());
+            for (int32_t i = 0; i < min_len; i++) {
+                if (prompt_tokens[i] != full_tokens[i]) {
+                    divergence = i;
+                    break;
+                }
+            }
+            const int32_t cand_len = (int32_t) full_tokens.size() - divergence;
+            backend::CandidateScore * cs = response->add_candidates();
+            cs->set_num_tokens(cand_len);
+            if (cand_len <= 0) {
+                cs->set_log_prob(0.0);
+                if (request->length_normalize()) {
+                    cs->set_length_normalized_log_prob(0.0);
+                }
+                continue;
+            }
+            if (divergence < 1) {
+                // Need at least one prior token (typically BOS) to
+                // predict the first candidate token's logit. Tokeniser
+                // models without BOS + an empty prompt fall in here.
+                return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT,
+                    "Score: prompt produced no leading tokens; need at least one (e.g. BOS) to predict candidate");
+            }
+            if ((int32_t) full_tokens.size() > n_ctx) {
+                return grpc::Status(grpc::StatusCode::OUT_OF_RANGE,
+                    "Score: prompt+candidate exceeds context size (got " +
+                    std::to_string(full_tokens.size()) + ", n_ctx=" + std::to_string(n_ctx) + ")");
+            }
+
+            // Build a batch covering the entire prompt+candidate. We
+            // need logits at (divergence-1) onward — those are the
+            // predictions for each candidate token.
+            llama_batch batch = llama_batch_init((int32_t) full_tokens.size(), 0, 1);
+            for (int32_t i = 0; i < (int32_t) full_tokens.size(); i++) {
+                batch.token[i]    = full_tokens[i];
+                batch.pos[i]      = i;
+                batch.n_seq_id[i] = 1;
+                batch.seq_id[i][0] = score_seq_id;
+                // logits[i] is "do we want the prediction *for the
+                // next token*, computed from this position?"
+                // We want predictions for candidate tokens at
+                // positions divergence .. full_tokens.size()-1, which
+                // come from logits at positions (divergence-1) ..
+                // (full_tokens.size()-2).
+                bool need_logit = (i >= divergence - 1) && (i < (int32_t) full_tokens.size() - 1);
+                batch.logits[i] = need_logit ? 1 : 0;
+            }
+            batch.n_tokens = (int32_t) full_tokens.size();
+
+            // Decode the batch. If decode fails (e.g. KV slot
+            // exhaustion), surface as INTERNAL — the caller will
+            // typically fall back to a sampling-based classifier.
+            int decode_err = llama_decode(lctx, batch);
+            if (decode_err != 0) {
+                llama_batch_free(batch);
+                llama_memory_seq_rm(mem, score_seq_id, -1, -1);
+                return grpc::Status(grpc::StatusCode::INTERNAL,
+                    "llama_decode failed during Score: " + std::to_string(decode_err));
+            }
+
+            // Sum log-probabilities of the actual candidate tokens.
+            double total_log_prob = 0.0;
+            for (int32_t k = 0; k < cand_len; k++) {
+                // The k-th candidate token sits at full_tokens index
+                // (divergence + k). Its predicting logit is at batch
+                // position (divergence + k - 1).
+                int32_t logit_pos = divergence + k - 1;
+                const float * logits = llama_get_logits_ith(lctx, logit_pos);
+                if (logits == nullptr) {
+                    llama_batch_free(batch);
+                    llama_memory_seq_rm(mem, score_seq_id, -1, -1);
+                    return grpc::Status(grpc::StatusCode::INTERNAL,
+                        "llama_get_logits_ith returned null at position " + std::to_string(logit_pos));
+                }
+                llama_token target_token = full_tokens[divergence + k];
+
+                // Compute log_softmax(logits)[target_token] with the
+                // max-subtraction stability trick.
+                float max_logit = logits[0];
+                for (int32_t v = 1; v < n_vocab; v++) {
+                    if (logits[v] > max_logit) max_logit = logits[v];
+                }
+                double sum_exp = 0.0;
+                for (int32_t v = 0; v < n_vocab; v++) {
+                    sum_exp += std::exp((double)(logits[v] - max_logit));
+                }
+                double token_log_prob = (double)(logits[target_token] - max_logit) - std::log(sum_exp);
+                total_log_prob += token_log_prob;
+
+                if (request->include_token_logprobs()) {
+                    backend::TokenLogProb * tlp = cs->add_tokens();
+                    std::string piece = common_token_to_piece(lctx, target_token);
+                    tlp->set_token(piece);
+                    tlp->set_log_prob(token_log_prob);
+                }
+            }
+
+            cs->set_log_prob(total_log_prob);
+            if (request->length_normalize() && cand_len > 0) {
+                cs->set_length_normalized_log_prob(total_log_prob / (double) cand_len);
+            }
+
+            llama_batch_free(batch);
+            // Drop this candidate's KV-cache contribution so the next
+            // candidate starts from a clean state. Without this, the
+            // next decode would conflict at positions 0..N-1 for our
+            // sequence ID.
+            llama_memory_seq_rm(mem, score_seq_id, -1, -1);
+        }
+
+        return grpc::Status::OK;
+    }
+
    grpc::Status TokenizeString(ServerContext* context, const backend::PredictOptions* request, backend::TokenizationResponse* response) override {
        auto auth = checkAuth(context);
        if (!auth.ok()) return auth;
        if (params_base.model.path.empty()) {
            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Model not loaded");
        }
+        conflict_guard guard("TokenizeString", slot_loop_inflight, score_inflight, "score_inflight");
        json body = parse_options(false, request, params_base, ctx_server.get_llama_context());
        body["stream"] = false;

@@ -3155,7 +3580,7 @@ public:
        if (body.count("prompt") != 0) {
            const bool add_special = json_value(body, "add_special", false);

-            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("content"), add_special, true);
+            llama_tokens tokens = tokenize_mixed(ctx_server.impl->vocab, body.at("prompt"), add_special, true);


            for (const auto& token : tokens) {
@@ -3169,6 +3594,8 @@ public:

    grpc::Status GetMetrics(ServerContext* /*context*/, const backend::MetricsRequest* /*request*/, backend::MetricsResponse* response) override {

+        conflict_guard guard("GetMetrics", slot_loop_inflight, score_inflight, "score_inflight");
+
 // request slots data using task queue
        auto rd = ctx_server.get_response_reader();
        int task_id = rd.queue_tasks.get_new_id();
--- a/backend/cpp/llama-cpp/paged/.gitignore
+++ b/backend/cpp/llama-cpp/paged/.gitignore
@@ -0,0 +1,7 @@
+tests/test_free_block_queue
+tests/test_block_pool
+tests/test_paged_kv_manager
+tests/test_prefix_cache
+tests/test_ggml_paged_rw
+tests/test_ggml_paged_attn
+paged-bench
--- a/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
+++ b/backend/cpp/llama-cpp/paged/BLACKWELL_KERNEL_GAPS.md
@@ -0,0 +1,105 @@
+# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
+
+Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
+profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
+Blackwell**. And the parity target is far lower than the headline vLLM number implied.
+
+## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
+
+vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
+compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
+(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
+(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
+the BF16 ceiling) or ~6,600 (beats it, via FP4).
+
+## 2. GB10 per-precision DENSE peaks (measured, not spec)
+
+| precision | dense peak | vs BF16 |
+|---|---|---|
+| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
+| INT8 | ~215 TOPS | **1.0×** |
+| FP4 (MXFP4/NVFP4) | ~427–500 TFLOP/s | **2.0×** |
+
+Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
+**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
+int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
+direct GB10 measurement; on-the-record discrepancy.)
+
+## 3. Measured gaps (nsys, GB10)
+
+| path | kernel | % of prefill | achieved | % of ceiling |
+|---|---|---|---|---|
+| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
+| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~4–5% of 500** (or ~10% of BF16) |
+
+Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
+RTX 3000/4000" (Ampere/Ada). The headroom (4–5×) is recoverable; it's not an architectural ceiling.
+
+## 4. ggml's current quantized-matmul paths (what exists)
+
+- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
+- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
+  for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
+  by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
+- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
+  Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
+- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
+
+## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
+
+**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
+a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
+is also broken on sm_121.
+
+Ranked, by effort:
+1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
+   recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
+   the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
+   activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
+2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
+   Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
+   reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
+   genuine ggml gap. **This is the recommended kernel to MATCH.**
+
+**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
+3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
+   (RTX 5090: +43–68% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
+   miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
+   (it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
+
+**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
+ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
+problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
+should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
+
+## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
+
+Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
+speed-only test) and benched prefill:
+
+| quant | kernel | pp512 | pp2048 | vs Q4_K |
+|---|---|---|---|---|
+| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
+| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
+
+**Findings:**
+- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
+  dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
+- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
+  **FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
+  from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~4–6× on the table.
+- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
+  serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
+  tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
+  to *match* on the BF16 ceiling if FP4 tuning stalls.
+
+Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
+it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
+
+## Sources
+GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
+arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
+DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
+vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
+
+> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.
--- a/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/CHUNKED_PREFILL_PLAN.md
@@ -0,0 +1,334 @@
+# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
+
+Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
+`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
+plan for what the brief called "chunked prefill".
+
+Line numbers below are from two trees:
+- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
+  `backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
+- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
+  build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
+  lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
+  `update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
+  cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
+  `f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
+  a few rows at the pin — match on the quoted comment strings, not the integers.
+
+---
+
+## TL;DR — the headline finding
+
+**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
+llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
+this version. `update_slots()` in `server-context.cpp`:
+
+1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
+   ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
+   one sampled token into the shared `llama_batch` before any prefill is added.
+2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens** —
+   "next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
+   gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
+   default, `grpc-server.cpp:547`). The per-slot prefill fill loop
+   (≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
+   batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
+   the **remaining** budget and defers the rest to the next iteration.
+3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
+   and prefill-chunk tokens go through the **same `llama_decode`**, which then
+   splits internally into `n_ubatch` physical sub-batches.
+
+This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
+("server : chunked prefill support") asked for — "the first task is no longer
+blocked by the second long prompt processing task." That PR is still marked OPEN
+but its goal was absorbed into the natural evolution of `update_slots()`; we do
+**not** need to port it. A long prefill no longer stalls the decode batch: decode
+slots are serviced first every iteration, prefill consumes only the leftover
+budget.
+
+**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
+narrow and is the rest of this plan:
+
+- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
+  the scheduler token budget (`n_batch`) to the physical forward width
+  (`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
+  `n_batch == n_ubatch`, so the logical scheduling window can never be wider than
+  one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
+  spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
+  into a larger logical window. There is no first-class `batch:`/`ubatch:` split
+  on the Go side, and there is only a one-directional `ubatch` override on the C++
+  side (you can shrink ubatch below the coupled value, never grow n_batch above
+  it).
+- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
+  caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
+  one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
+  to the decoders sharing that forward. vLLM exposes
+  `long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
+  LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
+  bounds that jitter. This is genuinely not in upstream and is the only place a
+  scheduler-policy change is warranted.
+
+---
+
+## 1. Current behavior — precise citations
+
+### 1.1 The scheduler is upstream, inherited verbatim
+- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
+  `grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
+  service + `params_parse` + `parse_options`. `update_slots()`, the slot state
+  machine, and the batch builder are **upstream `server-context.cpp`**, untouched
+  by LocalAI today.
+- Slot states: `server-context.cpp:36-42` —
+  `SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
+  GENERATING`.
+
+### 1.2 Decode-first, then prefill-fill, one shared batch
+- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
+- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
+  `common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
+  token. Decode is guaranteed a seat before prefill runs.
+- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
+  `n_ubatch = llama_n_ubatch(ctx)`.
+- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
+  → with cont_batching ON, prefill is added to the **same** batch as decode.
+- Per-slot prefill fill (≈ 2552-2597):
+  `while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
+  — adds prompt tokens until the slot is done **or** the shared budget is hit.
+  Whatever does not fit stays for the next iteration (the slot remains
+  `SLOT_STATE_PROCESSING_PROMPT`).
+- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
+  it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
+  the sampler. Next iteration it becomes `GENERATING`.
+- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
+- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
+  calls `llama_decode`; the physical `n_ubatch` split happens inside
+  `llama_decode`.
+
+### 1.3 The chunking is gated by `can_split()`
+- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
+  embeddings with non-LAST pooling. So **completion/generation tasks always
+  chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
+  ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
+  size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
+
+### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
+- `grpc-server.cpp:515` — `params.n_batch  = request->nbatch();`
+- `grpc-server.cpp:519` — `params.n_ubatch = request->nbatch();` with the comment
+  that this fixes reranking being capped at the 512 default `n_ubatch`.
+- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
+  `ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
+  There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
+  above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
+  in `grpc-server.cpp` returns nothing.
+- Options arrive via `request->options(i)` parsed as `optname:optval`
+  (`grpc-server.cpp:584-585`); these come from `ModelOptions.Options` ⟵
+  `c.Options` (`core/backend/options.go:221`).
+
+### 1.5 Go side sends a single batch number
+- `backend/backend.proto:341` — `int32 NBatch = 4;` is the only batch field; there
+  is **no** `NUBatch`.
+- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
+  else context size for single-pass (score/embed/rerank), else
+  `hardwareDefaultBatchSize(512)`.
+- `core/backend/options.go:228` — `NBatch: int32(b)` (single value to the
+  backend; becomes both `n_batch` and `n_ubatch` via 1.4).
+- `core/backend/hardware_defaults.go:28,37-40` — `BlackwellBatchSize = 2048`;
+  on Blackwell an unset batch defaults to 2048, so today
+  `n_batch == n_ubatch == 2048` there.
+
+---
+
+## 2. Why the decouple matters for serving (not just rerank)
+
+Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
+width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
+**scheduler token budget** — the logical window shared by decode + prefill chunks,
+analogous to vLLM's `max_num_batched_tokens`.
+
+With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
+physical ubatch. Consequences:
+- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
+  is capped at the physical ubatch, so aggregate prefill cannot grow past one
+  ubatch worth of tokens per iteration even when more slots have prompts queued.
+- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
+  degrading prefill GEMM efficiency — and vice versa.
+
+Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
+`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
+logical window, lifting aggregate prefill under mixed load — `llama_decode` still
+tiles the physical work at 2048.
+
+---
+
+## 3. Phased implementation
+
+### Phase 0 — Verification harness (do first; TDD red)
+Bite-sized, no code change to the scheduler.
+- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
+  `n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
+  decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
+  PR #10718's body works). Capture each stream's full token id sequence. Re-run
+  with the prefill request absent. **Assert the short streams' token ids are
+  byte-identical** in both runs — proves interleaving does not perturb decode
+  numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
+  spec under the backend e2e suite.
+- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
+  the same tree) or a small driver hitting `/v1/chat/completions`: measure
+  aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
+  under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
+  config. This is the before of Phase A/B.
+
+Expected result of Phase 0: 0.1 already passes (interleave is correct today);
+0.2 gives the baseline the decouple must beat.
+
+### Phase A — Decouple n_batch from n_ubatch
+Goal: let model config set the physical ubatch independently of the logical batch,
+defaulting to today's behavior (no regression).
+
+- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
+  In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
+  sibling branch:
+  ```cpp
+  } else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
+      if (optval != NULL) {
+          try { params.n_batch = std::stoi(optval_str); } catch (...) {}
+      }
+  ```
+  This is the missing direction (raise `n_batch` above the coupled value). Order
+  matters: both `:515/:519` run first (coupling as default), then option parsing
+  overrides either independently. Add a clamp note: if a user sets
+  `n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
+  `:519` aliasing for backward compat (rerank still works with no options).
+
+- **A.2 Proto: add an explicit physical ubatch field.**
+  `backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
+  4). Regenerate with `make protogen-go` + the C++ proto build.
+
+- **A.3 C++: honor `NUBatch` when present.**
+  In `grpc-server.cpp` `params_parse`, after `:519`, add:
+  ```cpp
+  if (request->nubatch() > 0) {
+      params.n_ubatch = request->nubatch();
+  }
+  ```
+  so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
+  string-option as a third path for users who only edit `options:`.
+
+- **A.4 Go: config surface + plumbing.**
+  - Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
+    (search `core/config` for the `Batch` field; mirror it).
+  - In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
+    `EffectiveBatchSize` (return `c.UBatch` if set, else
+    `min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
+    stays at the hardware sweet spot while `n_batch` may be larger). Set
+    `NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
+  - Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
+    byte-identical to today.
+
+- **A.5 Serving default (the lever).**
+  In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
+  measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
+  configs (when `n_parallel > 1` and the model is a completion model), while
+  `EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
+  Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
+  paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
+  Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
+
+- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
+  `EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
+  `NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
+  0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
+  neutral ITL) at `n_batch=4096, n_ubatch=2048`.
+
+### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
+Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
+one change that touches the inherited scheduler, so it lives as a patch in
+`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
+`:141-145`), never as an edit to a checked-in upstream file.
+
+Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
+`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
+
+```
+# token budget for THIS iteration, decode already seated:
+n_decode_in_batch = batch.n_tokens            # set after the decode phase
+prefill_budget    = n_batch                    # default == today
+
+if serving_mode and n_decode_in_batch > 0:
+    # leave room so decoders are not starved/jittered by one giant prefill chunk
+    # max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
+    prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
+
+# fill loop guard becomes:
+while slot.prompt.n_tokens() < slot.task->n_tokens()
+      and batch.n_tokens < prefill_budget:
+      ...
+```
+
+- `max_prefill_per_iter` is a new `common_params` field surfaced as an
+  `options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
+  exactly like A.1, default `0` = disabled = today's behavior.
+- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
+  ongoing decodes keep a steady cadence; the remaining prompt rides the next
+  iteration (already supported by the state machine — slot stays
+  `PROCESSING_PROMPT`).
+- **Correctness:** unchanged KV/position path — chunk boundaries already advance
+  `slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
+  from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
+  *how many* tokens are added this iteration, not *which* positions, so 0.1 must
+  remain token-identical.
+
+### Phase C — Docs + defaults rollout
+- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
+  `docs/content/` model-config reference, with the serving recipe
+  (`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
+- Note the orthogonality to paged KV (below) in
+  `PHASED_VLLM_PARITY_PLAN.md` Phase 3.
+
+---
+
+## 4. Risk / correctness
+
+- **KV-cache & positions across chunks:** already handled upstream. Each prefill
+  token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
+  (≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
+  boundaries are transparent to the KV cache because positions are absolute, not
+  per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
+  per-iteration count. The 0.1 token-identical test is the guardrail.
+- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
+  unaffected — co-batching prefill+decode across slots is what the unified cache is
+  for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
+- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
+  EffectiveBatchSize` and A.1 logs a warning if options violate it.
+- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
+  `can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
+  context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
+  — do not let the serving `BlackwellLogicalBatch` default leak into single-pass
+  configs (A.5 gates on completion + `n_parallel>1`).
+- **Turboquant fork:** the fork lacks some `common_params` fields (see
+  `LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
+  `n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
+  guard the new field behind a `#ifndef` like the checkpoint block does.
+
+## 5. Orthogonality to paged KV (Phase 2)
+
+Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
+and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
+prefill / this decouple changes **how many tokens per iteration** the scheduler
+batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
+KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
+scheduling window to feed those slots; neither touches the other's data structures.
+The only contact point is `update_slots()` — if both ship a vendored patch to it,
+land them as separate, ordered patches in `patches/` and keep the hunks disjoint
+(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
+budget).
+
+---
+
+## 6. Bottom line
+
+- Chunked prefill + decode interleave: **already present and correct** on the
+  pinned llama.cpp — verify (Phase 0.1), do not rebuild.
+- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
+  default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
+  if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
+  + proto + `options.go`; B as a vendored `patches/` hunk.
--- a/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
+++ b/backend/cpp/llama-cpp/paged/DECODE_OVERHEAD.md
@@ -0,0 +1,215 @@
+# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
+
+Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
+vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
+sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
+`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
+
+## TL;DR (the result overturns the brief's premise)
+
+On **this** build the prime suspect is wrong and the host-overhead premise does not
+hold:
+
+1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
+   decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
+   key, no property churn post-warmup). The keyed-warmup gate works.
+2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
+   decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
+   **GPU-compute-bound**, not host-bound.
+3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
+   different/earlier build (mainline without these graph fixes). It is not
+   reproducible on `llama.cpp-pr24423`.
+4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
+   the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
+   npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
+5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
+   (Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
+   memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
+   GEMM kernels, not host-side work. This is a kernel project (the direction the
+   prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
+   started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
+
+## 1. Why CUDA graphs are (not) disabled - exact code + measurement
+
+### The gate (code)
+
+PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
+`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
+
+- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
+  `cgraph->nodes[0]` (first-node pointer).
+- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
+  - **split buffers** (`ggml_backend_buft_is_cuda_split`), and
+  - **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
+    `ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
+  Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
+- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
+  after **2 consecutive calls with no property change** (`warmup_complete`); any
+  property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
+  detects change by `memcmp` of the full `ggml_tensor` struct + per-src
+  data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
+
+### Why it stays enabled across decode steps
+
+The graph stays stable because llama.cpp's host-side graph reuse holds during
+decode, so node pointers/props (and `cgraph->uid`) do not churn:
+
+- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
+  a multiple of 256** ("so that the graph remains constant across batches and can
+  be reused"). For ntg<=256 within the first KV block, n_kv is constant.
+- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
+  `ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
+  at npl=128.
+- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
+  scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
+  `ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
+  which is skipped on the reuse path -> stable uid -> CUDA graph replays.
+
+### Measurement (instrumented build, npl=128, ntg=96)
+
+Env-gated counters added to `ggml_backend_cuda_graph_compute` /
+`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
+Release / NDEBUG). End-of-run summary:
+
+```
+[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
+```
+
+94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
+warmup resets; a **single** distinct graph key for the whole decode; no node
+property churn after warmup. Graphs are fully engaged at npl=128.
+
+(The instrumentation was reverted afterwards; the checkout is back to its
+pre-task state and the `.so` rebuilt clean.)
+
+## 2. The per-step CPU "hotspot" - there isn't one on this build
+
+GPU utilization during npl=128 decode (ntg=256):
+
+- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
+  steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
+- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
+  total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
+  window = ~96% GPU-active**. Even with every kernel launched individually from
+  the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
+
+Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
+graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
+step ~= wall time per step. The host work between steps is in the low single-digit
+ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
+
+This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
+come from a pre-graphs build.
+
+### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
+
+| Kernel | % GPU time | ~ms/step |
+|--------|-----------:|---------:|
+| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
+| `flash_attn_ext_f16` | 19.3 | ~44 |
+| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
+| `unary_gated` silu | 4.1 | ~9 |
+| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
+| rms_norm / rope / set_rows / add | ~4 | ~10 |
+
+Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
+
+`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
+the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
+
+## 3. Fix attempt + measured result
+
+### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
+
+Graphs are already enabled and the batch is already stable (n_kv padded to 256,
+kq_mask dims constant). The clean cold A/B (cooldowns between every run):
+
+| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
+|----:|----------------:|-----------------:|------:|
+| 32  | 242.60 | 235.75 | +2.9% |
+| 64  | 398.59 | 389.06 | +2.5% |
+| 128 | 543.95 | 535.71 | +1.5% |
+
+Baseline (separate cold runs, original non-instrumented build):
+npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
+
+Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
+the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
+remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
+toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
+
+### Where the number actually is, and the real lever
+
+- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
+  ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
+- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
+  per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
+  llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
+  floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
+  efficiency difference is the ~547 -> 667 gap.
+- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
+  (median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
+  n=128 decode shape.
+
+**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
+int4 kernel for the decode shapes. This is exactly the direction of the prior
+session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
+`fp4-grouped-moe.cu` (already wired via
+`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
+Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
+so they are inert for this model - a Marlin path for K-quants (or shipping the
+model in a Marlin-friendly int4 format) would be required. That is a multi-day
+kernel effort, out of scope for this session, but it is the only lever that can
+move the number.
+
+### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
+
+`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
+(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
+MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
+drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
+graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
+confirms the host term is real but tiny at 256 - not a path to 667.
+
+## How this would land in LocalAI
+
+- **No host/graph patch is warranted** for this build: graphs already engage and
+  the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
+  would change nothing measurable at high concurrency.
+- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
+  (extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
+  model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
+- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
+  concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
+
+## Reproduction
+
+```
+# baseline / A/B (cold, 30s cooldowns)
+llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
+  -ngl 99 -b 2048 -ub 2048 -fa on            # graphs on
+GGML_CUDA_DISABLE_GRAPHS=1 ...same...        # graphs off
+
+# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
+# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
+#   nsys stats --report cuda_gpu_kern_sum  -> sum/0.516 ~= 7.72s of 8s = ~96%
+```
+
+## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
+
+The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
+and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
+that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
+
+| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
+|---|---|---|---|
+| Q4_K_M | 547 (548/546) | - | 82% |
+| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
+
+NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
+decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
+as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
+vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
+decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
+from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
+both the prefill and the decode gap.
--- a/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/DGX_BLACKWELL_PLAN.md
@@ -0,0 +1,253 @@
+# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
+
+Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
+Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
+Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
+
+> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
+> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
+> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
+> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
+
+---
+
+## Baseline results (established)
+
+### Single-stream (B=1), matched ~8-bit
+| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
+|---|---|---|
+| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
+| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
+| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
+
+\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
+
+### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
+| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
+|---|---|---|---|---|
+| 1 | 1080 | 9644 | 60.1 | 48.0 |
+| 8 | 2189 | 33373 | 160.8 | 312.4 |
+| 32 | 2198 | 99398 | 357.1 | 1171 |
+| 64 | 2194 | 151990 | 519.2 | 2064 |
+
+llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
+
+### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
+Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
+| | B=1 | B=8 | B=32 |
+|---|---|---|---|
+| stock decode | 61.2 | 171.7 | 377.0 |
+| paged decode | 62.7 | 170.8 | 376.8 |
+
+Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
+
+---
+
+## Root-cause diagnosis (nsys + code audit)
+
+- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
+- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
+- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
+- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
+- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
+- **The 45–69× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~25–50% long-context**, not 45–69×.
+
+Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
+
+---
+
+## The levers (cheap → expensive) — execution log
+
+### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
+Status: **IN PROGRESS** — single-stream done, concurrency next.
+Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
+
+Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
+| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
+|---|---|---|---|
+| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
+| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
+| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
+| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
+
+Findings:
+- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
+- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
+- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
+Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
+| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
+|---|---|---|---|---|---|---|
+| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
+| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
+| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
+| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
+
+**Lever-1 verdict:** MXFP4 is a large, free win — decode +50–66% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 2–3. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
+
+### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
+Status: **DONE + SHIPPED (auto-default implemented)**
+MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
+**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~3300–3650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 3–5).
+**Implemented:** `core/backend/hardware_defaults.go` — `EffectiveBatchSize` now defaults the physical batch
+(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
+config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
+common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
+
+### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
+**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
+| metric | Q4_K_M | MXFP4 | Q8 |
+|---|---|---|---|
+| decode tg128 | **93.5** | 86.4 | 62.2 |
+| prefill pp512 | 2164 | **3061** | 2215 |
+| prefill pp2048 | 2953 | **3441** | ~2200 |
+**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
+MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
+**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
+
+### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
+Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
+
+**Decisive measurements:**
+- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
+  pp16384=2051. So the plateau is not a batch-size fix.
+- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
+  at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
+- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
+  `mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
+  inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
+
+**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
+the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
+fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
+Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
+(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
+(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
+streaming and flat vs batch.
+Approach:
+- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
+  experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
+  m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
+- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
+  separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
+- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
+  `ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
+- Effort: high (2–4 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
+  prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
+  not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
+
+### Lever 4 — FP8 (e4m3) GEMM for dense layers
+Status: **DESIGNED, not built** (blocked on a core ggml API change).
+Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
+tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
+Approach (two options):
+- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
+  compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
+  scale-tensor plumbing below.
+- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
+- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042 —
+  per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
+  scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
+- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
+  prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
+
+### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
+Status: **DESIGNED, not built** (very high effort; last increment).
+Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
+tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
+Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
+3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
+`mma.sync`.
+- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
+  worth it after Levers 3–4 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
+
+---
+
+## Paged attention — complete implementation (after kernels are fair)
+The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
+four gaps. CPU foundation already built & verified (`PagedKVManager` P0–P3, `README.md`); the in-model parts
+are unbuilt. **Build order and concrete design:**
+
+1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
+   - Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
+     KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
+     (already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
+   - Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
+2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165` → `ggml_get_rows`
+   gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
+   7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
+3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
+   the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
+   current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
+4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
+   already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
+
+Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0–P3 done; in-model #1–#4 unbuilt.
+**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
+**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
+test that (correctly) showed no benefit.
+
+> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
+> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
+> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
+> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 1–3), separate
+> from paging.
+
+---
+
+## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
+
+Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
+`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
+Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
+tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
+(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
+
+Phases (each: hypothesis → edit `ggml/src/ggml-cuda/` → `cmake --build build --target llama-bench` →
+`llama-bench` MXFP4 pp/concurrency → record):
+1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
+   FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
+   - **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
+     (mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
+     (nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
+     it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
+2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
+   a global round-trip. Tractable, ~1.1x.
+   - **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
+     `ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
+     *inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
+     post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
+
+**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
+not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
+dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
+(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
+is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
+3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
+   CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
+   one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
+   Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
+4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
+   the whole prefill runs on FP4 tensor cores, not int8-MMQ.
+Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
+Matching vLLM realistically requires phase 3; phases 1–2 are the warm-up + de-risking.
+
+## Implementation plan B — Complete paged attention (the pivot)
+
+CPU foundation done (P0–P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
+numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
+Phases:
+1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
+   `PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
+   fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
+2. **Gather-read** (`get_k/get_v` `:1145/1165` → `ggml_get_rows` into scratch) + `build_attn_paged` branch
+   in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
+3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
+   real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
+4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
+   already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
+Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
+the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
+(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
+
+## Honest scope note
+Levers 3–5 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.
--- a/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
+++ b/backend/cpp/llama-cpp/paged/FP4_GROUPED_MOE_KERNEL.md
@@ -0,0 +1,59 @@
+# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
+
+The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
+bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
+**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
+it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
+tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
+
+## Scaffold (DONE — builds clean, default byte-identical)
+
+Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
+upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
+
+- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
+  (true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
+  returns false → **default build unchanged**.
+- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
+  `if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
+  after adding the file — GLOB is configure-time).
+
+This is the integration seam. The kernel fills the stub.
+
+## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
+
+1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
+   dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
+   on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
+2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
+   GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
+3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
+   Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
+   3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
+4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
+5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
+
+## Dependencies / decisions
+
+- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
+  Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
+  early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
+- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
+- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
+  (exhaustive search confirmed). Net-new value upstream.
+
+## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
+
+Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.6–32×
+behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
+So the kernel track is **two kernels, not one**:
+
+- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
+  land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
+  before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
+- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
+
+Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
+prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
+GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
+W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)
--- a/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
+++ b/backend/cpp/llama-cpp/paged/MXFP4_QUALITY.md
@@ -0,0 +1,140 @@
+# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
+
+## Question
+
+MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
+Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
+routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
+default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
+stronger 4-bit format).
+
+Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
+with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
+MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
+
+## Verdict
+
+**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
+Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
+shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
+
+- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
+- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
+  than Q4_K**).
+
+The double-quant suspicion was correct but it was **not** the main culprit: even a clean
+MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
+speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
+gibberish), so it is usable where raw throughput dominates and a quality hit is
+acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
+
+## Evidence
+
+### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
+
+`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
+created it:
+
+```
+SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf      # <-- source is Q4_K_M, not F16/BF16
+OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
+$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
+    "$SRC" "$OUT" MXFP4_MOE
+```
+
+Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
+overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
+fair answer.
+
+### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
+
+`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
+
+| 32B model | PPL | vs Q4_K |
+|---|---|---|
+| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
+| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
+
+MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
+unfair -> escalated to a clean small-model comparison.
+
+### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
+
+The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
+small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
+GiB) and quantized it **from that same BF16 source** to both formats with the identical
+recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
+
+```
+llama-quantize  q3-4b-bf16.gguf  q3-4b-q4km.gguf   Q4_K_M
+llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
+               q3-4b-bf16.gguf  q3-4b-mxfp4.gguf  MXFP4_MOE
+```
+
+Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
+
+| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
+|---|---|---|---|---|
+| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
+| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
+| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
+
+This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
+than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
+Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
+smaller models being more quantization-sensitive - the double-quant did not invent the
+problem, it is intrinsic to the format as quantized by `llama-quantize`.
+
+### 4. Coherence spot-check (32B, llama-simple, n=60)
+
+MXFP4-dense 32B is fully coherent, not degraded gibberish:
+
+- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
+  (correct); Q4_K similar.
+- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
+  (sound); Q4_K answers 391 directly (correct).
+- "def fibonacci(n):" -> both emit valid Python.
+
+So the quality cost shows up as measurably higher perplexity (and would surface on harder
+/ longer tasks), not as obviously broken text at short generation lengths.
+
+## Why
+
+`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
+round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
+per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
+better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
+kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
+but the weights it accepts (MXFP4 here) are lossy for dense.
+
+## Caveat, stated precisely
+
+This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
+against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
+format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
+through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
+
+- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
+  (**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
+  may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
+- The result is for Qwen3 dense; other families may differ in magnitude but the
+  format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
+
+## Recommendation
+
+- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
+  quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
+  inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
+- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
+  option with the perplexity caveat surfaced, not a default.
+- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
+  Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
+  format before making any FP4-dense recommendation.
+
+## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
+
+- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
+- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
+- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
+- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
+  correctness metric, so thermal state does not affect these numbers.
--- a/backend/cpp/llama-cpp/paged/Makefile
+++ b/backend/cpp/llama-cpp/paged/Makefile
@@ -0,0 +1,41 @@
+CXX ?= g++
+CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
+
+TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
+BINS  = $(addprefix tests/,$(TESTS))
+
+all: $(BINS)
+
+tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
+	$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
+
+check: all
+	@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
+
+paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
+	$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
+
+bench: paged-bench
+	./paged-bench
+
+# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
+# Requires a built ggml. Override these to point at your checkout / build:
+#   make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
+GGML_SRC   ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
+GGML_BUILD ?= /tmp/ggml-build
+GGML_LIBDIR = $(GGML_BUILD)/src
+
+GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
+GGML_BINS  = $(addprefix tests/,$(GGML_TESTS))
+
+tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
+	$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
+		-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
+
+ggml-check: $(GGML_BINS)
+	@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
+
+clean:
+	rm -f $(BINS) $(GGML_BINS) paged-bench
+
+.PHONY: all check ggml-check clean
--- a/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
+++ b/backend/cpp/llama-cpp/paged/NVFP4_TEST.md
@@ -0,0 +1,114 @@
+# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
+
+Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
+kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
+established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
+BF16, no imatrix.
+
+## Verdict (short)
+
+YES on all the load-bearing questions, with one honest caveat:
+
+1. llama.cpp CAN produce an NVFP4 GGUF.
+2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
+   slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
+3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
+   4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
+4. Output is coherent.
+
+Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
+essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
+tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
+workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
+NVFP4 quant would likely close most of that remaining gap.
+
+## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
+
+- The type exists with a full quantize path, not just a kernel:
+  - `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
+  - `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
+  - type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
+- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
+  no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
+  `--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
+  `ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
+  MXFP4-dense.
+- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
+  norms F32, all 2D attn+ffn weights to FP4):
+
+  ```
+  llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
+                 q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
+  ```
+
+  Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
+  Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
+
+The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
+do not feed llama.cpp - confirmed and irrelevant.
+
+## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
+
+`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
+
+| Quant   | PPL    | vs BF16  | vs Q4_K  |
+|---------|--------|----------|----------|
+| BF16    | 13.32  | -        | -        |
+| Q4_K_M  | 13.66  | +2.6%    | -        |
+| NVFP4   | 14.31  | +7.4%    | +4.8%    |
+| MXFP4   | 17.42  | +30.8%   | +27.6%   |
+
+(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
+
+NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
+sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
+all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
+firmly in the "acceptable 4-bit" regime, not the lossy one.
+
+## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
+
+No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
+so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
+cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
+
+Prefill S_PP (t/s):
+
+| B   | Q4_K   | NVFP4  | MXFP4  | NVFP4 / Q4_K | NVFP4 / MXFP4 |
+|-----|--------|--------|--------|--------------|---------------|
+| 8   | 4862   | 6313   | 6602   | 1.30x        | 0.96x         |
+| 32  | 5020   | 6497   | 6836   | 1.29x        | 0.95x         |
+| 64  | 5031   | 6490   | 6831   | 1.29x        | 0.95x         |
+
+- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
+  kernel. NVFP4 does NOT fall back to a slow path.
+- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
+  Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
+  32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
+  smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
+- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
+
+## 4. Coherence
+
+`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
+- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
+  ...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
+- "Q: What is 17 plus 25? A:" -> "42." (correct)
+
+Coherent and factually accurate.
+
+## Recommendation for LocalAI on Blackwell
+
+Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
+via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
+norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
+expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
+MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
+
+Caveats / follow-ups:
+- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
+  does not matter, Q4_K_M remains the better pick.
+- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
+  next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
+  blanket recommendation.
+- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
+  confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_HIGH_CONCURRENCY.md
@@ -0,0 +1,115 @@
+# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
+
+Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
+"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
+`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
+(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
+**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
+where decode aggregate is free to keep climbing with concurrency.
+
+Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
+`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
+sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
+`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
+-ngl 99`. Cold runs, 12 s cooldowns.
+
+## TL;DR for the decision
+
+**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
+aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
+already reaches it.** Measured across two model regimes and concurrency up to 2048
+sequences:
+
+- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
+  - 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
+  - 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
+- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
+  *slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
+- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
+  throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
+  max**) while aggregate stays flat.
+
+**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
+regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
+does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
+prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
+single-model benchmark deliberately does not create the memory-pressure regime where
+paging pays off, which is exactly why no win appears.
+
+## The numbers
+
+### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
+
+| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
+|----:|------------------------:|----------------------------:|------------------:|---------------------:|
+| 128 | 2,643 | 2,887 | 23-25 | - |
+| 256 | 2,925 | - | - | - |
+| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
+| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
+| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
+
+Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
+**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
+`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
+contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
+paged 12-13% *behind*.)
+
+### Why it plateaus (the hardware ceiling, not the KV layout)
+
+Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
+read across the whole batch. Once concurrency is high enough that the shared weight-read
+is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
+which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
+is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
+the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
+plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
+bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
+
+## What paged KV is actually for (the honest, deliverable value)
+
+Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
+to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
+which require the regime this hardware+benchmark does not exercise, are:
+
+1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
+   in-flight sequences (variable, dynamically arriving/leaving contexts) without the
+   contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
+   compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
+   with very long/variable contexts.
+2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
+   prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
+   real token-budget win for shared-prefix workloads; PR #22569 defers this to a
+   non-existent Phase 2 (our from-scratch P0 has the machinery).
+
+These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
+aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
+
+## Recommendation
+
+- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
+  the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
+  benchmark.
+- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
+  ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
+  npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
+  MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
+  Blackwell number.
+- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
+  lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
+  framed and measured as concurrent-tenant capacity and KV memory saved, on a
+  variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
+  aggregate regardless of KV layout; that is a fleet-level result.
+
+## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
+
+```sh
+M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
+# contiguous
+for NPL in 128 256 512 1024; do
+  ./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
+    -b 2048 -ub 2048 -fa on -c $((NPL*160)); done
+# paged
+for NPL in 512 1024 2048; do
+  ./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
+    -np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
+```
--- a/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
+++ b/backend/cpp/llama-cpp/paged/PAGED_KV_TARGET_READINESS.md
@@ -0,0 +1,170 @@
+# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
+
+Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
+the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
+artifacts (low bandwidth caps throughput before KV memory ever binds). This document
+delivers the three things needed to push paged KV toward the real target:
+
+1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
+2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
+3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
+
+---
+
+## 1. Correctness: PASS (after fixing the auto-fit OOM)
+
+`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
+(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
+it aborted at context creation. Root cause found:
+
+- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
+  `n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
+  GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
+  `cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
+  explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
+
+**Fix (item-1 patch, applied on the box):**
+
+```diff
+--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
+@@ run_paged()
+     params.kv_paged      = true;
+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
+     params.n_gpu_blocks  = 64;
+```
+
+**Result (Qwen3-0.6B-Q8_0, GB10):**
+
+```
+test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
+test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
+test-paged-kv-e2e: PASSED
+```
+
+The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
+bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
+
+**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
+brittle and must be hardened before it runs on a real serving box - even though
+`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
+(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
+`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
+(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
+
+---
+
+## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
+
+**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
+`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
+load. That has no over-reservation and no fragmentation, so contiguous KV is already
+memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
+paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
+the real serving regime. No tool in the tree creates it.
+
+`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
+API:
+
+- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
+  cross-request prefix sharing,
+- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
+- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
+  `LG_GENSHORT`) - the over-reservation driver,
+- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
+  one finishes.
+
+It reports the load-bearing number for the buy decision - the **capacity ratio**:
+
+```
+paged peak KV      = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
+contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token   (worst-case per slot)
+CAPACITY RATIO     = contiguous_reserve / paged_peak   (+ prefix sharing on top)
+```
+
+`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
+`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
+
+**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
+CMakeLists next to `llama-paged`, build, then e.g.
+`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
+Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
+It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
+the ratio is uninteresting because throughput plateaus before memory binds (see below).
+
+---
+
+## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
+
+### Measured on GB10 (this work)
+
+| model | decode plateau (aggregate) | plateau concurrency | bound by |
+|---|---|---|---|
+| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
+| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
+
+### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
+
+| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
+|---|---|---|---|---|
+| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
+| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
+| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
+
+Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
+is reached scale with bandwidth (~30x on 2x H200)**:
+
+- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
+  ~128 x 30 ~= **3,800 concurrent sequences**.
+
+### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
+
+To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
+
+- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
+- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
+- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
+  sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
+
+So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
+and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
+is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
+caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
+inverted on the real target.
+
+### Magnitude of the paged win
+
+Paging recovers concurrency two ways, both multiplicative on achievable throughput:
+
+1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
+   `ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
+   long, prompts ~512) the average held context is several-fold below `max_ctx` ->
+   `paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
+   your workload's length distribution).
+2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
+   workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
+
+Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
+concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
+decode ceiling. **That is the datacenter payoff, and it is real on the target even though
+GB10 cannot exhibit it.**
+
+### Honest caveats for the buy case
+
+- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
+  workload's context-length distribution (more variable -> bigger paged win) and TP
+  efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
+- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
+  *slower* than the mature contiguous flash-attention path at equal concurrency
+  (`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
+  the fit-robustness bug above. Adopting paged KV for the target means either hardening
+  #22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
+  competitive* op, which is the remaining engineering.
+- Prefill on either KV layout is compute-capped, not a paged concern.
+
+**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
+the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
+**correctness-verified**, the **benchmark to size the win exists**, and the projection
+says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
+decode** on the target. The remaining work is hardening/finishing the paged op, not
+proving the thesis.
--- a/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/PHASED_VLLM_PARITY_PLAN.md
@@ -0,0 +1,55 @@
+# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
+
+Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
+quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
+`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
+
+## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
+
+- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
+- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
+- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
+- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
+  viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
+
+## Phases
+
+### Phase 1 — Hardware-tuned config (PR #10411) — DONE
+Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
+- Blackwell physical batch (n_ubatch) = 2048.
+- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
+  which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
+  KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
+  flash-attn=auto, context=4096.
+
+### Phase 2 — Paged / block KV cache  ← biggest structural multi-user lever
+vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
+contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
+- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
+  rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
+  design informs the review/port; the upstream momentum is the place to land it.
+- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
+
+### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
+llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
+capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
+- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
+  a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
+- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
+
+### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
+Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
+Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
+BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
+
+### Phase 5 — Backend GPU sampling
+CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
+
+### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
+Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
+config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
+
+## Sequencing rationale
+Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
+Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
+(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.
--- a/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR17004_EVAL.md
@@ -0,0 +1,90 @@
+# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
+
+Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
+Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
+
+## TL;DR (clean negative)
+
+1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
+2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
+3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
+4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
+5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
+
+## 1. What PR #17004 does + state
+
+- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
+- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
+- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
+- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
+- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
+- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
+
+Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
+
+## 2/3. Apply + build
+
+No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
+
+```
+cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
+  -DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
+  -DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
+cmake --build build --target llama-batched llama-batched-bench -j20
+```
+
+**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
+
+## 4. Decode aggregate: fix vs baseline vs vLLM
+
+### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
+`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
+
+| npl | S_TG t/s |
+|-----|----------|
+| 32  | 241.8 |
+| 64  | 395.1 |
+| 128 | 542.6 |
+| 256 | 567.2 |
+
+Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
+
+### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
+`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
+
+| np  | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
+|-----|------------------|------------------------|-------|
+| 32  | 174.1 | 217.5 | +25% |
+| 64  | 390.5 | 403.4 | +3.3% |
+| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
+| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
+
+(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
+
+**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
+
+## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
+
+`nsys profile -t cuda ... -n 96 -np 64`
+
+| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
+|------|-----------|------------------------------|----------------------|
+| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
+| GPU `-bs`    | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
+
+GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
+
+(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
+
+## LocalAI adoption path
+
+**The code arrives transparently with a version bump; enabling it is not transparent.**
+
+- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
+- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
+- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
+- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
+- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
+
+### Recommendation
+Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.
--- a/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
+++ b/backend/cpp/llama-cpp/paged/PR22569_EVAL.md
@@ -0,0 +1,136 @@
+# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
+
+Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
+high-concurrency GPU throughput, or should we finish our own from-scratch P4
+(`backend/cpp/llama-cpp/paged/`)?
+
+Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
+memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
+n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
+the correctness gate.
+
+## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
+
+On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
+win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
+256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
+hold for this PR or this hardware/model. On top of that it is broken out of the box,
+wired to the wrong integration surface, and a contested draft.
+
+## 1. Builds? Correct?
+
+- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
+  `0b0f7bd...`, base = current master). Clean CUDA build for sm_121
+  (`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
+  `llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
+  It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
+  depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
+
+- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
+  at context creation:
+  `build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
+  `GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
+  `--fit off` (so it is the real graph, not just the memory probe).
+  **Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
+  wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
+  32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
+  "qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
+  (use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
+
+- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
+  during the device-memory probe (before the fix). After the reshape fix, paged
+  auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
+
+- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
+  output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
+  256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
+  correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
+  top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
+  **not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
+  (96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
+  issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
+  **unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
+  output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
+  does not change the verdict, which is decided by throughput below.
+
+## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
+
+Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
+Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
+`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
+
+| npl  | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
+|------|----------------------|------------------------|---------|
+| 128  | **537** (S 553)      | **477**                | both run; paged ~12% slower |
+| 256  | **541** (S 550)      | **471**                | both run; paged ~13% slower; neither gains over 128 |
+| 512  | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
+| 1024 | FAIL                 | FAIL                   | **both** die: `n_seq_max must be <= 256` |
+
+### The decisive facts
+
+1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
+   identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
+   `LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
+   is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
+   "paged unlocks thousands of seqs" premise is false for this PR.
+
+2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
+   the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
+   concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
+   24 GiB card; that lever does not exist on GB10's 119 GiB.
+
+3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
+   flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
+   nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
+   recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
+   ~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
+   layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
+   problem, not a KV-fragmentation problem.
+
+## 3. Verdict and reasoning: finish our own P4
+
+**Do not adopt #22569 as the base.** Reasons:
+
+- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
+  what we already have and capped at the same 256 seqs. There is no throughput or
+  concurrency dividend to harvest here.
+- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
+  (`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
+  bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
+  only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
+  LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
+  around the new scheduler API.
+- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
+  decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
+  creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
+  (gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
+  `seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
+- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
+  API is even the right design; maintainers are skeptical of paged for single-node use.
+
+**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
+gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
+The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
+implements the machinery for and that #22569 explicitly omits:
+- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
+  over-reservation), and
+- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
+  prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
+
+Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
+prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
+saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
+`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
+paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
+worth upstreaming to #22569 regardless, but the PR is not our base.
+
+### Reproduction (DGX, `~/llama.cpp-pr22569`)
+```sh
+export PATH=/usr/local/cuda/bin:$PATH
+# contiguous
+./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
+  -npl 128 -c 20480 -b 2048 -ub 2048        # 256/512/1024 -> n_seq_max must be <= 256
+# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
+./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
+  -np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99   # 512/1024 -> n_seq_max must be <= 256
+```
--- a/backend/cpp/llama-cpp/paged/README.md
+++ b/backend/cpp/llama-cpp/paged/README.md
@@ -0,0 +1,95 @@
+# Paged Attention for llama.cpp (vLLM-parity), CPU-first
+
+A from-scratch port of vLLM V1's paged KV-cache model into the llama.cpp / ggml
+world, built CPU-first and verified incrementally. The host-side block manager is
+a faithful port of vLLM; the compute stays in ggml (no new op — the read path
+gathers blocks with `ggml_get_rows` and feeds the existing attention ops).
+
+Design: `docs/superpowers/specs/2026-06-19-paged-attention-llamacpp-design.md`
+Plan:   `docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md`
+
+## Status
+
+| Phase | What | State |
+|------|------|-------|
+| P0 | vLLM-parity host block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache) | ✅ verified — `make check`, 4/4 suites |
+| P1 | ggml paged write/gather mechanism (`set_rows` by slot_mapping → `get_rows` gather) | ✅ verified — `make ggml-check`, non-contiguous blocks `[2,1,5]` round-trip + isolation |
+| P2 (core) | attention over gathered paged KV matches independent host reference | ✅ verified — max abs err **7.5e-08** |
+| P3 (partial) | capacity & prefix-sharing wins | ✅ measured — `make bench`: **9.2×** more concurrent seqs, **11.3×** less KV memory |
+| **P3 (in-model placement)** | **paged, non-contiguous block KV placement in the real model** | ✅ **Gate 0 PASSED** — Qwen3-0.6B token-identical (`patches/0001-paged-kv-block-placement.patch`) |
+| P4 (in-model compute) | gather-read (`build_attn_paged`, read only a seq's blocks) + win-2 throughput + multi-seq | ⛔ remaining |
+
+The design's central risk — *does paged (non-contiguous) KV produce correct attention?* —
+is **retired at two levels**: (1) at the ggml-op level (P2, 7.5e-08 vs reference) and
+(2) **in a real model** (P3): with KV physically scattered across permuted, non-contiguous
+blocks (cells `0-15, 144-159, 32-47, …`), Qwen3-0.6B greedy generation is **token-for-token
+identical** to the contiguous cache. Reproduce:
+
+```sh
+# from backend/cpp/llama-cpp-fallback-build/llama.cpp (patch applied, CPU build)
+B=build-cpu/bin/llama-simple; M=<Qwen3-0.6B.Q4_K_M.gguf>; P="...long prompt..."
+"$B" -m "$M" -n 40 "$P"                         > base.txt
+LLAMA_KV_PAGED=1 "$B" -m "$M" -n 40 "$P"        > paged.txt
+diff base.txt paged.txt && echo TOKEN-IDENTICAL
+# LLAMA_KV_PAGED_DEBUG=1 prints the permuted physical cells per step
+```
+
+This proves the **storage/placement** layer of paged attention in-model. What remains (P4)
+is the **compute** optimization that yields the throughput win: a gather-read that attends
+only a sequence's own blocks (instead of scanning `[0,n_kv)` with a mask), plus the
+multi-sequence driver to measure tok/s vs concurrency. The patch is single-sequence scope.
+
+## Build & test
+
+```sh
+make check                     # P0 host-manager unit suites (pure C++, no deps)
+make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>   # P1/P2 ggml tests
+make bench                     # P3 capacity + prefix-sharing numbers
+```
+
+`ggml-check` needs a built ggml. To build one CPU-only from a llama.cpp checkout:
+`cmake -S <llama.cpp>/ggml -B /tmp/ggml-build -DGGML_CUDA=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build /tmp/ggml-build -j`
+(if it complains about a missing `ggml.pc.in`, add a minimal pkg-config stub).
+
+## Files
+
+- `paged_kv_manager.{h,cpp}` — the vLLM-parity block manager (no ggml/llama dep).
+- `tests/test_free_block_queue.cpp` — intrusive LRU free list.
+- `tests/test_block_pool.cpp` — alloc/touch/free/evict/cache.
+- `tests/test_paged_kv_manager.cpp` — allocate/block_table/slot_mapping/free.
+- `tests/test_prefix_cache.cpp` — chained block hashing + first-miss cache hit.
+- `tests/test_ggml_paged_rw.cpp` — paged write/gather through real ggml ops.
+- `tests/test_ggml_paged_attn.cpp` — attention over paged KV vs host reference.
+- `paged-bench.cpp` — capacity (win 1) + prefix-sharing (win 3) measurements.
+
+## Remaining work — integration map (for the next session)
+
+Target: a paged read path active behind a flag, producing **token-identical** greedy
+output vs the contiguous cache on a real model (Gate 0), then `paged-bench` win 2.
+
+Exact seams in the vendored llama.cpp (`backend/cpp/llama-cpp-fallback-build/llama.cpp`,
+the pinned build fetches `LLAMA_VERSION=f3e182816421…`):
+
+1. **Memory type** — `src/llama-model.cpp:2070` `create_memory()` constructs `llama_kv_cache`.
+   Add a paged variant (or a flag on the existing cache) implementing `llama_memory_i`
+   (`src/llama-memory.h`), backed by `PagedKVManager`.
+2. **Allocation** — `src/llama-kv-cache.cpp:818` `find_slot()` produces `slot_info.idxs`.
+   Replace the ring-buffer scan with block-aligned allocation from `PagedKVManager`.
+3. **Read path** — `src/llama-kv-cache.cpp:1145/1165` `get_k`/`get_v` return a contiguous
+   `[0,n_kv)` view. For paged, gather the sequence's blocks (`ggml_get_rows`) into scratch.
+   The new branch lives alongside `build_attn` in `src/llama-graph.cpp` (`build_attn_mha`).
+4. **Mask** — `src/llama-graph.cpp` `build_attn_inp_kq_mask` sizes the mask to the gathered
+   length per sequence.
+5. **Gate 0 driver** — `build-cpu/bin/llama-simple` (greedy argmax) on
+   `Qwen3-0.6B.Q4_K_M.gguf`; assert paged output == contiguous output token-for-token.
+
+### Honest caveats (from the maintainer discussion + reading `find_slot`)
+
+- llama.cpp's **unified cache already shares one KV pool** across sequences and already
+  tolerates non-contiguous slots. So win-1 vs *unified* is smaller than vs per-seq
+  reservation (stream mode). The durable LocalAI wins are **on-demand sizing** and
+  **automatic cross-tenant prefix sharing** (P0 implements the block-hash machinery).
+- vLLM's classic `paged_attention_v1/v2` CUDA kernel is **deprecated**; the live path is
+  FlashAttention/FlashInfer over a block table. The port targets that pattern, not the
+  old kernel. Upstream draft PRs #22569 (new `ggml_paged_attn` op) and #17579 (CUDA) are
+  unmerged; maintainers are skeptical for single-user use.
--- a/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
+++ b/backend/cpp/llama-cpp/paged/UPSTREAM_GGML_ISSUE.md
@@ -0,0 +1,78 @@
+# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
+
+**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
+
+## Summary
+
+On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
+`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
+fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
+on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
+limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
+
+## Hardware / build
+
+- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
+- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
+- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
+
+## Measurements
+
+Single-stream (`llama-bench`, ub2048):
+
+| metric | Q8_0 | MXFP4 | vLLM FP8 |
+|---|---|---|---|
+| prefill pp2048 | ~2200 | 3441 | — |
+| decode tg128 | 62 | 86 | 52 |
+
+Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
+
+| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
+|---|---|---|---|---|
+| 1 | 1625 | 9644 | 83 | 48 |
+| 8 | 3634 | 33373 | 267 | 312 |
+| 32 | 3651 | 99398 | 551 | 1171 |
+| 64 | 3648 | 151990 | 770 | 2064 |
+
+Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
+
+## Profiling (nsys, MXFP4 pp2048 kernel time)
+
+| kernel | % |
+|---|---|
+| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
+| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
+| `flash_attn_ext_f16` | 8.8 |
+| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
+
+Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
+
+## What we ruled out (so it's the kernel, not config)
+
+- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
+- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
+- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
+  helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
+- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
+  1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
+
+## Proposal
+
+A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
+- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
+  vs today's per-expert MMQ scheduler.
+- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
+  warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
+  tile sizes).
+- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
+- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
+
+This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
+
+## Repro
+
+```sh
+llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
+llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
+llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
+```
--- a/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_DECOMPOSITION.md
@@ -0,0 +1,83 @@
+# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured)
+
+Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag
+`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the
+scheduler track.
+
+## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts)
+
+| | vLLM W4A16 Marlin | llama.cpp | verdict |
+|---|---|---|---|
+| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** |
+| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps |
+| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** |
+
+vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k
+headline is entirely the aggregate decode multiplier.
+
+## The kernel vLLM actually runs on sm_121 (W4A16, forced)
+
+Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) →
+`compressed_tensors_w4a4_nvfp4.py:28` → `kernels/linear/__init__.py:894` (`if use_a16: force_kernel =
+MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py` → `marlin_utils_fp4.py:182`
+`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu`
+ `marlin_template.h` + `marlin.cuh`.
+
+Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722
+^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait<stages-2>`), ldmatrix+mma,
+FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent
+on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.**
+
+## The scheduler (the real multiplier) — what llama.cpp lacks
+
+- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very
+  high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency).
+- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill
+  chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch).
+- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`,
+  rudimentary — we enabled VRAM-scaled slots in #10411).
+
+## Sizing the scheduler gap — MEASURED (llama.cpp aggregate, the surprise)
+
+`llama-batched-bench` Qwen3-32B-Q4_K_M, npp=128 ntg=128, npl scaling (DGX):
+
+| npl | S_PP (agg prefill) | **S_TG (agg decode)** | vLLM decode | llama % of vLLM |
+|---|---|---|---|---|
+| 1 | 628 | 10.2 | 11.8 | 86% |
+| 8 | 773 | 59.8 | - | - |
+| 32 | 763 | **235** | **328** | **72%** |
+| 64 | 761 | **391** | **569** | **69%** |
+| 128 | 762 | **540** | **667** | **81%** |
+
+**The "30x gap" headline is wrong for realistic concurrency.** llama.cpp's continuous batching already
+captures **~70-81% of vLLM's aggregate decode** at npl<=128, with a near-identical multiplier (10.2 -> 540 =
+**53x**, vs vLLM's 56x). And it is still climbing linearly at 128 (not plateaued). Combined with llama.cpp being
+*ahead* single-stream (MXFP4 1153 > vLLM 800), **llama.cpp is already broadly competitive with vLLM on GB10 at
+self-hosted concurrency.**
+
+Two real findings remain:
+1. **Aggregate prefill is flat ~760** regardless of npl - but that is the **GB10 compute roofline** (vLLM single-
+   stream is ~800; neither can prefill faster aggregate, it is compute-bound). So prefill is **not a throughput
+   gap**; chunked prefill is a **latency/TTFT** win (stop a long prefill stalling the decode batch), not a
+   throughput one.
+2. **vLLM's ~24k headline lives at thousands-of-sequences concurrency**, which **paged KV** unlocks (block KV,
+   no fragmentation). llama.cpp's contiguous KV caps how far npl can scale before memory/fragmentation bite. So
+   paged KV is the **high-concurrency (datacenter) lever**, not a moderate-concurrency one.
+
+## Recommendation
+
+**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.**
+Now that the gap is measured, ROI-ordered:
+1. **Ship the MXFP4-dense win** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant
+   recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. Cheapest.
+2. **Chunked prefill** — the tractable scheduler win: interleave prefill chunks with decode so a long prompt
+   doesn't stall the decode batch. Payoff is **latency/TTFT under mixed load** (and steadier decode batches),
+   not aggregate prefill throughput (that's GB10-compute-capped at ~760-800 for both engines). A grpc-server
+   scheduler change; no KV-layout rewrite.
+3. **Paged KV** — the **high-concurrency (thousands-of-seqs) lever** that unlocks vLLM's 24k regime. Heavy
+   (block KV manager; contested upstream PR #22569 / vendored `patches/`). Worth it only if datacenter-scale
+   concurrency is a target; at self-hosted concurrency (npl<=128) llama.cpp is already ~75-80% of vLLM.
+
+**Reframed expectation:** llama.cpp on GB10 is NOT 30x behind vLLM. It is ahead single-stream (MXFP4) and
+~70-81% of vLLM aggregate at npl<=128. The genuine differentiator vLLM still has is **scaling to very high
+concurrency via paged KV**. Kernel tracks (W4A16 178 t/s; FP4-MMA) stay **banked** - not the lever.
--- a/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
+++ b/backend/cpp/llama-cpp/paged/VLLM_THROUGHPUT_GAP.md
@@ -0,0 +1,59 @@
+# Where vLLM beats llama.cpp on a DGX Spark (GB10), and how to close it — keeping quality
+
+The question: "vLLM is faster at the end — what do we improve, while keeping good quality?" Answer: the
+gap is **three independent things**, and the biggest *per-user, quality-preserving* one is **speculative
+decoding**, which llama.cpp already supports.
+
+## Decomposition (measured + researched)
+
+| vLLM advantage | helps single user? | llama.cpp answer | quality cost | status |
+|---|---|---|---|---|
+| **Per-user decode speed** | **yes** | **speculative decoding** (Qwen3 draft / EAGLE3) | **none** (target-verified, lossless) | mature in llama.cpp; **the main lever** |
+| Prefill / TTFT | no (it's first-token latency) | tune FP4-MMA / Marlin W4A16 kernel | none | hard; `BLACKWELL_KERNEL_GAPS.md` |
+| Aggregate throughput @ concurrency | no (per-user = 0) | continuous batching (paged engine) | none | also kernel-bound |
+
+Key measured fact: **single-user decode is already at parity** (Qwen3-32B: llama 10.2 vs vLLM 11.7 t/s) —
+both hit GB10's ~273 GB/s bandwidth wall (~15 t/s ceiling) **without** spec-dec. So vLLM's real per-user
+speed edge is spec-dec, not architecture.
+
+## Why spec-dec is THE lever here (and quality-safe)
+
+- **Lossless:** the 32B target verifies every drafted token (accept/reject) — output distribution is
+  identical to no-drafting. So you keep **Q4_K_M quality** (no lossy MXFP4 needed) *and* get speed.
+- **GB10 is best-case for it:** decode is bandwidth-bound (one ~17 GB weight-read per token) with huge idle
+  compute. Spec-dec verifies K drafted tokens in **one** weight-read → converts the loop to compute-bound,
+  where GB10 has headroom. Realized speedup ≈ mean accepted length.
+- **Measured (others, same model class):** llama.cpp Qwen2.5-32B dense + 0.5B draft = **2.9×** (13→38 t/s);
+  vLLM EAGLE3 on Qwen3-32B = ~1.8–2.5× general, up to ~3× code/structured. **Competitive.**
+- **Regime caveat:** spec-dec gives **~nothing for MoE-A3B** models (only ~3B active → not bandwidth-bound,
+  nothing to amortize). It shines for **dense** 27–32B — the opposite regime. So this lever is *dense-model*
+  specific.
+
+## Qwen3-32B specifics
+
+- **No native MTP head** (MTP is a Qwen3-*Next*/MoE feature). Options: a **same-family draft**
+  (Qwen3-0.6B or **1.7B** — same tokenizer, llama.cpp vocab check passes) or an external **EAGLE3 head**
+  (RedHatAI/AngelSlim Qwen3-32B-eagle3, accept length 2.15–2.49).
+- Draft pick: **lean Qwen3-1.7B** (0.6B had ~60% lower acceptance in AWS's test; on a bandwidth-bound box the
+  32B weight-read dwarfs the draft cost, so maximize acceptance). `--spec-draft-n-max 5–8`.
+
+## Recommended LocalAI actions (quality-preserving, ranked)
+
+1. **Make speculative decoding easy/recommended for dense ≥14B models on Blackwell** — a draft-model field in
+   the model config (`-md` / `--spec-draft-*`), with a suggested Qwen3-1.7B draft for the Qwen3 family. This
+   is the biggest per-user speed win, lossless, available **now** (no kernel). Gallery: ship target+draft pairs.
+2. Kernel work (FP4-MMA tuning / Marlin W4A16) — improves **prefill/TTFT**, separate metric.
+3. Continuous batching (paged engine) — **aggregate** concurrency only; per-user = 0.
+
+## Honesty / status
+
+The research conclusion is solid (sources below). **Our own empirical spec-dec run on the DGX is pending** —
+the box rebooted mid-session and `llama-cli` now hangs at 0% GPU (while `llama-bench` works), plus the network
+is dropping ssh mid-command. Drafts (Qwen3-0.6B/1.7B Q8) are downloaded and the spec-dec flags are confirmed;
+re-run `llama-cli -m Qwen3-32B-Q4_K_M -md Qwen3-1.7B-Q8_0 -ngl 99 -ngld 99 --spec-draft-n-max 8` when the box
+is stable to confirm the ~2× locally. The conclusion does not depend on it (it's measured-reproducible by
+others on this exact model class), but we should bank our own number.
+
+Sources: llama.cpp Discussion #10466 (Qwen2.5-32B+0.5B = 2.9×), #16578 (DGX Spark), DandinPower/llama.cpp_bench
+(32B = 10.7 t/s, bandwidth-bound); vLLM MTP docs + Red Hat EAGLE3 article (lossless, up to 2.5×); AWS spec-dec
+blog (Qwen3-32B+1.7B up to 3×, 0.6B ~60% lower accept); RedHatAI/AngelSlim Qwen3-32B-eagle3 heads.
--- a/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
+++ b/backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md
@@ -0,0 +1,176 @@
+# W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan
+
+> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.**
+> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED
+> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved
+> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy
+> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS
+> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so
+> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×)
+> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV +
+> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is
+> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail:
+> `VLLM_DECOMPOSITION.md`.
+
+The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213
+TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's
+own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).
+
+## Why a custom kernel (validated, not assumed)
+
+On GB10 (sm_121), measured: **both** llama-MMQ (int8, Ampere-tuned) **and** cuBLAS-FP16 sit at ~46 TFLOP/s
+(~21% of peak). cuBLAS falls back to an Ampere `cutlass_80_tensorop` kernel (CUDA-13 has no sm_121 GEMM for
+these shapes); rebuilt with `-DGGML_CUDA_FORCE_CUBLAS=ON` it's *slower* than MMQ (690 vs 750). **No library
+path reaches the ceiling on consumer Blackwell** — a hand-tuned sm_120a kernel is required. `mmapeak` measures
+the 213 BF16 peak as reachable, and vLLM's Marlin hits it, so the ceiling is real; the work is reaching it.
+
+## What Marlin does (the design we mirror)
+
+Weights stored 4-bit, **dequantized in-register/shared-mem** in-flight; GEMM math on **FP16/BF16 tensor
+cores** (`mma.sync m16n8k16`). Speed comes from: `cp.async` global→shared with a **multi-stage double-buffered
+pipeline**, **offline weight reshuffle** into the MMA-friendly layout, activations kept resident in registers,
+and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLLM machete (Hopper successor).
+
+## Phases (each ends with: numerical parity vs MMQ + a prefill benchmark)
+
+### P0 — Harness + baseline — DONE
+- **Correctness gate (GREEN):** `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103 passed** (CUDA vs CPU
+  reference, covers Q4_0/Q4_K at the real FFN shapes m=4096,k=14336,n=1..512). This is *the* parity check the
+  W4A16 kernel must keep green at every phase — it tests the CUDA MUL_MAT path the kernel will hook. The
+  `not supported` lines are `type_b=f16` combos (irrelevant; prefill uses f32 activations).
+- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
+  of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
+  per-shape GFLOPS too; build it once with the harness.)
+- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
+  | n (tokens) | q4_0 | q4_K | regime |
+  |---|---|---|---|
+  | 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
+  | 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
+  | **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
+
+  So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
+  is cleaner than end-to-end for kernel iteration.
+- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
+  phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
+- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
+
+### P1 — Dispatch seam (no behavior change) — DONE
+- `marlin-w4a16.{cuh,cu}` + a gated hook in `ggml_cuda_mul_mat` (dense, non-ids path), behind
+  `GGML_CUDA_W4A16` + sm_120/121 (`cc >= GGML_CUDA_CC_BLACKWELL`) + type∈{Q4_0,Q4_K} + f32 activations.
+  Returns false → falls back to MMQ. Source + apply instructions: `kernel/w4a16/` (`HOOK.md`).
+- **Verified on GB10:** clean build; `test-backend-ops MUL_MAT` = **1103/1103** (byte-identical default);
+  `llama-bench` dense Q4 pp512 unchanged (717.77 default / 718.26 with flag); `GGML_CUDA_W4A16=1` reaches the
+  seam (stderr `[w4a16] ... P1 seam - using MMQ`) and falls back. The empty frame P2/P3 fills.
+
+### P2 — Correctness-first kernel (slow OK) — DONE
+- **Kernel:** `marlin-w4a16.cu` replaces the P1 TODO with a real W4A16 GEMM. In-kernel dequant Q4→BF16 into
+  shared mem, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` via ggml's `mma.cuh` tile abstractions
+  (`tile<16,8,nv_bfloat162>` A, `tile<8,8,nv_bfloat162>` B, `tile<16,8,float>` C), F32 accumulate, F32 write.
+  One warp per 16(M)x8(N) output tile, K looped in steps of 16. Both src0 (weights, row m) and src1 (acts,
+  row n) are row-major `[row][k]`, so A and B load symmetrically via `load_generic`; the mma does the dot over k.
+- **Types handled:** Q4_0 and Q4_K. Q4_0 dequant `w=d*(q-8)` inline; Q4_K via the superblock decode mirrored
+  from `convert.cu` (`get_scale_min_k4`, 8x32 sub-blocks, `d*q-m`).
+- **Shape classes handled:** contiguous 2D GEMM (the prefill path), `ne2==ne3==1`, f32 activations, K%16==0
+  (always true: Q4_0 K%32, Q4_K K%256). **Falls back to MMQ (returns false)** for batched (bs!=[1,1]),
+  broadcast (nr!=[1,1]), permuted / non-contiguous (per!=[0,1,2,3]), and any non-f32 activation (e.g. f16) -
+  keeps the gate green. M / N boundaries are zero-padded in-kernel (handles M not %16, N not %8).
+- **Parity (the gate):** `GGML_CUDA_W4A16=1 test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103 passed**
+  (the Q4_0/Q4_K f32 contiguous shapes run the kernel and match the CPU reference; batched/permuted/f16 fall
+  back). Default (flag-unset) build still **1103/1103** (byte-identical, seam returns false).
+- **Model sanity / P2 perf:** `GGML_CUDA_W4A16=1 llama-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -p 512 -n 16
+  -ub 2048` runs clean: **pp512 = 31.75 t/s**, tg16 = 6.28 t/s. Slow as expected (naive 1-warp/tile, weights
+  re-dequantized per n-tile, no pipeline) - this is the correctness checkpoint; P3 brings the speedup. The real
+  Q4_K model matmul path engages the kernel without error.
+
+### P3 — The Marlin pipeline (the speedup) — STEP 1 + SKEW-PAD/TILING LANDED; PREPACK + PIPELINE + STREAM-K DEFERRED
+Goal: `cp.async` double/triple-buffered global->shared; offline weight reshuffle (a one-time repack of the Q4
+tensor into the mma+pipeline layout); register-resident activation tiles; Stream-K split for the prefill M.
+Target: >=150 TFLOP/s (>=~2,300 t/s), then ~213. **MMQ baseline to beat: 47.1 TFLOPS (q4_K n=512) / pp512 718.**
+
+**Kernel structure now (committed P3b):** block-tiled multi-warp GEMM with a CONFLICT-FREE shared feed via skew
+padding. `blockDim=(32, WM*WN)` so `threadIdx.x` is the warp lane (required by `mma.cuh` get_i/get_j) and
+`threadIdx.y` is the warp index; the original 1-warp P2 launch put 128 threads on `threadIdx.x` and exploded
+`get_j` into an out-of-bounds shared read (found via compute-sanitizer). `WM*WN` warps compute a
+`BM(=WM*FM*16) x BN(=WN*FN*8)` output tile; each warp owns an `FM x FN` grid of m16n8k16 mma fragments
+accumulated in F32. Per k-step (16-deep): all warps cooperatively dequant the `BM x 16` Q4 weight strip + load
+the `BN x 16` f32->bf16 activation strip into shared, one `__syncthreads`, then `ldmatrix.x4` (A) / `ldmatrix.x2`
+(B) fragments + `FM*FN` mmas. The shared rows hold 8 bf162 of data but are stored at a PADDED stride of 12 bf162
+(`W4A16_SPAD`): ldmatrix's per-lane address is `row*stride`, and the natural stride 8 (a divisor of the
+32-bank / 128-byte cycle) collides rows 0,4,8,12 into a 2-way bank conflict; skewing to 12 (4-byte aligned, so
+ldmatrix's 16-byte alignment holds) makes `{r*12 mod 32}` hit 8 distinct bank-quads for r in 0..7, so both
+halves of ldmatrix are conflict-free at only +50% on the small staged tile (~12 KB at the shipping tile).
+Shipping config `WM=4,WN=4,FM=2,FN=4` -> `BM=128, BN=128`, 16 warps, 8 m16n8 C-tiles per warp (keeping
+register pressure low is what lets BN grow without an occupancy cliff). M/N tails zero-padded in-kernel; still
+gated to contiguous 2D Q4_0/Q4_K f32 prefill, else falls back to MMQ.
+
+**Per-step results (q4_K n=512 via `test-backend-ops perf`; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):**
+
+| step | q4_K n=512 | q4_0 n=512 | pp512 | pp2048 | vs MMQ 47 / 718 | notes |
+|---|---|---|---|---|---|---|
+| P2 (1 warp/tile) | ~2 TFLOPS | - | 31.75 | - | 0.04x | correctness checkpoint |
+| Step 1: block tiling (load_generic, BM64/4w) | 6.63 (cold) | 7.53 | 119 | 123 | 0.14x | original committed kernel |
+| P3b-1: skew-pad ldmatrix + BM128/8w | 8.50 (cold) | 10.56 | 148.5 | 153.9 | 0.18x | +28% q4_K, +40% q4_0 over step 1 |
+| **P3b-2: + BN128/16w (current)** | **9.92 (cold)** | **11.68** | **177.6** | **185.0** | **0.21x** | +17% q4_K, +20% pp512 over P3b-1 (+49% pp512 over step 1) |
+
+Parity gate **1103/1103** at every step, flag set and unset (byte-identical when unset). All P3b numbers above
+are from thermally-bracketed cold A/B sessions (committed measured immediately before AND after each candidate,
+identical both times -> the deltas are real, not thermal). P3b-1 cold A/B: 6.63/7.53 vs 8.52/10.49. P3b-2 cold
+A/B: BN64/8w 10.56/8.50 then 10.51/8.45 (bracket) vs BN128/16w 11.68/9.92.
+
+**What landed / what was tried (honest):**
+- **P3b - LANDED (committed).** Two combined changes lift the prior committed kernel: (1) **skew-pad
+  conflict-free ldmatrix** (shared row stride 8->12 bf162; makes `ldmatrix.x4`/`.x2` bank-conflict-free at near
+  zero occupancy cost) and (2) **bigger tile / more warps** (`BM=128, BN=64`, 8 warps). Cold A/B: q4_K
+  6.63->8.52 (+28%), q4_0 7.53->10.49 (+40%), pp512 119->148.5 (+25%). **Still ~5.5x under MMQ (47) per-op and
+  ~4.8x under pp512 718 - does NOT beat MMQ.** This is forward progress, not the finish line.
+- **The XOR-swizzle-FIRST plan was tested and is WRONG for this GPU - documented so it is not re-tried.** A
+  wide-row (BK=64, 128-byte rows) XOR swizzle `seg ^ (row&7)` IS conflict-free, but the 16 KB shared it needs
+  collapsed occupancy and dropped q4_K n=512 to **2.84 TFLOPS** (worse than the unswizzled 6.63) - the same
+  occupancy cliff P3 hit with a 32 KB pipeline. The conflict-free feed must be bought WITHOUT widening shared:
+  skew padding (above) does exactly that (6 KB), which is why it is the committed form. Lesson: on GB10 occupancy
+  dominates bank-conflict latency; never trade occupancy for a conflict-free layout.
+- **Conflict-free feed alone did NOT beat the unswizzled kernel - the limiter moved.** At the SAME BM64/4w tile,
+  skew-pad ldmatrix (6.70) ~= load_generic (6.63): removing bank conflicts bought ~nothing. The win came only
+  when the tile grew (BM128/8w). A 5-config tile sweep then split the two quant types:
+  - **q4_0 SCALES with warps/tiles** (7.7 -> 10.5 -> **15.8 TFLOPS at BM128/16w**): feed/global-traffic bound,
+    helped by cutting redundant activation re-reads (more BM = fewer M-blocks each re-reading the act column).
+  - **q4_K is largely DEQUANT-COMPUTE bound** (the BM64/16w tile gives q4_0=15.8 but q4_K=6.8 - they diverge
+    hard). This **refines P3's "within 12%" finding**: that held only in the low-throughput memory-bound regime;
+    once the feed is unblocked, q4_K's per-element 6-bit superblock decode (`get_scale_min_k4` + superblock
+    indexing, redone every k-step AND re-done by every N-block) becomes the wall. BM256 regressed both (too few
+    blocks / register pressure).
+- **Growing BN partly relieves the q4_K dequant wall (P3b-2).** Because every N-block re-decodes the same
+  weight strip, halving the N-block count (BN 64->128) halves that redundant q4_K decode - but only when BN is
+  spread across MORE WARPS (16w, 8 C-tiles/warp), not more fragments-per-warp: the FN=8 / FM=4 variants (16
+  C-tiles/warp) regressed to ~6.6 on register pressure, while WM=4,WN=4,FM=2,FN=4 (16w, 8 tiles/warp) lifted
+  q4_K 8.5->9.9 and q4_0 10.6->11.7 cold. BN=256 was no better and costs more shared. **BN128/16w is the
+  shipping tile.**
+- **Next blocker (the remaining q4_K unlock) = offline prepack.** BN growth only divides the redundant decode by
+  the N-block count; it cannot remove the per-k-step decode itself. The full fix is the **one-time offline
+  repack** - decode the Q4 tensor ONCE into a cached device buffer keyed off the tensor data pointer, in a layout
+  with the scale/min pre-applied (store reshuffled 4-bit + per-subblock bf16 d,m, ~1.25x the q4 size, NOT a full
+  bf16 blow-up which would be ~4x), so the in-kernel path becomes a cheap `q*d - m` with coalesced loads. Then
+  `cp.async` multi-stage (sized to NOT widen shared past the occupancy cliff) and **Stream-K** over M. These
+  remain the multi-week core; **prepack is the highest-value next step for q4_K specifically** (it should let
+  q4_K join q4_0 on the feed-bound scaling curve instead of plateauing at ~10).
+- **Methodology note (unchanged):** the box thermally throttles under sustained perf+bench runs (identical code
+  ~8.8 cold vs ~6.6 hot earlier), so only same-session A/Bs are trustworthy. The P3b deltas above were taken in
+  one bracketed cold session for exactly this reason.
+
+### P4 — Tune
+- Tile (mmq_x/y analogues), warps, pipeline depth, occupancy. We have nsys (throughput) but **not ncu** on the
+  DGX — tuning is empirical (sweep configs, measure t/s). Note ncu would need sudo/driver perms we lack.
+
+### P5 — Enable
+- Default on for sm_120/121 + Q4_0/Q4_K dense when parity holds + faster; keep the flag as an escape hatch.
+  Ship as a LocalAI llama.cpp patch (the patches/ series) and/or upstream (ggml has no Marlin-equivalent —
+  issue #1519 — so it's net-new upstream value; float it with maintainers first).
+
+## Risks / notes
+- **Multi-week, expert-CUDA, DGX-only** (GB10 is the only sm_121). The session's network flakiness +
+  `llama-cli` hang make `llama-bench`/`test-backend-ops` the reliable verification tools (both work).
+- Quantization correctness: Q4_K's superblock structure (256-elem, 6-bit scales) is more complex to dequant
+  in-kernel than Q4_0; consider landing Q4_0 first, then Q4_K.
+- **Beat-path follow-on:** the FP4-MMA path (`mul_mat_q<MXFP4>`, ~5% of FP4 peak) tuned/fixed on sm_121 reaches
+  ~6,600 (2× BF16). Separate track; this W4A16 kernel is the match-path foundation.
+- Reuse ggml's `mma.cuh` tile abstractions (MMQ already uses them) rather than raw PTX where possible.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/HOOK.md
@@ -0,0 +1,31 @@
+# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout
+
+Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
+new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).
+
+## Files (copy into `ggml/src/ggml-cuda/`)
+- `marlin-w4a16.cuh`
+- `marlin-w4a16.cu`
+
+## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`
+
+1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
+   ```cpp
+   #include "ggml-cuda/marlin-w4a16.cuh"
+   ```
+
+2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
+   `if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
+   ```cpp
+   if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
+   ```
+
+## Verify (P1 acceptance — met)
+- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
+- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
+- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
+- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
+  gating passes on sm_121, falls back).
+
+The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
+and returns `true` once parity holds.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/SUBAGENT_BRIEFS.md
@@ -0,0 +1,66 @@
+# W4A16 kernel - subagent dispatch briefs (P3, P4, P5)
+
+**Dispatch strategy.** Each phase = one fresh **Opus-4.8** subagent handed a complete zero-context brief.
+Phases are **sequential** (P3 needs P2's correct kernel; P4 needs P3's pipeline; P5 needs P4's tuned kernel),
+so dispatch phase N+1 only after phase N's commit lands, and before dispatching, splice phase N's *actual*
+deliverable (final kernel shape, configs, fallback set) into the next brief. P2's brief (already dispatched)
+is the template; reuse the COMMON section below verbatim in every dispatch.
+
+---
+
+## COMMON (paste into every phase brief)
+
+- **Kernel dev is on the remote DGX** (GB10, sm_121): `ssh -o ConnectTimeout=25 -o ServerAliveInterval=10 -o ServerAliveCountMax=10 dgx.casa '<cmd>'`. Network is FLAKY (re-poll on drop; nohup jobs survive). `llama-cli` HANGS - never use it. Only `llama-bench` + `test-backend-ops` work.
+- Checkout `~/llama.cpp-pr24423`, build `~/llama.cpp-pr24423/build` (sm_121, `-DLLAMA_BUILD_TESTS=ON`). Kernel file `ggml/src/ggml-cuda/marlin-w4a16.cu`. Build auto-GLOBs it; no CMakeLists edits. Hook already in `ggml-cuda.cu`, gated behind env `GGML_CUDA_W4A16`.
+- Dense test model: `~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
+- **Builds run detached + poll** (never blocking foreground): write a `~/pN.sh` that builds `--target test-backend-ops llama-bench`, echoes `RC=$?`, runs the gate, echoes `PN_DONE`; `nohup` it; poll `for i in $(seq 1 90); do grep -q PN_DONE ~/pN.out && break; sleep 20; done; tail ~/pN.out`.
+- **GPU hygiene:** check `docker ps | grep local-ai` + `nvidia-smi`; `docker stop` a running localai worker if present (authorized); never pkill native procs; never start model servers.
+- **Parity gate (must stay green every step):** `GGML_CUDA_W4A16=1 CUDA_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**; and flag-unset stays 1103/1103 (byte-identical). A wrong result is worse than a fallback - return false for any shape you can't do correctly.
+- **Perf measurement:** `test-backend-ops perf -o MUL_MAT -b CUDA0` (per-shape GFLOPS; the canonical target is q4_K m=4096 k=14336 **n=512**, baseline **47.1 TFLOPS**, ceiling ~213) + `llama-bench -m <model> -ngl 99 -p 512,2048 -n 0 -ub 2048` (baseline pp512 ~718).
+- **LocalAI repo (commit here; you do NOT inherit cwd - `cd` explicitly):** `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`. Plan: `backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md`. Source mirror: `backend/cpp/llama-cpp/paged/kernel/w4a16/`. After a phase passes: fetch the final `marlin-w4a16.cu` from the DGX (`ssh ... 'cat ...'`), overwrite the mirror, update the plan (mark the phase DONE with numbers), `git commit -s` (DCO sign-off; user is Ettore Di Giacinto <mudler@localai.io>). **No `Co-Authored-By`. No em-dashes anywhere. Trailer `Assisted-by: Claude:opus-4.8 [Claude Code]`. Do NOT push.**
+- Final message = the result (gate ?/1103, the perf delta, blockers + resolutions, commit hash). A precise partial result beats a vague success claim.
+
+---
+
+## P3 brief - the Marlin pipeline (the speedup)
+
+**Goal.** Take P2's correct-but-slow kernel from ~47 toward ~150+ TFLOPS (then ~213) on the q4_K n=512 prefill GEMM, **without ever breaking parity**. This is the Marlin design: the math is the same BF16 mma; the speed comes from feeding the tensor cores without stalling.
+
+**Implement, incrementally (re-run the parity gate after each):**
+1. **`cp.async` multi-stage pipeline** - double/triple-buffer global->shared loads of both the Q4 weight tiles and the activation tiles so dequant+mma on stage k overlaps the load of stage k+1. (Study `mma.cuh` + how `mmq.cu`/`mmf.cu` stage shared memory; ggml already uses `cp.async`/`__pipeline_*`.)
+2. **Offline weight reshuffle** - repack the Q4 weights once into the mma+pipeline-friendly layout (Marlin's interleave) so loads are coalesced and the mma fragment maps directly. Do this as a load-time transform of src0 (a new prepacked buffer keyed off the tensor) - NOT per-call. Document where the repack lives + its memory cost.
+3. **Register-resident activation tiles + Stream-K** split of the M dimension across blocks for the prefill (large-M) case so all SMs stay busy.
+
+**Acceptance.** Parity gate stays **1103/1103** at every commit; `test-backend-ops perf` q4_K n=512 climbs materially above 47 TFLOPS (target >=150) and `llama-bench` pp512 climbs above ~718. Report the TFLOPS + t/s after each of the 3 steps so the contribution of each is visible. If a step regresses parity, revert it and report why.
+
+**Reference.** IST-DASLab/marlin (github), arXiv 2408.11743, vLLM machete. Mirror `mmf.cu`'s BF16 GEMM structure; Marlin = that + Q4 dequant-on-load + the pipeline/reshuffle.
+
+**Splice before dispatch:** P2's final kernel structure (tile sizes, which types/shapes it handles vs falls back, helper functions it defined).
+
+---
+
+## P4 brief - tune to the ceiling
+
+**Goal.** Drive the P3 kernel as close to the ~213 TFLOPS ceiling as empirical tuning allows. **No `ncu` on this box** (no driver perms) - tune by throughput: `test-backend-ops perf` + `llama-bench` + `nsys` (throughput only).
+
+**Do.** Parametrize the kernel (template params / constants) over: tile M/N/K, warps per block, pipeline depth (stages), and occupancy (regs, shared-mem budget). Sweep systematically (a script that rebuilds + benches each config, logs q4_K n=512 TFLOPS + pp512/pp2048 t/s), pick the best, hard-set it (with a short comment on the sweep). Check both prefill shapes (n=512 and n=2048) and confirm decode (n=1) didn't regress (it should still route to mat-vec, not this kernel - verify the gating).
+
+**Acceptance.** Best config maximizes q4_K n=512 TFLOPS (stretch ~150-213) with parity **1103/1103** intact; the sweep table (config -> TFLOPS/t-s) is recorded in the plan's P4 section. Report the chosen config + the final pp512/pp2048 t/s vs the 718/750 baseline and vs vLLM's ~3300 single-stream target.
+
+**Splice before dispatch:** P3's pipeline structure + the perf it reached + which knobs are already fixed vs free.
+
+---
+
+## P5 brief - enable + package + (maybe) upstream
+
+**Goal.** Make W4A16 the default dense-Q4 path on Blackwell and ship it through LocalAI.
+
+**Do.**
+1. **Flip the gate:** default-ON for sm_120/121 + Q4_0/Q4_K dense when faster, keep an opt-out env (e.g. `GGML_CUDA_W4A16=0`) as an escape hatch. The existing return-false-on-unhandled-shape path is the correctness safety net; keep it. Verify the default (no env) build now runs W4A16 for dense Q4, gate green, faster than the old MMQ baseline.
+2. **Package as a LocalAI llama.cpp patch:** produce `backend/cpp/llama-cpp/paged/patches/kernel/0002-w4a16-marlin.patch` (the new files + the `ggml-cuda.cu` hook + the gate flip) that applies cleanly to the pinned llama.cpp, mirroring the existing `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`. Confirm LocalAI's `make backends/llama-cpp` build path can consume it (read `.agents/llama-cpp-backend.md` + the build memory: `make -C backend/cpp/llama-cpp clean` before rebuilds).
+3. **Docs:** update `BLACKWELL_KERNEL_GAPS.md` + the plan with the shipped result; add a short note to the LocalAI docs if there's a Blackwell/performance page.
+4. **Upstream decision (do NOT open without surfacing first):** ggml has no Marlin-equivalent (issue #1519) so this is net-new upstream value. Draft (do not submit) an upstream PR description + note the sm_121 build-flag caveats; report it for the user to decide.
+
+**Acceptance.** Default Blackwell build uses W4A16 for dense Q4, parity 1103/1103, measurably faster than MMQ; the patch applies + the LocalAI llama-cpp backend builds with it (verify or, if the full backend build is too heavy, document the exact build command + that the patch applies cleanly). Report the end-to-end LocalAI dense-Q4 prefill number vs the start-of-project 765 t/s.
+
+**Splice before dispatch:** P4's final kernel + config + the measured ceiling reached; the exact enable condition decided.
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cu
@@ -0,0 +1,258 @@
+#include "marlin-w4a16.cuh"
+#include "mma.cuh"
+
+#include <cstdio>
+#include <cstdlib>
+#include <cuda_bf16.h>
+
+// W4A16 Marlin-style GEMM.
+//
+// In-kernel dequantize Q4 weights -> BF16, multiply against BF16-converted F32
+// activations using mma.sync m16n8k16 BF16 tensor-core ops, accumulate in F32,
+// write F32 output. Handles only the contiguous 2D GEMM (prefill) case for
+// Q4_0 / Q4_K; everything else returns false and falls back to MMQ.
+//
+// ggml MUL_MAT convention: dst[m,n] = sum_k src0[k,m] * src1[k,n].
+//   src0 (weights): ne0=K (contiguous), ne1=M  -> row m is K contiguous quants.
+//   src1 (acts,f32): ne0=K (contiguous), ne1=N -> row n is K contiguous floats.
+//   dst  (f32):      ne0=M (contiguous), ne1=N -> element (m,n) at m + n*M.
+// Both operands are row-major [row][k]; m16n8k16 computes C[m,n] += sum_k A[m,k]*B[n,k].
+//
+// Thread layout: blockDim = (32, WM*WN). threadIdx.x is the warp lane (0..31,
+// required by mma.cuh get_i/get_j), threadIdx.y is the warp index.
+//
+// P3b step 1 - conflict-free shared layout via SKEW PADDING:
+//  - WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile; each warp
+//    owns an FM x FN grid of m16n8k16 mma fragments accumulated in F32.
+//  - Per 16-deep k-step the warps cooperatively dequant the BM x 16 Q4 weight
+//    strip + load the BN x 16 f32->bf16 activation strip into shared, then feed
+//    the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B).
+//  - The shared rows are PADDED to SPAD(=12) bf162 instead of the natural 8.
+//    ldmatrix's per-lane address is row*stride; with the natural stride 8 (a
+//    divisor of the 32-bank / 128-byte cycle) rows 0,4,8,12 collide -> 2-way
+//    bank conflict on every fragment load (this is why P3 measured a plain
+//    ldmatrix swap as neutral). Skewing the stride to 12 (4-byte aligned, so
+//    ldmatrix's 16-byte alignment holds) makes {r*12 mod 32} hit 8 distinct
+//    bank-quads for r in 0..7, so both halves of ldmatrix.x4 and ldmatrix.x2 are
+//    conflict-free. The pad costs only +50% on the small (~4 KB) staged tile, so
+//    unlike a 128-byte-row XOR swizzle it does NOT collapse occupancy on GB10
+//    (a wide-row swizzle pushed shared to 16 KB and dropped this to ~2.8 TFLOPS).
+//
+// Dead-ends already proven (do not re-try): a double-buffered KSTAGE=64 cp.async
+// pipeline collapsed occupancy (32 KB shared -> 2.7 TFLOPS); a plain ldmatrix on
+// the UNpadded layout was neutral (bank conflicts); a wide-row (BK=64) XOR swizzle
+// was conflict-free but occupancy-starved (16 KB shared -> 2.8 TFLOPS). Skew
+// padding gets the conflict-free feed at near-zero occupancy cost.
+
+using namespace ggml_cuda_mma;
+
+typedef tile<16, 8, nv_bfloat162> tile_A; // 16(M) x 16(K)
+typedef tile< 8, 8, nv_bfloat162> tile_B; //  8(N) x 16(K)
+typedef tile<16, 8, float>        tile_C; // 16(M) x  8(N)
+
+// bf162 columns actually live per shared row (16 k-values = 8 bf162) ...
+#define W4A16_KP   8
+// ... padded to this stride to bank-skew the ldmatrix row addresses.
+#define W4A16_SPAD 12
+
+static bool w4a16_enabled() {
+    static const bool en = (std::getenv("GGML_CUDA_W4A16") != nullptr);
+    return en;
+}
+
+// 6-bit packed scale/min decode for Q4_K (mirrors convert.cu get_scale_min_k4).
+static __device__ __forceinline__ void w4a16_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
+    if (j < 4) {
+        d = q[j] & 63; m = q[j + 4] & 63;
+    } else {
+        d = (q[j+4] & 0xF) | ((q[j-4] >> 6) << 4);
+        m = (q[j+4] >>  4) | ((q[j-0] >> 6) << 4);
+    }
+}
+
+// Dequantize a single Q4_0 weight at column k of a row.
+static __device__ __forceinline__ float w4a16_dq_q4_0(const char * row, int k) {
+    const block_q4_0 * blk = (const block_q4_0 *) row + (k / QK4_0);
+    const int j = k % QK4_0;
+    const float d = __half2float(blk->d);
+    const int q = (j < QK4_0/2) ? (blk->qs[j] & 0xF) : (blk->qs[j - QK4_0/2] >> 4);
+    return (q - 8) * d;
+}
+
+// Dequantize a single Q4_K weight at column k of a row.
+static __device__ __forceinline__ float w4a16_dq_q4_K(const char * row, int k) {
+    const block_q4_K * blk = (const block_q4_K *) row + (k / QK_K);
+    const int e = k % QK_K;
+    const int il     = e / 64;        // 0..3
+    const int within = e % 64;
+    const int half   = within / 32;   // 0..1
+    const int pos    = within % 32;
+    const int ir     = pos / 4;       // 0..7
+    const int l      = pos % 4;       // 0..3
+    const int is     = 2*il + half;
+    const float dall = __low2half (blk->dm);
+    const float dmin = __high2half(blk->dm);
+    uint8_t sc, mn;
+    w4a16_scale_min_k4(is, blk->scales, sc, mn);
+    const float d = dall * sc;
+    const float m = dmin * mn;
+    const uint8_t qb = blk->qs[32*il + 4*ir + l];
+    const int q = (half == 0) ? (qb & 0xF) : (qb >> 4);
+    return d * q - m;
+}
+
+template <bool IS_Q4_K, int WM, int WN, int FM, int FN>
+static __global__ void __launch_bounds__(WM*WN*32, 1)
+w4a16_gemm_kernel(
+        const char * __restrict__ src0,
+        const char * __restrict__ src1,
+        float      * __restrict__ dst,
+        const int M, const int N, const int K,
+        const int64_t nb01, const int64_t nb11, const int64_t dst_ne0) {
+    constexpr int KP   = W4A16_KP;      // 8 bf162 = 16 k per row
+    constexpr int SPAD = W4A16_SPAD;    // padded row stride (bank skew)
+    constexpr int BM  = WM*FM*16;
+    constexpr int BN  = WN*FN*8;
+    constexpr int NTH = WM*WN*32;
+
+    const int m0 = blockIdx.x * BM;
+    const int n0 = blockIdx.y * BN;
+
+    const int warp_id = threadIdx.y;        // 0 .. WM*WN-1
+    const int warp_n  = warp_id % WN;
+    const int warp_m  = warp_id / WN;
+    const int tid     = threadIdx.y*32 + threadIdx.x;
+
+    __shared__ nv_bfloat162 sW[BM*SPAD]; // [m][kpair], padded row stride SPAD
+    __shared__ nv_bfloat162 sB[BN*SPAD]; // [n][kpair], padded row stride SPAD
+
+    tile_C C[FM][FN]; // zero-initialized accumulators
+
+    for (int k0 = 0; k0 < K; k0 += 16) {
+        // Dequantize the BM x 16 weight strip once; reused across the block's BN span.
+        #pragma unroll
+        for (int idx = tid; idx < BM*KP; idx += NTH) {
+            const int m  = idx / KP;
+            const int kk = idx % KP;
+            const int k  = k0 + 2*kk;
+            float w0 = 0.0f, w1 = 0.0f;
+            if (m0 + m < M) {
+                const char * row = src0 + (int64_t)(m0 + m) * nb01;
+                if (IS_Q4_K) { w0 = w4a16_dq_q4_K(row, k); w1 = w4a16_dq_q4_K(row, k + 1); }
+                else         { w0 = w4a16_dq_q4_0(row, k); w1 = w4a16_dq_q4_0(row, k + 1); }
+            }
+            sW[m*SPAD + kk] = __floats2bfloat162_rn(w0, w1);
+        }
+        // Load the BN x 16 activation strip (f32 -> bf16).
+        #pragma unroll
+        for (int idx = tid; idx < BN*KP; idx += NTH) {
+            const int n  = idx / KP;
+            const int kk = idx % KP;
+            const int k  = k0 + 2*kk;
+            float a0 = 0.0f, a1 = 0.0f;
+            if (n0 + n < N) {
+                const float * arow = (const float *)(src1 + (int64_t)(n0 + n) * nb11);
+                a0 = arow[k]; a1 = arow[k + 1];
+            }
+            sB[n*SPAD + kk] = __floats2bfloat162_rn(a0, a1);
+        }
+        __syncthreads();
+
+        tile_A Af[FM];
+        tile_B Bf[FN];
+        #pragma unroll
+        for (int fm = 0; fm < FM; ++fm) {
+            const int mrow = (warp_m*FM + fm) * 16;
+            load_ldmatrix(Af[fm], sW + mrow*SPAD, SPAD);
+        }
+        #pragma unroll
+        for (int fn = 0; fn < FN; ++fn) {
+            const int ncol = (warp_n*FN + fn) * 8;
+            load_ldmatrix(Bf[fn], sB + ncol*SPAD, SPAD);
+        }
+        #pragma unroll
+        for (int fm = 0; fm < FM; ++fm) {
+            #pragma unroll
+            for (int fn = 0; fn < FN; ++fn) {
+                mma(C[fm][fn], Af[fm], Bf[fn]);
+            }
+        }
+        __syncthreads();
+    }
+
+    #pragma unroll
+    for (int fm = 0; fm < FM; ++fm) {
+        #pragma unroll
+        for (int fn = 0; fn < FN; ++fn) {
+            const int mbase = m0 + (warp_m*FM + fm) * 16;
+            const int nbase = n0 + (warp_n*FN + fn) * 8;
+            #pragma unroll
+            for (int l = 0; l < tile_C::ne; ++l) {
+                const int m = mbase + tile_C::get_i(l);
+                const int n = nbase + tile_C::get_j(l);
+                if (m < M && n < N) {
+                    dst[(int64_t)n * dst_ne0 + m] = C[fm][fn].x[l];
+                }
+            }
+        }
+    }
+}
+
+bool ggml_cuda_w4a16_mul_mat(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,
+        const ggml_tensor * src1,
+        ggml_tensor       * dst) {
+    if (!w4a16_enabled()) {
+        return false;
+    }
+    if (src0->type != GGML_TYPE_Q4_0 && src0->type != GGML_TYPE_Q4_K) {
+        return false;
+    }
+    if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
+        return false;
+    }
+    const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+    if (!GGML_CUDA_CC_IS_NVIDIA(cc) || cc < GGML_CUDA_CC_BLACKWELL) {
+        return false; // consumer Blackwell (sm_120/121) only
+    }
+
+    if (src0->ne[2] != 1 || src0->ne[3] != 1 ||
+        src1->ne[2] != 1 || src1->ne[3] != 1 ||
+        dst->ne[2]  != 1 || dst->ne[3]  != 1) {
+        return false;
+    }
+    if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
+        return false;
+    }
+
+    const int64_t K = src0->ne[0];
+    const int64_t M = src0->ne[1];
+    const int64_t N = src1->ne[1];
+    if (src1->ne[0] != K || dst->ne[0] != M || dst->ne[1] != N) {
+        return false;
+    }
+    if (K % 16 != 0) {
+        return false;
+    }
+
+    cudaStream_t stream = ctx.stream();
+
+    // Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
+    constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
+    constexpr int BM = WM*FM*16;
+    constexpr int BN = WN*FN*8;
+    const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
+    const dim3 block(32, WM*WN, 1);
+
+    if (src0->type == GGML_TYPE_Q4_K) {
+        w4a16_gemm_kernel<true, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
+            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
+            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
+    } else {
+        w4a16_gemm_kernel<false, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
+            (const char *) src0->data, (const char *) src1->data, (float *) dst->data,
+            (int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
+    }
+    return true;
+}
--- a/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
+++ b/backend/cpp/llama-cpp/paged/kernel/w4a16/marlin-w4a16.cuh
@@ -0,0 +1,14 @@
+#pragma once
+
+#include "common.cuh"
+
+// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
+// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
+// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
+// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
+// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
+bool ggml_cuda_w4a16_mul_mat(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,   // 4-bit weights (Q4_0/Q4_K)
+        const ggml_tensor * src1,   // F32 activations
+        ggml_tensor       * dst);   // F32 output
--- a/backend/cpp/llama-cpp/paged/paged-bench.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-bench.cpp
@@ -0,0 +1,129 @@
+// paged-bench: quantify the multi-tenant wins of paged KV allocation that are
+// properties of the host-side block model (vLLM-parity), independent of the
+// in-model compute path.
+//
+//   Win 1 (capacity):       on-demand block allocation vs contiguous per-seq
+//                           reservation, under a fixed KV block budget.
+//   Win 3 (prefix sharing): automatic cross-tenant prefix dedup via block
+//                           hashing.
+//
+// Win 2 (throughput) is intentionally NOT here: it requires the paged read
+// path wired into llama-graph.cpp (Gate 0). Measuring it at this layer would
+// be dishonest, so it is reported as pending.
+
+#include "paged_kv_manager.h"
+
+#include <cstdio>
+#include <vector>
+#include <numeric>
+
+using namespace paged;
+
+// A deterministic LCG so sequence lengths vary without Math.random-style nondeterminism.
+struct Lcg {
+    uint64_t s;
+    explicit Lcg(uint64_t seed) : s(seed) {}
+    uint32_t next() { s = s * 6364136223846793005ULL + 1442695040888963407ULL; return (uint32_t)(s >> 33); }
+    int range(int lo, int hi) { return lo + (int)(next() % (uint32_t)(hi - lo + 1)); }
+};
+
+static size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+int main() {
+    const int block_size = 16;
+    const int n_ctx      = 2048;   // max context a sequence could use
+    const int num_blocks = 512;    // fixed KV budget: 512 blocks * 16 = 8192 cells
+
+    printf("paged-bench  (block_size=%d, n_ctx=%d, budget=%d blocks = %d cells)\n\n",
+           block_size, n_ctx, num_blocks, num_blocks * block_size);
+
+    // ---------------------------------------------------------------------
+    // WIN 1: concurrency capacity. Sequences have realistic, VARYING lengths
+    // (most short, a few long) - the regime where reserving n_ctx per seq
+    // wastes the most. Count how many fit under the same block budget.
+    // ---------------------------------------------------------------------
+    {
+        Lcg rng(12345);
+        const int blocks_per_ctx = (int) cdiv(n_ctx, block_size); // contiguous reserves this per seq
+
+        // Contiguous (stream-style) reservation: every seq reserves n_ctx worth.
+        int contiguous_fit = num_blocks / blocks_per_ctx;
+
+        // Paged on-demand: draw real lengths until the pool is exhausted.
+        PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
+        int paged_fit = 0;
+        long total_tokens = 0;
+        for (int seq = 0; ; ++seq) {
+            // 80% short (8-128 tok), 20% long (up to n_ctx)
+            int len = (rng.range(0, 99) < 80) ? rng.range(8, 128) : rng.range(128, n_ctx);
+            if (!m.allocate(seq, (size_t) len)) break;
+            paged_fit++;
+            total_tokens += len;
+        }
+
+        printf("WIN 1  concurrency capacity @ %d-block budget\n", num_blocks);
+        printf("  contiguous (reserve n_ctx/seq): %d sequences\n", contiguous_fit);
+        printf("  paged (on-demand blocks):       %d sequences  (avg %ld tok/seq)\n",
+               paged_fit, paged_fit ? total_tokens / paged_fit : 0);
+        printf("  --> paged fits %.1fx more concurrent sequences\n\n",
+               contiguous_fit ? (double) paged_fit / contiguous_fit : 0.0);
+    }
+
+    // ---------------------------------------------------------------------
+    // WIN 3: cross-tenant prefix sharing. N tenants share a long system
+    // prompt / RAG context, then diverge. Compare physical blocks consumed
+    // with prefix caching on vs off.
+    // ---------------------------------------------------------------------
+    {
+        const int n_tenants    = 32;
+        const int shared_len   = 1024;  // shared system prompt (64 blocks)
+        const int distinct_len = 64;    // per-tenant suffix (4 blocks)
+
+        // Shared prefix token ids (identical across tenants -> identical block hashes).
+        std::vector<int> shared(shared_len);
+        for (int i = 0; i < shared_len; ++i) shared[i] = 1000 + i;
+
+        // --- prefix caching OFF: every tenant pays for the whole prefix ---
+        long blocks_off = 0;
+        {
+            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/false);
+            for (int t = 0; t < n_tenants; ++t) {
+                m.allocate(t, (size_t) (shared_len + distinct_len));
+                blocks_off += m.block_table(t).size();
+            }
+        }
+
+        // --- prefix caching ON: shared blocks are deduped to one physical copy ---
+        long blocks_on = 0;
+        {
+            PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/true);
+            // tenant 0 fills + caches the shared prefix
+            auto h = m.compute_block_hashes(shared);
+            m.allocate(0, (size_t) (shared_len + distinct_len));
+            m.cache_blocks(0, h, (size_t) shared_len);
+            long physical = m.block_table(0).size();
+            // tenants 1..N-1 hit the cached prefix; only their distinct suffix is new
+            for (int t = 1; t < n_tenants; ++t) {
+                size_t cached_tokens = m.get_computed_blocks(h); // shared blocks reused
+                size_t new_tokens = (shared_len - cached_tokens) + distinct_len;
+                m.allocate(t, (size_t) (shared_len + distinct_len));
+                // physically new blocks = only what wasn't already resident
+                physical += (long) cdiv(new_tokens, block_size);
+            }
+            blocks_on = physical;
+        }
+
+        printf("WIN 3  cross-tenant prefix sharing (%d tenants, %d-tok shared prefix)\n",
+               n_tenants, shared_len);
+        printf("  prefix-cache OFF: %ld physical blocks\n", blocks_off);
+        printf("  prefix-cache ON:  %ld physical blocks\n", blocks_on);
+        printf("  --> %.1fx less KV memory for the shared workload\n\n",
+               blocks_on ? (double) blocks_off / blocks_on : 0.0);
+    }
+
+    printf("WIN 2  aggregate throughput under load: PENDING\n");
+    printf("  Requires the paged gather-read path wired into llama-graph.cpp\n");
+    printf("  (Gate 0) to measure tok/s vs concurrency. Not measurable at the\n");
+    printf("  allocation layer; not reported here to avoid overclaiming.\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
+++ b/backend/cpp/llama-cpp/paged/paged-loadgen.cpp
@@ -0,0 +1,169 @@
+// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the
+// regime where paging wins - variable prompt lengths, variable generation lengths,
+// staggered (continuous) arrival, and a shared system prefix. The stock
+// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a
+// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and
+// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md).
+//
+// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to
+// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU
+// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV
+// memory becomes the binding constraint - that is where paged KV pays off and where
+// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput
+// plateaus long before memory binds, so the win is not observable there regardless.
+//
+// Metrics reported:
+//   - goodput (decode tokens/s aggregate) under the dynamic load
+//   - peak concurrent in-flight sequences actually sustained
+//   - paged peak KV bytes used  vs  the contiguous reservation a unified cache needs
+//     (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks
+//
+// The capacity ratio is the load-bearing number for the buy decision: it is how many
+// more concurrent tenants a fixed HBM budget serves with paging than without.
+
+#include "common.h"
+#include "llama.h"
+
+#include <cmath>
+#include <cstdio>
+#include <cstring>
+#include <random>
+#include <string>
+#include <vector>
+
+// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ----
+static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
+
+struct workload_cfg {
+    int    total_requests  = env_int("LG_TOTAL",    2000); // total requests to serve
+    int    target_inflight = env_int("LG_INFLIGHT",  256); // continuous-batching concurrency target
+    int    prefix_tokens   = env_int("LG_PREFIX",    512); // shared system-prompt prefix (prefix-cache target)
+    int    suffix_min      = env_int("LG_SUFMIN",     16); // per-request unique prompt suffix range
+    int    suffix_max      = env_int("LG_SUFMAX",    768);
+    int    gen_short       = env_int("LG_GENSHORT",   32); // bimodal generation: most short...
+    int    gen_long        = env_int("LG_GENLONG",  1024); // ...some long (the over-reservation driver)
+    int    gen_long_pct    = env_int("LG_LONGPCT",    15); // % of requests that are long
+    int    block_size      = env_int("LG_BLOCK",      16); // must match -kvbls
+    unsigned seed          = (unsigned) env_int("LG_SEED", 1234);
+};
+
+// Per-request plan drawn from the workload distribution.
+struct req_plan { int prompt_len; int gen_len; };
+
+int main(int argc, char ** argv) {
+    common_params params;
+    params.n_predict = -1; // per-request, controlled by the plan below
+    if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) {
+        fprintf(stderr, "usage: %s -m <model> -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]);
+        return 1;
+    }
+    params.kv_paged = true;
+
+    common_init_result init = common_init_from_params(params);
+    llama_model *   model = init.model.get();
+    llama_context * ctx   = init.context.get();
+    if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; }
+    const llama_vocab * vocab = llama_model_get_vocab(model);
+
+    workload_cfg cfg;
+    std::mt19937 rng(cfg.seed);
+    std::uniform_int_distribution<int> suf(cfg.suffix_min, cfg.suffix_max);
+    std::uniform_int_distribution<int> pct(1, 100);
+
+    // KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed
+    // against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio.
+    const int n_layers   = llama_model_n_layer(model);
+    const int n_head_kv  = llama_model_n_head_kv(model);
+    const int head_dim   = llama_model_n_embd(model) / llama_model_n_head(model);
+    const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t);
+
+    // A long shared system prefix that every request reuses (the prefix-cache target).
+    std::vector<llama_token> prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true);
+
+    // Pre-draw all request plans so paged peak usage and the contiguous reservation are
+    // computed from the SAME workload.
+    std::vector<req_plan> plans(cfg.total_requests);
+    int max_ctx = 0;
+    for (auto & p : plans) {
+        p.prompt_len = cfg.prefix_tokens + suf(rng);
+        p.gen_len    = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short;
+        max_ctx      = std::max(max_ctx, p.prompt_len + p.gen_len);
+    }
+
+    llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx);
+    if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; }
+
+    // ---- continuous-arrival loop: keep ~target_inflight requests live at all times ----
+    int    next_req = 0, done = 0, inflight = 0, peak_inflight = 0;
+    long   total_decoded = 0;
+    size_t peak_kv_bytes_paged = 0;   // sum over live seqs of ceil(used/block)*block*kv_bytes
+    size_t live_used_tokens = 0;      // running sum of actual KV tokens held by live seqs
+
+    auto admit = [&](int rid) {
+        const req_plan & p = plans[rid];
+        std::vector<llama_token> toks = prefix; // shared prefix...
+        std::vector<llama_token> suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false);
+        toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix
+        if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) {
+            inflight++; peak_inflight = std::max(peak_inflight, inflight);
+            live_used_tokens += p.prompt_len;
+        }
+    };
+
+    const int64_t t0 = ggml_time_us();
+    for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++);
+
+    llama_batch batch = {};
+    std::vector<llama_token> sampled; std::vector<int8_t> stop_flags;
+
+    while (done < cfg.total_requests) {
+        if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break;
+        const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched);
+        sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0);
+
+        // (decode is done inside the scheduler/update path in PR #22569; greedy here)
+        for (int i = 0; i < info->n_seq; ++i) {
+            const int rid = info->seq_ids[i];
+            llama_paged_seq_state st{};
+            llama_paged_scheduler_get_seq_state(sched, rid, &st);
+            // greedy argmax from the i-th row of logits
+            const float * lg = llama_get_logits_ith(ctx, i);
+            int best = 0; float bv = lg[0];
+            for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; }
+            sampled[i] = best;
+            const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len;
+            stop_flags[i] = stop ? 1 : 0;
+            if (!stop) { total_decoded++; live_used_tokens++; }
+            if (stop) {
+                done++; inflight--;
+                live_used_tokens -= (plans[rid].prompt_len + st.n_decoded);
+                if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival
+            }
+        }
+        // paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate
+        // current paged footprint from live_used_tokens rounded up per the block size.
+        const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size)
+                                 * cfg.block_size * kv_bytes_per_token;
+        peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now);
+
+        llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data());
+    }
+    const double secs = (ggml_time_us() - t0) / 1e6;
+
+    // Contiguous unified-KV reservation needed to serve the SAME peak concurrency without
+    // mid-generation eviction: every live slot must be backed for the worst-case context.
+    const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token;
+
+    printf("\n==== paged-loadgen ====\n");
+    printf("requests served      : %d  (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight);
+    printf("goodput (decode)     : %.1f tok/s   (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs);
+    printf("kv bytes / token     : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim);
+    printf("paged peak KV        : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0);
+    printf("contiguous reserve   : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx);
+    printf("CAPACITY RATIO       : %.2fx  <- tenants-per-HBM paging unlocks\n",
+           peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0);
+    printf("  (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens);
+
+    llama_paged_scheduler_free(sched);
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.cpp
@@ -0,0 +1,296 @@
+#include "paged_kv_manager.h"
+#include <cassert>
+#include <stdexcept>
+
+namespace paged {
+
+// ---------------------------------------------------------------------------
+// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
+// ---------------------------------------------------------------------------
+
+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
+    num_free_blocks = blocks.size();
+    for (size_t i = 0; i < blocks.size(); ++i) {
+        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
+        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
+    }
+    if (!blocks.empty()) {
+        fake_head.next_free = blocks.front();
+        blocks.front()->prev_free = &fake_head;
+        fake_tail.prev_free = blocks.back();
+        blocks.back()->next_free = &fake_tail;
+    } else {
+        fake_head.next_free = &fake_tail;
+        fake_tail.prev_free = &fake_head;
+    }
+}
+
+KVCacheBlock* FreeBlockQueue::popleft() {
+    KVCacheBlock* first = fake_head.next_free;
+    if (first == &fake_tail || first == nullptr) {
+        assert(num_free_blocks == 0);
+        throw std::runtime_error("No free blocks available");
+    }
+    fake_head.next_free = first->next_free;
+    first->next_free->prev_free = &fake_head;
+    first->prev_free = first->next_free = nullptr;
+    num_free_blocks--;
+    return first;
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
+    std::vector<KVCacheBlock*> ret;
+    if (n == 0) return ret;
+    assert(num_free_blocks >= n);
+    num_free_blocks -= n;
+    KVCacheBlock* curr = fake_head.next_free;
+    ret.reserve(n);
+    for (size_t i = 0; i < n; ++i) {
+        assert(curr != nullptr);
+        ret.push_back(curr);
+        KVCacheBlock* last = curr;
+        curr = curr->next_free;
+        last->prev_free = last->next_free = nullptr;
+    }
+    if (curr != nullptr) {
+        fake_head.next_free = curr;
+        curr->prev_free = &fake_head;
+    }
+    return ret;
+}
+
+void FreeBlockQueue::remove(KVCacheBlock* block) {
+    if (!block->prev_free || !block->next_free)
+        throw std::runtime_error("remove() called on an invalid block");
+    block->prev_free->next_free = block->next_free;
+    block->next_free->prev_free = block->prev_free;
+    block->prev_free = block->next_free = nullptr;
+    num_free_blocks--;
+}
+
+void FreeBlockQueue::append(KVCacheBlock* block) {
+    KVCacheBlock* last = fake_tail.prev_free;
+    last->next_free = block;
+    block->prev_free = last;
+    block->next_free = &fake_tail;
+    fake_tail.prev_free = block;
+    num_free_blocks++;
+}
+
+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* last = fake_tail.prev_free;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = last;
+        last->next_free = b;
+        last = b;
+    }
+    last->next_free = &fake_tail;
+    fake_tail.prev_free = last;
+    num_free_blocks += blocks.size();
+}
+
+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* first = fake_head.next_free;
+    KVCacheBlock* prev = &fake_head;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = prev;
+        prev->next_free = b;
+        prev = b;
+    }
+    prev->next_free = first;
+    first->prev_free = prev;
+    num_free_blocks += blocks.size();
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+    std::vector<KVCacheBlock*> ret;
+    const KVCacheBlock* curr = fake_head.next_free;
+    while (curr && curr->next_free != nullptr) {
+        ret.push_back(const_cast<KVCacheBlock*>(curr));
+        curr = curr->next_free;
+    }
+    return ret;
+}
+
+// ---------------------------------------------------------------------------
+// BlockPool  (port of block_pool.py)
+// ---------------------------------------------------------------------------
+
+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
+    std::vector<KVCacheBlock*> p;
+    p.reserve(v.size());
+    for (auto& b : v) p.push_back(&b);
+    return p;
+}
+
+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
+    std::vector<KVCacheBlock> v;
+    v.reserve(num_blocks);
+    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
+    return v;
+}
+
+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
+    : enable_caching_(enable_caching),
+      blocks_(make_block_vec(num_blocks)),
+      ptrs_(make_ptrs(blocks_)),
+      free_queue_(ptrs_) {
+    // vLLM reserves block_id 0 as the null block (never cached).
+    null_block = free_queue_.popleft();
+    null_block->is_null = true;
+}
+
+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
+    if (!block->has_hash) return false;
+    auto it = cached_block_hash_to_block_.find(block->block_hash);
+    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
+    cached_block_hash_to_block_.erase(it);
+    block->reset_hash();
+    return true;
+}
+
+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
+    if (n > get_num_free_blocks())
+        throw std::runtime_error("Cannot get free blocks from pool");
+    auto ret = free_queue_.popleft_n(n);
+    for (KVCacheBlock* b : ret) {
+        if (enable_caching_) maybe_evict_cached_block(b);
+        assert(b->ref_cnt == 0);
+        b->ref_cnt += 1;
+    }
+    return ret;
+}
+
+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
+    auto it = cached_block_hash_to_block_.find(block_hash);
+    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
+}
+
+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
+    for (KVCacheBlock* b : blocks) {
+        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
+        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
+        b->ref_cnt += 1;
+    }
+}
+
+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
+    std::vector<KVCacheBlock*> without_hash, with_hash;
+    for (KVCacheBlock* b : ordered_blocks) {
+        if (b->is_null) continue;
+        b->ref_cnt -= 1;
+        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
+    }
+    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
+    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
+}
+
+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                                  size_t num_cached_blocks, size_t num_full_blocks,
+                                  const std::vector<uint64_t>& block_hashes) {
+    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
+        KVCacheBlock* blk = req_blocks[i];
+        if (blk->has_hash) continue;
+        blk->has_hash = true;
+        blk->block_hash = block_hashes[i];
+        cached_block_hash_to_block_[blk->block_hash] = blk;
+    }
+}
+
+// ---------------------------------------------------------------------------
+// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
+// ---------------------------------------------------------------------------
+
+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
+    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
+
+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t need = cdiv(total_tokens, block_size_);
+    if (need <= req.size()) return true;
+    size_t add = need - req.size();
+    if (add > pool_.get_num_free_blocks()) return false; // OOM
+    auto nb = pool_.get_new_blocks(add);
+    req.insert(req.end(), nb.begin(), nb.end());
+    return true;
+}
+
+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
+    std::vector<int32_t> bt;
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return bt;
+    bt.reserve(it->second.size());
+    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
+    return bt;
+}
+
+int64_t PagedKVManager::slot(int seq_id, int pos) const {
+    const auto& req = req_to_blocks_.at(seq_id);
+    int32_t phys = req[pos / block_size_]->block_id;
+    return (int64_t)phys * block_size_ + (pos % block_size_);
+}
+
+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
+    std::vector<int64_t> sm;
+    sm.reserve(positions.size());
+    for (int p : positions) sm.push_back(slot(seq_id, p));
+    return sm;
+}
+
+void PagedKVManager::free(int seq_id) {
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return;
+    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
+    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
+    pool_.free_blocks(ordered);
+    req_to_blocks_.erase(it);
+}
+
+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+// hash into the seed so each block hash transitively encodes its whole prefix
+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
+    uint64_t h = 1469598103934665603ull ^ parent_hash;
+    for (int t : token_ids) {
+        h ^= (uint64_t)(uint32_t)t;
+        h *= 1099511628211ull;
+    }
+    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
+    return h;
+}
+
+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
+    std::vector<uint64_t> hashes;
+    uint64_t parent = 0; // NONE_HASH analogue
+    size_t n_full = token_ids.size() / block_size_;
+    for (size_t i = 0; i < n_full; ++i) {
+        std::vector<int> blk(token_ids.begin() + i * block_size_,
+                             token_ids.begin() + (i + 1) * block_size_);
+        parent = hash_block(parent, blk);
+        hashes.push_back(parent);
+    }
+    return hashes;
+}
+
+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
+    std::vector<KVCacheBlock*> hits;
+    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
+        KVCacheBlock* cb = pool_.get_cached_block(bh);
+        if (!cb) break;
+        hits.push_back(cb);
+    }
+    pool_.touch(hits);                        // ++ref_cnt, pull from free list
+    return hits.size() * (size_t)block_size_;
+}
+
+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t n_full = num_tokens / block_size_;
+    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+}
+
+} // namespace paged
--- a/backend/cpp/llama-cpp/paged/paged_kv_manager.h
+++ b/backend/cpp/llama-cpp/paged/paged_kv_manager.h
@@ -0,0 +1,108 @@
+#pragma once
+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
+//
+// Host-side block management is a faithful port of vLLM V1:
+//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
+//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
+//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
+//
+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
+// dependency so it can be unit-tested in isolation.
+
+#include <cstdint>
+#include <vector>
+#include <unordered_map>
+#include <map>
+
+namespace paged {
+
+// vLLM KVCacheBlock (kv_cache_utils.py).
+struct KVCacheBlock {
+    int32_t  block_id   = 0;
+    int      ref_cnt    = 0;
+    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
+    uint64_t block_hash = 0;
+    bool     is_null    = false;
+    KVCacheBlock* prev_free = nullptr;
+    KVCacheBlock* next_free = nullptr;
+
+    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
+    void reset_hash() { has_hash = false; block_hash = 0; }
+};
+
+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
+// O(1) middle removal is required so touch() can pull a warm cached block out of the
+// free list when a later request hits its prefix.
+class FreeBlockQueue {
+public:
+    size_t num_free_blocks = 0;
+
+    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
+    KVCacheBlock* popleft();
+    std::vector<KVCacheBlock*> popleft_n(size_t n);
+    void remove(KVCacheBlock* block);
+    void append(KVCacheBlock* block);
+    void append_n(const std::vector<KVCacheBlock*>& blocks);
+    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+    std::vector<KVCacheBlock*> get_all_free_blocks() const;
+
+private:
+    KVCacheBlock fake_head{-1};
+    KVCacheBlock fake_tail{-1};
+};
+
+// vLLM BlockPool (block_pool.py).
+class BlockPool {
+public:
+    KVCacheBlock* null_block = nullptr;
+
+    BlockPool(int32_t num_blocks, bool enable_caching);
+    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
+    KVCacheBlock* get_cached_block(uint64_t block_hash);
+    void touch(const std::vector<KVCacheBlock*>& blocks);
+    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
+    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                           size_t num_cached_blocks, size_t num_full_blocks,
+                           const std::vector<uint64_t>& block_hashes);
+    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+
+private:
+    bool maybe_evict_cached_block(KVCacheBlock* block);
+
+    bool enable_caching_;
+    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
+    std::vector<KVCacheBlock*> ptrs_;
+    FreeBlockQueue free_queue_;
+    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
+    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
+    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
+};
+
+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
+class PagedKVManager {
+public:
+    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
+
+    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
+    bool allocate(int seq_id, size_t total_tokens);
+    std::vector<int32_t> block_table(int seq_id) const;
+    int64_t slot(int seq_id, int pos) const;
+    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
+    void free(int seq_id);
+    int block_size() const { return block_size_; }
+
+    // Prefix caching (win 3).
+    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+
+protected:
+    int block_size_;
+    BlockPool pool_;
+    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
+};
+
+} // namespace paged
--- a/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0001-paged-kv-block-placement.patch
@@ -0,0 +1,59 @@
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index a49a055a6..d95102bbd 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -11,6 +11,8 @@
+ #include <cstring>
+ #include <limits>
+ #include <map>
+#include <numeric>
+#include <cstdlib>
+ #include <stdexcept>
+ 
+ static bool ggml_is_power_of_2(int n) {
+@@ -931,6 +933,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+             return { };
+         }
+ 
+        // [paged, experimental] Place this sequence's tokens at permuted,
+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+        // This validates that attention is invariant to physical KV placement -
+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+        // Single-sequence scope (uses get_used() as the logical base); falls back
+        // to the normal allocator if the permuted cells aren't available.
+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+        if (paged_mode) {
+            const uint32_t bs   = 16;                 // block size (tokens/block)
+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
+            if (nblk >= 2) {
+                // stride coprime to nblk => block-index permutation is a bijection
+                uint32_t k = 1;
+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+                }
+                const uint32_t base = cells.get_used();
+                bool ok = true;
+                for (uint32_t i = 0; i < n_tokens; ++i) {
+                    const uint32_t L    = base + i;
+                    const uint32_t b    = L / bs;
+                    const uint32_t off  = L % bs;
+                    if (b >= nblk) { ok = false; break; }
+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+                    res.idxs[s].push_back(phys);
+                }
+                if (ok && res.idxs[s].size() == n_tokens) {
+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+                    }
+                    continue; // paged placement succeeded for this sequence
+                }
+                res.idxs[s].clear(); // fall back to the normal allocator
+            }
+        }
+
+         uint32_t n_tested = 0;
+ 
+         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
--- a/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
+++ b/backend/cpp/llama-cpp/paged/patches/0002-paged-e2e-disable-broken-autofit.patch
@@ -0,0 +1,12 @@
+diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp
+index 5a352e3..06ead50 100644
+--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
+@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) {
+     params.sampling.temp = 0.0f;  // greedy
+     params.warmup        = false;
+     params.kv_paged      = true;
+    params.fit_params    = false;  // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
+     params.n_gpu_blocks  = 64;
+     params.n_cpu_blocks  = 16;
+     params.n_sequences   = 1;
--- a/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_block_pool.cpp
@@ -0,0 +1,42 @@
+#include "../paged_kv_manager.h"
+#include <cassert>
+#include <cstdio>
+using namespace paged;
+
+int main() {
+    BlockPool pool(/*num_blocks=*/8, /*enable_caching=*/true);
+    // block 0 is reserved as null_block (vLLM pops one at init)
+    assert(pool.null_block != nullptr && pool.null_block->block_id == 0);
+    assert(pool.get_num_free_blocks() == 7);
+
+    // get_new_blocks sets ref_cnt=1 and removes from free list
+    auto b = pool.get_new_blocks(2);
+    assert(b.size() == 2 && b[0]->ref_cnt == 1 && b[1]->ref_cnt == 1);
+    assert(pool.get_num_free_blocks() == 5);
+
+    // cache two full blocks with chained hashes, then look them up
+    std::vector<uint64_t> hashes = {1111, 2222};
+    pool.cache_full_blocks(b, /*num_cached=*/0, /*num_full=*/2, hashes);
+    assert(b[0]->has_hash && b[0]->block_hash == 1111);
+    assert(pool.get_cached_block(1111) == b[0]);
+    assert(pool.get_cached_block(2222) == b[1]);
+    assert(pool.get_cached_block(9999) == nullptr);
+
+    // free: hashed blocks go to tail (kept warm), so they remain queryable.
+    pool.free_blocks(b);
+    assert(b[0]->ref_cnt == 0);
+    assert(pool.get_num_free_blocks() == 7);
+    assert(pool.get_cached_block(1111) == b[0]); // still cached/warm
+
+    // touch a warm cached block: pulls it out of free list, ++ref_cnt
+    pool.touch({b[0]});
+    assert(b[0]->ref_cnt == 1);
+    assert(pool.get_num_free_blocks() == 6);
+
+    // exhausting the pool then allocating evicts a warm cached hash
+    auto rest = pool.get_new_blocks(pool.get_num_free_blocks());
+    (void) rest;
+    assert(pool.get_cached_block(2222) == nullptr); // evicted on reuse
+    printf("test_block_pool: OK\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_free_block_queue.cpp
@@ -0,0 +1,44 @@
+#include "../paged_kv_manager.h"
+#include <cassert>
+#include <cstdio>
+#include <vector>
+
+using namespace paged;
+
+static std::vector<KVCacheBlock> make_blocks(int n) {
+    std::vector<KVCacheBlock> v;
+    v.reserve(n);
+    for (int i = 0; i < n; ++i) v.push_back(KVCacheBlock{i});
+    return v;
+}
+
+int main() {
+    // ordered 0..9 at init; popleft yields ascending block_ids
+    auto blocks = make_blocks(10);
+    std::vector<KVCacheBlock*> ptrs;
+    for (auto& b : blocks) ptrs.push_back(&b);
+    FreeBlockQueue q(ptrs);
+    assert(q.num_free_blocks == 10);
+
+    KVCacheBlock* b0 = q.popleft();
+    assert(b0->block_id == 0);
+    assert(q.num_free_blocks == 9);
+
+    auto two = q.popleft_n(2);            // {1,2}
+    assert(two.size() == 2 && two[0]->block_id == 1 && two[1]->block_id == 2);
+    assert(q.num_free_blocks == 7);
+
+    // O(1) middle removal: remove block 5 (currently free), count drops
+    q.remove(ptrs[5]);
+    assert(q.num_free_blocks == 6);       // free: 3,4,6,7,8,9
+
+    // append puts a block at the tail; it comes back out only after the rest
+    q.append(b0);                          // free order now: 3,4,6,7,8,9,0
+    assert(q.num_free_blocks == 7);
+    auto all = q.get_all_free_blocks();
+    assert(all.front()->block_id == 3);
+    assert(all.back()->block_id == 0);
+
+    printf("test_free_block_queue: OK\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_attn.cpp
@@ -0,0 +1,133 @@
+// Phase 2 (core numeric de-risk): attention over GATHERED paged KV must equal
+// an independent host-computed reference.
+//
+// This answers the central risk in the design: feeding gather-to-scratch KV
+// (a sequence whose blocks are non-contiguous in the shared pool) into ggml's
+// standard attention ops (mul_mat -> soft_max_ext -> mul_mat) produces correct
+// attention. If this holds, the paged read path is numerically sound; the
+// remaining work is wiring it into llama-graph.cpp (Gate 0 in a real model).
+
+#include "../paged_kv_manager.h"
+
+#include "ggml.h"
+#include "ggml-cpu.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+
+#include <cassert>
+#include <cstdio>
+#include <cmath>
+#include <vector>
+
+using namespace paged;
+
+int main() {
+    const int d          = 8;     // head dim
+    const int n_kv       = 48;    // 3 blocks worth of KV tokens
+    const int n_q        = 4;     // query tokens
+    const int block_size = 16;
+    const int num_blocks = 8;
+    const int total_slots = block_size * num_blocks;
+    const float scale = 1.0f / std::sqrt((float) d);
+
+    // Non-contiguous physical layout for the KV sequence (blocks [2,1,5]).
+    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
+    assert(m.allocate(0, 2 * block_size));
+    assert(m.allocate(1, 2 * block_size));
+    m.free(0);
+    assert(m.allocate(2, n_kv));
+    std::vector<int> positions(n_kv);
+    for (int i = 0; i < n_kv; ++i) positions[i] = i;
+    auto slots64 = m.slot_mapping(2, positions);
+    std::vector<int32_t> slots32(slots64.begin(), slots64.end());
+
+    // Deterministic K, V, Q in logical [d, n] layout (column-major: col = token).
+    std::vector<float> K(d * n_kv), V(d * n_kv), Q(d * n_q);
+    for (int t = 0; t < n_kv; ++t)
+        for (int e = 0; e < d; ++e) {
+            K[t * d + e] = std::sin(0.1f * t + 0.3f * e);
+            V[t * d + e] = std::cos(0.2f * t - 0.1f * e);
+        }
+    for (int q = 0; q < n_q; ++q)
+        for (int e = 0; e < d; ++e) Q[q * d + e] = std::sin(0.05f * q + 0.7f * e);
+
+    // ---- Independent host reference attention -------------------------------
+    std::vector<float> ref(d * n_q, 0.0f);
+    for (int q = 0; q < n_q; ++q) {
+        std::vector<float> score(n_kv);
+        float mx = -1e30f;
+        for (int t = 0; t < n_kv; ++t) {
+            float dot = 0.0f;
+            for (int e = 0; e < d; ++e) dot += K[t * d + e] * Q[q * d + e];
+            score[t] = dot * scale;
+            mx = std::fmax(mx, score[t]);
+        }
+        float sum = 0.0f;
+        for (int t = 0; t < n_kv; ++t) { score[t] = std::exp(score[t] - mx); sum += score[t]; }
+        for (int t = 0; t < n_kv; ++t) {
+            float p = score[t] / sum;
+            for (int e = 0; e < d; ++e) ref[q * d + e] += p * V[t * d + e];
+        }
+    }
+
+    // ---- ggml paged path ----------------------------------------------------
+    ggml_backend_t backend = ggml_backend_cpu_init();
+    struct ggml_init_params dp = { ggml_tensor_overhead() * 16, NULL, true };
+    struct ggml_context * ctx_data = ggml_init(dp);
+
+    struct ggml_tensor * poolK = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
+    struct ggml_tensor * poolV = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
+    struct ggml_tensor * kSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
+    struct ggml_tensor * vSrc  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
+    struct ggml_tensor * qT    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_q);
+    struct ggml_tensor * wIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_kv);
+    struct ggml_tensor * gIdx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_kv);
+
+    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
+    std::vector<float> zeros(d * total_slots, 0.0f);
+    ggml_backend_tensor_set(poolK, zeros.data(), 0, ggml_nbytes(poolK));
+    ggml_backend_tensor_set(poolV, zeros.data(), 0, ggml_nbytes(poolV));
+    ggml_backend_tensor_set(kSrc, K.data(), 0, ggml_nbytes(kSrc));
+    ggml_backend_tensor_set(vSrc, V.data(), 0, ggml_nbytes(vSrc));
+    ggml_backend_tensor_set(qT,   Q.data(), 0, ggml_nbytes(qT));
+    ggml_backend_tensor_set(wIdx, slots64.data(), 0, ggml_nbytes(wIdx));
+    ggml_backend_tensor_set(gIdx, slots32.data(), 0, ggml_nbytes(gIdx));
+
+    struct ggml_init_params cp = { ggml_tensor_overhead() * 64 + ggml_graph_overhead(), NULL, true };
+    struct ggml_context * ctx = ggml_init(cp);
+
+    struct ggml_tensor * wroteK = ggml_set_rows(ctx, poolK, kSrc, wIdx);
+    struct ggml_tensor * wroteV = ggml_set_rows(ctx, poolV, vSrc, wIdx);
+    struct ggml_tensor * gK = ggml_get_rows(ctx, wroteK, gIdx);          // [d, n_kv]
+    struct ggml_tensor * gV = ggml_get_rows(ctx, wroteV, gIdx);          // [d, n_kv]
+
+    struct ggml_tensor * kq    = ggml_mul_mat(ctx, gK, qT);              // [n_kv, n_q]
+    struct ggml_tensor * probs = ggml_soft_max_ext(ctx, kq, NULL, scale, 0.0f);
+    struct ggml_tensor * vT    = ggml_cont(ctx, ggml_transpose(ctx, gV)); // [n_kv, d]
+    struct ggml_tensor * out   = ggml_mul_mat(ctx, vT, probs);           // [d, n_q]
+    ggml_set_output(out);
+
+    struct ggml_cgraph * gf = ggml_new_graph(ctx);
+    ggml_build_forward_expand(gf, out);
+    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
+    assert(ggml_gallocr_alloc_graph(galloc, gf));
+    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
+
+    std::vector<float> got(d * n_q);
+    ggml_backend_tensor_get(out, got.data(), 0, ggml_nbytes(out));
+
+    // ---- compare ------------------------------------------------------------
+    double max_err = 0.0;
+    for (int i = 0; i < d * n_q; ++i) max_err = std::fmax(max_err, std::fabs(got[i] - ref[i]));
+    printf("paged attention max abs err vs host reference: %.3e\n", max_err);
+    assert(max_err < 1e-4 && "paged-gathered attention must match host reference");
+
+    ggml_gallocr_free(galloc);
+    ggml_free(ctx);
+    ggml_free(ctx_data);
+    ggml_backend_buffer_free(buf);
+    ggml_backend_free(backend);
+
+    printf("test_ggml_paged_attn: OK (attention over non-contiguous paged KV matches reference)\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_ggml_paged_rw.cpp
@@ -0,0 +1,142 @@
+// Phase 1 integration test: prove the paged KV write+read MECHANISM at the
+// ggml-op level, driven by PagedKVManager.
+//
+//   write:  ggml_set_rows(pool, k_src, slot_mapping)   // scatter by slot
+//   read:   ggml_get_rows(pool, gather_idx)            // gather seq's slots
+//
+// The decisive property: a sequence's physical blocks are NON-CONTIGUOUS and
+// OUT-OF-ORDER (forced via allocate/free/reallocate), yet gather(write(x)) == x,
+// and a second sequence written into disjoint blocks does not contaminate it.
+// This is exactly how a paged read path feeds contiguous scratch to attention.
+
+#include "../paged_kv_manager.h"
+
+#include "ggml.h"
+#include "ggml-cpu.h"
+#include "ggml-alloc.h"
+#include "ggml-backend.h"
+
+#include <cassert>
+#include <cstdio>
+#include <cmath>
+#include <vector>
+
+using namespace paged;
+
+int main() {
+    const int n_embd      = 8;
+    const int block_size  = 16;
+    const int num_blocks  = 8;                       // block 0 reserved as null
+    const int total_slots = block_size * num_blocks; // 128
+
+    // --- Force a non-contiguous, out-of-order block layout for seqC ----------
+    PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
+    assert(m.allocate(/*seqA=*/0, 2 * block_size)); // blocks {1,2}
+    assert(m.allocate(/*seqB=*/1, 2 * block_size)); // blocks {3,4}
+    m.free(0);                                       // returns {1,2} to free list
+    assert(m.allocate(/*seqC=*/2, 3 * block_size));  // reuses freed blocks, reordered
+
+    auto btC = m.block_table(2);
+    auto btB = m.block_table(1);
+    printf("seqC block_table = [");
+    for (size_t i = 0; i < btC.size(); ++i) printf("%s%d", i ? "," : "", btC[i]);
+    printf("]\n");
+    assert(btC.size() == 3);
+    // sanity: seqC and seqB occupy disjoint physical blocks
+    for (int cb : btC) for (int bb : btB) assert(cb != bb);
+
+    const int n_tokens = 3 * block_size; // 48 tokens for seqC
+
+    // slot_mapping for seqC positions 0..n_tokens-1
+    std::vector<int> positions(n_tokens);
+    for (int i = 0; i < n_tokens; ++i) positions[i] = i;
+    std::vector<int64_t> slots64 = m.slot_mapping(2, positions); // I64 for set_rows
+    std::vector<int32_t> slots32(slots64.begin(), slots64.end()); // I32 for get_rows
+
+    // seqB occupies different blocks; write a sentinel there to prove isolation.
+    std::vector<int> posB(2 * block_size);
+    for (size_t i = 0; i < posB.size(); ++i) posB[i] = (int) i;
+    std::vector<int64_t> slotsB64 = m.slot_mapping(1, posB);
+
+    // --- ggml backend + persistent (statically allocated) tensors ------------
+    ggml_backend_t backend = ggml_backend_cpu_init();
+    assert(backend);
+
+    struct ggml_init_params dp = { /*mem_size=*/ ggml_tensor_overhead() * 16,
+                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
+    struct ggml_context * ctx_data = ggml_init(dp);
+
+    // The shared paged KV pool: one flat block pool, exactly like a paged layer.
+    struct ggml_tensor * pool    = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, total_slots);
+    struct ggml_tensor * k_src   = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, n_tokens);
+    struct ggml_tensor * w_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_tokens);
+    struct ggml_tensor * g_idx   = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_tokens);
+    struct ggml_tensor * kB_src  = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, (int) posB.size());
+    struct ggml_tensor * wB_idx  = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, (int) posB.size());
+
+    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
+    assert(buf);
+
+    // pool starts zeroed
+    std::vector<float> zeros(n_embd * total_slots, 0.0f);
+    ggml_backend_tensor_set(pool, zeros.data(), 0, ggml_nbytes(pool));
+
+    // token t carries the value (float) t in every embedding lane -> easy to verify
+    std::vector<float> ksrc(n_embd * n_tokens);
+    for (int t = 0; t < n_tokens; ++t)
+        for (int e = 0; e < n_embd; ++e) ksrc[t * n_embd + e] = (float) t;
+    ggml_backend_tensor_set(k_src, ksrc.data(), 0, ggml_nbytes(k_src));
+    ggml_backend_tensor_set(w_idx, slots64.data(), 0, ggml_nbytes(w_idx));
+    ggml_backend_tensor_set(g_idx, slots32.data(), 0, ggml_nbytes(g_idx));
+
+    // seqB sentinel = 999 everywhere
+    std::vector<float> kBsrc(n_embd * posB.size(), 999.0f);
+    ggml_backend_tensor_set(kB_src, kBsrc.data(), 0, ggml_nbytes(kB_src));
+    ggml_backend_tensor_set(wB_idx, slotsB64.data(), 0, ggml_nbytes(wB_idx));
+
+    // --- compute graph: write seqB, write seqC, then gather seqC -------------
+    struct ggml_init_params cp = { /*mem_size=*/ ggml_tensor_overhead() * 32 + ggml_graph_overhead(),
+                                   /*mem_buffer=*/ NULL, /*no_alloc=*/ true };
+    struct ggml_context * ctx = ggml_init(cp);
+
+    struct ggml_tensor * wroteB = ggml_set_rows(ctx, pool,   kB_src, wB_idx); // view(pool)
+    struct ggml_tensor * wroteC = ggml_set_rows(ctx, wroteB, k_src,  w_idx);  // chain so order is fixed
+    struct ggml_tensor * gathered = ggml_get_rows(ctx, wroteC, g_idx);
+    ggml_set_output(gathered);
+
+    struct ggml_cgraph * gf = ggml_new_graph(ctx);
+    ggml_build_forward_expand(gf, gathered);
+
+    ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
+    assert(ggml_gallocr_alloc_graph(galloc, gf));
+
+    assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
+
+    // --- verify gather(write(x)) == x for the non-contiguous sequence --------
+    std::vector<float> out(n_embd * n_tokens);
+    ggml_backend_tensor_get(gathered, out.data(), 0, ggml_nbytes(gathered));
+
+    int mism = 0;
+    for (int t = 0; t < n_tokens; ++t)
+        for (int e = 0; e < n_embd; ++e)
+            if (std::fabs(out[t * n_embd + e] - (float) t) > 1e-6f) mism++;
+    assert(mism == 0 && "gathered paged KV must equal source (round-trip)");
+
+    // --- verify isolation: read seqC slots directly from pool, unaffected by seqB
+    std::vector<float> pool_host(n_embd * total_slots);
+    ggml_backend_tensor_get(pool, pool_host.data(), 0, ggml_nbytes(pool));
+    for (int t = 0; t < n_tokens; ++t) {
+        int slot = (int) slots64[t];
+        for (int e = 0; e < n_embd; ++e)
+            assert(std::fabs(pool_host[slot * n_embd + e] - (float) t) < 1e-6f);
+    }
+
+    ggml_gallocr_free(galloc);
+    ggml_free(ctx);
+    ggml_free(ctx_data);
+    ggml_backend_buffer_free(buf);
+    ggml_backend_free(backend);
+
+    printf("test_ggml_paged_rw: OK (non-contiguous paged write/gather round-trip)\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_paged_kv_manager.cpp
@@ -0,0 +1,32 @@
+#include "../paged_kv_manager.h"
+#include <cassert>
+#include <cstdio>
+using namespace paged;
+
+int main() {
+    PagedKVManager m(/*num_blocks=*/8, /*block_size=*/16, /*enable_caching=*/false);
+    // 20 tokens -> ceil(20/16)=2 blocks
+    assert(m.allocate(/*seq=*/0, 20));
+    auto bt = m.block_table(0);
+    assert(bt.size() == 2);
+
+    // slot arithmetic: pos 0 -> block bt[0]*16 + 0 ; pos 17 -> bt[1]*16 + 1
+    assert(m.slot(0, 0)  == (int64_t)bt[0] * 16 + 0);
+    assert(m.slot(0, 17) == (int64_t)bt[1] * 16 + 1);
+
+    auto sm = m.slot_mapping(0, {0, 16, 17});
+    assert(sm.size() == 3 && sm[1] == (int64_t)bt[1] * 16 + 0);
+
+    // growing the same seq reuses existing blocks, adds only new ones
+    assert(m.allocate(0, 40)); // ceil(40/16)=3 -> +1 block
+    assert(m.block_table(0).size() == 3);
+
+    // OOM: blocks left = 8 - 1(null) - 3 = 4 blocks; ask for 5 blocks
+    assert(m.allocate(1, 5 * 16) == false);
+
+    // free returns blocks to the pool for reuse
+    m.free(0);
+    assert(m.allocate(1, 5 * 16)); // now fits
+    printf("test_paged_kv_manager: OK\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
+++ b/backend/cpp/llama-cpp/paged/tests/test_prefix_cache.cpp
@@ -0,0 +1,35 @@
+#include "../paged_kv_manager.h"
+#include <cassert>
+#include <cstdio>
+#include <vector>
+using namespace paged;
+
+int main() {
+    PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*enable_caching=*/true);
+
+    // shared prefix of 32 tokens (2 full blocks) + distinct suffix
+    std::vector<int> shared(32);
+    for (int i = 0; i < 32; ++i) shared[i] = 100 + i;
+
+    // chained hashing is deterministic and prefix-sensitive
+    auto h = m.compute_block_hashes(shared);
+    assert(h.size() == 2);
+    auto h2 = m.compute_block_hashes(shared);
+    assert(h == h2);                          // deterministic
+    std::vector<int> other = shared; other[0] = 999;
+    assert(m.compute_block_hashes(other)[0] != h[0]); // sensitive to content
+
+    // seq 0: cold, no cache hit yet
+    assert(m.get_computed_blocks(h) == 0);
+    assert(m.allocate(0, 32));
+    m.cache_blocks(0, h, 32);
+
+    // seq 1: warm — the 2 shared blocks are a cache hit (32 tokens)
+    assert(m.get_computed_blocks(h) == 32);
+
+    // first-miss stop: a chain that diverges after block 1 hits only 1 block
+    auto hmix = h; hmix[1] = 0xDEADBEEF;
+    assert(m.get_computed_blocks(hmix) == 16);
+    printf("test_prefix_cache: OK\n");
+    return 0;
+}
--- a/backend/cpp/llama-cpp/patches/BENCHMARKS.md
+++ b/backend/cpp/llama-cpp/patches/BENCHMARKS.md
@@ -0,0 +1,106 @@
+# Paged-attention / parity benchmarks (GB10 / DGX Spark)
+
+Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
+
+**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
+(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
+prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
+
+## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
+
+| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
+|---|-----------|-----------|--------|-----------|-----------|--------|
+| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
+| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
+| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
+| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
+
+## Verdict — two distinct gaps, only one is the engine's
+
+1. **Prefill (S_PP): 6–48× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
+   GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this** —
+   it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
+2. **Decode at concurrency (S_TG): 2.5–3.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
+   engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
+   **This is what patches 0003–0006 target.** The win here is realistic; the prefill win is not (kernel).
+
+## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
+
+The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
+
+| kernel | % GPU time |
+|---|---|
+| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
+| `flash_attn_ext` (attention) | 19.8 |
+| `mul_mat_q<Q8>` (dense) | 10.9 |
+| KV writes / quant / norms / rest | ~15 |
+
+**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
+Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
+fraction of that. So **the paged series (0003–0006) cannot close the vLLM gap in either phase** — both are
+MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
+
+### What the paged series IS still good for (just not throughput parity)
+
+- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
+- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
+
+These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
+batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
+
+## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
+
+The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
+so we benchmarked a dense 32B head-to-head.
+
+**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
+DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
+(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
+the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
+root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
+
+| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
+|---|---|---|---|---|---|---|
+| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
+| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
+| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
+| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
+
+**Findings:**
+1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
+   B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
+   tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
+   than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
+2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
+   gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
+3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
+   block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
+   (a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
+   - **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
+     dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
+     nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
+     model classes. Parity is strictly the FP4 tensor-core kernel.
+4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
+   limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
+   kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
+   working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
+   CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
+   120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
+   `enable_flashinfer_autotune=True` at 0–3% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
+   with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
+   everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
+   isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
+   **Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
+   Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
+   `VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
+
+## So, honestly, where parity stands
+
+- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
+- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
+  0005 continuous batching). Target: close the 2.5–3.7× at B≥8.
+- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
+
+**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
+turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
+B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.
--- a/backend/cpp/llama-cpp/patches/README.md
+++ b/backend/cpp/llama-cpp/patches/README.md
@@ -0,0 +1,82 @@
+# llama.cpp patch series — paged attention (vLLM-parity engine)
+
+A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
+in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
+(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
+`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
+**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
+patch needs fixing, and the failure points at exactly which step the upstream change touched.
+
+## Base
+
+- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
+  the pin = re-run the regen workflow below and fix only the patches that no longer apply.
+
+## The series (phases → patches)
+
+| # | Patch | What | Verifies |
+|---|-------|------|----------|
+| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
+| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
+| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
+| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
+| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
+| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
+
+Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
+(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
+
+## Regen workflow (the anti-drift recipe)
+
+```sh
+# 1. check out the exact pin into a dev tree
+git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
+git checkout <LLAMA_VERSION from ../Makefile>
+git checkout -b paged
+
+# 2. apply the current series (each becomes a commit), or develop the next patch
+git am /path/to/backend/cpp/llama-cpp/patches/00*.patch     # or `git apply` + commit per patch
+
+# 3. iterate a phase as ONE commit, then export the whole series 1:1
+git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
+
+# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
+```
+
+## Build integration
+
+`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
+```
+for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
+```
+All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
+
+## Status
+
+- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
+- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
+  **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
+- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
+  (`ADDITIVE_DESIGN.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
+  subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
+  `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
+  `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
+  **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
+  compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
+  - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
+    flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
+    scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
+    the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
+    identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
+    makes paged placement token-identical under flash-attn.**
+- 0004–0006 follow.
+
+### Honest parity note (important)
+
+This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
+by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
+(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
+touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
+gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
+`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
+kernel; neither alone suffices.
--- a/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
+++ b/backend/cpp/llama-cpp/patches/kernel/0001-fp4-grouped-moe-scaffold.patch
@@ -0,0 +1,91 @@
+diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
+new file mode 100644
+index 0000000..5f5a782
+--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
+@@ -0,0 +1,46 @@
+#include "fp4-grouped-moe.cuh"
+
+#include <cstdlib>
+#include <cstdio>
+
+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
+//
+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
+//
+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
+
+static bool fp4_grouped_enabled() {
+    static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
+    return en;
+}
+
+bool ggml_cuda_fp4_grouped_moe(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,
+        const ggml_tensor * src1,
+        const ggml_tensor * ids,
+        ggml_tensor       * dst) {
+    GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
+
+    if (!fp4_grouped_enabled()) {
+        return false; // default: existing MMQ path
+    }
+    if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
+        return false;
+    }
+
+    // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
+    // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
+    // vs mul_mat_q<MXFP4> before enabling by default.
+    static bool warned = false;
+    if (!warned) {
+        warned = true;
+        fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
+    }
+    return false; // scaffold: fall back until the kernel lands
+}
+diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
+new file mode 100644
+index 0000000..29e1b5a
+--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
+@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.cuh"
+
+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
+bool ggml_cuda_fp4_grouped_moe(
+        ggml_backend_cuda_context & ctx,
+        const ggml_tensor * src0,   // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
+        const ggml_tensor * src1,   // activations, F32 [n_embd, n_tokens, ...]
+        const ggml_tensor * ids,    // expert routing, I32
+        ggml_tensor       * dst);   // F32 output
+diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
+index 8ea462a..104d131 100644
+--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
+@@ -30,6 +30,7 @@
+ #include "ggml-cuda/im2col.cuh"
+ #include "ggml-cuda/mmf.cuh"
+ #include "ggml-cuda/mmq.cuh"
+#include "ggml-cuda/fp4-grouped-moe.cuh"
+ #include "ggml-cuda/mmvf.cuh"
+ #include "ggml-cuda/mmvq.cuh"
+ #include "ggml-cuda/norm.cuh"
+@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
+         }
+ 
+         if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
+            if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
+             ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
+             return;
+         }
--- a/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0001-vendor-paged-kv-manager.patch
@@ -0,0 +1,447 @@
+From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 19 Jun 2026 22:54:49 +0000
+Subject: [PATCH] vendor paged kv manager
+
+vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
+PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
+nothing uses it yet; wired in by later patches in the series.
+---
+ src/CMakeLists.txt       |   1 +
+ src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
+ src/paged-kv-manager.h   | 108 ++++++++++++++
+ 3 files changed, 405 insertions(+)
+ create mode 100644 src/paged-kv-manager.cpp
+ create mode 100644 src/paged-kv-manager.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index d15ccfd99..a030940b8 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -24,6 +24,7 @@ add_library(llama
+             llama-io.cpp
+             llama-kv-cache.cpp
+             llama-kv-cache-iswa.cpp
+            paged-kv-manager.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
+new file mode 100644
+index 000000000..ca0dcd83a
+--- /dev/null
+++ b/src/paged-kv-manager.cpp
+@@ -0,0 +1,296 @@
+#include "paged-kv-manager.h"
+#include <cassert>
+#include <stdexcept>
+
+namespace paged {
+
+// ---------------------------------------------------------------------------
+// FreeBlockQueue  (port of kv_cache_utils.py FreeKVCacheBlockQueue)
+// ---------------------------------------------------------------------------
+
+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
+    num_free_blocks = blocks.size();
+    for (size_t i = 0; i < blocks.size(); ++i) {
+        if (i > 0)                  blocks[i]->prev_free = blocks[i - 1];
+        if (i + 1 < blocks.size())  blocks[i]->next_free = blocks[i + 1];
+    }
+    if (!blocks.empty()) {
+        fake_head.next_free = blocks.front();
+        blocks.front()->prev_free = &fake_head;
+        fake_tail.prev_free = blocks.back();
+        blocks.back()->next_free = &fake_tail;
+    } else {
+        fake_head.next_free = &fake_tail;
+        fake_tail.prev_free = &fake_head;
+    }
+}
+
+KVCacheBlock* FreeBlockQueue::popleft() {
+    KVCacheBlock* first = fake_head.next_free;
+    if (first == &fake_tail || first == nullptr) {
+        assert(num_free_blocks == 0);
+        throw std::runtime_error("No free blocks available");
+    }
+    fake_head.next_free = first->next_free;
+    first->next_free->prev_free = &fake_head;
+    first->prev_free = first->next_free = nullptr;
+    num_free_blocks--;
+    return first;
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
+    std::vector<KVCacheBlock*> ret;
+    if (n == 0) return ret;
+    assert(num_free_blocks >= n);
+    num_free_blocks -= n;
+    KVCacheBlock* curr = fake_head.next_free;
+    ret.reserve(n);
+    for (size_t i = 0; i < n; ++i) {
+        assert(curr != nullptr);
+        ret.push_back(curr);
+        KVCacheBlock* last = curr;
+        curr = curr->next_free;
+        last->prev_free = last->next_free = nullptr;
+    }
+    if (curr != nullptr) {
+        fake_head.next_free = curr;
+        curr->prev_free = &fake_head;
+    }
+    return ret;
+}
+
+void FreeBlockQueue::remove(KVCacheBlock* block) {
+    if (!block->prev_free || !block->next_free)
+        throw std::runtime_error("remove() called on an invalid block");
+    block->prev_free->next_free = block->next_free;
+    block->next_free->prev_free = block->prev_free;
+    block->prev_free = block->next_free = nullptr;
+    num_free_blocks--;
+}
+
+void FreeBlockQueue::append(KVCacheBlock* block) {
+    KVCacheBlock* last = fake_tail.prev_free;
+    last->next_free = block;
+    block->prev_free = last;
+    block->next_free = &fake_tail;
+    fake_tail.prev_free = block;
+    num_free_blocks++;
+}
+
+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* last = fake_tail.prev_free;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = last;
+        last->next_free = b;
+        last = b;
+    }
+    last->next_free = &fake_tail;
+    fake_tail.prev_free = last;
+    num_free_blocks += blocks.size();
+}
+
+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+    if (blocks.empty()) return;
+    KVCacheBlock* first = fake_head.next_free;
+    KVCacheBlock* prev = &fake_head;
+    for (KVCacheBlock* b : blocks) {
+        b->prev_free = prev;
+        prev->next_free = b;
+        prev = b;
+    }
+    prev->next_free = first;
+    first->prev_free = prev;
+    num_free_blocks += blocks.size();
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+    std::vector<KVCacheBlock*> ret;
+    const KVCacheBlock* curr = fake_head.next_free;
+    while (curr && curr->next_free != nullptr) {
+        ret.push_back(const_cast<KVCacheBlock*>(curr));
+        curr = curr->next_free;
+    }
+    return ret;
+}
+
+// ---------------------------------------------------------------------------
+// BlockPool  (port of block_pool.py)
+// ---------------------------------------------------------------------------
+
+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
+    std::vector<KVCacheBlock*> p;
+    p.reserve(v.size());
+    for (auto& b : v) p.push_back(&b);
+    return p;
+}
+
+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
+    std::vector<KVCacheBlock> v;
+    v.reserve(num_blocks);
+    for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
+    return v;
+}
+
+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
+    : enable_caching_(enable_caching),
+      blocks_(make_block_vec(num_blocks)),
+      ptrs_(make_ptrs(blocks_)),
+      free_queue_(ptrs_) {
+    // vLLM reserves block_id 0 as the null block (never cached).
+    null_block = free_queue_.popleft();
+    null_block->is_null = true;
+}
+
+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
+    if (!block->has_hash) return false;
+    auto it = cached_block_hash_to_block_.find(block->block_hash);
+    if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
+    cached_block_hash_to_block_.erase(it);
+    block->reset_hash();
+    return true;
+}
+
+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
+    if (n > get_num_free_blocks())
+        throw std::runtime_error("Cannot get free blocks from pool");
+    auto ret = free_queue_.popleft_n(n);
+    for (KVCacheBlock* b : ret) {
+        if (enable_caching_) maybe_evict_cached_block(b);
+        assert(b->ref_cnt == 0);
+        b->ref_cnt += 1;
+    }
+    return ret;
+}
+
+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
+    auto it = cached_block_hash_to_block_.find(block_hash);
+    return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
+}
+
+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
+    for (KVCacheBlock* b : blocks) {
+        // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
+        if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
+        b->ref_cnt += 1;
+    }
+}
+
+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
+    std::vector<KVCacheBlock*> without_hash, with_hash;
+    for (KVCacheBlock* b : ordered_blocks) {
+        if (b->is_null) continue;
+        b->ref_cnt -= 1;
+        if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
+    }
+    free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
+    free_queue_.append_n(with_hash);     // hashed: kept warm (tail)
+}
+
+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                                  size_t num_cached_blocks, size_t num_full_blocks,
+                                  const std::vector<uint64_t>& block_hashes) {
+    for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
+        KVCacheBlock* blk = req_blocks[i];
+        if (blk->has_hash) continue;
+        blk->has_hash = true;
+        blk->block_hash = block_hashes[i];
+        cached_block_hash_to_block_[blk->block_hash] = blk;
+    }
+}
+
+// ---------------------------------------------------------------------------
+// PagedKVManager  (port of SingleTypeKVCacheManager / FullAttentionManager)
+// ---------------------------------------------------------------------------
+
+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
+    : block_size_(block_size), pool_(num_blocks, enable_caching) {}
+
+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t need = cdiv(total_tokens, block_size_);
+    if (need <= req.size()) return true;
+    size_t add = need - req.size();
+    if (add > pool_.get_num_free_blocks()) return false; // OOM
+    auto nb = pool_.get_new_blocks(add);
+    req.insert(req.end(), nb.begin(), nb.end());
+    return true;
+}
+
+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
+    std::vector<int32_t> bt;
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return bt;
+    bt.reserve(it->second.size());
+    for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
+    return bt;
+}
+
+int64_t PagedKVManager::slot(int seq_id, int pos) const {
+    const auto& req = req_to_blocks_.at(seq_id);
+    int32_t phys = req[pos / block_size_]->block_id;
+    return (int64_t)phys * block_size_ + (pos % block_size_);
+}
+
+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
+    std::vector<int64_t> sm;
+    sm.reserve(positions.size());
+    for (int p : positions) sm.push_back(slot(seq_id, p));
+    return sm;
+}
+
+void PagedKVManager::free(int seq_id) {
+    auto it = req_to_blocks_.find(seq_id);
+    if (it == req_to_blocks_.end()) return;
+    // Free in reverse so the tail of the block chain is evicted first (vLLM order).
+    std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
+    pool_.free_blocks(ordered);
+    req_to_blocks_.erase(it);
+}
+
+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+// hash into the seed so each block hash transitively encodes its whole prefix
+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
+    uint64_t h = 1469598103934665603ull ^ parent_hash;
+    for (int t : token_ids) {
+        h ^= (uint64_t)(uint32_t)t;
+        h *= 1099511628211ull;
+    }
+    if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
+    return h;
+}
+
+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
+    std::vector<uint64_t> hashes;
+    uint64_t parent = 0; // NONE_HASH analogue
+    size_t n_full = token_ids.size() / block_size_;
+    for (size_t i = 0; i < n_full; ++i) {
+        std::vector<int> blk(token_ids.begin() + i * block_size_,
+                             token_ids.begin() + (i + 1) * block_size_);
+        parent = hash_block(parent, blk);
+        hashes.push_back(parent);
+    }
+    return hashes;
+}
+
+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
+    std::vector<KVCacheBlock*> hits;
+    for (uint64_t bh : block_hashes) {        // stop at first miss (prefix property)
+        KVCacheBlock* cb = pool_.get_cached_block(bh);
+        if (!cb) break;
+        hits.push_back(cb);
+    }
+    pool_.touch(hits);                        // ++ref_cnt, pull from free list
+    return hits.size() * (size_t)block_size_;
+}
+
+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
+    auto& req = req_to_blocks_[seq_id];
+    size_t n_full = num_tokens / block_size_;
+    pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+}
+
+} // namespace paged
+diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
+new file mode 100644
+index 000000000..740280a7f
+--- /dev/null
+++ b/src/paged-kv-manager.h
+@@ -0,0 +1,108 @@
+#pragma once
+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
+//
+// Host-side block management is a faithful port of vLLM V1:
+//   vllm/v1/core/kv_cache_utils.py            (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
+//   vllm/v1/core/block_pool.py                (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
+//   vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
+//
+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
+// dependency so it can be unit-tested in isolation.
+
+#include <cstdint>
+#include <vector>
+#include <unordered_map>
+#include <map>
+
+namespace paged {
+
+// vLLM KVCacheBlock (kv_cache_utils.py).
+struct KVCacheBlock {
+    int32_t  block_id   = 0;
+    int      ref_cnt    = 0;
+    bool     has_hash   = false;   // vLLM: _block_hash is set only when full+cached
+    uint64_t block_hash = 0;
+    bool     is_null    = false;
+    KVCacheBlock* prev_free = nullptr;
+    KVCacheBlock* next_free = nullptr;
+
+    explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
+    void reset_hash() { has_hash = false; block_hash = 0; }
+};
+
+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
+// O(1) middle removal is required so touch() can pull a warm cached block out of the
+// free list when a later request hits its prefix.
+class FreeBlockQueue {
+public:
+    size_t num_free_blocks = 0;
+
+    explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
+    KVCacheBlock* popleft();
+    std::vector<KVCacheBlock*> popleft_n(size_t n);
+    void remove(KVCacheBlock* block);
+    void append(KVCacheBlock* block);
+    void append_n(const std::vector<KVCacheBlock*>& blocks);
+    void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+    std::vector<KVCacheBlock*> get_all_free_blocks() const;
+
+private:
+    KVCacheBlock fake_head{-1};
+    KVCacheBlock fake_tail{-1};
+};
+
+// vLLM BlockPool (block_pool.py).
+class BlockPool {
+public:
+    KVCacheBlock* null_block = nullptr;
+
+    BlockPool(int32_t num_blocks, bool enable_caching);
+    std::vector<KVCacheBlock*> get_new_blocks(size_t n);
+    KVCacheBlock* get_cached_block(uint64_t block_hash);
+    void touch(const std::vector<KVCacheBlock*>& blocks);
+    void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
+    void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+                           size_t num_cached_blocks, size_t num_full_blocks,
+                           const std::vector<uint64_t>& block_hashes);
+    size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+
+private:
+    bool maybe_evict_cached_block(KVCacheBlock* block);
+
+    bool enable_caching_;
+    std::vector<KVCacheBlock> blocks_;     // owns all block descriptors
+    std::vector<KVCacheBlock*> ptrs_;
+    FreeBlockQueue free_queue_;
+    // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
+    // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
+    std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
+};
+
+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
+class PagedKVManager {
+public:
+    PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
+
+    // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
+    bool allocate(int seq_id, size_t total_tokens);
+    std::vector<int32_t> block_table(int seq_id) const;
+    int64_t slot(int seq_id, int pos) const;
+    std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
+    void free(int seq_id);
+    int block_size() const { return block_size_; }
+
+    // Prefix caching (win 3).
+    static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+    std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+    size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+    void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+
+protected:
+    int block_size_;
+    BlockPool pool_;
+    std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
+};
+
+} // namespace paged
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,75 @@
+From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Fri, 19 Jun 2026 23:04:17 +0000
+Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
+
+Place each sequence's tokens at permuted, non-contiguous fixed-size block
+positions in find_slot, proving attention is invariant to physical KV placement
+(token-identical greedy generation). Default off; single-sequence scope; falls
+back to the normal allocator. The paged-placement substrate for the gather-read.
+---
+ src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
+ 1 file changed, 41 insertions(+)
+
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 2802103bd..999e2ae61 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -11,6 +11,8 @@
+ #include <cstring>
+ #include <limits>
+ #include <map>
+#include <numeric>
+#include <cstdlib>
+ #include <stdexcept>
+ 
+ static bool ggml_is_power_of_2(int n) {
+@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+             return { };
+         }
+ 
+        // [paged, experimental] Place this sequence's tokens at permuted,
+        // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+        // This validates that attention is invariant to physical KV placement -
+        // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+        // Single-sequence scope (uses get_used() as the logical base); falls back
+        // to the normal allocator if the permuted cells aren't available.
+        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+        if (paged_mode) {
+            const uint32_t bs   = 16;                 // block size (tokens/block)
+            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
+            if (nblk >= 2) {
+                // stride coprime to nblk => block-index permutation is a bijection
+                uint32_t k = 1;
+                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+                }
+                const uint32_t base = cells.get_used();
+                bool ok = true;
+                for (uint32_t i = 0; i < n_tokens; ++i) {
+                    const uint32_t L    = base + i;
+                    const uint32_t b    = L / bs;
+                    const uint32_t off  = L % bs;
+                    if (b >= nblk) { ok = false; break; }
+                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+                    res.idxs[s].push_back(phys);
+                }
+                if (ok && res.idxs[s].size() == n_tokens) {
+                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+                    }
+                    continue; // paged placement succeeded for this sequence
+                }
+                res.idxs[s].clear(); // fall back to the normal allocator
+            }
+        }
+
+         uint32_t n_tested = 0;
+ 
+         // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
+++ b/backend/cpp/llama-cpp/patches/paged/0003-gather-read-plan.md
@@ -0,0 +1,102 @@
+# Patch 0003 — paged gather-read: exact implementation plan
+
+**Goal:** a sequence attends only its own (compacted) cells via `ggml_get_rows`, instead of the scattered
+`[0,n_kv)` window. Token-identical (attention is permutation-invariant over the KV set). **Gated**: stock
+path stays byte-identical (no new ops unless `LLAMA_KV_PAGED`).
+
+**Base:** applies on top of 0001+0002 at the pin. Dev tree: `backend/cpp/llama-cpp-paged-dev` (branch `paged`).
+
+## Design
+
+The gather is keyed off one runtime index list (the sequence's used cells, in a fixed order), exposed as a
+graph input (mirroring `k_idxs`). In `build_attn`, gather K, V **and the kq_mask** by that same index, so all
+three stay aligned. `n_gathered` replaces `n_kv` for the attention. Only active when the cache is in paged
+mode (a new `is_paged()` flag set when `LLAMA_KV_PAGED`/find_slot used permuted placement).
+
+ggml note: `ggml_get_rows(a,b)` gathers `a`'s **ne1** by `b` (I32). Raw K is `[n_embd_k_gqa, kv_size, n_stream]`
+→ ne1 = cells → direct. The mask is `[n_kv, n_tokens, 1, n_stream]` → n_kv is **ne0**, so gather as
+`transpose → get_rows → transpose`.
+
+### KEY CORRECTIONS (found while implementing — these change the edits)
+
+1. **Gather index = ALL used (non-empty) cells in `[0,n_kv)`, NOT `sinfo.idxs`.** `sinfo.idxs` is only the
+   *current ubatch's write slots*; attention reads the *full history*. The query set per token is masked by
+   `kq_mask`, so gathering the union of all used cells + gathering the mask the same way is token-identical
+   and drops exactly the empty (already-masked) cells. So: `gather = { i in [0,n_kv) : !cells.is_empty(i) }`.
+
+2. **Static-graph size is fine because llama.cpp rebuilds the graph every ubatch.** `n_gather` (used-cell
+   count) is therefore a build-time constant for that ubatch — `build_input_gather_idxs` sizes the I32
+   tensor to `get_n_gather()` computed at build, `set_input_gather_idxs` fills the identical cell list. They
+   MUST use the same loop (`for i in [0,n_kv): if !is_empty(i) push i`) so build-order == fill-order.
+
+3. **K/V gather can live entirely in `build_attn`, no cache get_k change.** The `get_k` 4d view is contiguous
+   in `[ne0,ne1,ne2]` from cell 0 (nb2 == n_embd_head*n_head_kv*elemsz), so for **single stream (ns==1)**:
+   `reshape_3d(k, n_embd_head*n_head_kv, n_kv, 1) → get_rows(., gi) → reshape_4d(., n_embd_head, n_head_kv, n_gather, 1)`.
+   Multi-stream (ns>1) breaks contiguity (nb3 uses kv_size) → gate to ns==1 first, multi-stream follow-up.
+
+4. So the ONLY cache additions are `is_paged()`, `get_n_gather(n_kv)`, `build/set_input_gather_idxs(n_kv)`;
+   everything else (K/V/mask gather) is in `build_attn`. `set_input_kq_mask` is **unchanged** (built over
+   n_kv, then gathered). Smaller than the 7-edit estimate above.
+
+## Edits
+
+### 1. `src/llama-kv-cache.h` — declare gather infra (in `llama_kv_cache`)
+```cpp
+    bool        is_paged() const { return paged_active; }            // near get_size()
+    ggml_tensor * build_input_gather_idxs(ggml_context * ctx, const slot_info & sinfo) const;
+    void          set_input_gather_idxs (ggml_tensor * dst, const slot_info & sinfo) const;
+    uint32_t      get_n_gather(const slot_info & sinfo) const;       // == sum of used cells gathered
+```
+Add member `mutable bool paged_active = false;` and in `llama_kv_cache_context` forward the three (like
+`build_input_k_idxs`/`get_n_kv`).
+
+### 2. `src/llama-kv-cache.cpp`
+- In `find_slot`, in the paged branch (0002), set `paged_active = true;` on success.
+- `get_n_gather(sinfo)` = `sinfo.idxs[0].size()` summed over streams (the count actually placed).
+- `build_input_gather_idxs`: `ggml_new_tensor_1d(ctx, GGML_TYPE_I32, get_n_gather(sinfo)); ggml_set_input(...)`.
+- `set_input_gather_idxs`: fill `data[k++] = strm_off + sinfo.idxs[s][i]` for every placed cell (same order
+  the mask/k/v will see). This is the canonical gather order.
+
+### 3. `src/llama-graph.h` — `llm_graph_input_attn_kv`
+Add `ggml_tensor * gather_idxs = nullptr;` + `ggml_tensor * get_gather_idxs() const { return gather_idxs; }`.
+
+### 4. `src/llama-graph.cpp`
+- `llm_graph_input_attn_kv::set_input`: if `mctx->is_paged()` → `mctx->set_input_gather_idxs(gather_idxs, ...)`.
+- `build_attn_inp_kv` (creates the input): if `mctx_cur->is_paged()` → `inp->gather_idxs =
+  mctx_cur->build_input_gather_idxs(ctx0, ...)`.
+- `build_attn` (the kv overload, ~2356): after `k`,`v`,`kq_mask`:
+```cpp
+if (ggml_tensor * gi = inp->get_gather_idxs()) {
+    k = ggml_get_rows(ctx0, k, gi);                                   // [d, n_gather, ...] (reshape view ok)
+    v = v_trans ? /* gather columns */ : ggml_get_rows(ctx0, v, gi);
+    ggml_tensor * m = ggml_cont(ctx0, ggml_transpose(ctx0, kq_mask)); // [n_tokens, n_kv]
+    m = ggml_get_rows(ctx0, m, gi);                                   // [n_tokens, n_gather]
+    kq_mask = ggml_cont(ctx0, ggml_transpose(ctx0, m));              // [n_gather, n_tokens]
+}
+ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+```
+Note: `get_k` returns the reshaped 4d view; gather must run on a cell-major shape. Simplest: add a paged
+variant `get_k(ctx,il)` that returns `ggml_get_rows` of the **raw** `layers[ikv].k` then reshapes to
+`[n_embd_head, n_head_kv, n_gather, ns]`. Do the gather in the cache, not the graph, for K/V; keep only the
+mask gather in the graph. (Cleaner — revisit during impl.)
+
+### 5. V-transposed path
+When `!flash_attn`, V is stored transposed `[kv_size, n_embd_v_gqa]`; gather its **rows** (ne1 = n_embd) won't
+work — gather columns via the same idx on the non-transposed store, OR force `is_paged()` to require
+flash-attn for the first cut (`GGML_ASSERT`) and handle v_trans in a follow-up.
+
+## Verification (the gate)
+```sh
+cmake --build build-cpu --target llama-simple -j
+M=Qwen3-0.6B.Q4_K_M.gguf ; P="<the 0002 prompt>"
+build-cpu/bin/llama-simple -m $M -n 64 "$P" > a.txt                    # stock
+LLAMA_KV_PAGED=1 build-cpu/bin/llama-simple -m $M -n 64 "$P" > b.txt   # paged gather-read
+diff a.txt b.txt        # MUST be identical
+```
+Also assert (debug) that `n_gather < n_kv` on a multi-chunk sequence (proves compaction, not identity).
+Export only when identical: `git format-patch HEAD~1 -o patches/ --start-number 3 -N`.
+
+## Risks
+- Mask transpose/layout: if `b.txt` diverges, dump the gathered mask vs expected for token 0; off-by-order
+  means the `set_input_gather_idxs` order ≠ the get_k gather order — they MUST use the identical loop.
+- flash-attn vs not: do flash-attn first (simpler mask), then v_trans.
--- a/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,369 @@
+From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 10:24:22 +0200
+Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
+
+Gather K, V and the kq_mask down to each sequence stream's non-empty cells
+before build_attn_mha. Position-sorted per stream so the flash-attn online
+softmax reduction order matches stock byte-for-byte. Multi-stream: one index
+column per stream over k->ne[3], padded to the max non-empty count with a
+masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
+---
+ src/CMakeLists.txt     |   1 +
+ src/llama-graph.cpp    |   9 ++-
+ src/llama-kv-cache.cpp |  74 ++++++++++++++++++++++++
+ src/llama-kv-cache.h   |  11 ++++
+ src/paged-attn.cpp     | 128 +++++++++++++++++++++++++++++++++++++++++
+ src/paged-attn.h       |  40 +++++++++++++
+ 6 files changed, 262 insertions(+), 1 deletion(-)
+ create mode 100644 src/paged-attn.cpp
+ create mode 100644 src/paged-attn.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index a030940..58083b3 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -25,6 +25,7 @@ add_library(llama
+             llama-kv-cache.cpp
+             llama-kv-cache-iswa.cpp
+             paged-kv-manager.cpp
+            paged-attn.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
+index 68c9e60..b59d2a5 100644
+--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
+@@ -6,6 +6,8 @@
+ #include "llama-cparams.h"
+ 
+ #include "llama-kv-cache.h"
+
+#include "paged-attn.h"
+ #include "llama-kv-cache-iswa.h"
+ #include "llama-kv-cache-dsa.h"
+ #include "llama-memory-hybrid.h"
+@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
+     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
+     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
+ 
+-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+    // [paged 0003] gather K, V and the mask to the sequence's used cells only
+    //   (no-op unless env LLAMA_KV_PAGED is set).
+    ggml_tensor * kq_mask_g = kq_mask;
+    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+
+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+     cb(cur, "kqv_out", il);
+ 
+     if (inp->self_v_rot) {
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 999e2ae..30d02d7 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1,4 +1,6 @@
+ #include "llama-kv-cache.h"
+#include <vector>
+#include <utility>
+ 
+ #include "llama-impl.h"
+ #include "llama-io.h"
+@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
+             ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
+ }
+ 
+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
+// single stream addressed by sinfo. With paged placement (patch 0002) these are
+// the sequence's scattered block cells; gathering K/V/mask by this index list
+// compacts the attention read while preserving every unmasked (token,cell) pair.
+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
+    // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
+    // n_stream], so n_gather is the MAX non-empty count across the batch streams.
+    // Streams with fewer cells are padded (see get_gather_idxs) with a masked
+    // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
+    // K is laid out over physical streams [s0, s1]; index v_cells the same way.
+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+    uint32_t mx = 0;
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        uint32_t cnt = 0;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                ++cnt;
+            }
+        }
+        mx = std::max(mx, cnt);
+    }
+    return mx;
+}
+
+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
+    const uint32_t ns       = sinfo.s1 - sinfo.s0 + 1;
+    const uint32_t n_gather = get_n_gather(n_kv, sinfo);
+    // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        // Collect the non-empty cells, then order them by token POSITION (not by
+        // physical cell index). The attention reduction (flash-attn online
+        // softmax, and the non-flash soft_max) runs over cells in array order and
+        // is order-sensitive in floating point. Stock (contiguous) placement
+        // happens to store cells in position order, so emitting the gathered
+        // indices in position order reproduces stock's exact reduction order -
+        // making the paged read bit-identical, not merely math-equivalent.
+        std::vector<std::pair<llama_pos, int32_t>> pc;
+        pc.reserve(n);
+        int32_t pad = -1;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
+            } else if (pad < 0) {
+                pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
+            }
+        }
+        std::sort(pc.begin(), pc.end());
+        int32_t * col = dst + (size_t) j * n_gather;
+        for (size_t k = 0; k < pc.size(); ++k) {
+            col[k] = pc[k].second;
+        }
+        // Pad the tail to n_gather with a masked (empty) cell so the rectangular
+        // gather drops to zero contribution for streams shorter than the max.
+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+        for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
+            col[k] = padv;
+        }
+    }
+}
+
+ ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
+     GGML_UNUSED(sinfo);
+ 
+@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
+     return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
+ }
+ 
+uint32_t llama_kv_cache_context::get_n_gather() const {
+    return kv->get_n_gather(n_kv, sinfos[i_cur]);
+}
+
+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+    kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+}
+
+ ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
+     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
+ }
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index 3d68f98..494c0fb 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -171,6 +171,12 @@ public:
+     ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+     ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+ 
+    // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
+    //   sinfo (position-sorted, padded across streams). Used by paged-attn
+    //   gather-read. get_n_gather returns the max count across streams.
+    uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+    void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+
+     // store k_cur and v_cur in the cache based on the provided head location
+     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
+     ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
+@@ -368,6 +374,11 @@ public:
+     ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
+     ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
+ 
+    // [paged 0003] gather-read helpers (delegate to the kv cache for the
+    //   current ubatch's stream).
+    uint32_t get_n_gather() const;
+    void     get_gather_idxs(int32_t * dst) const;
+
+     // store k_cur and v_cur in the cache based on the provided head location
+     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
+     //   - k_cur  [n_embd_head_k, n_head_k, n_tokens]
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+new file mode 100644
+index 0000000..ade75e8
+--- /dev/null
+++ b/src/paged-attn.cpp
+@@ -0,0 +1,128 @@
+#include "paged-attn.h"
+
+#include "llama-graph.h"
+#include "llama-kv-cache.h"
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#include <cstdlib>
+#include <cstdio>
+
+namespace paged_attn {
+
+bool active() {
+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+    return a;
+}
+
+static bool debug() {
+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+    return d;
+}
+
+namespace {
+
+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
+// with each stream's non-empty cell indices (position-sorted, padded with a
+// masked/empty cell) by delegating to the kv-cache context. Private to this
+// unit; default can_reuse()==false keeps the graph from being reused across
+// decodes (n_gather grows every step).
+class input_gather_idxs : public llm_graph_input_i {
+public:
+    input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
+        : mctx(mctx), idxs(idxs) {}
+
+    void set_input(const llama_ubatch * ubatch) override {
+        GGML_UNUSED(ubatch);
+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+        mctx->get_gather_idxs((int32_t *) idxs->data);
+    }
+
+    const llama_kv_cache_context * mctx;
+    ggml_tensor * idxs;
+};
+
+} // namespace
+
+void gather(ggml_context * ctx0,
+            llm_graph_result * res,
+            const llama_kv_cache_context * mctx,
+            ggml_tensor ** k,
+            ggml_tensor ** v,
+            ggml_tensor ** kq_mask) {
+    if (!active()) {
+        return;
+    }
+
+    ggml_tensor * K = *k;
+    ggml_tensor * V = *v;
+    ggml_tensor * M = *kq_mask;
+
+    // Number of streams (sequences) in the unified batch. K is laid out
+    // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
+    // gather is per-stream (one index column per stream), so a single
+    // ggml_get_rows over the stream axis handles 1..N streams uniformly.
+    const int64_t n_stream = K->ne[3];
+    GGML_ASSERT(M->ne[3] == n_stream);
+
+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
+    if (n_gather <= 0) {
+        // Worst-case graph reserve (empty cache) or nothing placed yet: leave
+        // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
+        return;
+    }
+
+    if (debug()) {
+        static int64_t once = 0;
+        if (once++ < 2) {
+            fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+        }
+    }
+
+    // Per-stream index tensor [n_gather, n_stream], filled at set_input from
+    // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
+    // n_stream, so column s gathers from stream s of the source.
+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
+    ggml_set_input(idx);
+    res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
+
+    // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
+    {
+        ggml_tensor * t = ggml_cont(ctx0, K);                                          // [d, h, n_kv, ns]
+        t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream);           // [d*h, n_kv, ns]
+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
+        *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream);         // [d, h, n_gather, ns]
+    }
+
+    // --- gather V ---
+    // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
+    // result is contiguous and build_attn_mha sees a consistent v_trans==false.
+    {
+        const bool v_trans = V->nb[1] > V->nb[2];
+        ggml_tensor * vsrc = v_trans
+            ? ggml_permute(ctx0, V, 2, 1, 0, 3)   // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
+            : V;                                  // already [d, h, n_kv, ns]
+        ggml_tensor * t = ggml_cont(ctx0, vsrc);                                       // [d, h, n_kv, ns]
+        t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream);  // [d*h, n_kv, ns]
+        t = ggml_get_rows(ctx0, t, idx);                                               // [d*h, n_gather, ns]
+        *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream);   // [d, h, n_gather, ns]
+    }
+
+    // --- gather mask (cells are ne0): transpose so cells become the row axis,
+    //     gather per stream, transpose back ---
+    {
+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);      // [n_kv, n_tps, ns]
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_tps, n_kv, ns]
+        m = ggml_get_rows(ctx0, m, idx);                                               // [n_tps, n_gather, ns] (F32)
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));                                  // [n_gather, n_tps, ns]
+        m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
+        if (M->type != m->type) {
+            m = ggml_cast(ctx0, m, M->type);   // flash-attn requires an F16 mask
+        }
+        *kq_mask = m;
+    }
+}
+
+} // namespace paged_attn
+diff --git a/src/paged-attn.h b/src/paged-attn.h
+new file mode 100644
+index 0000000..c5b7bd7
+--- /dev/null
+++ b/src/paged-attn.h
+@@ -0,0 +1,40 @@
+#pragma once
+// Paged attention gather-read (patch 0003, experimental).
+//
+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
+// build_attn_mha.
+//
+// Correctness: attention is permutation-invariant over the KV set, and dropping
+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+//
+// All logic lives here to keep the core files additive: build_attn gets one
+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
+
+#include <cstdint>
+
+struct ggml_context;
+struct ggml_tensor;
+class  llm_graph_result;
+class  llama_kv_cache_context;
+
+namespace paged_attn {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
+// point at the compacted tensors; pass them straight to build_attn_mha.
+void gather(ggml_context * ctx0,
+            llm_graph_result * res,
+            const llama_kv_cache_context * mctx,
+            ggml_tensor ** k,
+            ggml_tensor ** v,
+            ggml_tensor ** kq_mask);
+
+} // namespace paged_attn
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,298 @@
+From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 10:50:35 +0200
+Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
+ 0004
+
+Drive the paged placement in find_slot through the vendored PagedKVManager
+(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
+free pool on demand as the sequence crosses block boundaries (peak << full
+reservation) and returned on sequence end (seq_rm full removal / clear). One
+manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
+so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
+gated call. Default off; stock path byte-identical.
+---
+ src/CMakeLists.txt     |   1 +
+ src/llama-kv-cache.cpp |  69 +++++++++++++++++----------
+ src/paged-alloc.cpp    | 106 +++++++++++++++++++++++++++++++++++++++++
+ src/paged-alloc.h      |  39 +++++++++++++++
+ 4 files changed, 190 insertions(+), 25 deletions(-)
+ create mode 100644 src/paged-alloc.cpp
+ create mode 100644 src/paged-alloc.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index 58083b3..4d9d7d1 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -26,6 +26,7 @@ add_library(llama
+             llama-kv-cache-iswa.cpp
+             paged-kv-manager.cpp
+             paged-attn.cpp
+            paged-alloc.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 30d02d7..1125d9a 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1,4 +1,5 @@
+ #include "llama-kv-cache.h"
+#include "paged-alloc.h"
+ #include <vector>
+ #include <utility>
+ 
+@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
+ }
+ 
+ void llama_kv_cache::clear(bool data) {
+    // [paged 0004] return all on-demand blocks to the pool on cache clear.
+    if (paged_alloc::active()) {
+        paged_alloc::release_all(this);
+    }
+
+     for (uint32_t s = 0; s < n_stream; ++s) {
+         v_cells[s].reset();
+         v_heads[s] = 0;
+@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
+         p1 = std::numeric_limits<llama_pos>::max();
+     }
+ 
+    // [paged 0004] free a stream's on-demand blocks when its whole sequence is
+    // removed (sequence end), so they return to the pool for reuse.
+    if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+        if (seq_id >= 0) {
+            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+        } else {
+            paged_alloc::release_all(this);
+        }
+    }
+
+     if (seq_id >= 0) {
+         auto & cells = v_cells[seq_to_stream[seq_id]];
+         auto & head  = v_heads[seq_to_stream[seq_id]];
+@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+         // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+         // Single-sequence scope (uses get_used() as the logical base); falls back
+         // to the normal allocator if the permuted cells aren't available.
+-        static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+-        if (paged_mode) {
+        // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
+        // invariant to physical KV placement; here that placement is driven by
+        // the vendored PagedKVManager (patch 0001): blocks are popped from a free
+        // pool only as the sequence crosses block boundaries (peak << full
+        // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
+        // falls back to the normal allocator on pool exhaustion or any conflict.
+        if (paged_alloc::active()) {
+             const uint32_t bs   = 16;                 // block size (tokens/block)
+-            const uint32_t nblk = cells.size() / bs;  // blocks in this stream's pool
+            const uint32_t nblk = cells.size() / bs;  // this stream's block budget
+             if (nblk >= 2) {
+-                // stride coprime to nblk => block-index permutation is a bijection
+-                uint32_t k = 1;
+-                for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+-                    if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+-                }
+                 const uint32_t base = cells.get_used();
+-                bool ok = true;
+-                for (uint32_t i = 0; i < n_tokens; ++i) {
+-                    const uint32_t L    = base + i;
+-                    const uint32_t b    = L / bs;
+-                    const uint32_t off  = L % bs;
+-                    if (b >= nblk) { ok = false; break; }
+-                    const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+-                    if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+-                    res.idxs[s].push_back(phys);
+-                }
+-                if (ok && res.idxs[s].size() == n_tokens) {
+-                    if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+-                        fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+-                        for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+-                        fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+                const int      strm = (int) seq_to_stream[seq_id];
+                std::vector<uint32_t> placed;
+                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+                    bool ok = (placed.size() == n_tokens);
+                    for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+                        if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+                            ok = false;
+                        }
+                    }
+                    if (ok) {
+                        for (uint32_t phys : placed) {
+                            res.idxs[s].push_back(phys);
+                        }
+                        if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+                            fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
+                            for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+                            fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
+                        }
+                        continue; // on-demand paged placement succeeded
+                     }
+-                    continue; // paged placement succeeded for this sequence
+                    res.idxs[s].clear(); // fall back to the normal allocator
+                 }
+-                res.idxs[s].clear(); // fall back to the normal allocator
+             }
+         }
+ 
+diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
+new file mode 100644
+index 0000000..1d13f9c
+--- /dev/null
+++ b/src/paged-alloc.cpp
+@@ -0,0 +1,106 @@
+#include "paged-alloc.h"
+#include "paged-kv-manager.h"
+
+#include <cstdlib>
+#include <cstdio>
+#include <map>
+#include <memory>
+#include <utility>
+
+namespace paged_alloc {
+
+bool active() {
+    static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+    return a;
+}
+
+static bool debug() {
+    static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+    return d;
+}
+
+namespace {
+
+using key_t = std::pair<const void *, int>;
+
+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+// physical pool of cells.size() cells, so a manager's block ids map directly to
+// cell ranges within that stream's pool. The internal request id is always 0.
+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+
+paged::PagedKVManager * get_mgr(const void * cache, int stream,
+                                uint32_t pool_blocks, uint32_t block_size) {
+    const key_t k{cache, stream};
+    auto it = g_managers.find(k);
+    if (it == g_managers.end()) {
+        // enable_caching=false: prefix caching is a later patch; 0004 exercises
+        // only on-demand allocate / free.
+        auto mgr = std::make_unique<paged::PagedKVManager>(
+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+        it = g_managers.emplace(k, std::move(mgr)).first;
+    }
+    return it->second.get();
+}
+
+} // namespace
+
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+           uint32_t block_size, uint32_t pool_blocks,
+           std::vector<uint32_t> & out) {
+    if (n_tokens == 0) {
+        return true;
+    }
+
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+
+    const size_t before = mgr->block_table(0).size();
+
+    // Grow the request to cover the highest logical position. The manager pops
+    // free blocks only for the boundaries actually crossed - that is the on-
+    // demand behavior; an already-covered range adds nothing.
+    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+        return false; // pool exhausted -> caller falls back to the stock path
+    }
+
+    out.reserve(out.size() + n_tokens);
+    for (uint32_t i = 0; i < n_tokens; ++i) {
+        const int64_t s = mgr->slot(0, (int) (base + i));
+        out.push_back((uint32_t) s);
+    }
+
+    if (debug()) {
+        const size_t after = mgr->block_table(0).size();
+        if (after != before) {
+            fprintf(stderr,
+                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+                    "(budget=%u; base=%u +%u tok)\n",
+                    cache, stream, before, after, pool_blocks, base, n_tokens);
+        }
+    }
+
+    return true;
+}
+
+void release(const void * cache, int stream) {
+    auto it = g_managers.find({cache, stream});
+    if (it == g_managers.end()) {
+        return;
+    }
+    it->second->free(0);
+    g_managers.erase(it);
+    if (debug()) {
+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+    }
+}
+
+void release_all(const void * cache) {
+    for (auto it = g_managers.begin(); it != g_managers.end(); ) {
+        if (it->first.first == cache) {
+            it = g_managers.erase(it);
+        } else {
+            ++it;
+        }
+    }
+}
+
+} // namespace paged_alloc
+diff --git a/src/paged-alloc.h b/src/paged-alloc.h
+new file mode 100644
+index 0000000..bf66665
+--- /dev/null
+++ b/src/paged-alloc.h
+@@ -0,0 +1,39 @@
+#pragma once
+// On-demand paged KV block allocation (patch 0004, experimental).
+//
+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+// sequence's logical positions onto a fixed full-pool permutation, blocks are
+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+// and returned to the pool on sequence end. This is where the paged memory-
+// capacity benefit begins: a short sequence holds only a few blocks, not the
+// whole reserved window.
+//
+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+// struct stays untouched - find_slot only gains a gated call.
+
+#include <cstdint>
+#include <vector>
+
+namespace paged_alloc {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+// demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+// allocator. The caller still validates each returned cell is empty.
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+           uint32_t block_size, uint32_t pool_blocks,
+           std::vector<uint32_t> & out);
+
+// Return a stream's blocks to the pool (sequence end).
+void release(const void * cache, int stream);
+
+// Return every stream's blocks for a kv-cache (clear() / teardown).
+void release_all(const void * cache);
+
+} // namespace paged_alloc
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,143 @@
+From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 12:13:44 +0200
+Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
+ patch 0006
+
+Add host-side cross-request prefix sharing to the vendored PagedKVManager
+(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
+matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
+fresh blocks only for the divergent suffix. A shared block is freed only at
+ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
+write so co-owners stay byte-correct. All logic lives in the vendored
+src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
+core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
+
+Wiring the physical-cell reuse into find_slot so the engine itself skips
+recompute needs core seq-membership changes and is left to a later patch.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
+ src/paged-kv-manager.h   | 23 ++++++++++++++
+ 2 files changed, 88 insertions(+)
+
+diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
+index ca0dcd8..4c6ee4c 100644
+--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
+@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
+     pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+ }
+ 
+// ---------------------------------------------------------------------------
+// Cross-request prefix caching + copy-on-write  (patch 0006)
+// ---------------------------------------------------------------------------
+
+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
+    auto& req = req_to_blocks_[seq_id];
+
+    // Longest cached prefix: hash the full blocks and stop at the first miss.
+    // A block hash transitively encodes its whole prefix (FNV chaining), so the
+    // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
+    const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
+    std::vector<KVCacheBlock*> hits;
+    for (uint64_t bh : hashes) {
+        KVCacheBlock* cb = pool_.get_cached_block(bh);
+        if (!cb) break;
+        hits.push_back(cb);
+    }
+
+    // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
+    // splice the shared physical blocks into this sequence's block table.
+    pool_.touch(hits);
+    req.insert(req.end(), hits.begin(), hits.end());
+
+    // Allocate fresh blocks only for the divergent suffix.
+    const size_t need = cdiv(token_ids.size(), block_size_);
+    if (need > req.size()) {
+        const size_t add = need - req.size();
+        if (add > pool_.get_num_free_blocks()) {
+            // OOM: roll the sequence back (un-touch the shared prefix so no ref
+            // leaks) and report no placement; the caller falls back to stock.
+            std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
+            pool_.free_blocks(ordered);
+            req.clear();
+            return 0;
+        }
+        auto nb = pool_.get_new_blocks(add);
+        req.insert(req.end(), nb.begin(), nb.end());
+    }
+    return hits.size();
+}
+
+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
+    auto& req = req_to_blocks_.at(seq_id);
+    KVCacheBlock* old = req.at(bi);
+    if (old->ref_cnt <= 1) {
+        return { old->block_id, old->block_id }; // already private - no copy
+    }
+    // Private copy for this sequence. get_new_blocks sets the fresh block's
+    // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
+    // it is NOT returned to the pool and the other owners are left untouched.
+    KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
+    pool_.free_blocks({ old });
+    req[bi] = fresh;
+    return { old->block_id, fresh->block_id };
+}
+
+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
+    return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
+}
+
+size_t PagedKVManager::num_blocks(int seq_id) const {
+    auto it = req_to_blocks_.find(seq_id);
+    return it == req_to_blocks_.end() ? 0 : it->second.size();
+}
+
+ } // namespace paged
+diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
+index 740280a..34decbc 100644
+--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
+@@ -14,6 +14,7 @@
+ #include <vector>
+ #include <unordered_map>
+ #include <map>
+#include <utility>
+ 
+ namespace paged {
+ 
+@@ -99,6 +100,28 @@ public:
+     size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+     void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+ 
+    // Cross-request prefix caching + copy-on-write (patch 0006).
+    //
+    // Splice the longest cached prefix of token_ids into seq_id (reuse the
+    // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
+    // allocate fresh blocks only for the divergent suffix. Returns the number of
+    // shared (reused) blocks; the caller skips recomputing those tokens. On pool
+    // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
+    size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
+
+    // Copy-on-write the block at logical index bi of seq_id. If that block is
+    // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
+    // the shared one (other owners keep it, content untouched) and install the
+    // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
+    // block was already private (ref_cnt<=1) and no copy is needed. The caller
+    // copies the physical cell contents old_block_id -> new_block_id.
+    std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
+
+    // Introspection for the prefix-share gate (debug/tests).
+    int    block_ref_cnt_at(int seq_id, size_t bi) const;
+    size_t num_blocks(int seq_id) const;
+    size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
+
+ protected:
+     int block_size_;
+     BlockPool pool_;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,531 @@
+From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 12:46:28 +0200
+Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
+ patch 0007
+
+Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
+new sequence physically SHARES the cached prefix blocks and skips recomputing the
+shared prefix - the actual compute win that 0006 (which only proved the host-side
+machinery + realised reuse via the stock seq_cp) did not yet deliver from the
+paged path itself.
+
+Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
+
+  * paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
+    into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
+    requests are keyed by the real llama_seq_id. free(seq) now releases exactly
+    one sequence, so ref-counted shared blocks survive while another sharer holds
+    them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
+    slot, commit (publish a sequence into the content cache), ref-counted release,
+    plus ref/num-free introspection.
+
+  * Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
+    paged_prefix_share() reuses the longest cached content prefix for a sequence
+    and marks the shared physical cells as belonging to it (cells.seq_add) so the
+    engine's attention mask includes the already-computed prefix KV; the caller
+    then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
+    sequence's full blocks for later reuse.
+
+  * find_slot's paged branch anchors placement on each sequence's own logical base
+    (ubatch.pos) and keys the manager request by seq_id, so an independently-freed
+    sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
+    per-sequence (ref-counted) instead of nuking the whole stream.
+
+  * paged-prefix-api: a thin gated shim so a caller holding only the public
+    llama.h can reach the seam and the introspection without the internal headers.
+
+Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
+additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
+sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
+scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
+a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
+hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
+use-after-free) and returns to the pool only when all sharers are freed. The
+0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/CMakeLists.txt       |   1 +
+ src/llama-kv-cache.cpp   |  66 +++++++++++++++++++++++--
+ src/llama-kv-cache.h     |   8 +++
+ src/paged-alloc.cpp      | 104 ++++++++++++++++++++++++++++++---------
+ src/paged-alloc.h        |  69 +++++++++++++++++++-------
+ src/paged-prefix-api.cpp |  48 ++++++++++++++++++
+ src/paged-prefix-api.h   |  27 ++++++++++
+ 7 files changed, 280 insertions(+), 43 deletions(-)
+ create mode 100644 src/paged-prefix-api.cpp
+ create mode 100644 src/paged-prefix-api.h
+
+diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
+index 4d9d7d1..432f42d 100644
+--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
+@@ -27,6 +27,7 @@ add_library(llama
+             paged-kv-manager.cpp
+             paged-attn.cpp
+             paged-alloc.cpp
+            paged-prefix-api.cpp
+             llama-kv-cache-dsa.cpp
+             llama-memory.cpp
+             llama-memory-hybrid.cpp
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 1125d9a..7510ff9 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
+     // removed (sequence end), so they return to the pool for reuse.
+     if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+         if (seq_id >= 0) {
+-            paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+            paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
+         } else {
+             paged_alloc::release_all(this);
+         }
+@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+             const uint32_t bs   = 16;                 // block size (tokens/block)
+             const uint32_t nblk = cells.size() / bs;  // this stream's block budget
+             if (nblk >= 2) {
+-                const uint32_t base = cells.get_used();
+                // [paged 0007] Anchor placement on this sequence's own logical
+                // base position (ubatch.pos), not the shared used-count, and key
+                // the manager request by the real seq_id. slot(seq,pos) is then
+                // stable per sequence, so an independently-freed (ref-counted)
+                // sequence and a shared prefix can coexist in one unified pool.
+                const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
+                 const int      strm = (int) seq_to_stream[seq_id];
+                 std::vector<uint32_t> placed;
+-                if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+                if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
+                     bool ok = (placed.size() == n_tokens);
+                     for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+                         if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
+     return res;
+ }
+ 
+// [paged 0007] Cross-request prefix recompute-skip.
+//
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
+// blocks for the divergent suffix. We then mark the shared physical cells as
+// belonging to seq_id - those cells already hold the owner's computed KV at the
+// matching logical positions, so the caller decodes ONLY the suffix and the
+// prefix is never recomputed. Returns the number of shared prefix tokens.
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+    if (!paged_alloc::active() || tokens.empty()) {
+        return 0;
+    }
+    const uint32_t bs   = 16;
+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+    auto & cells = v_cells[strm];
+    const uint32_t nblk = cells.size() / bs;
+    if (nblk < 2) {
+        return 0;
+    }
+
+    std::vector<int> toks(tokens.begin(), tokens.end());
+    const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
+
+    for (size_t p = 0; p < kshare; ++p) {
+        const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
+        if (cell < 0 || (uint32_t) cell >= cells.size() ||
+            cells.is_empty((uint32_t) cell) ||
+            cells.pos_get((uint32_t) cell) != (llama_pos) p) {
+            // Owner cell missing / repurposed: cannot safely share. Roll the
+            // sequence back so the caller recomputes the whole prompt.
+            paged_alloc::release(this, (int) strm, (int) seq_id);
+            return 0;
+        }
+        if (!cells.seq_has((uint32_t) cell, seq_id)) {
+            cells.seq_add((uint32_t) cell, seq_id);
+        }
+    }
+    return (int32_t) kshare;
+}
+
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
+// computed (its prefill decode has run).
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+    if (!paged_alloc::active() || tokens.empty()) {
+        return;
+    }
+    const uint32_t bs   = 16;
+    const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+    const uint32_t nblk = v_cells[strm].size() / bs;
+    std::vector<int> toks(tokens.begin(), tokens.end());
+    paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
+}
+
+ void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
+     // TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
+     if (other) {
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index 494c0fb..f374ac6 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -199,6 +199,14 @@ public:
+     // emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
+     void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
+ 
+    // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
+    // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
+    // for seq_id and returns the number of shared prefix tokens (the caller
+    // decodes only the suffix); paged_prefix_commit() publishes a sequence into
+    // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
+    int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+    void    paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+
+     //
+     // input API
+     //
+diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
+index 1d13f9c..c1027fb 100644
+--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
+@@ -23,9 +23,13 @@ namespace {
+ 
+ using key_t = std::pair<const void *, int>;
+ 
+-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+-// physical pool of cells.size() cells, so a manager's block ids map directly to
+-// cell ranges within that stream's pool. The internal request id is always 0.
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
+// separate physical pool of cells.size() cells, so a manager's block ids map
+// directly to cell ranges within that stream's pool. Requests inside a manager
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
+// can publish blocks and share_prefix() can hit them.
+ std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+ 
+ paged::PagedKVManager * get_mgr(const void * cache, int stream,
+@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
+     const key_t k{cache, stream};
+     auto it = g_managers.find(k);
+     if (it == g_managers.end()) {
+-        // enable_caching=false: prefix caching is a later patch; 0004 exercises
+-        // only on-demand allocate / free.
+         auto mgr = std::make_unique<paged::PagedKVManager>(
+-            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+            (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
+         it = g_managers.emplace(k, std::move(mgr)).first;
+     }
+     return it->second.get();
+ }
+ 
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
+    auto it = g_managers.find({cache, stream});
+    return it == g_managers.end() ? nullptr : it->second.get();
+}
+
+ } // namespace
+ 
+-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
+            uint32_t block_size, uint32_t pool_blocks,
+            std::vector<uint32_t> & out) {
+     if (n_tokens == 0) {
+@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ 
+     paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ 
+-    const size_t before = mgr->block_table(0).size();
+    const size_t before = mgr->block_table(seq).size();
+ 
+-    // Grow the request to cover the highest logical position. The manager pops
+-    // free blocks only for the boundaries actually crossed - that is the on-
+-    // demand behavior; an already-covered range adds nothing.
+-    if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+    // Grow this sequence's request to cover its highest logical position. The
+    // manager pops free blocks only for boundaries actually crossed; if
+    // share_prefix() already reserved these blocks, this is a no-op.
+    if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
+         return false; // pool exhausted -> caller falls back to the stock path
+     }
+ 
+     out.reserve(out.size() + n_tokens);
+     for (uint32_t i = 0; i < n_tokens; ++i) {
+-        const int64_t s = mgr->slot(0, (int) (base + i));
+        const int64_t s = mgr->slot(seq, (int) (base + i));
+         out.push_back((uint32_t) s);
+     }
+ 
+     if (debug()) {
+-        const size_t after = mgr->block_table(0).size();
+        const size_t after = mgr->block_table(seq).size();
+         if (after != before) {
+             fprintf(stderr,
+-                    "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+                    "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
+                     "(budget=%u; base=%u +%u tok)\n",
+-                    cache, stream, before, after, pool_blocks, base, n_tokens);
+                    cache, stream, seq, before, after, pool_blocks, base, n_tokens);
+         }
+     }
+ 
+     return true;
+ }
+ 
+-void release(const void * cache, int stream) {
+-    auto it = g_managers.find({cache, stream});
+-    if (it == g_managers.end()) {
+size_t share_prefix(const void * cache, int stream, int seq,
+                    const std::vector<int> & tokens,
+                    uint32_t block_size, uint32_t pool_blocks) {
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+    const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
+    const size_t shared_tokens = shared_blocks * (size_t) block_size;
+    if (debug() && shared_blocks > 0) {
+        fprintf(stderr,
+                "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
+                "(%zu tokens) - prefix NOT recomputed\n",
+                cache, stream, seq, shared_blocks, shared_tokens);
+    }
+    return shared_tokens;
+}
+
+int64_t slot(const void * cache, int stream, int seq, int pos) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+        return -1;
+    }
+    if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
+        return -1;
+    }
+    return mgr->slot(seq, pos);
+}
+
+void commit(const void * cache, int stream, int seq,
+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
+    paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+    mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
+    if (debug()) {
+        fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
+                cache, stream, seq, tokens.size());
+    }
+}
+
+void release(const void * cache, int stream, int seq) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+         return;
+     }
+-    it->second->free(0);
+-    g_managers.erase(it);
+    mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
+     if (debug()) {
+-        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+        fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+                cache, stream, seq, mgr->num_free_blocks());
+     }
+ }
+ 
+@@ -103,4 +146,21 @@ void release_all(const void * cache) {
+     }
+ }
+ 
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    if (!mgr) {
+        return -1;
+    }
+    const size_t bi = (size_t) pos / block_size;
+    if (bi >= mgr->num_blocks(seq)) {
+        return -1;
+    }
+    return mgr->block_ref_cnt_at(seq, bi);
+}
+
+size_t num_free(const void * cache, int stream) {
+    paged::PagedKVManager * mgr = find_mgr(cache, stream);
+    return mgr ? mgr->num_free_blocks() : 0;
+}
+
+ } // namespace paged_alloc
+diff --git a/src/paged-alloc.h b/src/paged-alloc.h
+index bf66665..88dedef 100644
+--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
+@@ -1,17 +1,27 @@
+ #pragma once
+-// On-demand paged KV block allocation (patch 0004, experimental).
+// On-demand paged KV block allocation + cross-request prefix reuse
+// (patches 0004 + 0007, experimental).
+ //
+-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+-// sequence's logical positions onto a fixed full-pool permutation, blocks are
+-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+-// and returned to the pool on sequence end. This is where the paged memory-
+-// capacity benefit begins: a short sequence holds only a few blocks, not the
+-// whole reserved window.
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
+// host-side PagedKVManager (patch 0001). Two responsibilities:
+ //
+-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+-// struct stays untouched - find_slot only gains a gated call.
+//   * On-demand allocation (0004): a sequence's logical positions are mapped to
+//     physical cells block-by-block, popped from a free pool only as the
+//     sequence grows and returned on sequence end.
+//
+//   * Cross-request prefix reuse (0007): before a new sequence's suffix is
+//     decoded, share_prefix() reuses the cached physical blocks of a matching
+//     content prefix (ref_cnt++), so the engine shares the already-computed KV
+//     cells and the caller decodes ONLY the divergent suffix - the prefix is not
+//     recomputed. commit() publishes a sequence's full blocks into the content
+//     cache so later sequences can hit them. Freeing is ref-counted: a shared
+//     block returns to the pool only when every sharer has been released.
+//
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
+// shared blocks survive at ref>0. All state lives in this unit (a static
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+ 
+ #include <cstdint>
+ #include <vector>
+@@ -21,19 +31,42 @@ namespace paged_alloc {
+ // true iff env LLAMA_KV_PAGED is set (evaluated once).
+ bool active();
+ 
+-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+-// demand, appending their physical cell indices to `out`. pool_blocks =
+-// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+// on demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
+ // `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+ // allocator. The caller still validates each returned cell is empty.
+-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
+            uint32_t block_size, uint32_t pool_blocks,
+            std::vector<uint32_t> & out);
+ 
+-// Return a stream's blocks to the pool (sequence end).
+-void release(const void * cache, int stream);
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
+// (block-aligned); the caller marks those cells for seq and decodes only the
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
+size_t share_prefix(const void * cache, int stream, int seq,
+                    const std::vector<int> & tokens,
+                    uint32_t block_size, uint32_t pool_blocks);
+
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
+int64_t slot(const void * cache, int stream, int seq, int pos);
+ 
+-// Return every stream's blocks for a kv-cache (clear() / teardown).
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
+void commit(const void * cache, int stream, int seq,
+            const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
+void release(const void * cache, int stream, int seq);
+
+// Drop every manager for a kv-cache (clear() / teardown).
+ void release_all(const void * cache);
+ 
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
+// ref count of the block backing logical position `pos`, or -1 if unknown.
+int    ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+size_t num_free(const void * cache, int stream);
+
+ } // namespace paged_alloc
+diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
+new file mode 100644
+index 0000000..8573cd2
+--- /dev/null
+++ b/src/paged-prefix-api.cpp
+@@ -0,0 +1,48 @@
+#include "paged-prefix-api.h"
+#include "paged-alloc.h"
+#include "llama-kv-cache.h"
+
+#include <vector>
+
+namespace paged_prefix_api {
+
+static llama_kv_cache * kv_of(llama_context * ctx) {
+    // The driver targets a plain unified KV-cache model; dynamic_cast yields null
+    // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
+    // not apply, so the shim degrades to a safe no-op.
+    return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
+}
+
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv || n <= 0) {
+        return 0;
+    }
+    return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv || n <= 0) {
+        return;
+    }
+    kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv) {
+        return -1;
+    }
+    return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
+}
+
+long num_free(llama_context * ctx) {
+    llama_kv_cache * kv = kv_of(ctx);
+    if (!kv) {
+        return 0;
+    }
+    return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+}
+
+} // namespace paged_prefix_api
+diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
+new file mode 100644
+index 0000000..78a3864
+--- /dev/null
+++ b/src/paged-prefix-api.h
+@@ -0,0 +1,27 @@
+#pragma once
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
+// without pulling in the internal kv-cache headers. All entry points are no-ops
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
+
+#include "llama.h"
+
+namespace paged_prefix_api {
+
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
+// return the number of shared prefix tokens (the caller decodes only the
+// suffix). 0 if nothing was shared.
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
+// stream 0), or -1 if unknown.
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+long num_free(llama_context * ctx);
+
+} // namespace paged_prefix_api
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,130 @@
+From 088d58f3a0160cbc706226ac2e77ecfeae4c164a Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 17:02:22 +0200
+Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
+ - patch 0008
+
+Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
+paged_prefix_api::share/commit) into the llama-server continuous-batching loop
+(update_slots) so CONCURRENT requests that share a long prefix physically reuse
+one committed copy of the prefix blocks and prefill only their divergent suffix.
+Patch 0007 proved the engine seam correct via a standalone driver, but the server
+never called it: two concurrent shared-prefix requests each recomputed the full
+prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
+(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
+concurrent slots. 0008 adds that cross-slot share.
+
+Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
+
+  * In update_slots prompt-processing, after the native n_past is computed and
+    only for a FRESH slot (n_past < one block, i.e. the native cache did not
+    already cover the prefix), call paged_prefix_api::share() to splice the
+    longest committed cross-request prefix into this sequence (ref_cnt++ on the
+    shared physical blocks) and advance n_past past it, so the batch fill computes
+    ONLY the suffix. The slot's own divergent tail cells are removed first so the
+    shared cells own [n_past, kshare) without colliding (the native path removes
+    these later anyway). The n_past < block gate guarantees any block-aligned
+    share the engine returns is strictly larger than n_past and therefore always
+    adopted, so the engine's reservation always matches the suffix-only batch and
+    never leaves stale blocks (which otherwise fragment the paged pool).
+
+  * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
+    KV just computed), call paged_prefix_api::commit() to publish its prefix so
+    concurrent/later sharers can reuse it.
+
+The share() / commit() entry points are forward-declared (defined in libllama,
+src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
+server translation unit.
+
+Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
+holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
+~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
+K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
+blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
+documented CUDA batch-shape non-determinism band (stock native prompt-caching
+shows the same magnitude). Cross-request sharing requires the unified KV cache.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
+ 1 file changed, 50 insertions(+)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index da6a475..04c6361 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -15,6 +15,16 @@
+ #include "mtmd.h"
+ #include "mtmd-helper.h"
+ 
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
+// cache wires into update_slots() without pulling in internal kv-cache headers.
+// Fully gated; stock (paged off) is byte-identical.
+namespace paged_prefix_api {
+    int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+    void    commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+}
+
+ #include <algorithm>
+ #include <cstddef>
+ #include <cinttypes>
+@@ -3007,6 +3017,37 @@ private:
+                             }
+                         }
+ 
+                        // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
+                        // above only reuses THIS slot's own prior prompt; when the paged KV
+                        // engine is active, also reuse a committed CROSS-slot prefix so
+                        // concurrent requests sharing a long prefix skip recompute. Gated on
+                        // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
+                        static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
+                        // Only attempt the cross-request share on a FRESH slot (the native
+                        // cache above did not already cover the prefix). With n_past < a
+                        // block, any block-aligned share the engine returns is strictly
+                        // larger than n_past and is therefore always adopted below - so the
+                        // engine's full-prompt reservation always matches the suffix-only
+                        // submission and never leaves stale blocks (which fragmented the
+                        // paged pool and crashed the server under high fan-out otherwise).
+                        if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
+                            const llama_tokens ptoks = input_tokens.get_text_tokens();
+                            // Drop this slot's own cells beyond the natively-cached prefix before
+                            // splicing the shared physical prefix in, so the shared cells can own
+                            // [n_past, kshare) without colliding (the native path removes exactly
+                            // these later; a no-op for a fresh slot).
+                            common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
+                            const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
+                            if (kshare > n_past) {
+                                slot.prompt.tokens.keep_first(n_past);
+                                for (int i = n_past; i < kshare; ++i) {
+                                    slot.prompt.tokens.push_back(ptoks[i]);
+                                }
+                                n_past = kshare;
+                                SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
+                            }
+                        }
+
+                         // [TAG_PROMPT_LOGITS]
+                         if (n_past == slot.task->n_tokens() && n_past > 0) {
+                             SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
+@@ -3427,6 +3468,15 @@ private:
+                     // prompt evaluated for next-token prediction
+                     slot.state = SLOT_STATE_GENERATING;
+ 
+                    // [paged 0008] Publish this slot's computed prefix so concurrent/later
+                    // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+                    // for [0, n_tokens) has just run, so the prefix KV is computed.
+                    static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+                    if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+                        const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+                        paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+                    }
+
+                     if (slot.can_speculate()) {
+                         common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
+                     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch
@@ -0,0 +1,609 @@
+From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 20:03:17 +0200
+Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
+
+Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
+contiguous buffer) with an in-kernel paged read on the decode step. build_attn
+passes the UNMODIFIED physical K/V views plus a block table (src[5] of
+ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
+index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
+reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
+read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
+(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
+causal mask in the same position order; KV_max / parallel_blocks / stream_k
+split-K are unchanged. The decode shape is forced onto the vec kernel (the only
+one wired for the block table); a nullptr block table => the stock contiguous
+read, byte-identical.
+
+Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
+so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
+tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
+the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
+ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
+recovering the gather regression to stock parity (647 ms). Gated behind
+LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                  |   6 ++
+ ggml/src/ggml-cpu/ops.cpp            |  10 ++-
+ ggml/src/ggml-cuda/fattn-common.cuh  |   8 +-
+ ggml/src/ggml-cuda/fattn-mma-f16.cuh |   4 +-
+ ggml/src/ggml-cuda/fattn-tile.cuh    |   4 +-
+ ggml/src/ggml-cuda/fattn-vec.cuh     |  25 +++++--
+ ggml/src/ggml-cuda/fattn-wmma-f16.cu |   4 +-
+ ggml/src/ggml-cuda/fattn.cu          |   9 +++
+ ggml/src/ggml.c                      |  14 ++++
+ src/llama-graph.cpp                  |  23 ++++--
+ src/llama-graph.h                    |   3 +-
+ src/llama-kv-cache.cpp               |  31 ++++++++
+ src/llama-kv-cache.h                 |   4 +
+ src/paged-attn.cpp                   | 107 +++++++++++++++++++++++++++
+ src/paged-attn.h                     |  18 +++++
+ 15 files changed, 248 insertions(+), 22 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index d6807b6..823f5a9 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2427,6 +2427,12 @@ extern "C" {
+             struct ggml_tensor * a,
+             struct ggml_tensor * sinks);
+ 
+    // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
+    // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
+    GGML_API void ggml_flash_attn_ext_set_block_table(
+            struct ggml_tensor * a,
+            struct ggml_tensor * block_table);
+
+     // TODO: needs to be adapted to ggml_flash_attn_ext
+     GGML_API struct ggml_tensor * ggml_flash_attn_back(
+            struct ggml_context * ctx,
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 74611dc..63c07a2 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+     const ggml_tensor * v     = dst->src[2];
+     const ggml_tensor * mask  = dst->src[3];
+     const ggml_tensor * sinks = dst->src[4];
+    const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
+    const int32_t     * bt    = block_table ? (const int32_t *) block_table->data : nullptr;
+ 
+     GGML_TENSOR_LOCALS(int64_t, neq, q,   ne)
+     GGML_TENSOR_LOCALS(size_t,  nbq, q,   nb)
+@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+ 
+             float s; // KQ value
+ 
+-            const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
+            // [paged] map the logical KV index ic to its physical cell via the block table.
+            const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
+            const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
+             kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
+ 
+             s = s*scale; // scale KQ value
+@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
+             float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
+             float vs = 1.0f; // post-softmax KQ value, expf(s - M)
+ 
+-            const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
+            const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
+ 
+             if (v->type == GGML_TYPE_F16) {
+                 if (s > M) {
+@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
+         const int64_t dr = (nr + nchunk - 1) / nchunk;
+ 
+         static constexpr int64_t Q_TILE_SZ  = ggml_fa_tile_config::Q;
+-        bool use_tiled = !use_ref &&
+        bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
+                                (q->type == GGML_TYPE_F32 &&
+                                 kv_is_f32_or_f16 &&
+                                 k->type == v->type &&
+diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
+index 8dfa51a..3c6ddd5 100644
+--- a/ggml/src/ggml-cuda/fattn-common.cuh
+++ b/ggml/src/ggml-cuda/fattn-common.cuh
+@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33);
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table);
+ 
+ typedef float (*vec_dot_KQ_t)(
+     const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
+@@ -981,6 +982,8 @@ void launch_fattn(
+ 
+     const ggml_tensor * mask  = dst->src[3];
+     const ggml_tensor * sinks = dst->src[4];
+    const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
+    const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
+ 
+     ggml_tensor * KQV = dst;
+ 
+@@ -1217,7 +1220,8 @@ void launch_fattn(
+         K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
+         nb21, nb22, nb23,
+         mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
+-        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
+        mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
+        bt_ptr
+     );
+     CUDA_CHECK(cudaGetLastError());
+ 
+diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+index 83478a0..0a92cd6 100644
+--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+     ggml_cuda_pdl_sync(); // TODO optimize placement
+ #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
+index 0a09981..0ff14e6 100644
+--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
+@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
+index 69dd936..a09e2fb 100644
+--- a/ggml/src/ggml-cuda/fattn-vec.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
+@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+     ggml_cuda_pdl_lc();
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
+                   nb11, nb12, nb13,
+                   nb21, nb22, nb23,
+                   ne31, ne32, ne33,
+-                  nb31, nb32, nb33);
+                  nb31, nb32, nb33, block_table);
+         NO_DEVICE_CODE;
+         return;
+     }
+@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
+     K += nb13*sequence + nb12*(head / gqa_ratio);
+     V += nb23*sequence + nb22*(head / gqa_ratio);
+ 
+    // [paged] in-kernel block-table read: logical KV index j -> physical cell
+    // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
+    // mask/KV_max stay logical (the table is in token-position order). nullptr =>
+    // the stock contiguous read below.
+    const char * GGML_CUDA_RESTRICT K0 = K;
+    const char * GGML_CUDA_RESTRICT V0 = V;
+    const int  * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
+     const half * maskh  = (const half  *) (mask + nb33*(sequence % ne33) + nb31*ic0);
+ 
+     const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
+@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+         for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
+             const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
+            const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
+ 
+ #pragma unroll
+             for (int j = 0; j < ncols; ++j) {
+-                float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
+                float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
+                 sum = warp_reduce_sum<nthreads_KQ>(sum);
+ 
+                 if (use_logit_softcap) {
+@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+         for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
+             const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
+            const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
+ 
+ #ifdef V_DOT2_F32_F16_AVAILABLE
+             half2 KQ_k[ncols];
+@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
+                 half2 tmp[V_rows_per_thread/2];
+                 if constexpr (type_V == GGML_TYPE_BF16) {
+                     float2 tmp_f[V_rows_per_thread/2];
+-                    dequantize_V(V + k*nb21, tmp_f,
+                    dequantize_V(V_blk, tmp_f,
+                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+ #pragma unroll
+                     for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
+                         tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
+                     }
+                 } else {
+-                    dequantize_V(V + k*nb21, tmp,
+                    dequantize_V(V_blk, tmp,
+                         2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+                 }
+ #pragma unroll
+@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
+ #pragma unroll
+             for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
+                 float2 tmp[V_rows_per_thread/2];
+-                dequantize_V(V + k*nb21, tmp,
+                dequantize_V(V_blk, tmp,
+                     2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
+ #pragma unroll
+                 for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
+@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
+               nb11, nb12, nb13,
+               nb21, nb22, nb23,
+               ne31, ne32, ne33,
+-              nb31, nb32, nb33);
+              nb31, nb32, nb33, block_table);
+     NO_DEVICE_CODE;
+ #endif // FLASH_ATTN_AVAILABLE
+ }
+diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+index 6850716..5357849 100644
+--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
+                             const int32_t nb11, const int32_t nb12, const int64_t nb13,
+                             const int32_t nb21, const int32_t nb22, const int64_t nb23,
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+-                            const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+                            const int32_t nb31, const int32_t nb32, const int64_t nb33,
+        const int  * __restrict__ block_table) {
+    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index d6c501b..e3771ee 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
+ 
+ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+     ggml_cuda_set_device(ctx.device);
+
+    // [paged] the block table (src[5]) is only honored by the vec kernel's
+    // in-kernel read; force it. build_attn only sets it for a vec-supported
+    // 1-token-per-stream decode shape.
+    if (dst->src[5] != nullptr) {
+        ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        return;
+    }
+
+     switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
+         case BEST_FATTN_KERNEL_NONE:
+             GGML_ABORT("fatal error");
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index b43016c..adbe52b 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
+     a->src[4] = sinks;
+ }
+ 
+void ggml_flash_attn_ext_set_block_table(
+        struct ggml_tensor * a,
+        struct ggml_tensor * block_table) {
+    if (!block_table) {
+        a->src[5] = NULL;
+        return;
+    }
+
+    GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
+    GGML_ASSERT(block_table->type == GGML_TYPE_I32);
+
+    a->src[5] = block_table;
+}
+
+ // ggml_flash_attn_back
+ 
+ struct ggml_tensor * ggml_flash_attn_back(
+diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
+index b59d2a5..abdb48d 100644
+--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
+@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
+          ggml_tensor * sinks,
+          ggml_tensor * v_mla,
+                float   kq_scale,
+-                 int   il) const {
+                 int   il,
+         ggml_tensor * block_table) const {
+     const bool v_trans = v->nb[1] > v->nb[2];
+ 
+     // split the batch into streams if needed
+@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
+                                   hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
+         cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
+ 
+        if (block_table) {
+            ggml_flash_attn_ext_set_block_table(cur, block_table);
+        }
+         ggml_flash_attn_ext_add_sinks(cur, sinks);
+         ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
+ 
+@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
+     ggml_tensor * k = mctx_cur->get_k(ctx0, il);
+     ggml_tensor * v = mctx_cur->get_v(ctx0, il);
+ 
+-    // [paged 0003] gather K, V and the mask to the sequence's used cells only
+-    //   (no-op unless env LLAMA_KV_PAGED is set).
+-    ggml_tensor * kq_mask_g = kq_mask;
+-    paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+    // [paged] decode read: when paging is active and this is a 1-token-per-stream
+    //   decode step, present K/V as n_gather views + a block table so the fattn
+    //   kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
+    //   fall back to the gather-read (prefill, transposed V, or env off). All a
+    //   no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
+    ggml_tensor * kq_mask_g   = kq_mask;
+    ggml_tensor * block_table = nullptr;
+    const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
+    if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
+        paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+    }
+ 
+-    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+    ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
+     cb(cur, "kqv_out", il);
+ 
+     if (inp->self_v_rot) {
+diff --git a/src/llama-graph.h b/src/llama-graph.h
+index 5e8a658..c95ae49 100644
+--- a/src/llama-graph.h
+++ b/src/llama-graph.h
+@@ -969,7 +969,8 @@ struct llm_graph_context {
+             ggml_tensor * sinks,   // [n_head_q]
+             ggml_tensor * v_mla,   // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
+                   float   kq_scale,
+-                    int   il) const;
+                    int   il,
+            ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
+ 
+     llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
+ 
+diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
+index 7510ff9..0351f86 100644
+--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
+@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
+     }
+ }
+ 
+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
+    const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+    for (uint32_t j = 0; j < ns; ++j) {
+        const auto & cells = v_cells[sinfo.s0 + j];
+        const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+        std::vector<std::pair<llama_pos, int32_t>> pc;
+        pc.reserve(n);
+        int32_t pad = -1;
+        for (uint32_t i = 0; i < n; ++i) {
+            if (!cells.is_empty(i)) {
+                pc.emplace_back(cells.pos_get(i), (int32_t) i);
+            } else if (pad < 0) {
+                pad = (int32_t) i;
+            }
+        }
+        std::sort(pc.begin(), pc.end());
+        int32_t * col = dst + (size_t) j * n_blk;
+        for (size_t k = 0; k < pc.size(); ++k) {
+            col[k] = pc[k].second;
+        }
+        const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+        for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
+            col[k] = padv;
+        }
+    }
+}
+
+ ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
+     GGML_UNUSED(sinfo);
+ 
+@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+     kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+ }
+ 
+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
+    kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+}
+
+ ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
+     return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
+ }
+diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
+index f374ac6..e9980b6 100644
+--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
+@@ -176,6 +176,9 @@ public:
+     //   gather-read. get_n_gather returns the max count across streams.
+     uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+     void     get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+    // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
+    //   per column with a masked empty cell) for the in-kernel paged read.
+    void     get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
+ 
+     // store k_cur and v_cur in the cache based on the provided head location
+     ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
+@@ -386,6 +389,7 @@ public:
+     //   current ubatch's stream).
+     uint32_t get_n_gather() const;
+     void     get_gather_idxs(int32_t * dst) const;
+    void     get_block_table(int32_t * dst, uint32_t n_blk) const;
+ 
+     // store k_cur and v_cur in the cache based on the provided head location
+     // note: the heads in k_cur and v_cur should be laid out contiguously in memory
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index ade75e8..8eebeaa 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -43,6 +43,25 @@ public:
+     ggml_tensor * idxs;
+ };
+ 
+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
+// with a masked empty cell, by delegating to the kv-cache context.
+class input_block_table : public llm_graph_input_i {
+public:
+    input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
+        : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
+
+    void set_input(const llama_ubatch * ubatch) override {
+        GGML_UNUSED(ubatch);
+        GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+        mctx->get_block_table((int32_t *) idxs->data, n_blk);
+    }
+
+    const llama_kv_cache_context * mctx;
+    ggml_tensor * idxs;
+    uint32_t n_blk;
+};
+
+ } // namespace
+ 
+ void gather(ggml_context * ctx0,
+@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
+     }
+ }
+ 
+bool in_kernel_decode(ggml_context * ctx0,
+                      llm_graph_result * res,
+                      const llama_kv_cache_context * mctx,
+                      ggml_tensor ** k,
+                      ggml_tensor ** v,
+                      ggml_tensor ** kq_mask,
+                      ggml_tensor ** block_table) {
+    if (!active()) {
+        return false;
+    }
+    // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
+    // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
+    static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
+    if (force_gather) {
+        return false;
+    }
+
+    ggml_tensor * K = *k;
+    ggml_tensor * V = *v;
+    ggml_tensor * M = *kq_mask;
+
+    const int64_t n_stream = K->ne[3];
+    GGML_ASSERT(M->ne[3] == n_stream);
+
+    const int64_t n_gather = (int64_t) mctx->get_n_gather();
+    if (n_gather <= 0) {
+        // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
+        return false;
+    }
+
+    // The in-kernel read addresses V along its d-major (non-transposed) axis. If
+    // the cache stores V transposed, fall back to gather() (which normalizes it).
+    if (V->nb[1] > V->nb[2]) {
+        return false;
+    }
+
+    if (debug()) {
+        static int64_t once = 0;
+        if (once++ < 2) {
+            fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
+                    (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+        }
+    }
+
+    // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
+    // in token-POSITION order (identical to the gather index, so the reduction
+    // order matches stock bit-for-bit), padded with a masked empty cell. Filled
+    // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
+    // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
+    // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
+    // engages; padded entries point at a masked empty cell (0 contribution). Stays
+    // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
+    int64_t n_view = GGML_PAD(n_gather, 256);
+    if (n_view > K->ne[2]) {
+        n_view = K->ne[2];
+    }
+
+    ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+    ggml_set_input(idx);
+    res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+
+    // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
+    // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
+    // dim shrinks to n_view. NOT materialized - the kernel reads in place.
+    *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
+                      K->nb[1], K->nb[2], K->nb[3], 0);
+    *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
+                      V->nb[1], V->nb[2], V->nb[3], 0);
+
+    // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
+    // the kernel's logical mask index aligns with the block table. Cheap: the
+    // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
+    {
+        ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+        m = ggml_get_rows(ctx0, m, idx);
+        m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+        m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
+        if (M->type != m->type) {
+            m = ggml_cast(ctx0, m, M->type);
+        }
+        *kq_mask = m;
+    }
+
+    *block_table = idx;
+    return true;
+}
+
+ } // namespace paged_attn
+diff --git a/src/paged-attn.h b/src/paged-attn.h
+index c5b7bd7..23e2184 100644
+--- a/src/paged-attn.h
+++ b/src/paged-attn.h
+@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
+             ggml_tensor ** v,
+             ggml_tensor ** kq_mask);
+ 
+// [paged inc1] In-kernel paged decode read. Instead of materializing the
+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
+// full physical window and return the position-ordered physical-cell index list
+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
+// not apply - env off, nothing placed, or a transposed V cache - so the caller
+// keeps the dense gather()/contiguous read.
+bool in_kernel_decode(ggml_context * ctx0,
+                      llm_graph_result * res,
+                      const llama_kv_cache_context * mctx,
+                      ggml_tensor ** k,
+                      ggml_tensor ** v,
+                      ggml_tensor ** kq_mask,
+                      ggml_tensor ** block_table);
+
+ } // namespace paged_attn
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch
@@ -0,0 +1,269 @@
+From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Mon, 22 Jun 2026 22:36:09 +0200
+Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
+ LLAMA_KV_PAGED) - patch 0010
+
+Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
+safe against silent mis-routing, and plumb the same read into the tile kernel
+for the increment-3 GQA head-group work.
+
+fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
+fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
+(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
+i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
+The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
+so every existing non-paged caller is byte-identical to stock. The mask / KV_max
+stay logical (token-position order), as in vec.
+
+fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
+the vec or tile kernel and never fall through to the best-kernel switch. The
+mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
+(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
+GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
+(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
+kernel once.
+
+Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
+vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
+Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
+vec, matching the nsys profile (flash_attn_ext_vec).
+
+The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
+and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
+tile-paged diverging from tile-stock at the first cross-tile KV depth: the
+GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
+oob_check=false while the compacted paged mask is not padded to cover the tile, so
+past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
+the tile path is increment-3 work; the default vec route and all stock paths are
+untouched.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
+ ggml/src/ggml-cuda/fattn.cu       | 38 +++++++++++++++++++++++---
+ 2 files changed, 64 insertions(+), 19 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
+index 0ff14e6..bb84d61 100644
+--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
+@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
+ // TODO: deduplicate with mma-f16
+ template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
+ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+-        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+        const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+                     const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
+ 
+                     const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
+                     ggml_cuda_memcpy_1<cpy_nb>(
+                         tile_KV + i*(J/2 + J_padding) + j,
+-                        !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+                        !oob_check || i < i_sup ? KV_row + j : zero);
+                 }
+             }
+         }
+@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+ 
+ template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
+ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+-        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+        const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+        const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
+ 
+                     const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
+                     __align__(16) half2 tmp_h2[cpy_ne/2];
+                    // [paged] remap the row through the block table (nullptr => stock contiguous read).
+                    const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
+                     ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
+-                        tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+                        tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
+ 
+                     __align__(16) float2 tmp_f2[cpy_ne/2];
+ #pragma unroll
+@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
+         const int k_VKQ_0,
+         const int k_VKQ_sup,
+         const int k_KQ_0,
+        const int * const __restrict__ block_table,
+         float * KQ_acc) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
+     constexpr int cpw   = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
+     constexpr int np    = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
+ 
+    // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
+    const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
+     flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
+-        (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
+        (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
+     __syncthreads();
+ 
+ #ifdef FAST_FP16_AVAILABLE
+@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+         T_acc * const VKQ,
+         const int k_VKQ_0,
+         const int k_VKQ_max,
+-        const int col_Q_0) {
+        const int col_Q_0,
+        const int * const __restrict__ block_table) {
+     constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
+     constexpr int cpy_ne = cpy_nb / 4;
+ 
+@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+ #pragma unroll
+     for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
+         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
+-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
+     }
+     if (nbatch_K_last > 0) {
+         constexpr int k_KQ_0 = DKQ - nbatch_K_last;
+         flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
+-            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+            Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
+     }
+ 
+     // Apply logit softcap + mask, update KQ_max:
+@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
+     static_assert(nbatch_V % np == 0, "bad nbatch_V");
+ #pragma unroll
+     for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
+        // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
+        const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
+         flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
+-            (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
+            (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
+         __syncthreads();
+ 
+ #ifdef FAST_FP16_AVAILABLE
+@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
+                             const int32_t ne31, const int32_t ne32, const int32_t ne33,
+                             const int32_t nb31, const int32_t nb32, const int64_t nb33,
+         const int  * __restrict__ block_table) {
+-    GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
+ #ifdef FLASH_ATTN_AVAILABLE
+     const char * GGML_CUDA_RESTRICT Q        = Q_ptr;
+     const char * GGML_CUDA_RESTRICT K        = K_ptr;
+@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
+                   nb11, nb12, nb13,
+                   nb21, nb22, nb23,
+                   ne31, ne32, ne33,
+-                  nb31, nb32, nb33);
+                  nb31, nb32, nb33, block_table);
+         NO_DEVICE_CODE;
+         return;
+     }
+@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
+     const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
+     const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
+ 
+    // [paged] per-sequence logical->physical block table in token-position order
+    // (mask/KV_max stay logical); nullptr => the stock contiguous read.
+    const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
+     const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
+ 
+     const int stride_K2   = nb11 / sizeof(half2);
+@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
+             constexpr bool oob_check = false;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+             k_VKQ_0 += gridDim.y*nbatch_fa;
+         }
+         if (k_VKQ_0 < k_VKQ_max) {
+             constexpr bool oob_check = true;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+         }
+     } else {
+         // Branch without out-of-bounds checks.
+@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
+             constexpr bool oob_check = false;
+             flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
+                 (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
+-                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+                stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
+         }
+     }
+ 
+@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
+               nb11, nb12, nb13,
+               nb21, nb22, nb23,
+               ne31, ne32, ne33,
+-              nb31, nb32, nb33);
+              nb31, nb32, nb33, block_table);
+     NO_DEVICE_CODE;
+ #endif // FLASH_ATTN_AVAILABLE
+ }
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index e3771ee..afcafa2 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
+ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+     ggml_cuda_set_device(ctx.device);
+ 
+-    // [paged] the block table (src[5]) is only honored by the vec kernel's
+-    // in-kernel read; force it. build_attn only sets it for a vec-supported
+-    // 1-token-per-stream decode shape.
+    // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
+    // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
+    // silently read the wrong (contiguous physical) cells. So when a block table
+    // is present we route here and NEVER fall through to the best-kernel switch
+    // below - no decode shape can silently reach an mma/wmma misread. build_attn
+    // only sets src[5] for the 1-token-per-stream decode shape; the vec
+    // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+    // and any shape that should not be paged must take the host-side gather path
+    // (LLAMA_KV_PAGED_GATHER=1) instead.
+    //
+    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+    // with oob_check=false while the compacted paged mask is not padded to cover
+    // it, so it diverges from stock. Not for production paged decode until
+    // increment-3 bounds that path; the default vec route is unaffected.
+     if (dst->src[5] != nullptr) {
+-        ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+        if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+            static bool logged = false;
+            if (!logged) {
+                logged = true;
+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+                    paged_tile ? "TILE(experimental)" : "VEC",
+                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+            }
+        }
+        if (paged_tile) {
+            ggml_cuda_flash_attn_ext_tile(ctx, dst);
+        } else {
+            ggml_cuda_flash_attn_ext_vec(ctx, dst);
+        }
+         return;
+     }
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch
@@ -0,0 +1,147 @@
+From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 00:18:35 +0200
+Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
+ gqa>=2) - patch 0011
+
+Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
+in-kernel decode to the tile kernel for the common grouped-query F16 case, and
+keep the inc-1 vec kernel for everything else.
+
+The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
+q-heads that share one kv-head, so each K/V row is loaded once for the whole
+group instead of once per q-head. vec re-streams each kv-head's K/V once per
+q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
+3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
+The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
+this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
+
+Routing guard (why conditional): the tile kernel has no K/V type template - it
+loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
+launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
+read (the table indexes the original paged layout, not the copy). So tile is
+correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
+fall back to the inc-1 vec path, exactly as before this change. The head-group
+reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
+Note: paged decode is currently exercised with an F16 cache only; quantized +
+paged is a separate pre-existing limitation, independent of this change
+(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
+after this patch, since both route the non-F16 cache to vec).
+
+Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
+1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
+same build, env-toggled:
+  STOCK (mma)            174.8 ms/step  183.1 t/s
+  PAGED-VEC  (inc-1)     186.3 ms/step  171.8 t/s   (+6.6% vs stock)
+  PAGED-TILE (inc-3)     177.9 ms/step  179.8 t/s   (+1.8% vs stock)
+GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
+paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
+vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
+takes a larger share of the step.
+
+Why not the split-K tune: the vec decode grid is already block-saturated
+(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
+SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
+intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
+directly; more split-K does not.
+
+Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
+  - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
+  - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
+    in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
+    band where vec also drifts from stock. Stock uses the mma kernel for this
+    multi-stream GQA shape, so a different kernel = different rounding =
+    autoregressive token drift; vec and tile agree with each other while both
+    differ from stock (both pick 15678 where stock picks 38835), confirming the
+    drift is kernel choice, not a paging error.
+  - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
+    (seq3: tile == stock == 624 at the token where vec picked 13).
+
+Stock is byte-identical: the dispatch guard only diverts when the block table
+(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
+path reads the last nbatch_fa tile with oob_check=false and relies on the mask
+-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
+mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
+
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+Assisted-by: Claude:opus-4.8 [Claude Code]
+---
+ ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
+ 1 file changed, 36 insertions(+), 15 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
+index afcafa2..6b15810 100644
+--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
+@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
+     // silently read the wrong (contiguous physical) cells. So when a block table
+     // is present we route here and NEVER fall through to the best-kernel switch
+     // below - no decode shape can silently reach an mma/wmma misread. build_attn
+-    // only sets src[5] for the 1-token-per-stream decode shape; the vec
+    // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
+     // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+     // and any shape that should not be paged must take the host-side gather path
+     // (LLAMA_KV_PAGED_GATHER=1) instead.
+     //
+-    // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+-    // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+-    // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+-    // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+-    // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+-    // with oob_check=false while the compacted paged mask is not padded to cover
+-    // it, so it diverges from stock. Not for production paged decode until
+-    // increment-3 bounds that path; the default vec route is unaffected.
+    // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
+    // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
+    // kv-head (ncols2), loading each K/V row once for the whole group instead of
+    // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
+    // Two constraints make this conditional: (1) the tile kernel has no K/V type
+    // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
+    // converted by launch_fattn to a contiguous F16 copy, which breaks the
+    // in-kernel block-table read (the table indexes the original paged layout, not
+    // the copy); vec instead reads the original cache with in-kernel dequant, so it
+    // is the only correct paged path for non-F16 caches. (2) the head-group reuse
+    // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
+    // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
+    // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
+    // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
+    // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
+    // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
+    // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
+    // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
+    // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
+    // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
+    // uses for ncols2>1); the compacted paged mask is gathered to the n_view
+    // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
+    // the inc-1 vec path for A/B.
+     if (dst->src[5] != nullptr) {
+-        static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+        const ggml_tensor * Qp = dst->src[0];
+        const ggml_tensor * Kp = dst->src[1];
+        const ggml_tensor * Vp = dst->src[2];
+        const bool kv_f16    = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
+        const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
+        const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
+        const bool use_tile  = !force_vec && kv_f16 && gqa_ratio >= 2;
+         if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+             static bool logged = false;
+             if (!logged) {
+                 logged = true;
+-                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+-                    paged_tile ? "TILE(experimental)" : "VEC",
+-                    (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+-                    (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+                fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
+                    use_tile ? "TILE(gqa)" : "VEC",
+                    (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
+                    (long) gqa_ratio, (int) kv_f16);
+             }
+         }
+-        if (paged_tile) {
+        if (use_tile) {
+             ggml_cuda_flash_attn_ext_tile(ctx, dst);
+         } else {
+             ggml_cuda_flash_attn_ext_vec(ctx, dst);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0012-paged-mask-pad-invariant-assert.patch
@@ -0,0 +1,50 @@
+From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 11:12:05 +0200
+Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
+ route (patch 0012)
+
+The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
+leak past-end KV rows only because the compacted mask/block-table length is
+padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
+256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
+tile sits entirely inside the -inf pad window. That invariant was implicit.
+
+Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
+future change to the pad (e.g. < 256) or the tile (> 256) that broke the
+whole-tile property cannot silently reintroduce the leak. Additive only, no
+behaviour change.
+
+Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
+vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
+stays silent (n_view remains a whole number of tiles across all decode steps).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ src/paged-attn.cpp | 9 +++++++++
+ 1 file changed, 9 insertions(+)
+
+diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
+index 8eebeaa..fed8ca9 100644
+--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
+@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
+         n_view = K->ne[2];
+     }
+ 
+    // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
+    // a whole number of such tiles so the in-kernel decode never reads past the gathered
+    // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
+    // the boundary still contributes zero. This holds today only because the pad (256) is a
+    // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
+    // silently reintroduce a past-end KV leak, so assert it rather than trust it.
+    // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
+    GGML_ASSERT(n_view % 64 == 0);
+
+     ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+     ggml_set_input(idx);
+     res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0013-paged-decoupled-prefill-token-budget.patch
@@ -0,0 +1,137 @@
+From 17d97cb74e3e8c93751afd33f5c183e57056fde9 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 11:52:45 +0200
+Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
+ 0013)
+
+llama-server already co-batches decode with chunked prefill: update_slots()
+appends every generating slot's sampled token first, then fills the rest of the
+n_batch budget with prompt tokens, deferring the overflow to the next step. But
+the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
+~2048-token prefill chunk lands in a single compute-heavy step, and every decode
+co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
+Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
+prefill throughput, because they are coupled.
+
+Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
+(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
+The prompt-fill loop and the outer slot loop now also stop once this many prompt
+tokens have been added in the current update_slots() step, so a long prefill is
+split across more steps that each still advance in-flight decode. Default (env
+unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
+LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
+
+Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
+streams with one 6000-token prefill injected mid-stream; same binary, only
+LLAMA_PREFILL_BUDGET differs:
+
+  metric                        stock(off)  budget=256   budget=512
+  worst decode freeze (ms)         3380      482 (7.0x)   778 (4.3x)
+  median decode ITL in window      2264      411 (5.5x)   689
+  decode_stall (ms)                3285      387 (8.5x)   684 (4.8x)
+  decode steps during prefill        38      201 (5.3x)   108
+  injected-req TTFT (ms)           8493     10172 (+20%)  8432 (~0%)
+  steady-state baseline ITL          94        95          94
+
+This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
+the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
+worst freeze and 5.3x more decode progress during the prefill at budget=256), in
+exchange for a modest TTFT rise on the long request (the classic chunked-prefill
+trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
+unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
+which the scheduler cannot lift.
+
+Correctness (same model, greedy temp 0, fa on):
+- budget unset or >= n_batch: byte-identical to stock (the added break never
+  fires before the existing n_batch break; the off-path is a no-op by
+  construction).
+- short prompt (<= budget): byte-identical to stock.
+- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
+  stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
+  n_batch=2048 for decode width.
+- on a prompt larger than the budget the chunked greedy output diverges from the
+  single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
+  stock -b256 diverges from stock -b2048 the same way with the patch inactive,
+  and the output stays coherent and answers correctly.
+
+Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
+mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
+Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
+stays disjoint from the paged allocation hunks.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 35 ++++++++++++++++++++++++++++++++-
+ 1 file changed, 34 insertions(+), 1 deletion(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 04c6361..5d83b30 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -2723,6 +2723,29 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+        // tokens added per step independently of n_batch, splitting a long prefill across
+        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+        // (this is a pure scheduler knob; works with paged off).
+        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        {
+            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+            if (env_pb) {
+                const int v = atoi(env_pb);
+                if (v > 0) {
+                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                }
+            }
+        }
+        int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
+         float  alora_scale       = -1.0f;
+         size_t alora_disabled_id = 0;
+ 
+@@ -3159,7 +3182,10 @@ private:
+                     const bool n_before_user_known = n_before_user > 0;
+ 
+                     // add prompt tokens for processing in the current batch
+-                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch) {
+                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3185,6 +3211,7 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ 
+                         // stop the prompt batch exactly before the latest user input, so a checkpoint
+                         // can be created after the previous messages
+@@ -3293,6 +3320,12 @@ private:
+                 if (batch.n_tokens >= n_batch) {
+                     break;
+                 }
+
+                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+                // leaving the remaining batch capacity for co-batched decode of other slots
+                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                    break;
+                }
+             }
+         }
+ 
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch
@@ -0,0 +1,140 @@
+From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 15:47:06 +0200
+Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
+
+On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
+sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
+mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
+originally reported npl128 throughput cliff does NOT reproduce on this build.
+llama-batched-bench decode (S_TG t/s) is monotonic across batch:
+
+  npl        1     8    32    64   128   256
+  S_TG     85   282   629   935  1295  1779   (stock, mxfp4 MoE, -fa on)
+
+There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
+at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
+
+What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
+token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
+column upper bound = token count, up to 128) in one column-tile. At MoE decode
+the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
+ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
+col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
+time and burns throughput on the padding columns while the larger y-tile lowers
+occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
+covers the density would raise fill + occupancy at no extra weight read (at
+tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
+emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
+kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
+
+Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
+(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
+selection, and therefore every kernel launched, is byte-identical to stock. The
+cap only ever lowers the loop's upper bound and still selects from the same
+granularity- and shared-memory-validated mmq_x set stock already uses for
+smaller batches, so no new kernel configuration is exercised.
+
+Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
+only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
+
+  npl     stock S_TG   cap64 S_TG    d%     stock S_PP   cap64 S_PP
+   64        936          938      +0.1       2924         2883
+  128       1295         1357      +4.8       3075         3038
+  256       1784         1825      +2.3       3085         3046
+
+  (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
+
+cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
+npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
+cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
+tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
+re-reads), so 64 is the recommended value and the only one that helps net.
+
+Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
+throughput unlock (llama-server continuous batching already scales). It is a
+modest high-effective-batch DECODE micro-optimization that matches vLLM's
+smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
+durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
+ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
+patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
+
+Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
+stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
+prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
+npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
+ 1 file changed, 36 insertions(+), 1 deletion(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index edf546d..cff608e 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -6,6 +6,7 @@
+ 
+ #include <climits>
+ #include <cstdint>
+#include <cstdlib>
+ 
+ using namespace ggml_cuda_mma;
+ 
+@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+    static const int cap = []() -> int {
+        const char * s = getenv("LLAMA_MOE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return cap;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+    // off the ids path the cap never applies.
+    int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds != nullptr) {
+        const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+        if (moe_cap > 0) {
+            const int cap = moe_cap < 8 ? 8 : moe_cap;
+            mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        }
+    }
+
+     int mmq_x_best  = 0;
+     int ntiles_x_best = INT_MAX;
+ 
+-    for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+    for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
+         const int granularity = mmq_get_granularity_host(mmq_x, cc);
+ 
+         if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch
@@ -0,0 +1,238 @@
+From 151343bc8c7b956c99eafc855704b70d44637a3b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Tue, 23 Jun 2026 21:03:00 +0200
+Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
+ (patch 0015)
+
+The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
+0014 doc itself scoped): replace the manual env cap with a host-side, default-on
+auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
+MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
+(decode), and keeps the large 128-wide tile when density is high (prefill). No new
+kernel: the selection only lowers the loop's upper bound to an already-compiled,
+granularity- and shared-memory-validated mmq_x.
+
+Density is estimated host-side from the args the ids path already passes:
+  ne_get_rows = ncols_dst   = ne12 * n_expert_used   (token-expert assignments)
+  n_experts   = nchannels_x = ne02
+  density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
+Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
+global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
+regress by construction.
+
+density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
+a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
+standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
+16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
+sits strictly between for every n_experts in [128,511], so it caps decode and leaves
+prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
+cratered its S_PP by ~2%, the regression this threshold exists to avoid.
+
+Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
+attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
+(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
+
+  npl   S_TG stock  S_TG 0015   dTG%    S_PP stock  S_PP 0015   dPP%
+    8      183.59     183.18  -0.22%       1489.2     1500.1  +0.73%
+   32      264.02     263.44  -0.22%       2034.5     2033.5  -0.05%
+   64      311.76     310.41  -0.43%       2028.3     2027.6  -0.03%
+  128      336.10     337.32  +0.36%       2025.0     2027.7  +0.13%
+
+Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
+and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
+256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
+lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
+cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
+useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
+smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
+
+Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
+(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
+decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
+the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
+neutral on the SSM model, harmless where it does not help. Conservative by design:
+at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
+(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
+work.
+
+LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
+old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
+select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
+LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
+
+Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
+NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
+{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
+All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
+LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
+nothing changes (non-MoE mul_mat byte-identical to stock).
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
+ tests/test-backend-ops.cpp |  16 ++++++
+ 2 files changed, 99 insertions(+), 17 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index cff608e..9718b12 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     }
+ }
+ 
+-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
+// as an explicit override / A-B knob; the default path is now the auto-select.
+ static inline int ggml_cuda_moe_mmq_x_cap() {
+     static const int cap = []() -> int {
+         const char * s = getenv("LLAMA_MOE_MMQ_X");
+@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
+     return cap;
+ }
+ 
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
+    static const bool en = []() -> bool {
+        const char * s = getenv("LLAMA_MOE_AUTO_TILE");
+        return !(s && atoi(s) == 0);
+    }();
+    return en;
+}
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
+static inline int ggml_cuda_moe_decode_tile() {
+    static const int t = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DECODE_TILE");
+        const int v = s ? atoi(s) : 0;
+        return v >= 8 ? v : 64;
+    }();
+    return t;
+}
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
+// segment never splits into an extra col-tile.
+static inline int ggml_cuda_moe_density_max() {
+    static const int d = []() -> int {
+        const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
+        const int v = s ? atoi(s) : 0;
+        return v > 0 ? v : 8;
+    }();
+    return d;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+     const int mmq_y = get_mmq_y_host(cc);
+ 
+-    // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+-    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+-    // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+-    // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+-    // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+-    // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+-    // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+-    // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+-    // per-expert density raises tile fill + occupancy with no extra weight reads (at
+-    // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+-    // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+-    // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+-    // off the ids path the cap never applies.
+    // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+    // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+    // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
+    // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
+    // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
+    // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
+    // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
+    // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
+    // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
+    // SMALLER mmq_x when - and only when - the per-expert density is low:
+    //
+    //   ne_get_rows  = args.ncols_dst    = ne12 * n_expert_used  (total token-expert assignments)
+    //   n_experts    = args.nchannels_x  = ne02
+    //   n_active_est = min(n_experts, ne_get_rows)               (upper bound on active experts)
+    //   density      = ceil(ne_get_rows / n_active_est)          (avg tokens per active expert)
+    //
+    // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
+    // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
+    // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
+    // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
+    // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
+    // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
+    // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
+    // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
+    // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
+    // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
+    // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
+    //   - LLAMA_MOE_MMQ_X=<n>   : manual blunt global cap, overrides the auto-select (patch 0014).
+    //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+    //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+             const int cap = moe_cap < 8 ? 8 : moe_cap;
+             mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+        } else if (ggml_cuda_moe_auto_tile_enabled()) {
+            const int64_t ne_get_rows = args.ncols_dst;
+            const int64_t n_experts   = args.nchannels_x;
+            if (ne_get_rows > 0 && n_experts > 0) {
+                const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
+                const int64_t density  = (ne_get_rows + n_active - 1) / n_active;
+                const int     tile     = ggml_cuda_moe_decode_tile();
+                if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
+                    mmq_x_lim = tile;
+                }
+            }
+         }
+     }
+ 
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index 15ae389..f219309 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8575,6 +8575,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+     // gpt-oss issue with Vulkan mmq_id
+     test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ 
+    // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+    // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
+    // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
+    // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
+    // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
+    // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
+    // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
+    // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
+    // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
+    // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
+            test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch
@@ -0,0 +1,205 @@
+From 0a2677c6e6c608f9c0ec657faa0ff04a03370aa6 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 07:44:25 +0000
+Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
+ 0016, continuous-batch P1)
+
+Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
+decode-first token budget: the P1 of the token-granular continuous-batch
+scheduler scoped in CONTINUOUS_BATCH_SCHEDULER_SCOPE.md. This is a POLICY
+change only inside update_slots(): no new slot states, no batch-formation
+rewrite, zero libllama changes. llama-server already emits one unified
+mixed prefill+decode batch per step (Phase 1 appends every ready decode
+token unconditionally; Phase 2 fills prefill into the same batch); 0013
+already ships that mixed ubatch. 0016 only changes the COUNT of prefill
+tokens admitted per step.
+
+The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
+== D (the live decode load) is known there. Instead of 0013's constant
+LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets
+one long prompt monopolise the step), compute a dynamic budget:
+
+  T  = min(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_batch), floored at
+       n_ubatch (the vLLM max_num_batched_tokens analogue / ITL trade knob)
+  prefill_budget_step  = max(n_ubatch, T - D)   (leftover after decode,
+       auto-shrinks as decode load rises so the step never inflates past T)
+  prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch
+       (the long_prefill_token_threshold analogue: one long prompt cannot
+       eat the whole leftover; LLAMA_PREFILL_CAP overrides)
+
+Phase 2's inner prompt-fill loop and outer admission break are bounded by
+prefill_budget_step (across slots) and a new per-slot slot_prompt_added
+counter (per-slot cap), instead of the static 0013 cap; the n_batch hard
+ceiling stays as the compute bound. Decode is structurally claimed first
+and never capped (Phase 1), so the decode-first guarantee is free.
+
+Why it supersedes 0013: 0013 needs a hand-picked constant (256 for dense)
+that is net-negative at low npl and costs MoE TTFT; the T - D budget is
+self-tuning across npl 8..128 and across dense vs MoE, holding the GB10
+decode ceiling (~161 dense / ~333 MoE tok/s @npl128) WITHOUT per-workload
+tuning while collapsing burst TTFT. Steady-state decode throughput is NOT
+lifted (that is the decode-kernel ceiling, scoped as P3); the P1 win is
+TTFT + tuning-free robustness + clean supersession of 0013.
+
+DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
+to stock. The degenerate T == n_batch case is byte-identical to stock/0013
+(the determinism oracle): the leftover max(n_ubatch, n_batch - D) and the
+n_batch per-slot cap both reach the existing `batch.n_tokens < n_batch`
+ceiling at the same point, so no new bound fires. The legacy
+LLAMA_PREFILL_BUDGET path is preserved exactly (honoured only when
+LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly subsumed. Orthogonal
+to LLAMA_KV_PAGED: pure scheduler policy, identical decisions paged on/off.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
+ 1 file changed, 85 insertions(+), 22 deletions(-)
+
+diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
+index 5d83b30..f7a114c 100644
+--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
+@@ -2723,24 +2723,78 @@ private:
+         int32_t n_batch  = llama_n_batch(ctx_tgt);
+         int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ 
+-        // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+-        // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+-        // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+-        // sampled decode tokens of every generating slot are appended FIRST, then prompt
+-        // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+-        // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+-        // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+-        // tokens added per step independently of n_batch, splitting a long prefill across
+-        // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+-        // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+-        // (this is a pure scheduler knob; works with paged off).
+-        int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+        // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+        // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+        // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+        // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+        // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+        // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+        // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+        // lets one long prompt monopolise the step.
+        //
+        // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+        // a single total per-step token budget T, decode claims its D tokens first
+        // (already in the batch), and prefill gets the leftover T - D distributed across
+        // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+        // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+        // the budget self-tunes across the npl range and across dense vs MoE without a
+        // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+        // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+        // never capped (Phase 1), so the decode-first guarantee is free here.
+        //
+        //   LLAMA_MAX_BATCH_TOKENS (T)  total per-step token budget (decode + prefill),
+        //                               default n_batch, clamped to [n_ubatch, n_batch] so
+        //                               the compute loop stays a single llama_decode and
+        //                               prefill keeps an n_ubatch floor of progress.
+        //   LLAMA_PREFILL_CAP           per-slot max prompt tokens per step (the
+        //                               long_prefill_token_threshold analogue), default
+        //                               min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+        //                               one long prompt cannot eat the whole leftover.
+        //   LLAMA_PREFILL_BUDGET        legacy static cap (patch 0013); honoured ONLY when
+        //                               LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+        //
+        // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+        // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+        // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+        // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+        // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+        // scheduler policy, identical decisions with paged on or off.
+        const int32_t n_decode_in_batch = batch.n_tokens; // D: Phase 1 appended D decode tokens above
+        int32_t prefill_budget_step  = 0; // 0 = disabled (stock n_batch-only chunking)
+        int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
+         {
+-            const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+-            if (env_pb) {
+            int32_t mbt = 0;
+            if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+                mbt = atoi(env_mbt);
+            }
+            if (mbt > 0) {
+                // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+                int32_t T = std::min(n_batch, mbt);
+                T = std::max(T, n_ubatch);
+                // leftover after decode, floored at n_ubatch so prefill never fully starves
+                prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+                // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+                int32_t cap = 0;
+                if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+                    cap = atoi(env_cap);
+                }
+                if (cap <= 0) {
+                    const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+                    cap = std::min(T, std::max(n_ubatch, pct4));
+                }
+                cap = std::min(n_batch, std::max(n_ubatch, cap));
+                // at T == n_batch the leftover and cap both reach the n_batch ceiling
+                // together; pin the cap to n_batch so this case stays byte-identical
+                if (T >= n_batch) {
+                    cap = n_batch;
+                }
+                prefill_cap_per_slot = cap;
+            } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+                // legacy static budget (patch 0013), kept for back-compat when the
+                // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
+                 const int v = atoi(env_pb);
+                 if (v > 0) {
+-                    n_prefill_budget = std::min(n_batch, std::max(1, v));
+                    prefill_budget_step = std::min(n_batch, std::max(1, v));
+                 }
+             }
+         }
+@@ -3181,11 +3235,18 @@ private:
+                     const int32_t n_before_user = slot.task->params.n_before_user;
+                     const bool n_before_user_known = n_before_user > 0;
+ 
+                    // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+                    // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+                    int32_t slot_prompt_added = 0;
+
+                     // add prompt tokens for processing in the current batch
+-                    // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+-                    // prompt is split across more steps and leaves batch room for co-batched decode
+                    // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+                    // (the T - D leftover) is spent across all slots, or (b) this slot's
+                    // per-slot chunk cap is hit, so a long prompt is split across more steps
+                    // and leaves batch room for co-batched decode of the other slots
+                     while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch &&
+-                           (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+                           (prefill_budget_step  == 0 || n_prompt_budgeted < prefill_budget_step) &&
+                           (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
+                         // get next token to process
+                         llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
+                         if (cur_tok == LLAMA_TOKEN_NULL) {
+@@ -3211,7 +3272,8 @@ private:
+                         slot.prompt.tokens.push_back(cur_tok);
+ 
+                         slot.n_prompt_tokens_processed++;
+-                        n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+                        n_prompt_budgeted++;  // (patch 0016) toward the dynamic per-step prefill budget
+                        slot_prompt_added++;  // (patch 0016) toward this slot's per-step chunk cap
+ 
+                         // stop the prompt batch exactly before the latest user input, so a checkpoint
+                         // can be created after the previous messages
+@@ -3321,9 +3383,10 @@ private:
+                     break;
+                 }
+ 
+-                // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+-                // leaving the remaining batch capacity for co-batched decode of other slots
+-                if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+                // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+                // budget (the T - D leftover) is spent, leaving the remaining batch
+                // capacity for co-batched decode of the other slots
+                if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
+                     break;
+                 }
+             }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0017-fp4-gemm-decode-tile-tune.patch
@@ -0,0 +1,245 @@
+From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Wed, 24 Jun 2026 19:56:05 +0200
+Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
+ occupancy instrumentation (patch 0017)
+
+Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
+bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
+honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
+kill-gate tripped - so nothing is enabled by default.
+
+P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
+row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
+M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
+default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
+
+P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
+  - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
+    weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
+    assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
+    left as the P2 kernel change; the host/device plumbing is in place and inert.
+  - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
+    (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
+  - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
+
+Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
+  DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
+    (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
+  MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
+    (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
+
+nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
+(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
+not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
+mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
+mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
+ tests/test-backend-ops.cpp | 16 +++++++
+ 2 files changed, 92 insertions(+), 9 deletions(-)
+
+diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
+index 9718b12..b53e38a 100644
+--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
+@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
+ #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
+ }
+ 
+-static int get_mmq_y_host(const int cc) {
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
+#ifndef GGML_CUDA_FP4_MMQ_Y
+#define GGML_CUDA_FP4_MMQ_Y 128
+#endif
+
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
+    if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+     return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
+         ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
+ }
+@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
+     return MMQ_ITER_K;
+ }
+ 
+template <ggml_type type = GGML_TYPE_COUNT>
+ static constexpr __device__ int get_mmq_y_device() {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
+        return GGML_CUDA_FP4_MMQ_Y;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA1)
+     return 64;
+@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
+ #endif // defined(GGML_USE_HIP)
+ }
+ 
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
+#ifndef GGML_CUDA_FP4_MINBLOCKS
+#define GGML_CUDA_FP4_MINBLOCKS 1
+#endif
+template <ggml_type type = GGML_TYPE_COUNT>
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+    if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
+        return GGML_CUDA_FP4_MINBLOCKS;
+    }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+    return stock;
+}
+
+ // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
+ // The K dimension of the tiles has either,
+ // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
+@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+     constexpr int              warp_size  = ggml_cuda_get_physical_warp_size();
+     constexpr int              nwarps     = mmq_get_nwarps_device();
+     constexpr int              qk         = ggml_cuda_type_traits<type>::qk;
+-    constexpr int              mmq_y      = get_mmq_y_device();
+    constexpr int              mmq_y      = get_mmq_y_device<type>();
+     constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
+ 
+     extern __shared__ int data_mul_mat_q[];
+@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
+ template <ggml_type type, int mmq_x, bool need_check>
+ #if defined(GGML_USE_HIP)
+ #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
+ #else
+ #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
+ #else
+-    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+    __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
+ #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
+ #endif // defined(GGML_USE_HIP)
+ static __global__ void mul_mat_q(
+@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size();
+ 
+     constexpr int qk    = ggml_cuda_type_traits<type>::qk;
+-    constexpr int mmq_y = get_mmq_y_device();
+    constexpr int mmq_y = get_mmq_y_device<type>();
+ 
+     const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
+ 
+@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
+         float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
+         const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
+         const int stride_sample_dst, const uint3 ntx) {
+-    constexpr int mmq_y           = get_mmq_y_device();
+    constexpr int mmq_y           = get_mmq_y_device<type>();
+     constexpr int qk              = ggml_cuda_type_traits<type>::qk;
+     constexpr int ITER_K          = get_iter_k(type);
+     constexpr int blocks_per_iter = ITER_K / qk;
+@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
+     const int nsm = ggml_cuda_info().devices[id].nsm;
+     const int warp_size = ggml_cuda_info().devices[id].warp_size;
+     const int nwarps = mmq_get_nwarps_host(cc, warp_size);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     const dim3 block_dims(warp_size, nwarps, 1);
+ 
+@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
+     return d;
+ }
+ 
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
+    static const int c = []() -> int {
+        const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
+        return s ? atoi(s) : 0;
+    }();
+    return c;
+}
+
+ template <ggml_type type>
+ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
+     const int    id     = ggml_cuda_get_device();
+@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     const int nwarps    = mmq_get_nwarps_host(cc, warp_size);
+ 
+     const int mmq_x_max = get_mmq_x_max_host(cc);
+-    const int mmq_y = get_mmq_y_host(cc);
+    const int mmq_y = get_mmq_y_host(cc, type);
+ 
+     // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+     // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
+     //   - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+     //   - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
+     int mmq_x_lim = mmq_x_max;
+    if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
+        // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
+        const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
+        if (cap > 0 && cap < mmq_x_max) {
+            mmq_x_lim = cap < 8 ? 8 : cap;
+        }
+    }
+     if (args.expert_bounds != nullptr) {
+         const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+         if (moe_cap > 0) {
+diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
+index f219309..291c275 100644
+--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
+@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
+         }
+     }
+ 
+    // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
+    // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
+    // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
+    // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
+    // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
+    // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
+    // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
+    // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
+    for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+        for (int64_t m : {2048, 1600, 2050}) {
+            for (int64_t n : {32, 128}) {
+                test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
+            }
+        }
+    }
+
+     for (ggml_type type_a : all_types) {
+         test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
+     }
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch
@@ -0,0 +1,349 @@
+From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 00:44:13 +0200
+Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
+ write-back (patch 0018)
+
+Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
+16 full-attention layers) was dominated by recurrent-state plumbing, not the
+FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
+recurrent state into graph scratch, then a separate ggml_cpy persisted it into
+the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
+~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
+
+This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
+ggml_gated_delta_net_inplace writes the final recurrent state directly into the
+active sequences contiguous cache slot (at kv_head), removing the copy-back. The
+op output then carries only the attention scores; the SSM arithmetic is
+unchanged (bit-identical greedy output vs the copy-back baseline).
+
+- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
+- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
+- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
+  dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
+
+Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
+  npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
+  npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
+MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
+nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
+(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
+next lever.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ ggml/include/ggml.h                   | 14 ++++++
+ ggml/src/ggml-cpu/ops.cpp             | 13 ++++-
+ ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
+ ggml/src/ggml.c                       | 68 +++++++++++++++++++++++++++
+ src/models/delta-net-base.cpp         | 30 ++++++++++++
+ 5 files changed, 152 insertions(+), 12 deletions(-)
+
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 823f5a9..4e7ab32 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2579,6 +2579,20 @@ extern "C" {
+             struct ggml_tensor  * state,
+             int64_t               K);
+ 
+    // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+    // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+    // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+    // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * q,
+            struct ggml_tensor  * k,
+            struct ggml_tensor  * v,
+            struct ggml_tensor  * g,
+            struct ggml_tensor  * beta,
+            struct ggml_tensor  * state,
+            struct ggml_tensor  * state_dst);
+
+     // custom operators
+ 
+     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 63c07a2..9457add 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
+ 
+     const int64_t S_v      = src_v->ne[0];
+     const int64_t H        = src_v->ne[1];
+@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+ 
+     const float scale = 1.0f / sqrtf((float) S_v);
+ 
+    // when src_state_dst is provided (in-place decode write-back) the final state is written
+    // directly into the persistent cache view, removing the separate state copy-back node.
+    float * inplace_state_base = nullptr;
+    if (src_state_dst != nullptr) {
+        GGML_ASSERT(K == 1);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        inplace_state_base = (float *) src_state_dst->data;
+    }
+
+     for (int64_t ir = ir0; ir < ir1; ++ir) {
+         const int64_t iv1 = ir % H; // head_index
+         const int64_t iv3 = ir / H; // sequence
+@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+         // For K>1, work in scratch and copy out per-token when the slot is in range.
+         float * s_out = (K > 1)
+             ? state_work
+-            : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+            : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
+ 
+         // copy input state into the working buffer and operate in-place
+         // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index a547360..61a2b91 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
+                                      const uint3   neqk1_magic,
+                                      const uint3   rq3_magic,
+                                      float         scale,
+-                                     int           K) {
+                                     int           K,
+                                     float *       state_dst) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+     // each warp owns one column, using warp-level primitives to reduce across rows
+@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
+ 
+     const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+     float *       attn_data        = dst;
+-    float *       state            = dst + attn_score_elems;
+    // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+    // directly into the persistent cache view instead of being appended to the op output; this
+    // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+    float *       state            = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
+ 
+     // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
+     // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
+@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
+ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+         const float * g_d, const float * b_d, const float * s_d,
+-        float * dst_d,
+        float * dst_d, float * state_dst_d,
+         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
+         int64_t sq1,   int64_t sq2, int64_t sq3,
+         int64_t sv1,   int64_t sv2, int64_t sv3,
+@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 32:
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         case 64: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         case 128: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+             break;
+         }
+         default:
+@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     ggml_tensor * src_g     = dst->src[3];
+     ggml_tensor * src_beta  = dst->src[4];
+     ggml_tensor * src_state = dst->src[5];
+    ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
+ 
+     GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
+     GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
+@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const float * s_d   = (const float *) src_state->data;
+     float *       dst_d = (float *) dst->data;
+ 
+    float * state_dst_d = nullptr;
+    if (src_state_dst != nullptr) {
+        // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+        GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+        GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+        GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+        state_dst_d = (float *) src_state_dst->data;
+    }
+
+     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
+@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const int K = ggml_get_op_params_i32(dst, 0);
+     const bool keep_rs = K > 1;
+ 
+    // in-place write-back is only valid for the single-snapshot (final-state) case
+    GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
+     if (kda) {
+         if (keep_rs) {
+-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+     } else {
+         if (keep_rs) {
+-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index adbe52b..b8d34bf 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
+     return result;
+ }
+ 
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * v,
+        struct ggml_tensor  * g,
+        struct ggml_tensor  * beta,
+        struct ggml_tensor  * state,
+        struct ggml_tensor  * state_dst) {
+    GGML_ASSERT(ggml_is_contiguous_rows(q));
+    GGML_ASSERT(ggml_is_contiguous_rows(k));
+    GGML_ASSERT(ggml_is_contiguous_rows(v));
+    GGML_ASSERT(ggml_is_contiguous(g));
+    GGML_ASSERT(ggml_is_contiguous(beta));
+    GGML_ASSERT(ggml_is_contiguous(state));
+
+    GGML_ASSERT(q->type == GGML_TYPE_F32);
+    GGML_ASSERT(k->type == GGML_TYPE_F32);
+    GGML_ASSERT(v->type == GGML_TYPE_F32);
+    GGML_ASSERT(g->type == GGML_TYPE_F32);
+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
+    GGML_ASSERT(state->type == GGML_TYPE_F32);
+    GGML_ASSERT(state_dst != NULL);
+    GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+    const int64_t S_v      = v->ne[0];
+    const int64_t H        = v->ne[1];
+    const int64_t n_tokens = v->ne[2];
+    const int64_t n_seqs   = v->ne[3];
+
+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+    GGML_ASSERT(beta->ne[0] == 1);
+
+    GGML_ASSERT(state->ne[0] == S_v);
+    GGML_ASSERT(state->ne[1] == S_v);
+    GGML_ASSERT(state->ne[2] == H);
+    GGML_ASSERT(state->ne[3] == n_seqs);
+
+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+    const int64_t state_rows = S_v * n_seqs; // K == 1
+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+    ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+    result->op     = GGML_OP_GATED_DELTA_NET;
+    result->src[0] = q;
+    result->src[1] = k;
+    result->src[2] = v;
+    result->src[3] = g;
+    result->src[4] = beta;
+    result->src[5] = state;
+    result->src[6] = state_dst;
+
+    return result;
+}
+
+ ////////////////////////////////////////////////////////////////////////////////
+ 
+ struct ggml_hash_set ggml_hash_set_new(size_t size) {
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index ad9ce77..26a718b 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+     const bool keep = cparams.n_rs_seq > 0;
+ 
+     if (!keep) {
+        const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+        if (fused) {
+            // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+            // directly into the persistent cache slot for the active sequences (a contiguous block
+            // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+            // dominated decode. The op output then carries only the attention scores.
+            ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+            ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+            if (n_seq_tokens == 1) {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+            } else {
+                cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+            }
+
+            ggml_tensor * output = ggml_view_4d(ctx0, result,
+                    S_v, H_v, n_seq_tokens, n_seqs,
+                    ggml_row_size(result->type, S_v),
+                    ggml_row_size(result->type, S_v * H_v),
+                    ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+            cb(output, "attn_output", il);
+
+            // the state write is a side effect of the op; pull the op into the graph via the output
+            ggml_build_forward_expand(gf, output);
+
+            return output;
+        }
+
+         auto attn_out = build_delta_net(q, k, v, g, b, s, il);
+         ggml_tensor * output    = attn_out.first;
+         ggml_tensor * new_state = attn_out.second;
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
+++ b/backend/cpp/llama-cpp/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch
@@ -0,0 +1,678 @@
+From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
+From: Ettore Di Giacinto <mudler@localai.io>
+Date: Thu, 25 Jun 2026 01:45:02 +0200
+Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
+ (patch 0019)
+
+Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
+write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
+state get_rows gather (18.8% of decode GPU time): build_rs materialized each
+sequence's prior state into a contiguous scratch via ggml_get_rows before the
+gated-DeltaNet op read it.
+
+This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
+ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
+s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
+each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
+in-place write the op now reads AND writes the cache directly: no recurrent-state
+materialization at all. build_recurrent_attn feeds the full cache + ids through
+the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
+clear and the extra-states copy around the op.
+
+Race-free by construction on CUDA. In-place write plus an ids read of the same
+cache is only safe when read slot == write slot; s_copy is identity
+(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
+remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
+The recurrence kernel handles both per (seq, head) block on device: identity
+sequences read s0 in place from the destination slot (the kernel loads all of s0
+into registers before writing, so reading and writing the same slot is safe),
+and non-identity sequences read from a disjoint scratch that a small gather
+kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
+another block writes. The CPU op mirrors this (host identity check + a serial
+gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
+device-resident at op-execute time). Bit-identical to the get_rows path in every
+case.
+
+- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
+  (gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
+  CPU identity guard + serial gather fallback in the dispatcher
+- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
+  qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
+  the rollback (n_rs_seq > 0) path are unchanged.
+
+Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
+  dense q36-27b-nvfp4 : npl 32  137.64 -> 170.68 (+24.0 percent)
+                        npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
+  MoE   q36-35b-a3b-nvfp4: npl 32  299.68 -> 366.69 (+22.4 percent)
+                           npl 128 409.30 -> 553.63 (+35.3 percent)
+Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
+(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
+k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
+is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
+is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
+
+Assisted-by: Claude:opus-4.8 [Claude Code]
+Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
+---
+ SSM_DECODE_FIX_RESULTS.md             | 86 +++++++++++++++++++++++++++
+ ggml/include/ggml.h                   | 17 ++++++
+ ggml/src/ggml-cpu/ops.cpp             | 49 ++++++++++++++-
+ ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
+ ggml/src/ggml.c                       | 76 +++++++++++++++++++++++
+ src/models/delta-net-base.cpp         | 63 ++++++++++++++++++++
+ src/models/models.h                   | 13 ++++
+ src/models/qwen35.cpp                 |  6 +-
+ src/models/qwen35moe.cpp              |  6 +-
+ 9 files changed, 378 insertions(+), 23 deletions(-)
+
+diff --git a/SSM_DECODE_FIX_RESULTS.md b/SSM_DECODE_FIX_RESULTS.md
+index 2e7c8c2..77879e4 100644
+--- a/SSM_DECODE_FIX_RESULTS.md
+++ b/SSM_DECODE_FIX_RESULTS.md
+@@ -96,3 +96,89 @@ precedent (`ssm_scan` `ids`) and is the clear next move. The residual gap to vLL
+ after both SSM steps is the FP4 GEMM (~37% of decode), which is a separate kernel
+ track. No paged/graph/block-table change can move decode on this model (full
+ attention is 0.4% of decode).
+
+## STEP 2 (patch 0019): fuse the recurrent-state gather into the op
+
+After Step 1 the largest non-GEMM decode bucket was the recurrent-state
+`get_rows` gather (18.8% of decode GPU time): `build_rs` materialized each
+sequence's prior state into a contiguous scratch via `ggml_get_rows` before the
+gated-DeltaNet op read it. Step 2 eliminates that materialization, mirroring
+`ggml_ssm_scan`'s `ids` source.
+
+`ggml_gated_delta_net_inplace_ids` takes the FULL recurrent-state cache plus the
+`s_copy` ids (`src[5]` = full cache `[S_v, S_v, H, n_rs_slots]`, `src[7]` = ids,
+`op_param[1]` = `rs_head`) and reads each sequence's prior state directly from
+`cache[ids[seq]]`. Combined with Step 1's in-place write the op now reads AND
+writes the cache directly: no recurrent-state materialization at all. The
+`build_recurrent_attn` fused path feeds the full cache and ids through the
+`build_rs` `get_state_rows` lambda exactly like `mamba-base.cpp`, keeping the
+`rs_zero` clear and the extra-states copy around the op.
+
+### Race-free by construction (CUDA)
+
+In-place write plus an ids read of the same cache is only safe when the read slot
+equals the write slot. `s_copy(s) = cells[s + head].src0`, which is identity
+(`rs_head + s`) for stable continuing sequences (the entire AR decode path) but
+can remap on sequence reorder or `rs_zero` (e.g. multiple new sequences in one
+prefill ubatch). The kernel handles both per (seq, head) block on device:
+
+- identity sequences read `s0` in place from the destination slot `state_dst`
+  (the kernel loads all of `s0` into registers before it writes the new state,
+  so reading and writing the same slot is race-free) -- no materialization;
+- non-identity sequences read from a disjoint scratch that a small
+  `gdn_gather_nonident_kernel` copies from `cache[ids[seq]]` first, so the
+  recurrence never reads a slot another block writes.
+
+`ids` stays a device pointer (dereferenced only in the kernels; the input is
+device-resident at op-execute time, so a host read segfaults). The CPU op
+mirrors the same logic (host identity check + a serial gather in the dispatcher
+for the non-identity case). The math is unchanged, so the result is bit-identical
+to the `get_rows` path in every case.
+
+Gated to the `qwen35` / `qwen35moe` fused decode/prefill path; `qwen3next`,
+`kimi-linear`, the non-fused path and the rollback (`n_rs_seq > 0`) path are
+untouched (they keep the materialized-state overload).
+
+### Measured decode_agg (`S_TG` t/s, npp 128, ntg 128, -fa on, paged on, fusion off)
+
+Dense `q36-27b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2   | delta   | % of vLLM (391 @128) |
+|-----|-------------------|----------|---------|----------------------|
+| 32  | 137.64            | 170.68   | +24.0%  | -                    |
+| 128 | 186.25            | 256.57   | +37.8%  | 47.6% -> 65.6%       |
+
+The npl-128 result (256.57 t/s) beats the predicted ~247 t/s Step-2 ceiling.
+
+MoE `q36-35b-a3b-nvfp4`:
+
+| npl | Step 1 (baseline) | Step 2   | delta   |
+|-----|-------------------|----------|---------|
+| 32  | 299.68            | 366.69   | +22.4%  |
+| 128 | 409.30            | 553.63   | +35.3%  |
+
+(Step-1 baselines re-measured in the same session; the brief's reference figures
+were 136 / 180 dense and 279 / 373 MoE.)
+
+### Bit-exact gate
+
+Greedy (`--temp 0 --seed 1`) `llama-completion` output (256 tokens, paged on,
+fusion off) vs the Step-1 build:
+
+- dense `q36-27b-nvfp4`: model text byte-identical (md5 match);
+- MoE `q36-35b-a3b-nvfp4`: byte-identical;
+- Step-2 dense run1 == run2 (deterministic, no race).
+
+### nsys confirmation (npp 128, ntg 24, npl 128, fusion off, eager)
+
+The recurrent-state gather bucket collapsed:
+
+| kernel                     | Step 1   | Step 2                                  |
+|----------------------------|----------|-----------------------------------------|
+| `k_get_rows_float`         | 18.8%    | 0.7% (residual: embeddings / conv-state)|
+| `gdn_gather_nonident`      | -        | 1.7% (no-op at decode, median ~1.2 us)  |
+| `gated_delta_net_cuda`     | 26.0%    | 22.5%                                    |
+| FP4 GEMM family            | ~37.5%   | ~48% (now the dominant residual)        |
+
+The SSM state gather is effectively eliminated. The residual decode gap to vLLM
+is now the FP4 GEMM (~48% of decode), a separate kernel track.
+diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
+index 4e7ab32..951dd21 100644
+--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
+@@ -2593,6 +2593,23 @@ extern "C" {
+             struct ggml_tensor  * state,
+             struct ggml_tensor  * state_dst);
+ 
+    // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
+    // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
+    // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
+    // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
+    // base slot. Eliminates the recurrent-state gather on the decode path.
+    GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+            struct ggml_context * ctx,
+            struct ggml_tensor  * q,
+            struct ggml_tensor  * k,
+            struct ggml_tensor  * v,
+            struct ggml_tensor  * g,
+            struct ggml_tensor  * beta,
+            struct ggml_tensor  * state,
+            struct ggml_tensor  * state_dst,
+            struct ggml_tensor  * ids,
+            int                   rs_head);
+
+     // custom operators
+ 
+     typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
+diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
+index 9457add..b6a1976 100644
+--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
+@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+     const int64_t K = ggml_get_op_params_i32(dst, 0);
+     GGML_ASSERT(K >= 1);
+     // per-seq stride in floats (seq s starts at state + s * seq_stride)
+-    const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+    int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+ 
+     const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
+     const int ith = params->ith;
+@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
+ 
+     const float * state_in_base = (const float *)src_state->data;
+ 
+    // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
+    // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
+    // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
+    // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
+    // region. Bit-identical to the get_rows path.
+    ggml_tensor * src_ids = dst->src[7];
+    if (src_ids != nullptr) {
+        const int64_t   D       = S_v * S_v * H;
+        const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
+        const int32_t * ids     = (const int32_t *) src_ids->data;
+        bool identity = true;
+        for (int64_t s = 0; s < n_seqs; ++s) {
+            if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+        }
+        state_seq_stride = D;
+        state_in_base = identity
+            ? (const float *) src_state->data + (int64_t) rs_head * D
+            : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
+    }
+
+   //const int64_t rq1 = nev1 / neq1;
+   //const int64_t rk1 = nev1 / nek1;
+     const int64_t rq3 = nev3 / neq3;
+@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
+ 
+     if (ith == 0) {
+       ggml_threadpool_chunk_set(params->threadpool, nth);
+
+      // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
+      // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
+      // recurrence, so the in-place write never aliases another sequence's read.
+      ggml_tensor * src_ids = dst->src[7];
+      if (src_ids != nullptr) {
+          const ggml_tensor * src_state = dst->src[5];
+          const int64_t S_v      = V->ne[0];
+          const int64_t H        = V->ne[1];
+          const int64_t n_tokens = V->ne[2];
+          const int64_t n_seqs   = V->ne[3];
+          const int64_t D        = S_v * S_v * H;
+          const int32_t   rs_head = ggml_get_op_params_i32(dst, 1);
+          const int32_t * ids     = (const int32_t *) src_ids->data;
+          bool identity = true;
+          for (int64_t s = 0; s < n_seqs; ++s) {
+              if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+          }
+          if (!identity) {
+              const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+              const float * cache   = (const float *) src_state->data;
+              float *       scratch = (float *) dst->data + attn_score_elems;
+              for (int64_t s = 0; s < n_seqs; ++s) {
+                  memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
+              }
+          }
+      }
+     }
+ 
+     ggml_barrier(params->threadpool);
+diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
+index 61a2b91..86d5e2a 100644
+--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
+@@ -1,6 +1,34 @@
+ #include "gated_delta_net.cuh"
+ #include "ggml-cuda/common.cuh"
+ 
+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
+                                           float * scratch, int64_t D, int n_seqs) {
+    const int s = blockIdx.x;
+    if (s >= n_seqs) {
+        return;
+    }
+    const int r = ids[s];
+    if (r == rs_head + s) {
+        return; // identity: prior state already lives in the in-place destination slot
+    }
+    const float * src = cache   + (int64_t) r * D;
+    float *       dst = scratch + (int64_t) s * D;
+    for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
+        dst[i] = src[i];
+    }
+}
+
+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
+                                          float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
+    if (n_seqs <= 0) {
+        return;
+    }
+    gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+}
+
+ template <int S_v, bool KDA, bool keep_rs_t>
+ __global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+ gated_delta_net_cuda(const float * q,
+@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
+                                      const uint3   rq3_magic,
+                                      float         scale,
+                                      int           K,
+-                                     float *       state_dst) {
+                                     float *       state_dst,
+                                     const int32_t * ids,
+                                     int           rs_head) {
+     const uint32_t h_idx    = blockIdx.x;
+     const uint32_t sequence = blockIdx.y;
+     // each warp owns one column, using warp-level primitives to reduce across rows
+@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
+     const int64_t state_in_offset      = sequence * H * S_v * S_v + h_idx * S_v * S_v;
+     const int64_t state_out_offset     = (sequence * H + h_idx) * S_v * S_v;
+     state += state_out_offset;
+-    curr_state += state_in_offset + col * S_v;
+    // Step 2: select the prior-state read base per sequence. For the ids variant, identity
+    // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
+    // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
+    // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
+    // offset. The whole s0 is loaded into registers before the new state is written, so reading and
+    // writing the same slot per block (identity) is race-free.
+    const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+        ? state_dst : curr_state;
+    read_state += state_in_offset + col * S_v;
+     attn_data += (sequence * n_tokens * H + h_idx) * S_v;
+ 
+     constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
+@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
+ #pragma unroll
+     for (int r = 0; r < rows_per_lane; r++) {
+         const int i = r * warp_size + lane;
+-        s_shard[r]  = curr_state[i];
+        s_shard[r]  = read_state[i];
+     }
+ 
+     for (int t = 0; t < n_tokens; t++) {
+@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
+         const float * q_d, const float * k_d, const float * v_d,
+         const float * g_d, const float * b_d, const float * s_d,
+         float * dst_d, float * state_dst_d,
+        const int32_t * ids_d, int rs_head,
+         int64_t S_v,   int64_t H, int64_t n_tokens, int64_t n_seqs,
+         int64_t sq1,   int64_t sq2, int64_t sq3,
+         int64_t sv1,   int64_t sv2, int64_t sv3,
+@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         case 32:
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         case 64: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         }
+         case 128: {
+             ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
+                 q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+                 n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+-                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+                sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+             break;
+         }
+         default:
+@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+     const float * g_d = (const float *) src_g->data;
+     const float * b_d = (const float *) src_beta->data;
+ 
+-    const float * s_d   = (const float *) src_state->data;
+     float *       dst_d = (float *) dst->data;
+ 
+     float * state_dst_d = nullptr;
+@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+         state_dst_d = (float *) src_state_dst->data;
+     }
+ 
+    // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
+    // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
+    // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
+    // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
+    // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
+    // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
+    // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
+    ggml_tensor * src_ids = dst->src[7];
+    const float *   s_d     = (const float *) src_state->data;
+    const int32_t * ids_d   = nullptr;
+    int             rs_head = 0;
+    ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
+    if (src_ids != nullptr) {
+        GGML_ASSERT(state_dst_d != nullptr);
+        GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
+        rs_head = ggml_get_op_params_i32(dst, 1);
+        ids_d   = (const int32_t *) src_ids->data;
+        const int64_t D = S_v * S_v * H;
+        float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
+        ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
+        s_d = scratch;
+    }
+
+     GGML_ASSERT(ggml_is_contiguous_rows(src_q));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_k));
+     GGML_ASSERT(ggml_is_contiguous_rows(src_v));
+@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
+ 
+     if (kda) {
+         if (keep_rs) {
+-            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+     } else {
+         if (keep_rs) {
+-            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         } else {
+-            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+            launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
+                 S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+                 sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
+         }
+diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
+index b8d34bf..1762037 100644
+--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
+@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
+     return result;
+ }
+ 
+// ggml_gated_delta_net_inplace_ids
+//
+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
+// + s), where the prior state already lives in the in-place destination slots.
+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+        struct ggml_context * ctx,
+        struct ggml_tensor  * q,
+        struct ggml_tensor  * k,
+        struct ggml_tensor  * v,
+        struct ggml_tensor  * g,
+        struct ggml_tensor  * beta,
+        struct ggml_tensor  * state,
+        struct ggml_tensor  * state_dst,
+        struct ggml_tensor  * ids,
+        int                   rs_head) {
+    GGML_ASSERT(ggml_is_contiguous_rows(q));
+    GGML_ASSERT(ggml_is_contiguous_rows(k));
+    GGML_ASSERT(ggml_is_contiguous_rows(v));
+    GGML_ASSERT(ggml_is_contiguous(g));
+    GGML_ASSERT(ggml_is_contiguous(beta));
+    GGML_ASSERT(ggml_is_contiguous(state));
+
+    GGML_ASSERT(q->type    == GGML_TYPE_F32);
+    GGML_ASSERT(k->type    == GGML_TYPE_F32);
+    GGML_ASSERT(v->type    == GGML_TYPE_F32);
+    GGML_ASSERT(g->type    == GGML_TYPE_F32);
+    GGML_ASSERT(beta->type == GGML_TYPE_F32);
+    GGML_ASSERT(state->type == GGML_TYPE_F32);
+    GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
+    GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+    const int64_t S_v      = v->ne[0];
+    const int64_t H        = v->ne[1];
+    const int64_t n_tokens = v->ne[2];
+    const int64_t n_seqs   = v->ne[3];
+
+    GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+    GGML_ASSERT(beta->ne[0] == 1);
+
+    // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
+    GGML_ASSERT(state->ne[0] == S_v);
+    GGML_ASSERT(state->ne[1] == S_v);
+    GGML_ASSERT(state->ne[2] == H);
+    GGML_ASSERT(state->ne[3] >= n_seqs);
+
+    // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+    GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+    GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+    GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+    // ids: per-seq source slot into the full cache (s_copy_main)
+    GGML_ASSERT(ids->ne[0] >= n_seqs);
+
+    const int64_t state_rows = S_v * n_seqs; // K == 1
+    const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+    struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+    ggml_set_op_params_i32(result, 0, 1);       // K == 1
+    ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
+
+    result->op     = GGML_OP_GATED_DELTA_NET;
+    result->src[0] = q;
+    result->src[1] = k;
+    result->src[2] = v;
+    result->src[3] = g;
+    result->src[4] = beta;
+    result->src[5] = state;     // FULL cache (read via ids)
+    result->src[6] = state_dst; // in-place final-state write-back target
+    result->src[7] = ids;       // per-seq source slots (s_copy)
+
+    return result;
+}
+
+ ////////////////////////////////////////////////////////////////////////////////
+ 
+ struct ggml_hash_set ggml_hash_set_new(size_t size) {
+diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
+index 26a718b..194e611 100644
+--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
+@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
+     return conv_input;
+ }
+ 
+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
+// and rollback paths fall back to materializing the prior state and delegating below.
+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+        llm_graph_input_rs * inp,
+        ggml_tensor *        ssm_states_all,
+        ggml_tensor *        q,
+        ggml_tensor *        k,
+        ggml_tensor *        v,
+        ggml_tensor *        g,
+        ggml_tensor *        b,
+        int                  il) {
+    const auto * mctx_cur = inp->mctx;
+    const auto   kv_head  = mctx_cur->get_head();
+
+    const int64_t S_v          = v->ne[0];
+    const int64_t H_v          = v->ne[1];
+    const int64_t n_seqs       = v->ne[3];
+    const int64_t n_seq_tokens = q->ne[2];
+
+    const bool keep  = cparams.n_rs_seq > 0;
+    const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+    if (!keep && fused) {
+        // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
+        // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
+        // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
+        // the final state in place at kv_head; no recurrent-state materialization at all.
+        auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+            ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
+            ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
+                    ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+            return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
+        };
+
+        ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
+        if (n_seq_tokens == 1) {
+            cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+        } else {
+            cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+        }
+
+        ggml_tensor * output = ggml_view_4d(ctx0, result,
+                S_v, H_v, n_seq_tokens, n_seqs,
+                ggml_row_size(result->type, S_v),
+                ggml_row_size(result->type, S_v * H_v),
+                ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+        cb(output, "attn_output", il);
+
+        // the state write is a side effect of the op; pull the op into the graph via the output
+        ggml_build_forward_expand(gf, output);
+
+        return output;
+    }
+
+    // non-fused / rollback: materialize the prior state via gather and delegate to the
+    // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
+    ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+    s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
+    return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
+}
+
+ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+         llm_graph_input_rs * inp,
+         ggml_tensor *        ssm_states_all,
+diff --git a/src/models/models.h b/src/models/models.h
+index 2ac8415..98b89e9 100644
+--- a/src/models/models.h
+++ b/src/models/models.h
+@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
+             ggml_tensor *        b,
+             ggml_tensor *        s,
+             int                  il);
+
+    // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
+    // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
+    // delegates to the state-taking overload for the non-fused and rollback paths.
+    ggml_tensor * build_recurrent_attn(
+            llm_graph_input_rs * inp,
+            ggml_tensor *        ssm_states_all,
+            ggml_tensor *        q,
+            ggml_tensor *        k,
+            ggml_tensor *        v,
+            ggml_tensor *        g,
+            ggml_tensor *        b,
+            int                  il);
+ };
+ 
+ struct llm_build_rwkv6_base : public llm_graph_context {
+diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
+index 6783d98..0be3247 100644
+--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
+@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+ 
+     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+-    cb(state, "state_predelta", il);
+-
+     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+     cb(conv_output_proper, "conv_output_raw", il);
+ 
+@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
+     cb(k_conv, "k_conv_predelta", il);
+     cb(v_conv, "v_conv_predelta", il);
+ 
+-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
+ 
+     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
+     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
+diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
+index eb5e9a4..2995f04 100644
+--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
+@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+ 
+     ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ 
+-    ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+-    state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+-    cb(state, "state_predelta", il);
+-
+     ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+     cb(conv_output_proper, "conv_output_raw", il);
+ 
+@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
+     cb(k_conv, "k_conv_predelta", il);
+     cb(v_conv, "v_conv_predelta", il);
+ 
+-    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+    ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
+ 
+     // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
+     ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
+-- 
+2.43.0
+
--- a/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
+++ b/backend/cpp/llama-cpp/patches/paged/A2_CUDAGRAPH_DECODE.md
@@ -0,0 +1,347 @@
+# A.2 - CUDA-graphing the paged decode: measured lever + gap diagnosis
+
+Phase 1 (measure, do not punt). DGX GB10 (sm_121), dev tree `~/llama-paged-dev`
+HEAD 089f78d (patch 0017), `build-cuda`. Model `q36-27b-nvfp4.gguf` (dense),
+harness `llama-batched-bench`, fusion held OFF (`LLAMA_FUSE_NVFP4_QUANT=0`) for a
+clean stock-kernel baseline. `decode_agg` = the `S_TG t/s` column.
+
+## TL;DR verdict
+
+CUDA-graphing the paged decode is **NOT a real throughput lever** (ceiling well
+under 1%). The steady decode step is **GPU-compute-bound: 99.4-99.5% GPU-busy**.
+Total GPU idle is ~0.5-0.6% of the step, split into within-step launch gaps
+(0.37%, the only thing CUDA graphs remove) and a between-step host-loop gap
+(0.24%, one ~2 ms gap per step). Graphs already engage on the default paged
+decode and do collapse the launch gaps (0.37% -> 0.11%), but the GPU stays
+99.4-99.5% busy either way, so decode_agg is unchanged. The 2.6x gap to vLLM
+(148 vs 391) lives in the per-step GPU **kernel work** (FP4 GEMM + attention at
+batch 128), not in launch overhead or the host loop.
+
+The premise that "the paged decode runs eager (graphs reused=0)" did not survive
+measurement: at the benchmarked context the default paged decode captures and
+replays graphs exactly like stock non-paged. Two measurement traps (below)
+explain the earlier "reused=0 / gap-bound" reading.
+
+## Method note: a graph-enable trap that was corrected
+
+`GGML_CUDA_DISABLE_GRAPHS` is read with `getenv(...) != nullptr`
+(`ggml/src/ggml-cuda/common.cuh:1234`), so setting it to an **empty** string
+still disables graphs. A first 4-cell pass that used
+`GGML_CUDA_DISABLE_GRAPHS=""` for the "graphs ON" cells therefore ran graphs OFF
+in all four cells (an OFF-vs-OFF comparison). The numbers below ("v2") unset the
+variable with `env -u` for the ON cells. The `-lv 99` probe is unaffected (it
+never set the variable).
+
+## Step 1 - the 4-cell decode_agg table (corrected, graphs genuinely enabled)
+
+npp 128, ntg 128, npl 32 and 128, c 40960, b/ub 2048, fa on. `S_TG t/s`:
+
+| cell             | npl 32  | npl 128 |
+|------------------|---------|---------|
+| stock_graphon    | 116.47  | 148.41  |
+| stock_graphoff   | 115.17  | 148.21  |
+| paged_graphon    | 116.21  | 148.60  |
+| paged_graphoff   | 114.62  | 147.65  |
+
+ON vs OFF (the graph win):
+
+| config | npl 32 | npl 128 |
+|--------|--------|---------|
+| stock  | +1.13% | +0.13%  |
+| paged  | +1.39% | +0.64%  |
+
+- (a) Does STOCK get a graph win? Essentially no: +0.13% at npl 128, +1.13% at
+  npl 32 (small-batch, where per-kernel launch overhead is relatively larger).
+  All within run-to-run noise (~1% at npl 32, ~0.2% at npl 128).
+- (b) Does PAGED get a graph win? Same picture: +0.64% / +1.39%. Paged is NOT
+  eager at this config (see Step 2); it captures graphs like stock.
+- (c) LEVER SIZE (proxy = stock graph win, now genuinely measured): +0.13% at
+  npl 128, +1.1% at npl 32. Negligible vs the 2.6x (=+164%) gap to vLLM.
+
+All four cells sit at ~148 (npl 128) / ~115 (npl 32) within ~1%. The ~148 wall is
+shared by stock and paged; it is not paged-specific. Calibration cross-check
+(paged ON, ntg 64): 147.64, matching the reference 148-149.
+
+## Step 2 - why the "eager" premise is wrong, and what actually mutates
+
+CUDA-graph state machine (`ggml_backend_cuda_graph_compute` in
+`ggml/src/ggml-cuda/ggml-cuda.cu`): warmup completes after a step whose node
+properties did not change vs the previous step; any later change logs
+`CUDA graph warmup reset` and reverts to eager until stable again.
+`ggml_cuda_graph_update_required` memcmps every node's full `ggml_tensor` plus
+each src's `data` ptr / `ne` / `nb`.
+
+`-lv 99` probe, short context (npp 64, ntg 32, ctx <= 96):
+- stock:  `warmup complete` x2, `warmup reset` x0.
+- paged:  `warmup complete` x2, `warmup reset` x0.
+Both capture and then replay silently. The `CUDA Graph id N reused` line stays 0
+for both because llama rebuilds the cgraph each ubatch (new `cgraph->uid`), so
+the uid fast-path never fires; the graph is still replayed via the
+`instance != nullptr` path, which logs nothing. **"reused=0" is a false negative,
+not evidence of eager execution.** (Trap #1.)
+
+Cadence probe (npp 200, ntg 320, npl 4, ctx 200->520, crosses the 256 and 512
+token boundaries), counts over ~320 decode steps:
+
+| path                          | complete | reset | interpretation                |
+|-------------------------------|----------|-------|-------------------------------|
+| paged in-kernel (default)     | 10       | 8     | resets only at 256-boundaries |
+| paged gather (KV_PAGED_GATHER)| 0        | 0     | never captures -> pure eager  |
+| stock non-paged               | 10       | 8     | identical 256-cadence         |
+
+The 8 resets cluster at the two boundary crossings (timestamps ~9.9 s and ~34 s),
+not per-step. The default paged decode is therefore captured for ~97% of steps,
+re-warming only every ~256 tokens, with the **same cadence as stock**.
+
+What mutates (the block-table / gather input):
+- in-kernel decode (default): the block-table graph input
+  `idx = ggml_new_tensor_2d(ctx0, I32, n_view, n_stream)` with
+  `n_view = GGML_PAD(n_gather, 256)` (`src/paged-attn.cpp:199,213`). Its `ne[0]`
+  steps 256 -> 512 -> 768 only when the context crosses a 256-token boundary. The
+  kq_mask input (ne0 = n_kv, also padded to 256) steps in lockstep. So the
+  property change is per-256-tokens, not per-step.
+- gather fallback (`LLAMA_KV_PAGED_GATHER=1`, transposed-V, or prefill): the
+  index input `idx = ggml_new_tensor_2d(ctx0, I32, n_gather, n_stream)`
+  (`src/paged-attn.cpp:106`) has `ne[0] = n_gather` (UNPADDED), which grows every
+  step (the unit's own comment, `src/paged-attn.cpp:28-30`: "n_gather grows every
+  step"). That changes a node property every step, warmup never completes, and
+  the path runs pure eager. This is the only "graphs reused=0" path, and it is
+  not the default decode path.
+
+`LLAMA_KV_PAGED_DEBUG` dump at ctx 201 (first 2 decode calls, identical across
+the pair): `in-kernel decode n_stream=4 n_kv=256 n_gather=201` -> block-table
+`ne[0] = GGML_PAD(201,256) = 256`, stable until n_gather crosses 256.
+
+## Step 3 - where the step time goes (nsys), and a second trap
+
+npl 128, ntg 24, ctx 56 (< 256, so the ON run stays captured after warmup).
+Idle split by gap size: within-step launch gaps < 1 ms, between-step host gaps
+>= 1 ms. Steady window = 40%-97% of the trace span (excludes model load / graph
+reserve / prefill one-offs).
+
+Trap #2: `nsys --trace=cuda` does NOT emit the kernels INSIDE a replayed CUDA
+graph into `cuda_gpu_trace` by default. The graphs-ON trace had only 15,279 GPU
+rows vs 84,946 for the identical OFF workload and reported a bogus 0.3% GPU-busy.
+Re-profiling the ON case with `--cuda-graph-trace=node` restored all 84,946 rows
+and 99.5% busy. **Any "decode is idle/gap-bound" reading taken from a graphs-ON
+nsys trace without `--cuda-graph-trace=node` is artifactually idle-inflated** -
+the likely source of the earlier "freed GPU time became idle gaps" conclusion.
+
+Reliable steady-state numbers:
+
+| trace                          | GPU rows | busy   | within-step idle | between-step idle | host gap/step |
+|--------------------------------|----------|--------|------------------|-------------------|---------------|
+| OFF (eager)                    | 84,946   | 99.4%  | 0.37%            | 0.24%             | ~2.0 ms       |
+| ON (captured, node-trace)      | 84,946   | 99.5%  | 0.11%            | 0.38%             | ~1.9 ms       |
+
+- CUDA graphs replay (cudaGraphLaunch=46) and collapse the launch path: ON has
+  ~15k kernel launches/run vs OFF ~80k (cudaLaunchKernel 6,024 vs 31,764, plus
+  ExC 9,049 vs 48,165). That cuts within-step launch idle from 0.37% to 0.11%.
+- But the GPU is 99.4-99.5% busy in both, so decode_agg is unchanged.
+- Between-step host idle is one ~2 ms gap per decode step (the 128-way sample +
+  update_slots + batch build), 0.24-0.38% of the ~896 ms step.
+
+Per-step decomposition at npl 128: ~896 ms/step, of which ~890 ms is GPU kernel
+compute, ~2 ms host-loop gap, ~3 ms (eager) / ~1 ms (captured) launch gaps.
+
+## The load-bearing question, answered
+
+Within-step or between-step? **Neither is large.** The steady decode is 99.4%
+GPU-busy; the entire idle budget is ~0.6% of the step. CUDA graphs already remove
+the within-step launch fraction (0.37% -> 0.11%), and the between-step host gap is
+~2 ms/step (0.24%). There is no large gap for a host-loop rewrite to reclaim
+either; the host loop is currently **hidden under GPU compute** (the GPU stays
+busy while the host syncs/schedules). It would only become a lever once the
+kernels are fast enough to drop GPU-busy below the host time, i.e. it is a
+second-order floor, not the present bottleneck.
+
+## Verdict
+
+1. CUDA-graphing the paged decode is not the lever. Graphs already engage on the
+   default decode; capturing reduces within-step launch idle from 0.37% to 0.11%
+   but leaves the GPU 99.4-99.5% busy, so decode_agg moves by ~0% (measured
+   +0.1% to +0.6% at npl 128, +1.1% to +1.4% at npl 32, all within noise).
+2. The between-step host loop is not the present lever either (0.24%, ~2 ms/step,
+   hidden under GPU compute). It is the candidate floor only after the kernels
+   speed up.
+3. The decode is GPU-compute-bound at this NVFP4 fusion-OFF baseline. The 2.6x
+   gap to vLLM is in the per-step GPU kernel work (FP4 GEMM + attention at batch
+   128). That, not graphs and not the host loop, is the throughput lever.
+4. Corrected premises: paged is not perpetually eager (it captures with a
+   256-token reset cadence identical to stock); "graphs reused=0" was a uid
+   fast-path false negative; and a graphs-ON nsys trace under-counts GPU-busy
+   unless `--cuda-graph-trace=node` is set.
+
+No code patch in Phase 1 (graphs are not the lever, so there is no paged
+graph-capture patch to land). Evidence: `~/bench/a2_4cell_v2/`, `~/bench/a2_probe`,
+`~/bench/a2_probe2`, `~/bench/a2_nsys/*.nsys-rep` on the DGX.
+
+# Phase 2 - the real decode lever, located (per-kernel decomposition)
+
+Phase 1 ended on "decode is GPU-compute-bound; the 2.6x gap to vLLM lives in the
+per-step GPU kernel work (FP4 GEMM + attention at batch 128)." Phase 2 measured
+that per-step GPU work directly - per kernel and per memcpy, on the Phase-1 nsys
+`.sqlite` reps - and the "FP4 GEMM + attention" attribution does not survive the
+measurement. Two corrections, then the lever.
+
+The conditional Phase 2 fix (make the paged decode graph-capturable) is moot:
+Phase 1 already showed the default paged decode captures, and the fresh re-check
+below reconfirms the graph win is noise. Neither Phase 2 branch (within-step graph
+fix / between-step host loop) is the lever; the lever is a third thing, measured
+here.
+
+## Fresh re-confirmation: graphs are not the lever
+
+Independent run (npl128, ntg32, paged, fusion off), not reusing Phase 1's table:
+
+| paged decode  | S_TG t/s | vs vLLM 391 |
+|---------------|----------|-------------|
+| graphs ON     | 146.03   | 37.3%       |
+| graphs OFF    | 144.90   | 37.1%       |
+
+0.78%, within noise - same verdict as Phase 1's 4-cell. The ON nsys rep is also
+99.5% busy with the same ~3267 ms of memcpy as OFF: graphs capture the memcpy
+nodes too, so they cannot remove either the copies or the compute.
+
+## Correction 1: the model is a hybrid SSM, not a plain transformer
+
+`q36-27b-nvfp4.gguf` has `general.architecture = qwen35` with
+`qwen35.ssm.{conv_kernel,state_size,group_count,time_step_rank,inner_size}`. The
+decode-window kernel cadence (per step, ~19.8 steps in the window) is 48
+`gated_delta_net_cuda` + 48 `ssm_conv_f32` vs 16 `flash_attn_tile`, i.e. **48
+gated-DeltaNet linear-attention layers : 16 full-attention layers** (a 3:1
+hybrid, Qwen3-Next family). Paged attention only touches the 16 full-attention
+layers.
+
+## Correction 2: the 99.4% "busy" is ~19% D2D memcpy, not compute
+
+Interval-union sweep over the steady decode window (last 17 s of the npl128/ntg24
+OFF rep; single CUDA stream; running-max-end so it is overlap-correct):
+
+| activity set           | GPU busy | idle  |
+|------------------------|----------|-------|
+| kernels only           | 80.2%    | 19.8% |
+| kernels + memcpy (all) | 99.4%    | 0.6%  |
+
+The 969 inter-kernel gaps (>=1 ms, ~48/step) that drop kernels-only to 80% are
+filled by **D2D memcpy: 1584 copies/run (~80/step), ~230 MB each, ~2 ms each,
+356 GB moved in 17 s**. At batch 128 a ~230 MB block is the gated-DeltaNet
+recurrent state; these are the per-SSM-layer state copies. (HtoD copies = the
+paged block-table/index upload: 731/run but only 3 ms total, negligible; DtoH
+47 ms.) Phase 1's `cuda_gpu_trace`-based 99.4% counted these memcpys as "busy"
+and lumped them into "GPU kernel compute" - they are memory movement, and they
+are addressable.
+
+## Decode GPU-time decomposition (% of kernel+memcpy busy)
+
+OFF/eager rep, steady window. `/step` = instances per decode step.
+
+| share | activity                          | /step | role                          |
+|-------|-----------------------------------|-------|-------------------------------|
+| 23.4% | gated_delta_net_cuda              | 48    | linear-attn recurrence        |
+| 21.9% | k_get_rows_float                  | 97    | SSM state / conv-state gather |
+| 18.9% | MEMCPY DtoD                       | 80    | SSM recurrent-state copy      |
+| 15.5% | mul_mat_vec_q (nvfp4, ncols=1)    | 48    | FP4 GEMV                      |
+| 10.4% | mul_mat_q (nvfp4)                 | 352   | FP4 GEMM                      |
+|  1.9% | quantize_mmq_nvfp4                | 448   | act requant for MMQ           |
+|  1.0% | concat_cont                       | 48    | SSM state glue                |
+|  0.8% | ssm_conv_f32                      | 48    | SSM short conv                |
+|  0.7% | unary_gated_op silu               | 112   | SSM gating                    |
+|  0.4% | flash_attn_tile/_ext              | 16    | FULL attention (paged)        |
+
+Grouped:
+- gated-DeltaNet / SSM machinery (recurrence + get_rows gather + DtoD state copy
+  + conv + gating glue): **~67% of decode**.
+- FP4 matmul (GEMV + GEMM + requant + stream-k fixup): **~28%**.
+- Full attention - everything paged attention optimizes: **~0.4%**.
+
+## Verdict and scope of the real lever
+
+1. CUDA graphs: not the lever (Phase 1, re-confirmed: +0.78%, noise). They capture
+   the memcpy too, so they cannot touch the copies or the compute.
+2. Host loop: not the lever (true host idle in the union is 0.24%, ~41 ms/17 s).
+3. FP4 GEMM: secondary, ~28%. Consistent with Track B P2a (making the FP4 GEMM 26%
+   faster left decode_agg flat) - it was never the long pole.
+4. Paged / full attention: ~0.4% of decode. **No paged-attention change (graphs,
+   block-table stabilization, gather rewrite) can move decode_agg on this model**
+   - it optimizes under half a percent of the step. This is the structural reason
+   A.2, and the paged-decode track generally, cannot close the vLLM gap on
+   q36-27b: the model barely uses the path being optimized.
+
+The throughput lever is the ggml **qwen35 gated-DeltaNet decode**. Per SSM layer
+per step it re-materializes and D2D-copies the full recurrent state (~230 MB at
+batch 128; ~80 copies/step, ~18 GB/step) and feeds the recurrence through ~2
+`get_rows` gathers, so ~61% of decode (state copy + state gather + recurrence) is
+SSM state plumbing. vLLM's gated-DeltaNet decode (the flash-linear-attention
+`fused_recurrent_gated_delta_rule` path) keeps the state in place and fuses the
+gather into the scan, avoiding both the per-layer D2D copy and the gathers.
+
+Next-step scope (the real lever, to be done in the ggml/llama qwen35 SSM path -
+not paged-attn, not a graph capture, not a block-table tweak):
+1. Eliminate the per-layer recurrent-state D2D copy: update the state tensor
+   in place (or double-buffer / write-back), so the recurrence consumes and
+   produces the persistent state without a full-state copy each layer each step.
+2. Fuse the `get_rows` state / conv-state gather into the recurrent kernel.
+
+Ceiling from this rep (upper bound; assumes the work is fully removed, not just
+overlapped):
+- remove the DtoD state copy: reclaim 18.9% -> ~146 to ~180 t/s.
+- remove copy + gather: reclaim ~41% -> ~146 to ~247 t/s, which puts llama within
+  ~1.6x of vLLM 391 with the FP4 GEMM still untouched.
+
+No code patch in Phase 2 either: the lever is a gated-DeltaNet decode rewrite in
+the SSM path, too large for this measurement pass and orthogonal to paged
+attention. `patches/paged/0018` stays free. Evidence on the DGX:
+`~/bench/a2_decompose/decode_decomp.txt` (per-kernel table + reproducing SQL in
+its header), `~/bench/a2_decompose/SUMMARY.txt`, and the Phase-1 reps
+`~/bench/a2_nsys/paged_off_npl128.sqlite` / `paged_on_npl128_node.sqlite`.
+
+# A.2 final synthesis - the four-point verdict
+
+All numbers measured on the DGX (GB10, sm_121, q36-27b-nvfp4 dense, fusion OFF,
+`decode_agg` = `S_TG t/s`), npl 128 unless noted.
+
+**1. CUDA-graph lever size (measured, not guessed).** +0.13% (4-cell, stock
+ON-vs-OFF) to +0.78% (fresh paged re-check) at npl 128; +1.1% to +1.4% at npl 32.
+All inside run-to-run noise. The earlier grounding GUESSED ~10-20% from a
+94.6%-busy reading; direct measurement puts the steady decode at 99.4-99.5% busy,
+so the real graph ceiling is < 1%, not 10-20%. The guess was wrong because the
+busy-fraction it rested on was under-read (a graphs-ON nsys trace under-counts
+GPU-busy unless `--cuda-graph-trace=node` is set - trap #2).
+
+**2. Was "paged decode runs eager" fixed, and what is the decode_agg win?**
+There was nothing to fix: the premise was false. At the benchmarked context the
+DEFAULT in-kernel paged decode already captures and replays graphs, with a
+256-token reset cadence identical to stock non-paged (10 complete / 8 reset over
+~320 steps, resets clustered only at the 256/512 token boundaries). "graphs
+reused=0" was a uid fast-path false negative, not eager execution (trap #1). The
+only genuinely-eager path is the `LLAMA_KV_PAGED_GATHER=1` fallback (unpadded
+index grows every step), which is not the default decode. Because graphs were
+already engaged, the decode_agg win from "enabling" them is ~0 (+0.1% to +0.8%).
+Graphs DID collapse within-step launch idle (0.37% -> 0.11%, ~80k -> ~15k
+launches/run), but the GPU stays 99.4-99.5% busy, so throughput is unchanged.
+
+**3. New llama %-of-vLLM @npl128.** Unchanged by A.2: 146-148.6 t/s vs vLLM 391 =
+**37.3-38.0%**. Graphs ON vs OFF both land here (146.03 / 144.90 in the fresh
+re-check; 148.41 / 148.21 in the 4-cell). A.2 did not move the percentage.
+
+**4. Honest verdict - did A.2 move toward parity; residual + next lever.** No.
+A.2 closed zero of the 2.6x gap, and it provably cannot on this model: paged /
+full attention is ~0.4% of decode (16 full-attention layers vs 48 gated-DeltaNet
+layers, a 3:1 hybrid SSM), so no graph / block-table / gather change to the paged
+path can move decode_agg. The residual gap is structural and lives elsewhere:
+~67% of decode is gated-DeltaNet / SSM state plumbing (23.4% recurrence + 21.9%
+get_rows state gather + 18.9% D2D recurrent-state copy of ~230 MB per SSM layer
+per step, ~18 GB/step), and ~28% is FP4 matmul (already shown secondary by Track
+B: a 26%-faster GEMM left decode_agg flat). The within-step launch loop is solved
+(graphs) and the between-step host loop is a 0.24% second-order floor hidden under
+GPU compute - neither is the residual.
+
+The next lever is NOT in this track. It is the ggml qwen35 gated-DeltaNet decode:
+(1) eliminate the per-layer recurrent-state D2D copy (in-place / double-buffer
+write-back), and (2) fuse the get_rows gather into the recurrent kernel - mirroring
+vLLM's `fused_recurrent_gated_delta_rule`, which keeps the state in place and
+fuses the gather. Measured ceiling on this rep: remove the copy -> ~146 to ~180
+t/s; remove copy + gather -> ~146 to ~247 t/s (within ~1.6x of vLLM with FP4 GEMM
+still untouched). That work is orthogonal to paged attention; `patches/paged/0018`
+stays free.
--- a/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
+++ b/backend/cpp/llama-cpp/patches/paged/ADDITIVE_DESIGN.md
@@ -0,0 +1,107 @@
+# Additive layout for the paged-KV patch series - "hook, don't edit"
+
+Goal: ship paged KV as a vendored patch series that **survives llama.cpp pin bumps with
+minimal rebase pain**. PR #22569 (the upstream draft) was rejected by maintainers as
+"slop" and is far too invasive to vendor - it rewrites core attention. Our series must be
+the opposite: **additive**. This document is the design rule and the per-patch core-touch
+budget.
+
+## The rule
+
+> Every change is either (a) **new code in a new vendored file** under `src/`, or (b) a
+> **single, env-gated hook** at one call site in a core file that delegates to the new
+> file. No logic lives in a core file. No core struct/signature is edited.
+
+Why it works: a hook is a 1-3 line diff against a core file. When upstream churns that file,
+`git apply` either still lands the hook (context unchanged) or fails *only on that tiny
+hunk*, which is trivial to re-place. Logic embedded inside a core function (the PR #22569 /
+old-0003 approach) conflicts on every bump and must be re-understood each time.
+
+This is enforceable as a **core-touch budget**: each patch declares the core files it
+touches and the line count; review rejects anything that grows logic in core.
+
+## Why it's achievable here (grounded in the pinned source)
+
+The two seams paged KV needs are both already abstract in llama.cpp at the pin
+(`LLAMA_VERSION=f3e1828`), so new behavior plugs in without editing core types:
+
+- **KV placement** - `llama_kv_cache::find_slot` already returns a `slot_info` of physical
+  cell indices. Paged placement is just *different indices*. 0002 already does this as one
+  gated block (`if (paged_mode) { ... continue; }`, 41 lines, one file). Ideal.
+- **Graph inputs** - `llm_graph_input_i` is a pure-virtual base (`set_input()`), and
+  `llm_graph_result::add_input(llm_graph_input_ptr)` lets *any* code register a new input
+  subclass. So a paged graph input (the gather index) can be **a new class in a new file**,
+  added from a one-line hook - no edit to `llm_graph_input_attn_kv` or `llama-graph.h`.
+
+## Per-patch core-touch budget
+
+| # | Patch | New files (additive) | Core hooks (gated, minimal) | Core lines |
+|---|-------|----------------------|------------------------------|-----------:|
+| 0001 | vendor manager | `paged-kv-manager.{h,cpp}` | `CMakeLists.txt` +1 | 1 |
+| 0002 | block placement | - | one `if(paged_mode){...continue;}` in `find_slot` | ~41 |
+| 0003 | gather-read | `paged-attn.{h,cpp}` | `CMakeLists.txt` +1; **one** hook in `build_attn`; 2 tiny accessors on `llama_kv_cache_context` | ~8 |
+| 0004 | on-demand alloc | (uses 0001 manager) | one branch in `find_slot` calling the manager | ~10 |
+| 0005 | continuous batching | - | **LocalAI `grpc-server.cpp`** (already a LocalAI override, not a core patch) | 0 core |
+| 0006 | prefix caching | (uses 0001 manager) | one hash-lookup hook in the 0004 alloc branch | ~6 |
+
+Net core surface for the *entire* engine: `find_slot` (placement/alloc - where physical
+cells are already chosen) + **one** line in `build_attn` + two accessors. Everything else
+is new files or the LocalAI-side server loop.
+
+## 0003 redesigned to the rule (replaces the 4-file-surgery plan)
+
+The old `0003-gather-read-plan.md` edited `llama-kv-cache.{h,cpp}` + `llama-graph.{h,cpp}`
+(including a field added to `llm_graph_input_attn_kv` and fill logic in its `set_input`).
+The additive form removes the core-struct and core-`set_input` edits entirely:
+
+**New file `src/paged-attn.{h,cpp}`** holds *all* logic:
+- `class llm_graph_input_paged_gather : public llm_graph_input_i` - owns the `I32 [n_gather]`
+  gather-index tensor and a `const llama_kv_cache_context * mctx`. Its `set_input()` fills
+  the index with the sequence's used cells (`{ i in [0,n_kv) : !cells.is_empty(i) }`, the
+  same set the `kq_mask` keeps), in the canonical order.
+- `paged_attn::gather(ctx0, res, mctx, v_trans, &k, &v, &kq_mask)` - when paged is active,
+  constructs that input via `res->add_input(...)`, and applies `ggml_get_rows` to `k`, `v`,
+  and the transposed `kq_mask` by the shared index (mask: `transpose -> get_rows ->
+  transpose`). When not active it returns immediately -> **stock path byte-identical**.
+
+**Core hooks (the whole core diff for 0003):**
+1. `src/llama-graph.cpp`, in `build_attn` right before `build_attn_mha` (~line 2357):
+   ```cpp
+   paged_attn::gather(ctx0, res, mctx_cur, v_trans, &k, &v, &kq_mask); // no-op unless LLAMA_KV_PAGED
+   ```
+   One line. No new field on `llm_graph_input_attn_kv`; the gather input is a *separate*
+   registered input, so `llama-graph.h` is untouched.
+2. `src/llama-kv-cache.{h,cpp}`: two thin accessors on `llama_kv_cache_context` so the new
+   file can read the used-cell set without reaching into internals -
+   `uint32_t get_n_gather() const;` and `void get_gather_idxs(int32_t * dst) const;`
+   (delegate to `kv`/`sinfos[i_cur]`, mirroring the existing `get_n_kv` / `set_input_k_idxs`
+   pattern). ~8 lines total, no signature changes to existing methods.
+3. `src/CMakeLists.txt`: `+ paged-attn.cpp`.
+
+First cut: gate to **flash-attn + single-stream** (`GGML_ASSERT` otherwise) - the V-transposed
+(non-FA) and multi-stream gathers are a localized follow-up entirely inside `paged-attn.cpp`,
+no new core touch. Gate 0 stays the same: `diff` of greedy `llama-simple` output, stock vs
+`LLAMA_KV_PAGED=1`, must be identical (attention is permutation-invariant over the gathered
+KV set; `n_gather < n_kv` proves compaction, not identity).
+
+## Anti-drift practices (already in `README.md`, restated as policy)
+
+- **Stacking patches, one concern each**, exported 1:1 from a dev branch via
+  `git format-patch`. On a pin bump, rebase the branch; only the conflicting small patch
+  needs a touch, and the failure names the exact step.
+- **Default-off (`LLAMA_KV_PAGED`)** until each gate is green, so a partial series never
+  changes stock behavior - and the hooks compile to a no-op branch when the env is unset.
+- **Dev tree:** `git worktree add <dev> <LLAMA_VERSION>` off any checkout that has the pin
+  (e.g. the existing llama.cpp clone), `git apply` the series, develop the next patch as one
+  commit, re-export. (Set up and verified for this pin during this work.)
+
+## Status / next step
+
+- 0001, 0002: done, additive, verified token-identical.
+- 0003: **redesigned to the additive form above** (this doc). Dev tree at the pin with
+  0001+0002 applied is ready (`paged` branch). Remaining work is the focused
+  implement-and-verify block for `paged-attn.{h,cpp}` + the one `build_attn` hook, driven to
+  the token-identical Gate 0. That is a numerical-correctness task (mask/gather alignment,
+  FA-first), not a structural one - the structure is settled here.
+- 0004-0006: follow the budget above; 0005 lands in LocalAI's `grpc-server.cpp` (no core
+  patch at all).
--- a/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/CONTINUOUS_BATCH_SCHEDULER_SCOPE.md
@@ -0,0 +1,499 @@
+# Durable scope: token-granular continuous-batch scheduler for llama-server on GB10
+
+Build-ready plan. **Not implemented in this workflow** (serving-loop rewrite). This
+document scopes the durable path to give llama-server's `update_slots()` a vLLM-v1-style
+token-granular continuous-batch scheduler, and records the single honest finding that
+re-shapes what the change can and cannot buy.
+
+Hardware: NVIDIA GB10 / DGX Spark (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
+LPDDR5x ~273 GB/s. Models: dense Qwen3.6-27B NVFP4 (`~/bench/q36-27b-nvfp4.gguf`),
+MoE Qwen3.6-35B-A3B NVFP4 (`~/bench/q36-35b-a3b-nvfp4.gguf`). Dev tree `~/llama-paged-dev`
+(branch `paged`, HEAD `151343b`, patch 0015), `build-cuda` sm_121, `LLAMA_KV_PAGED=1`.
+Scheduler code: `tools/server/server-context.cpp::update_slots()` (LocalAI override that
+`#include`s it: `backend/cpp/llama-cpp/grpc-server.cpp`).
+
+## TL;DR (the honest reframe)
+
+Three findings, read directly from the source at HEAD `151343b` and from the committed
+NVFP4 re-run (`QWEN36_NVFP4_BENCH.md`), collapse the apparent size of this work and reset
+what it is allowed to claim:
+
+1. **The unified mixed batch already exists.** `update_slots()` already builds ONE
+   `llama_batch` per step = {every ready decode token} **+** {a bounded chunk of prefill
+   tokens}, in a fixed two-phase order: Phase 1 (lines 2604-2719) appends every
+   `SLOT_STATE_GENERATING` slot's sampled token **unconditionally** (no budget gate), then
+   Phase 2 (lines 2753-3330) fills the remaining batch capacity with prompt tokens. Decode
+   is therefore **already claimed first and never dropped or capped** - the exact property
+   vLLM's "RUNNING-before-WAITING" pass works to guarantee is **free** here by construction.
+
+2. **The chunked-prefill slot state already exists and already persists across steps.** A
+   slot in `SLOT_STATE_PROCESSING_PROMPT` with `slot.prompt.n_tokens() < slot.task->n_tokens()`
+   is a partial prefill; it stays in that state and resumes next step until its prompt is
+   fully ingested, at which point it flips to `SLOT_STATE_DONE_PROMPT` -> `GENERATING`
+   (line 3252, then 3502). Multiple slots can be `PROCESSING_PROMPT` and `GENERATING`
+   simultaneously; there is **no global "one prefill at a time" gate**. So the mission's
+   "allow a slot to be mid-prefill while others decode in the same step" is **not a state
+   machine to build - it is already the behaviour.** This is the single biggest de-risking
+   fact in this document.
+
+3. **What is genuinely missing is the budget POLICY, and it is small.** Patch 0013
+   (`LLAMA_PREFILL_BUDGET`) is a single **static** per-step prefill cap, consumed greedily by
+   slots in iteration order. It is not decode-load-aware (does not subtract the live decode
+   count `D`), not adaptive (one constant across npl 8..128), and not fair (the first
+   `PROCESSING_PROMPT` slot can eat the whole budget). The durable delta is to convert that
+   static cap into vLLM's **dynamic, decode-first, per-slot-fair token budget**: one total
+   per-step budget `T`, decode claims its `D` tokens first, prefill gets the **leftover**
+   `T - D` distributed across waiting prompts with a per-slot cap. That is ~the only
+   behavioural change. **No new slot states, no batch-formation rewrite.**
+
+### The honest ceiling (this is load-bearing for how the work is scoped and sold)
+
+The committed re-run and a dedicated profiling pass (`QWEN36_NVFP4_BENCH.md`, plus
+`~/bench/stag_128.json`) establish that **the residual ~2.4x high-concurrency decode gap is a
+decode-KERNEL batch-scaling ceiling, not a scheduler defect**:
+
+- At npl8 the kernels are **at parity** (dense 99%, MoE 84% of vLLM decode).
+- A clean staggered full-batch-128 run, with **all 128 slots cleanly decoding and zero
+  prefill starvation**, still tops out at **decode_agg 157.4 tok/s** (dense) - the same
+  ~157-161 ceiling that four independent measurements converge on. vLLM does **390.7** at the
+  same effective batch. With a *perfect* scheduler the kernel still gives ~157. **The
+  scheduler cannot lift this.**
+- Patch 0013 budget-256 **already reaches ~161** (the ceiling) at npl128. So a token-granular
+  scheduler buys **little additional steady-state decode_agg** over 0013 on the all-at-once
+  workload.
+
+Therefore this scheduler's deliverable is **NOT "match vLLM's 391/811 decode."** It is:
+
+- **Close the 12x TTFT gap** (dense 305 s @ 0013 / 491 s stock -> vLLM's ~25 s, and ~2 s on
+  staggered arrival) - the genuine, large win.
+- **Robustly HOLD the decode ceiling** (~161 dense / ~333 MoE @npl128) **without
+  per-workload budget tuning** - 0013 needs a hand-picked constant (256 for dense, costs MoE
+  TTFT, net-negative at low npl); the dynamic `T - D` budget is self-tuning across the whole
+  npl range and across dense vs MoE.
+- **Burst-robustness**: bounded TTFT for *all* concurrently-arriving prompts (kill the
+  burst-TTFT spread), and no admission collapse under sustained load.
+
+Closing the residual 2.4x decode-throughput gap is a **separate, named lever**: the
+paged-attention **decode-kernel** batch-scaling work (patches 0009-0011 territory) and/or
+CUDA-graphed decode. It is called out explicitly in P3 and is **out of this scope's
+scheduler mandate**. We must measure and sell this work on **TTFT + burst-robustness +
+self-tuning hold of the ceiling**, never on a decode_agg number the kernel forbids.
+
+## The gap, precisely localized (recap of the committed bench)
+
+At matched NVFP4 on one GB10 box (`QWEN36_NVFP4_BENCH.md`), llama (patch 0015) vs vLLM 0.23.0,
+decode_agg tok/s | TTFT mean, npl swept 8/32/64/128:
+
+| npl | dense llama (0013 b256) | dense vLLM | MoE llama (0013 b256) | MoE vLLM |
+|----:|------------------------:|-----------:|----------------------:|---------:|
+| 8   | 63.5  / 4.3 s   | 64.3  / 2.6 s | 169.3 / 1.7 s  | 202.0 / 0.8 s |
+| 32  | 105.7 / 23.1 s  | 189.8 / 7.5 s | 239.0 / 9.0 s  | 462.0 / 2.3 s |
+| 64  | 132.0 / 109 s   | 284.2 / 13 s  | 277.0 / 16.2 s | 624.5 / 4.1 s |
+| 128 | **161.2 / 305 s** | 390.7 / 24.8 s | **333.5 / 98 s** | 811.1 / 8.0 s |
+
+Both models converge to the **same ~41% of vLLM decode at npl128** after 0013. That
+convergence is the signal: once prefill starvation is removed, a dense model and a
+12x-cheaper-prefill MoE land on the **identical** ceiling -> the residual is **not prefill**
+and **not the kernel-at-parity-@npl8** - it is the **quality of the per-step batching
+decision** (TTFT/robustness) plus the **kernel decode ceiling** (the throughput residual).
+This scope addresses the first; it names the second as the separate lever.
+
+## What already exists (reuse, do NOT rebuild)
+
+All line numbers verified at `tools/server/server-context.cpp` HEAD `151343b`.
+
+- **[A] decode-first co-batch** - Phase 1, lines 2604-2719. Iterates `slots`; every
+  `SLOT_STATE_GENERATING` slot (gated only by `can_batch_with`, line 2611) is pushed to
+  `generating[]`; line 2715-2719 `for (slot : generating) slot.update_batch(batch)` appends
+  its sampled token (+ draft tokens) via `common_batch_add`. After this loop,
+  `batch.n_tokens == D` (the decode-token count). **No budget gate** - decode always goes in.
+- **[B] chunked-prefill state per slot** - the pair `slot.prompt.n_tokens()` (=
+  `num_computed_tokens`) vs `slot.task->n_tokens()` (= `num_tokens`). A `PROCESSING_PROMPT`
+  slot with `prompt.n_tokens() < task->n_tokens()` resumes next step (Phase 2 re-enters it).
+  Transition to `DONE_PROMPT` at line 3252 when the prompt is exhausted; to `GENERATING` at
+  line 3502. **This is exactly vLLM's "leave the request in `running`, advance
+  `num_computed_tokens` next step" - already implemented.**
+- **[C] single shared batch + compute chunking** - one `llama_batch` holds decode+prefill;
+  the compute loop (lines ~3366-3378) `for (i=0; i<batch.n_tokens; i+=n_tokens){ n_tokens =
+  min(n_batch, batch.n_tokens-i); llama_decode(batch_view); }` runs it as one `llama_decode`
+  when `batch.n_tokens <= n_batch`; `n_ubatch` (512) splitting happens inside `llama_decode`.
+- **[D] patch 0013 static prefill budget** - the thing to supersede. Read once at lines
+  2737-2747 (`n_prefill_budget = min(n_batch, atoi(LLAMA_PREFILL_BUDGET))`, a CONSTANT for
+  the run); enforced as an extra `while` predicate at line 3188 (`n_prompt_budgeted <
+  n_prefill_budget`), counter at 3214, outer break at 3326. `0` = disabled = byte-identical
+  stock.
+- **[E] productization seam** - `backend/cpp/llama-cpp/grpc-server.cpp` lines 781-791 parse
+  the model option `max_prefill_tokens` / `mpt` / `prefill_budget` and `setenv`
+  `LLAMA_PREFILL_BUDGET` before context init (same pattern as `kv_paged`). New knobs hang off
+  this seam identically.
+- **[F] paged KV (patches 0001-0011)** - on-demand block allocation keyed by sequence
+  position. Batch formation only changes **which** tokens are in a step; paged alloc is
+  driven by the per-slot sequence positions, which are unchanged. Orthogonal (see Correctness).
+
+## vLLM v1 reference algorithm (the target, for fidelity)
+
+From `vllm/v1/core/sched/scheduler.py::schedule()` (0.23.0, on the box). The unifying idea:
+there is no prefill phase vs decode phase. Every request advances `num_computed_tokens`
+toward `num_tokens` by up to N this step; for a decoder N=1, for a prefiller N=remaining
+prompt. One per-step `token_budget = max_num_batched_tokens` bounds the TOTAL (decode +
+prefill). Pass 1 visits `running` first (decoders cost 1 each -> all decode claimed before
+any prefill is sized); Pass 2 admits `waiting` (new prompts) only with leftover budget, each
+chunked by `min(remaining_prompt, long_prefill_token_threshold, leftover_budget)`. Caps:
+`max_num_seqs` (concurrent sequences), `long_prefill_token_threshold` (~4% of max_model_len,
+per-request prompt-chunk cap so one giant prompt cannot monopolize a step). Net: decode batch
+maximal every step (-> the GEMM-batching throughput vLLM gets), prefill always makes bounded
+progress (-> low, flat TTFT), one `model.forward()` per step.
+
+The mapping to llama is clean because [A]+[B] already give us "running visited first" and
+"prefiller resumes next step." We are missing only: **one total budget `T`, leftover `T - D`
+sizing, and the per-request chunk cap with fair distribution.**
+
+## The unified per-step batch-formation algorithm (the design)
+
+New knobs (all default to current behaviour; env set before context init like `LLAMA_KV_PAGED`):
+
+- `T` = `LLAMA_MAX_BATCH_TOKENS` (option `max_batch_tokens` / `mbt`) - total per-step token
+  budget (decode + prefill), the analogue of `max_num_batched_tokens`. Default `n_batch`
+  (2048). Clamped `T = min(T, n_batch)` so the existing single-`llama_decode` chunking is
+  unchanged.
+- `PREFILL_CAP` = `LLAMA_PREFILL_CAP` (option `prefill_cap`) - per-slot max prompt tokens per
+  step, the `long_prefill_token_threshold` analogue. Default `min(T, ceil(0.04 * n_ctx))`,
+  floored at `n_ubatch` (512) so a single prompt still makes a full ubatch of progress.
+- Back-compat: if only the legacy `LLAMA_PREFILL_BUDGET` is set (new knobs unset), behave
+  exactly as 0013 (static cap) - 0013 is the degenerate `T = n_batch`, no-leftover case.
+
+Pseudocode, mapping to real variables and seams (the `>>` lines are the change vs today):
+
+```
+common_batch_clear(batch);                                  // line 2594
+
+// PASS 1 - DECODE FIRST (unchanged: lines 2604-2719)
+for (slot : slots) if (slot.state == GENERATING && can_batch_with) generating.push(slot);
+... speculative draft ...
+for (slot : generating) slot.update_batch(batch);           // appends decode (+draft) tokens
+
+>> D = batch.n_tokens;                                       // NEW seam: decode load is now final (after 2719)
+>> T = min(LLAMA_MAX_BATCH_TOKENS ? : n_batch, n_batch);
+>> prefill_budget_step  = max(0, T - D);                     // DYNAMIC leftover, auto-shrinks with D
+>> prefill_cap_per_slot = PREFILL_CAP;                       // long_prefill_token_threshold analogue
+>> n_prompt_budgeted    = 0;                                 // total prompt tokens added this step (subsumes 0013)
+
+// PASS 2 - PREFILL FILLS THE LEFTOVER (lines 2753-3330, budget made dynamic + per-slot fair)
+if (cont_batching || batch.n_tokens == 0) {
+>>  for (k = 0; k < n_slots; ++k) {                          // round-robin start offset (fairness, see P2)
+>>      slot = slots[(rr_start + k) % n_slots];
+        if (!slot.is_processing() || !can_batch_with) continue;
+        if (slot.state == STARTED) slot.state = PROCESSING_PROMPT;     // line 2782 (unchanged)
+>>      slot_prompt_added = 0;                               // NEW: per-slot chunk counter (reset each slot)
+        // inner prompt-fill (lines 3187-3239), guard now triple-bounded:
+        while (slot.prompt.n_tokens() < slot.task->n_tokens()
+>>             && batch.n_tokens   < T                       // was: < n_batch
+>>             && n_prompt_budgeted < prefill_budget_step    // was: 0013 static n_prefill_budget
+>>             && slot_prompt_added < prefill_cap_per_slot) {// NEW: per-slot cap -> fair distribution
+            common_batch_add(batch, cur_tok, pos_next, {slot.id}, need_embd);
+            slot.prompt.tokens.push_back(cur_tok);
+            slot.n_prompt_tokens_processed++;
+            n_prompt_budgeted++; slot_prompt_added++;
+            ... checkpoint-boundary breaks (unchanged) ...
+        }
+        if (slot.prompt.n_tokens() == slot.task->n_tokens()) slot.state = DONE_PROMPT;  // line 3252
+        ... checkpoint creation (unchanged) ...
+>>      if (batch.n_tokens >= T) break;                      // was: >= n_batch (line 3320)
+>>      if (n_prompt_budgeted >= prefill_budget_step) break; // was: 0013 break (line 3326)
+    }
+}
+
+for (i=0; i<batch.n_tokens; i+=n) { n=min(n_batch,batch.n_tokens-i); llama_decode(view); }  // unchanged
+```
+
+The whole change is: (a) compute `prefill_budget_step = T - D` at the new seam after line
+2719 instead of reading a static env constant at 2737; (b) bound the inner/outer loops by `T`
+and the dynamic budget instead of `n_batch` and the static budget; (c) add `slot_prompt_added`
+with `prefill_cap_per_slot` for per-slot fairness; (d) a round-robin start offset so the same
+early slots do not always win the leftover.
+
+**Why this holds the decode ceiling without tuning.** `T` bounds total tokens per step ->
+bounds step compute time -> decode steps fire at a steady high rate (high decode-steps/sec).
+As decode load `D` rises, `prefill_budget_step = T - D` auto-shrinks, so prefill never inflates
+the step beyond `T` even at npl128. This is the mechanism by which 0013's hand-tuned 256
+reaches 161; here it is reached **automatically across the npl range** because the budget is
+`T - D`, not a constant. **Why this closes TTFT.** Prefill always gets a non-zero leftover
+(`prefill_budget_step >= 0`, and `T` is sized so leftover > 0 until the box is fully decode-
+saturated), distributed across waiting prompts by `prefill_cap_per_slot`, so every prompt makes
+bounded progress every step instead of waiting for a dedicated prefill burst.
+
+## Slot state machine changes (minimal - this is the headline de-risk)
+
+**No new states. No state-transition rewrite.** The existing 6-state machine
+(`IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT / GENERATING`, lines 67-72)
+already encodes everything:
+
+- "mid-prefill while others decode" = a `PROCESSING_PROMPT` slot coexisting with `GENERATING`
+  slots in the same step. **Already happens** (Phase 1 and Phase 2 populate one batch).
+- "chunked-prefill state per slot" = `(state == PROCESSING_PROMPT) && (prompt.n_tokens() <
+  task->n_tokens())`. **Already persisted** across `update_slots()` calls; Phase 2 re-enters
+  the slot and resumes from `prompt.n_tokens()`.
+
+The only **additions** are per-step scheduler scratch, not slot lifecycle state:
+
+1. `slot_prompt_added` - a per-slot, per-step counter (local to the Phase-2 loop body), for
+   the per-slot chunk cap. Not stored on the slot across steps.
+2. A `rr_start` round-robin offset (one `size_t` on the server, advanced each step) so the
+   leftover budget is distributed fairly across `PROCESSING_PROMPT` slots rather than always
+   draining the lowest-index slot first (this is what kills the burst-TTFT *spread* - without
+   it, slot 0's prompt finishes first every time and the last slots starve).
+3. Optional, P2: a per-step admission cap `K` on how many `STARTED -> PROCESSING_PROMPT`
+   transitions begin in one step. This falls out of the budget arithmetic already (a bounded
+   `prefill_budget_step` with a per-slot floor admits only `~budget/floor` prompts/step), so it
+   may need no explicit code; if made explicit it is the `max_num_seqs`-style "don't admit a
+   new prefill if the step is full" guard, mapped onto the pre-allocated `n_parallel` slots.
+
+That is the entire state-machine footprint: two pieces of per-step scratch and an optional cap.
+The mission's feared "slot-state rewrite" does not materialize.
+
+## How it supersedes / subsumes patch 0013
+
+| property | 0013 (static cap) | this scheduler (dynamic `T - D`) |
+|----------|-------------------|----------------------------------|
+| per-step prefill bound | constant `n_prefill_budget` | `T - D`, shrinks as decode load rises |
+| decode-load aware | no (ignores `D`) | yes (leftover after decode) |
+| works across npl 8..128 with one config | no (256 best @128, net-negative @8) | yes (self-tuning) |
+| fair across multiple waiting prompts | no (greedy, slot 0 wins) | yes (`prefill_cap_per_slot` + round-robin) |
+| TTFT on bursty arrival | raises it (defers first tokens) | bounded for all prompts |
+| decode-first guarantee | structural (Phase 1) | structural (Phase 1) - **kept** |
+
+0013 is the **degenerate case** `T = n_batch` with `prefill_budget_step` pinned to a constant
+and no per-slot cap. The patch keeps `LLAMA_PREFILL_BUDGET` working for back-compat (when the
+new knobs are unset). When `LLAMA_MAX_BATCH_TOKENS` is set, the static path is replaced by the
+dynamic one. **Default (all knobs unset) = byte-identical stock**, exactly like 0013.
+
+## Correctness
+
+- **KV cache during chunked prefill** - unchanged from today. A `PROCESSING_PROMPT` slot already
+  advances `slot.prompt.tokens` / `pos_next()` chunk by chunk across steps; we only change the
+  chunk SIZE per step, not how positions or sequence ids are assigned. `common_batch_add`
+  receives the same `(tok, pos, {slot.id})` tuples in the same order. No new KV state.
+- **Determinism** - greedy (temp 0) output can differ from a single-`n_batch`-chunk run only by
+  the **intrinsic flash-attn chunk-size FP grouping** that 0013 already documented and bounded:
+  pure stock `-b256` diverges from `-b2048` the same way with this patch inactive; output stays
+  coherent and answers correctly. The op-level math per token is position-determined and
+  unchanged; only the FA reduction grouping over a step's token mix shifts. The deterministic
+  oracle is the CPU backend / the op test (bit-exact); the GB10 CUDA greedy-decode band applies
+  to end-to-end only, never to the op test.
+- **Paged KV (patches 0001-0011)** - **orthogonal**. Paged on-demand block allocation is keyed
+  by sequence position and slot/stream, which this change does not touch; it changes only which
+  tokens are in a given `llama_decode`. The in-kernel paged decode read (0009-0011) operates
+  per-token via the block tables regardless of what prefill tokens are co-batched. Required gate:
+  run the full P0-P2 suite with `LLAMA_KV_PAGED=1` **and** `=0` and confirm **identical
+  scheduling decisions** (same per-step token counts, same admission order) - paged must be a
+  no-op on the scheduler.
+- **`can_batch_with` constraint** (line 302) - a batch admits only slots with the same
+  `task->type` and equal LoRA. Homogeneous-completion serving (the benchmark and the dominant
+  LocalAI case) satisfies it, so the mixed decode+prefill batch forms freely. Mixed task types /
+  per-request LoRA fall back to separate batches - a pre-existing bound, not a regression; note
+  it, do not try to lift it here.
+- **Checkpoint interaction (a real, orthogonal serving defect to account for)** - each slot that
+  reaches `DONE_PROMPT` may call `create_checkpoint` (line 2147), ~149 MiB per checkpoint on the
+  dense 27B, gated by `n_ctx_checkpoints > 0` (line 3133). Profiling found that under sustained
+  heavy load the checkpoint subsystem **thrashes**: admission collapsed to one slot every ~13 s,
+  zero decoding for 290 s, while `/slots` itself serialized behind a 13 s `update_slots` step.
+  This is **independent** of the decode/prefill mix but it **masks** the scheduler's win if left
+  on. **P0 must isolate it** (run with `n_ctx_checkpoints=0`), and **P2's admission decision
+  should be checkpoint-cost-aware** on the 128 GB unified box (do not admit a fresh prefill whose
+  checkpoint would thrash the pool). Treat as a named co-defect, not part of the core batching
+  change.
+
+## Phased plan P0 -> P3 (work, payoff, files, risk)
+
+| Phase | Work | Expected payoff (dense / MoE @npl128 unless noted) | Files | Risk |
+|-------|------|-----------------------------------------------------|-------|------|
+| **P0** baseline + metrics harness | Per-step effective-decode-batch poller (`/slots`), TTFT percentiles (p50/p90/p99/max), `decode_agg` over the fully-overlapped window, decode-ITL (worst freeze / median), **step-time histogram**, admission rate (slots/s reaching GENERATING), checkpoint-event log. Lock the staggered-arrival ceiling (**157.4** dense, all-128 clean) and the all-at-once burst pathology as the two reference traces. Isolate checkpoints (`n_ctx_checkpoints=0`). | dev-tree only: `~/bench/` (reuse `stag.py`, `slot_poll.py`, `h2h_cli.py`, `h2h_moe_sweep.sh`; `stag_128.json`, `h2h_real128b.json`) | **None** (gate). Locks correctness + the 157/333 ceiling so any regression is caught. | Low |
+| **P1** unified mixed-batch formation | Replace the static budget read (2737-2747) with the **dynamic `T - D`** computed at the new seam after line 2719; bound the inner/outer Phase-2 loops by `T` (3188, 3320) and `prefill_budget_step` (3326) instead of `n_batch` and the static cap. No per-slot cap, no round-robin yet (that is P2). | `tools/server/server-context.cpp` (seam @2719, knob read, 3188, 3320, 3326); mirror to `0016-paged-continuous-batch-scheduler.patch` | **TTFT**: removes the burst penalty 0013 inflicts - staggered TTFT ~2 s, burst TTFT collapses toward vLLM's ~25 s / 8 s. **Decode**: holds the ceiling **(~161 / ~333)** *without per-workload tuning* (0013 needed 256 hand-picked). No new throughput beyond the ceiling - by design. | Low-Med (loop-bound edits in a hot path; default-off gate makes it byte-identical stock) |
+| **P2** scheduling policy / fairness | Add `slot_prompt_added` + `prefill_cap_per_slot` (the `long_prefill_token_threshold` analogue) and the **round-robin start offset**; optional explicit per-step admission cap `K` + checkpoint-cost-aware admission. Tune `T`, `PREFILL_CAP` on GB10 (dense vs MoE, npl 8/32/64/128). | `server-context.cpp` (Phase-2 loop body @2753-3330, server-level `rr_start`); `grpc-server.cpp` (options `max_batch_tokens`/`mbt`, `prefill_cap` @781-791) | **TTFT spread**: bounds first-token latency for **all** concurrently-arriving prompts (kills the burst-TTFT spread, e.g. dense max 305 s -> single-digit-s on staggered, bounded on burst). **Robustness**: no admission collapse under sustained load; decode batch stays maximal so the *time-averaged* decode_agg on real (non-burst) traffic rises toward the staggered 157/333 because slots reach GENERATING fast. | Med (fairness + admission logic; e2e coherence + A/B vs 0013 required) |
+| **P3** residual decode throughput | **Honest boundary: this is the decode-KERNEL lever, NOT the scheduler.** The scheduler has delivered TTFT + robustness + ceiling-hold. Closing the residual 2.4x (161 -> 391 dense, 333 -> 811 MoE) requires paged-attention **decode-kernel** batch-scaling (patches 0009-0011 territory) and/or **CUDA-graphed decode** (the now-uniform decode-only step is graph-capturable). Scope/track separately. | (separate scope) `ggml/src/ggml-cuda/` decode-read kernels; optional CUDA-graph capture seam in `update_slots` | This is **where 391/811 would come from**; it is **out of this scope's mandate** and must not be charged against the scheduler. The scheduler makes the decode step uniform (a precondition that *helps* a future graph capture). | High (kernel work; the GB10 occupancy wall, see below) |
+
+**Per-phase payoff vs the mission targets (TTFT 25 s / 8 s, decode 391 / 811 @npl128):**
+
+- **TTFT 25 s / 8 s** - reached by **P1 + P2** (the 12x gap is the scheduler's to close; on
+  staggered arrival it goes below the vLLM burst figure to ~2 s).
+- **Decode 391 / 811** - **NOT a P1/P2 deliverable.** P1/P2 hold **161 / 333** (= ~41% of vLLM,
+  the kernel ceiling) robustly and tuning-free. The remaining ~2.4x is **P3 kernel**, a separate
+  lever. Pre-registering this split is the point: the scheduler is judged on TTFT + holding the
+  ceiling, the kernel on the throughput residual.
+
+## GB10 considerations
+
+- **Bandwidth floor ~273 GB/s** is the *cause* of the decode ceiling (NVFP4 weight-read +
+  paged-KV gather per step). The scheduler cannot lift a bandwidth/kernel floor - it can only
+  keep the batch *at* the ceiling. Size `T ~= n_batch` (2048) so the compute step stays a single
+  `llama_decode`; `n_ubatch` (512) governs the internal split.
+- **`T` is the ITL/TTFT trade knob** (vLLM's `max_num_batched_tokens`): larger `T` = more
+  prefill/step = faster TTFT but bigger per-step ITL spike; smaller `T` = smoother ITL, slower
+  TTFT. Because the budget is `T - D`, the spike is bounded at `T` regardless of decode load.
+  Default `T = n_batch`; expect to tune down toward ~1024 for ITL-sensitive serving.
+- **Checkpoint ~149 MiB/slot thrash** on the 128 GB unified box - admission must be
+  checkpoint-cost-aware (P2); P0 measures with checkpoints off to isolate the batching win.
+- **Memory**: paged on-demand KV (dense 52->94 GB, MoE 39->61 GB across npl) vs vLLM's flat
+  ~112 GB pre-reservation - llama's standing multi-tenant advantage, unaffected by this change.
+- **Eager mode** both engines today; **CUDA-graphed decode** is the P3 kernel lever, and the
+  scheduler's uniform decode-only step is a precondition that *helps* a future capture.
+
+## Biggest risks and how to de-risk
+
+1. **"Slot-state rewrite" (the feared big risk) = actually LOW.** The mid-prefill-while-others-
+   decode state and the chunked-prefill resume already exist ([B]); we add only per-step scratch
+   (`slot_prompt_added`, `rr_start`), not lifecycle states. **De-risk**: keep all 6 states
+   untouched; gate every change behind the new knobs; default-off = byte-identical 0013/stock,
+   verified by an A/B diff of per-step token counts.
+2. **Correctness regression in the mixed batch = the FA chunk-grouping nondeterminism.** Already
+   documented and bounded by 0013 (stock `-b256` vs `-b2048` diverge identically). **De-risk**:
+   op-test bit-exact where deterministic; greedy-coherence e2e on both models; A/B vs 0013 with
+   the new knobs set to reproduce 0013 (`T = n_batch`, no leftover) and confirm **byte-identical**
+   to 0013.
+3. **Paged-KV interaction = LOW (orthogonal positions).** **De-risk**: run the whole P0-P2 suite
+   with `LLAMA_KV_PAGED=1` and `=0`; assert identical scheduling decisions (paged must be a
+   no-op on batch formation). This is a hard gate, not a spot check.
+4. **Checkpoint thrash masks the win = MEDIUM.** A real serving defect that can swamp the
+   scheduler's signal. **De-risk**: P0 isolates it (`n_ctx_checkpoints=0`); P2 makes admission
+   checkpoint-cost-aware; report the scheduler metrics both with and without checkpoints so the
+   batching win is legible independent of the checkpoint co-defect.
+5. **Honest-payoff risk = the decode_agg number barely moves over 0013 (kernel ceiling), so the
+   work can be mis-judged as "no win."** This is the most important risk to manage. **De-risk**:
+   frame and measure on **TTFT percentiles, burst-TTFT spread, step-time histogram, admission
+   rate, and tuning-free ceiling-hold across npl/dense/MoE** - the axes the scheduler actually
+   moves - and **pre-register the decode-kernel as the separate residual-closer** (P3) so the
+   scheduler is never charged with the 391/811 number the kernel forbids.
+
+## Commit / hygiene
+
+Scope doc only (this file). **No engine change committed in this workflow.** Bench and parity
+scripts stay dev-tree-only (`~/bench/`, `~/llama-paged-dev/benches/`). When P1/P2 are
+implemented they mirror to `backend/cpp/llama-cpp/patches/paged/0016-paged-continuous-batch-
+scheduler.patch` (next free slot after 0015) and the LocalAI option lands in `grpc-server.cpp`
+beside `max_prefill_tokens`. Commit with `git commit -s`, trailer
+`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes. Do not push
+(human pushes).
+
+---
+
+## Review / risk (adversarial, source-verified)
+
+Skeptical staff review against the actual source at HEAD `151343b` (server-context.cpp,
+llama-batch.cpp, llama-kv-cache.cpp, paged-*.cpp), grpc-server.cpp in this worktree, and the
+committed `QWEN36_NVFP4_BENCH.md` plus the vLLM H2H serve logs/scripts on the box.
+
+### Verdict: the scope is SOUND. GO on P0 -> P1, CONDITIONAL P2, separate-track P3.
+
+The central de-risking claims check out against the code, and the load-bearing honesty (decode
+residual is a kernel ceiling, not a scheduler defect) is correct and now further corroborated.
+Two calibration fixes are required before P1 (below), neither changes the go decision.
+
+### (1) Tractability - CONFIRMED bounded; zero libllama changes. What enables/blocks it, concretely:
+
+- **Enables (already-exercised path, not new surface).** A mixed prefill+decode ubatch with
+  per-seq different `n_past` is the *existing* behaviour. `llama_batch` carries per-token `pos`
+  and `seq_id` (`common_batch_add(batch, tok, pos_next(), {slot.id}, ...)`); `llama_kv_cache` +
+  `paged_alloc::place()` place each `(seq, pos)` independently; `llama_kv_cache::init_batch`
+  (line 742) already splits the mixed batch into ubatches. **The server emits exactly this mixed
+  decode+prefill batch today** - patch 0013 ships it and produces coherent output - so the new
+  scheduler changes only the *count* of prefill tokens, never the batch *structure*. There is no
+  `llama_decode`/ubatch/KV rewrite in scope.
+- **Blocks: nothing in libllama.** The only constraints are pre-existing and orthogonal to the
+  target workload: (i) `can_batch_with` (same task type + equal LoRA per batch); (ii)
+  `split_equal(sequential=true)` errors on *coupled* sequences (shared-prompt parallel sampling),
+  forcing `-kvu`. Neither is introduced by this change.
+- **Correction to fold in:** the scope's [C] and the pseudocode imply contiguous `split_simple`
+  chunking. The real serving/benchmark config (`--parallel 128`, `kv_unified` default = `false`
+  -> `n_stream = n_seq_max = 128`) takes the **`split_equal(n_ubatch, sequential=true)`** path
+  (llama-kv-cache.cpp:742), which balances per-sequence rather than slicing contiguously. This
+  does not break anything (0013 already hits it) but it means the actual scheduled object is a
+  split_equal ubatch set; P0 must characterize that ubatch shape (not assume contiguous 512-chunks)
+  and the determinism band is over split_equal groupings. Lock the split path (unified vs not) in
+  the A/B so the byte-identical-to-0013 gate is meaningful. grpc seam [E] verified at
+  grpc-server.cpp:761-786 (`kv_paged`, `max_prefill_tokens`/`mpt`); new `mbt`/`prefill_cap` knobs
+  hang off it identically.
+
+### (2) Does it close the gap - the 2.4x is NOT CUDA graphs, and the TTFT root is quantified.
+
+- **CUDA graphs ruled out (verified).** Both NVFP4 H2H vLLM servers ran `--enforce-eager`
+  (`h2h_dense_vllm.sh`, `h2h_moe_serve_vllm.sh`; engine logs show `enforce_eager=True`,
+  `cudagraph_mode=NONE`, `CompilationMode.NONE`). So the npl128 2.4x decode gap is a genuine
+  **eager-mode kernel + per-step host-overhead** gap (ggml graph rebuild/realloc + ~1k kernel
+  launches per step on the weak Grace cores, paged-KV gather, MoE expert gather). The scheduler
+  cannot touch it; the staggered all-128-decoding 157.4 tok/s ceiling is solid. Scope is right to
+  refuse the 391/811 number. (CUDA graphs are a future *both-sides* lever, not the current cause.)
+- **The TTFT gap has a measured root the scope under-uses: prefill_tps collapse.** From the bench,
+  llama `prefill_tps` falls 1117 -> 752 -> 465 -> **125** (dense, npl 8/32/64/128) while vLLM holds
+  **flat ~1420** (MoE: 2813 -> 657 vs vLLM flat ~4263). That collapse - not a separate "scheduling
+  quality" abstraction - is the direct cause of the 491 s / 85 s TTFT, and it is exactly what the
+  dynamic `T - D` budget attacks: when decode load `D` is low (early in a burst) the leftover
+  `T - D` lets prefill take ~`n_batch` per step, and because llama's *larger per-step chunk*
+  compensates for its ~2.4x slower steps, a `T = 2048` budget can sustain prefill_tps at or above
+  vLLM's ~1420 during the drain. **So burst-TTFT parity is mechanically plausible, not just
+  "toward"** - the static budget-256 throttles prefill to 256/step (hence its weak 305 s) where the
+  dynamic budget would not. This strengthens P1's case beyond what the doc claims.
+- **Mandatory calibration fix:** that TTFT win **couples to a decode-ITL knob**. Spending the full
+  `T - D` on prefill during the drain makes those steps full `T`-token (mixed) computes, so
+  co-batched decoders get 1 token per slow step (ITL spike) *during the drain* - precisely vLLM's
+  tradeoff, navigated by `T`. The 157/333 ceiling is the **post-drain steady state**, not the
+  drain phase. Therefore the scope must **co-report drain-phase decode-ITL alongside TTFT** and
+  treat `T` as the published trade knob; reporting TTFT alone would hide the cost and reporting
+  decode_agg alone would hide the win (it is averaged across drain + steady state, which is why it
+  "barely moves"). Soften "P1+P2 reach 25 s / 8 s": the defensible claim is *staggered/realistic
+  arrival ~2 s, and all-at-once burst approaching vLLM with a tunable decode-ITL cost*.
+
+### (3) Correctness - paged orthogonality confirmed at source; the real risks are config, not code.
+
+- **Paged-KV is the same `llama_kv_cache` class** with `paged_alloc::` hooks inside the existing
+  find_slot/placement (llama-kv-cache.cpp:1043-1083), driven by per-slot `(seq, pos)` - which this
+  change does not touch. `init_batch`/split is paged-agnostic. The scope's "orthogonal" claim is
+  verified, not asserted. Keep the hard `LLAMA_KV_PAGED=1` vs `=0` identical-decisions gate.
+- **Determinism**: the FA grouping nondeterminism is over **split_equal** ubatches in the real
+  config; the `T = n_batch` A/B-must-be-byte-identical-to-0013 gate is the right oracle and is
+  sound (default-off path is untouched).
+- **Low-concurrency regression**: gated to byte-identical when knobs unset; the only live vector is
+  a **mis-tuned `T`** spiking ITL at low npl (the scope already flags `T` defaults). Config hygiene,
+  not a code risk. Add a guard/floor so `T` cannot be set below `n_ubatch`.
+
+### (4) Smaller higher-ROI step - yes, and the scope already contains it (P1).
+
+The minimal high-ROI change is **P1 alone**: replace the static read (server-context.cpp:2737-2747)
+with `prefill_budget_step = max(floor, T - batch.n_tokens)` computed after the decode-fill at line
+2719, and bound the Phase-2 loops by `T` / that budget (3188, 3320, 3326). That is a handful of
+line edits at named seams, default-off, and it captures the self-tuning + the bulk of the TTFT win.
+The even-smaller validation spike: a one-line `n_prefill_budget = max(floor, T - batch.n_tokens)`
+to confirm the prefill_tps/TTFT mechanism before writing the full P1. **P2** (round-robin +
+`prefill_cap_per_slot` + checkpoint-aware admission) is genuinely higher-effort and lower-marginal
+(it buys TTFT *spread*/tail and burst robustness, not the median); **gate P2 on P1's measured
+burst-TTFT-spread and drain-ITL**, do not commit to it up front. There is no smaller step that also
+fixes the static budget's npl-dependence - tuning 0013's constant cannot (256 is net-negative at
+npl8 and costs MoE TTFT), so P1 is the floor.
+
+### Realistic effort / payoff and sequencing
+
+- **P0** ~0.5-1 wk (harness largely exists in `~/bench/`): add drain-phase decode-ITL to the metric
+  set, lock the split path, isolate checkpoints (`n_ctx_checkpoints=0`). Gate only.
+- **P1** ~2-4 days: small diff + the A/B-vs-0013 byte-identical gate + the npl/dense/MoE sweep.
+  Payoff: self-tuning hold of 161/333 with no hand-picked constant; burst-TTFT 3-10x better than
+  0013 (plausibly approaching vLLM on the burst, parity on staggered), at a published `T`-tunable
+  decode-ITL cost. **This is the high-ROI core and the clean supersession of 0013.**
+- **P2** ~1-2 wk, conditional: fairness/admission + checkpoint-cost-awareness + tuning. Payoff: TTFT
+  tail/spread + no admission collapse under sustained load. Worth it only if P1 metrics show a
+  residual spread/robustness problem.
+- **P3** separate track, high effort: the *only* path to 391/811 is the eager-kernel + per-step
+  host-overhead residual. Highest-value probe is a **CUDA-graph capture of the steady-state
+  pure-decode step** - but note this works *independent of the scheduler* (the all-128-decoding
+  step is already fixed-shape today); the scheduler neither blocks nor specially enables it, so do
+  not credit graphs to the scheduler. The scope's "uniform decode step is a precondition" is a mild
+  over-claim; correct it to "graphs apply to the pure-decode steady state, which the scheduler does
+  not change."
+
+### Bottom line
+
+GO. The work is correctly localized to `update_slots()` batch-formation policy, requires no
+libllama changes (the mixed per-seq batch is the existing, shipping path), and supersedes 0013
+cleanly. The honest ceiling is real and well-stated; the two fixes are (a) co-report drain-phase
+decode-ITL with TTFT and stop selling/charging the decode_agg number, and (b) acknowledge the
+`split_equal`/`n_stream=128` path in the determinism and ubatch-shape analysis. Sequence
+P0 -> P1, measure, then decide P2; keep P3 (kernel/CUDA-graph) on its own track as the sole owner
+of the 2.4x throughput residual.
--- a/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
+++ b/backend/cpp/llama-cpp/patches/paged/DECODE_GAP_STUDY.md
@@ -0,0 +1,185 @@
+# llama-server vs vLLM: decode-step gap decomposition (DGX Spark, GB10 / sm_121)
+
+Profiling study (no engine changes). Question: matched apples-to-apples (both
+batched servers, NVFP4-class weights, prefix caching on, both eager), why is
+`llama-server` ~4-6x slower **per decode step** than vLLM on Qwen3-32B at a
+1024-token shared-prefix / batch-32 fan-out, and what is closable vs structural.
+
+Hardware: NVIDIA GB10 (sm_121), unified LPDDR5X. Model: Qwen3-32B, 64 layers.
+llama side: `~/llama-paged-dev/build-cuda/bin/llama-server`, `q3-32b-nvfp4-dense.gguf`
+(NVFP4 weights, type-40 FP4-MMA path), `-ngl 99 --parallel 32 -c 40960 -fa on`,
+`GGML_CUDA_DISABLE_GRAPHS=1` (eager). vLLM 0.23.0 NVFP4A16 (W4A16/Marlin),
+`--enforce-eager`. Workload: 1024-token shared prefix + unique 32-token suffix,
+K=32 concurrent, generate 64. All profiling scripts are dev-tree only
+(`~/bench/decode_study/`); minimal in-code timers were not needed (server already
+reports per-slot `eval time`, which excludes prompt-eval = pure decode).
+
+## TL;DR
+
+1. **The real-server decode is GPU-BOUND, not host-bound.** During steady decode
+   the GPU is **~94.6% utilized** (nvidia-smi, real run) / 85-95% busy (nsys).
+   Per-slot CPU sampling, detokenize, and `update_slots` are fully hidden: a 5-stage
+   sampler chain gives the *identical* step time as greedy (1346 vs 1343 ms). The
+   "GPU stalls on the CPU serving loop" hypothesis is **refuted** for this workload.
+2. **At 1024 context the decode step is ~84% KV/attention, ~16% weight GEMM** - the
+   opposite of the thin-batch-GEMM story. Attention scaling with context length, not
+   the matmul, is the load-bearing cost.
+3. **The worktree's paged KV engine is a decode REGRESSION: ~1.85x slower than
+   stock** at 1024 ctx (paged 1279-1343 ms/step vs stock 650-729 ms/step). It
+   gathers K/V/mask into a contiguous buffer (`ggml_get_rows`) every layer every
+   step, then runs a dense FA kernel - paying a full extra KV read+copy that vLLM's
+   in-kernel PagedAttention never pays. Paging helps prefix-prefill memory; it hurts
+   decode latency.
+4. Even **stock** llama-server (~650-729 ms/step) is **~4-5x slower than vLLM**
+   (~120-185 ms/step). The residual gap is the **long-context decode-attention
+   kernel** and, secondarily, the **thin-batch FP4 weight GEMM** - both kernel-maturity
+   gaps vs vLLM's FlashInfer/FA paged-decode + Marlin, not serving-loop gaps.
+
+## The measured numbers (batch 32, server-reported pure-decode step time)
+
+`server_decode_step_ms` = max / mean-of-top-8 of per-slot `eval time ms-per-token`
+(the most-contended, full-batch-32 slots; excludes prompt eval).
+
+| config                                   | decode step ms (max / top8) | client wall ms/step |
+|------------------------------------------|-----------------------------|---------------------|
+| paged, ctx 1024, greedy                  | 1343 / 1279                 | 1468                |
+| paged, ctx 1024, **heavy 5-sampler**     | 1346 / 1280                 | 1470                |
+| **stock** (no paging), ctx 1024, greedy  | **729 / 650**               | 768                 |
+| paged, **ctx 64** (short), greedy        | **215 / 215**               | 253                 |
+| vLLM NVFP4A16, ctx 1024 (K=32)           | **~120-185** (270 tok/s)    | -                   |
+
+The brief's reference ~828 ms/step sits between the stock (650-729) and paged
+(1279-1343) numbers measured here; the decomposition below is what is robust. Our
+fan-out shares no prefix across the 32 slots (each slot independently prefills 1056
+tokens - confirmed in the log), so the 32 sequences are genuinely concurrent and the
+"max" slot is maximally contended, which is why our paged max runs a little above 828.
+
+### Context sweep - decode step is attention-scaling, not fixed overhead
+
+Pure-decode step vs shared-prefix length (paged, batch 32):
+
+| prefix ctx | decode step ms |
+|-----------|----------------|
+| 64        | 215            |
+| 128       | ~290           |
+| 256       | ~410           |
+| 512       | ~660           |
+| 1024      | ~1280          |
+
+Roughly linear in context length: ~1 ms of added step time per added context token.
+The **215 ms at ctx 64 is the fixed floor** (weight GEMM + activations + norm/rope +
+loop + sampling, attention negligible). Everything above it scales with KV length =
+attention + KV plumbing. At 1024 ctx the fixed floor is only ~16% of the step.
+
+## Where the ~1280 ms paged decode step goes (nsys, pure-decode window)
+
+`nsys profile --delay=70 --duration=25 --trace=cuda` windowed onto steady 32-way
+decode (`srv_decode2.nsys-rep`; an earlier 25-60s window was discarded because nsys's
+own slowdown stretched the 32 prefills into it, inflating GEMM to a misleading 58%).
+GPU busy in-window 85.5% (nsys adds gaps; the real run is ~94.6% by nvidia-smi).
+
+| bucket                         | % GPU time | abs (of ~1280 ms) | what it is |
+|--------------------------------|-----------:|------------------:|------------|
+| `flash_attn_ext_f16` ATTENTION | **47.7%**  | ~610 ms           | decode attention over the 1056-cell KV |
+| `cpy_scalar` KV copy/cast      | 18.3%      | ~234 ms           | KV write + f32->f16 casts |
+| `get_rows/set_rows` KV gather  | 17.8%      | ~228 ms           | **paged** gather of K/V/mask to contiguous |
+| `mul_mat_q` + `quantize_mmq`   | 15.7%      | ~201 ms           | NVFP4 weight GEMM (+ activation requant) |
+| rmsnorm / silu / rope / add    | ~0.6%      | ~8 ms             | elementwise |
+
+Cross-check: the GEMM bucket (~201 ms) matches the ctx-64 floor (215 ms) - i.e. the
+weight matmul is ~the entire short-context step, and is context-independent, as
+expected. KV/attention buckets (47.7+18.3+17.8 = **83.8%**) match the context-sweep
+finding that ~84% of the step scales with context.
+
+Power signature: ~33-36 W at 94% "utilization" (GB10 can pull far more). High util%
+ low power = the kernels are **memory/latency-bound, not compute-saturated** - the
+classic decode signature (stream 19 GB of NVFP4 weights + a growing KV every step).
+
+### Stock vs paged decomposition
+
+- **Stock** (~650 ms): ~215 ms GEMM floor + ~435 ms attention/KV (contiguous KV read
+  directly by the FA kernel, **no gather**).
+- **Paged** (~1280 ms): same ~215 ms floor + ~610 ms attention + **~455 ms paged
+  gather/copy overhead** (the `get_rows` of K/V/mask plus the extra KV copy that
+  feeds the dense FA kernel). That ~455 ms (~36% of the step) is the paged engine's
+  self-inflicted cost and is the entire ~1.85x stock->paged regression.
+
+## vLLM decode architecture mapped onto each llama bucket
+
+vLLM at ~120-185 ms/step is faster on **every** bucket:
+
+| llama bucket (paged)        | ms    | vLLM equivalent | does vLLM avoid it? |
+|-----------------------------|-------|-----------------|---------------------|
+| paged KV gather (get_rows)  | ~228  | PagedAttention reads blocks **in-kernel** via a block table | **Yes - entirely.** No gather op exists. |
+| KV copy/cast                | ~234  | KV written once into block pool; FA reads it in place | Mostly - no per-step recopy |
+| decode attention            | ~610  | FlashInfer / FA paged-decode GQA kernel, split over KV | Same op, far faster kernel on sm_121 |
+| weight GEMM + act quant     | ~201  | fused Marlin/Machete W4A16 dequant+MMA, no separate quant pass | Faster + removes the requant kernel |
+| CPU sampling / loop         | ~0 (hidden) | on-GPU batched sampling | N/A here - already hidden on llama side too |
+
+vLLM's whole-step (~150 ms) is **less than llama's GEMM floor alone (~215 ms)**, so
+vLLM is ahead on the matmul *and* the attention *and* avoids the gather. The gap is a
+stack of kernel-efficiency wins, not one silver bullet.
+
+## Ranked levers - closable vs structural
+
+1. **Remove the paged gather regression. [Tractable, ~455 ms / ~36% on the paged
+   path; net-zero risk - it is a regression]** The worktree's paged engine makes
+   decode 1.85x slower than stock by gathering K/V/mask to contiguous every layer
+   every step (patch 0003 `ggml_get_rows`). For latency-bound decode, **do not enable
+   paged KV** - it only ever helps prefix-prefill *memory*, never decode latency.
+   Fully recovering this *and* keeping paging requires reading paged blocks
+   in-kernel like vLLM (a from-scratch paged-attention CUDA kernel) - see lever 2.
+
+2. **Long-context decode-attention kernel. [Biggest real lever, ~435 ms of stock /
+   ~610 ms of paged; partly structural]** Even stock is attention-bound at 1024 ctx.
+   llama.cpp's `flash_attn_ext_f16` decode path is ~4-5x slower than vLLM's
+   FlashInfer/FA paged-decode GQA kernel on this Blackwell-class part. This is the
+   cost that *grows with context* - exactly the regime the brief targets. Tractable in
+   principle (a proper flash-decoding / split-K-over-KV kernel, and a true in-kernel
+   paged read that also kills lever 1's gather), but it is deep CUDA work on a new
+   arch and partly gated by kernel maturity on sm_121. **Highest-impact, hardest.**
+
+3. **Thin-batch FP4 weight GEMM floor. [Tractable, ~201-215 ms / 15-30%; bounded]**
+   The NVFP4 `mul_mat_q` + separate `quantize_mmq` activation pass is memory-bound and
+   less efficient than vLLM's fused Marlin/Machete W4A16. Fusing dequant into the MMA
+   and folding the activation quant into the GEMM is tractable kernel work. Bounded
+   impact: the floor cannot drop below weight-read-bound (~19 GB / HBM BW per step).
+
+4. **Host serving loop / per-slot sampling. [NOT a lever]** Measured zero: greedy ==
+   heavy-sampler step time; GPU 94.6% busy. On-GPU/batched sampling buys nothing until
+   the kernels (levers 1-3) get fast enough to expose host overhead. Refutes the
+   "host-bound serving loop" hypothesis for this decode-bound workload.
+
+5. **Continuous-batch scheduler. [NOT the gap / structural elsewhere]** llama-server
+   already fuses all 32 slots into one decode step (one set of kernels per step over
+   batch 32 - confirmed in the trace). vLLM's continuous/chunked-prefill batching wins
+   on *mixed* prefill+decode overlap, but the steady decode-step gap measured here is
+   kernel-bound, not scheduler-bound.
+
+## Honest bottom line
+
+The ~4-6x per-step gap is **GPU-kernel-bound**, and it decomposes as:
+
+- ~36% of the *paged* step is a **self-inflicted gather regression** - remove it
+  (don't run paged for decode-latency workloads).
+- The remaining ~4-5x vs vLLM (true even for stock) is **kernel efficiency**:
+  llama.cpp's long-context decode-attention and thin-batch FP4 GEMM are slower than
+  vLLM's PagedAttention + Marlin on GB10. That is a **kernel project** (in-kernel
+  paged attention + flash-decoding + fused W4A16 GEMM), not a serving-loop project.
+- Sampling, detokenize, `update_slots`, and the continuous-batch scheduler are **not**
+  the gap; the GPU is ~95% busy on memory-bound kernels the whole step.
+
+What is closable: lever 1 (immediately, by not paging), lever 3 (bounded, with kernel
+work). What is structural / hard: lever 2 (the decode-attention kernel + a real
+in-kernel paged read), which is where the context-scaling gap actually lives and where
+any serious effort to approach vLLM on GB10 must go.
+
+## Reproduction (dev-tree only, `~/bench/decode_study/`)
+
+- `launch_srv.sh` / `runcfg.sh` - launch llama-server (paged on/off) and a config.
+- `client.py` - K=32 token-id fan-out (1024 prefix + 32 suffix), `SAMP=greedy|heavy`.
+- `d2drv.sh` - nsys pure-decode window (delay 70s past prefill) -> `srv_decode2.nsys-rep`.
+- `cat2.py` - kernel-time categorization from the sqlite export.
+- vLLM side: `~/bench/run_vllm.sh` + `vllm_prefix.py` (K=32, ~270 tok/s).
+</content>
+</invoke>
--- a/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
+++ b/backend/cpp/llama-cpp/patches/paged/FP4_GEMM_SCOPE_B.md
@@ -0,0 +1,532 @@
+# Track B: the FP4-MMA weight-GEMM for GB10 decode parity with vLLM — build-ready scope + honest go/no-go
+
+Scope only (build-ready plan + honest verdict). **Not implemented in this workflow.** Track B is the
+residual-kernel track after track A (fuse the standalone `quantize_mmq_fp4` activation-requant, the
+8.2% decode bucket — tasks 38-41, the fused `rms_norm+mul+nvfp4-quant` producer + prequantized-MMQ
+consumer) is handled separately. Track B owns the **weight GEMM**, the ~59% bucket.
+
+**The load-bearing question, restated:** at the decode batch shape (M≈128 tokens fused into one
+ubatch, NVFP4 weights), is the weight GEMM **compute-bound** (FP4-MMA throughput is the lever →
+parity reachable with a better kernel) or **bandwidth-bound** (273 GB/s weight-read is a hard floor →
+parity capped)? And given the GB10 occupancy history, can a better FP4-MMA decode GEMM actually reach
+vLLM's **391 (dense) / 811 (MoE)** decode-agg tok/s @npl128, or only partway?
+
+Hardware: NVIDIA GB10 / DGX Spark, sm_121 (CC 1210 = `GGML_CUDA_CC_DGX_SPARK`), unified LPDDR5x.
+Dev tree `~/llama-paged-dev` (branch `paged`, build-cuda sm_121). All numbers are reasoned from the
+committed nsys decomposition + measured GB10 specs + a source read of the FP4-MMA kernel; **no new GPU
+benchmarks were run** (track A is on the box).
+
+## 0. Grounded inputs (measured, committed)
+
+| quantity | value | source |
+|---|---|---|
+| LPDDR5x bandwidth (spec) | **273 GB/s** | `BLACKWELL_KERNEL_GAPS.md`, `VLLM_DECODE_GROUNDING.md` |
+| LPDDR5x bandwidth (achieved, batch-1 weight read) | **~216 GB/s** (19 GB / ~88 ms irreducible) | prior batch-1 study |
+| FP4 (NVFP4/MXFP4) dense peak | **~427–500 TFLOP/s** (2× BF16; GB10 is 1:1:2 BF16:INT8:FP4) | `BLACKWELL_KERNEL_GAPS.md` §2 |
+| BF16 / INT8 peak | ~213 TFLOP/s / ~215 TOPS (INT8 == BF16 on GB10) | same §2 |
+| Demonstrated GB10 FP4-MMA efficiency | **~17%** of FP4 peak at prefill M=512 (MXFP4 dense 1153 t/s); **~3% dense / ~35%-of-BW MoE at decode** | `BLACKWELL_KERNEL_GAPS.md` §6, `GDN_DECODE_VERIFY.md` |
+| Dense Qwen3.6-27B NVFP4 weights | **18.8 GB** file; ~18 GB matmul tensors | `du` on DGX |
+| MoE Qwen3.6-35B-A3B NVFP4 weights | **23.85 GB** file; ~22 GB read/step @npl128 (~98% experts hit) | `du` on DGX |
+| Decode step decomposition (dense npl128, nsys, GPU 92.7% busy) | GEMM_weight **59.2%**, act_quant 8.2%, GDN 10.4%, full-attn 1.8%, elementwise/norm/rope 13.5%, embed 2.9%, copy 1.8% | `GDN_DECODE_VERIFY.md` §3a |
+| Measured per-step @npl128 | dense **~795 ms** (llama) → **~328 ms** (vLLM); MoE **~384 ms** → **~158 ms** | `VLLM_DECODE_GROUNDING.md` |
+| Aggregate decode @npl128 (the parity scoreboard) | dense **161** (llama) vs **391** (vLLM); MoE **333** vs **811** | `QWEN36_NVFP4_BENCH.md` |
+
+`decode_agg = npl / step_s = 128 / step_s`. Crossover formula throughout:
+`M* = b · peak / (2 · BW)`, `b` = bytes per weight element. Below `M*` bandwidth-bound, above it
+compute-bound.
+
+---
+
+## 1. The kernel-approach decision: TUNE the existing FP4-MMA `mul_mat_q`, do NOT write a cutlass kernel
+
+This is the first thing track B must settle, and the evidence settles it decisively.
+
+| option | verdict | why |
+|---|---|---|
+| **(A) Tune the existing `mul_mat_q<NVFP4>` FP4-MMA path** | **CHOSEN — the tractable spine** | The kernel already exists, is **bit-exact** (`test-backend-ops MUL_MAT` 1103/1103), is genuine **W4A4** (below), and already **beats vLLM at batch-1 prefill** (MXFP4 1153 t/s vs vLLM's 800 W4A16 — vLLM has no FP4 cubins on sm_121). The deficit is **decode-shape scheduling**, not the math op. Host-side selection + a bounded occupancy tune respects the GB10 lessons and is build-ready against known files/lines. |
+| **(B) New cutlass-style SM120 FP4 collective** | **REJECTED** | Repeats the **proven GB10 dead-end**: the from-scratch W4A16 BF16 GEMM hit only ~9–15 TFLOP/s (¼ of MMQ) and was **STOPPED** (`W4A16_MARLIN_KERNEL_PLAN.md`) because deep `cp.async` + XOR-swizzle **collapse GB10 occupancy**. Worse, **CUTLASS's own SM120 grouped block-scaled FP4 GEMM is broken on consumer Blackwell** (garbage/init-fail — CUTLASS #3096/#2800) — it is the exact reason vLLM falls back to **BF16 Marlin** for its MoE on sm_121. "Port cutlass" is not even a working option for the MoE arm. |
+| **(C) Marlin-style W4A16 (FP4→BF16 dequant + BF16 HMMA)** | **REJECTED for the win, noted for context** | This is what **vLLM's MoE actually runs** on sm_121 (W4A16, BF16 activations, dequant-in-mainloop). On GB10 **INT8 == BF16 == ½ FP4 rate**, so a BF16-HMMA path concedes the 2× FP4 advantage llama already has. We do not want to *descend* to vLLM's slower arithmetic class; we want to keep the FP4-MMA class and schedule it better. |
+
+**Decision: track B = tune `mul_mat_q<NVFP4>` (dense, `mmq.cu`/`mmq.cuh`) + the grouped `mul_mat_q`
+id-branch (MoE, `mmid.cu` + the same `mmq.cuh`).** No new kernel, no rewrite, no descent to BF16.
+The win is kernel *engineering around an FP4-MMA llama already possesses*, so there is **no
+hardware-instruction wall** — but it is gated by whether MMQ's occupancy-bound design can be pushed
+to the bandwidth floor at the thin decode M-tile.
+
+### What "the existing path" actually is (source-read, DGX `ggml/src/ggml-cuda/`)
+
+Decode runs **one `mul_mat_q` per weight, M=128** (all 128 slots' single tokens fused into one
+ubatch — confirmed `mul_mat_q(M=128)` in `GDN_DECODE_VERIFY.md`, not 128× M=1). The NVFP4 path:
+`mmq.cu` `use_native_fp4` gate (L125) → `quantize_mmq_fp4_cuda` act-quant (L138 dense / L200 id;
+**track A's fuse target**) → `mul_mat_q` → `vec_dot_fp4_fp4_mma` (`mmq.cuh:997`) →
+`mma_block_scaled_fp4` (`mma.cuh:1126`).
+
+**Confirmed W4A4 (this corrects an earlier "A is 8-bit-class" framing):** `block_fp4_mmq`
+(`mmq.cuh:53`) is `uint32_t d4[4]` (four `ue4m3` block scales) + `int8_t qs[4*32]` = **256 FP4 (e2m1)
+values packed 2-per-byte**. `quantize_mmq_fp4_cuda` (`quantize.cu:422`) emits FP4 via
+`ggml_cuda_float_to_fp4_e2m1`. The MMA is
+`mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3`
+(`mma.cuh:1145`) — **both operands e2m1, ue4m3 block scales**. So llama's dense FP4-MMA path is
+already the *same arithmetic class as vLLM's cutlass W4A4 dense*. The `sizeof(block_fp4_mmq) ==
+sizeof(block_q8_1_mmq)` static_assert is a shared-tile-footprint convention, **not** an 8-bit
+activation. **Consequence: there is no "make activations 4-bit" work to do and no activation-traffic
+halving to win — that is already banked. The entire dense deficit is scheduling/occupancy.**
+
+Geometry (`vec_dot_fp4_fp4_mma`): `MMQ_NWARPS=8`, `iter_k=MMQ_ITER_K_FP4=512`, tiles
+`tile_A<16,8,int>` (weights, 16 N-rows × 64 FP4-in-K), `tile_B<8,8,int>` (acts, 8 M-cols × 64
+FP4-in-K), `tile_C<16,8,float>` (16 N-rows × 8 M-cols), `nfrags = MMQ_TILE_NE_K/tile_A::J`. The M loop
+is `for (j0=0; j0<mmq_x; j0 += ntx*tile_C::J)` — M tiled in steps of `tile_C::J=8`.
+
+---
+
+## 2. The roofline — answering the load-bearing question
+
+**Answer: BANDWIDTH-bound on the hardware roofline, but COMPUTE-bound in practice by the kernel's own
+under-occupancy. The 273 GB/s is NOT the wall at the parity target.**
+
+### 2a. DENSE Qwen3.6-27B, M=128
+
+`b = 18e9/27e9 = 0.667 B/param`; FLOPs/step `= 2·128·27e9 = 6.91 TFLOP`.
+
+- **Weight-read floor** (18 GB read ONCE for all 128 tokens): @273 GB/s = **65.9 ms → 1,942 tok/s**;
+  @216 GB/s = 83 ms → 1,542 tok/s.
+- **Crossover** at FP4 peak: `M* = 0.667·500e12/(2·273e9) = 611`. **M=128 ≪ 611 → an ideal FP4 GEMM
+  at decode is BANDWIDTH-bound.** At the kernel's *achieved* ~3% efficiency the effective peak
+  collapses and drags M* to ≈30, putting the *current* kernel in self-inflicted compute-bound
+  territory.
+- **Where llama sits:** GEMM = 59.2% × 795 ms = **471 ms = 14.7 TFLOP/s = 2.9% of FP4 peak = 7.1×
+  slower than the 66 ms weight-read floor.** Not a bandwidth wall — a kernel running deep in
+  compute-bound territory at single-digit efficiency.
+- **Where vLLM sits:** step 328 ms ≈ llama's GEMM bucket (471 ms) alone. The **entire 2.42× gap is
+  the GEMM.**
+
+### 2b. MoE Qwen3.6-35B-A3B, M=128
+
+@npl128, 128 tok × top-8 / 256 experts ⇒ ~98% experts read ⇒ ~22 GB/step (the full weight set), per-
+expert M ≈ **4 tokens**.
+
+- **Weight-read floor:** 22/273 = **80.6 ms → 1,588 tok/s** (@216: 102 ms → 1,255).
+- **Compute floor:** only ~3B active params ⇒ 0.77 TFLOP ⇒ 1.5 ms @peak — **trivial. MoE decode is
+  purely bandwidth/occupancy-bound, never compute-bound.** The hard part is saturating 273 GB/s while
+  feeding ragged M≈4 tiles.
+- **Where llama sits:** GEMM = 59% × 384 = **227 ms = 97 GB/s = 35% of peak BW** (occupancy/tile-fill
+  loss, not compute).
+- **Where vLLM sits:** step 158 ms ≈ grouped Marlin-NvFp4 at the ~80 ms floor + ~78 ms non-GEMM —
+  already pushing the MoE BW floor.
+
+**Both weight-read floors (dense ~1,940, MoE ~1,590 tok/s) sit 4–6× ABOVE vLLM's 391/811. Bandwidth
+is not the wall; the GB10 FP4-MMA occupancy efficiency is.**
+
+---
+
+## 3. The code-level inefficiencies, and the M-tile asymmetry that drives the whole plan
+
+The selection is `mul_mat_q_case` (`mmq.cuh:4108`): it loops `mmq_x = 8..mmq_x_max(=128) step 8` and
+keeps the `mmq_x` that **minimizes `ntiles_x = ceil(ncols_max/mmq_x)`**, stopping at `ntiles_x==1`.
+`mmq_y` (the weight-row tile) is pinned at **128** by `get_mmq_y_host` (L143). This produces the
+single most important structural fact for track B:
+
+> **`mmq_x` tiles M (tokens / output columns) — shrinking it RE-READS the weights `ntiles_x` times.
+> `mmq_y` tiles N (weight rows / output rows) — shrinking it does NOT re-read weights (each weight row
+> lives in exactly one row-tile); it only lowers shared footprint and raises occupancy.** The two
+> regimes pick opposite knobs:
+
+| | dense decode (M=128, no `expert_bounds`) | MoE decode (per-expert M≈4) |
+|---|---|---|
+| selection picks | `mmq_x=128` → `ntiles_x=1` → **weights read ONCE** (the one-read optimum) | `mmq_x=128` applied **per expert** → tile ~3% filled |
+| shrink `mmq_x`? | **NO — re-reads 18 GB ×`ntiles_x`**, fatal in the BW-bound regime | **YES, FREE** — 1 col-tile/expert regardless, no re-read → strictly occupancy-positive |
+| FP4-MMA M-frag fill | **full** (128/`tile_C::J`=16 frag-groups, all live) → no fragment waste | **wasted** (~1 of 8/16 frag-groups live, rest masked tails) |
+| BW-neutral occupancy lever | **`mmq_y`↓** (more resident CTAs, weights still read once) — kernel-structure change | **`mmq_x`↓** (toward density ≈8) — host-side template switch |
+| dominant loss | **occupancy** at the heavy 128×128 tile (exposed weight-load latency) | **tile-fill** (dense-tuned M-tile applied to ragged M≈4) |
+
+This asymmetry is the spine of the plan: **MoE's lever is host-only `mmq_x`↓ (already landed as patch
+0015 auto-cap→64; ideal ≈8–16); dense's lever is `mmq_y`↓ + occupancy, a bounded kernel change.**
+
+The five inefficiencies, ranked:
+
+1. **Separate activation-quant pass (track A's bucket, 8.2%).** `quantize_mmq_fp4_cuda` writes the
+   whole activation tensor to `block_fp4_mmq` in a standalone kernel; vLLM fuses `scaled_fp4_quant`
+   into the preceding RMSNorm/SiLU epilogue. **Handoff (track A → B):** B must consume A's prequantized
+   `block_fp4_mmq` y-tile in place of calling `quantize_mmq_fp4_cuda`, so the fusion saves the
+   activation round-trip, not just the launch (see §4.4).
+
+2. **No weight-load software pipeline → exposed latency at thin M (the #1 dense kernel lever).**
+   `load_tiles_nvfp4_nvfp4` (`mmq.cuh:946`) does plain global→shared stores → `__syncthreads` →
+   `vec_dot_fp4_fp4_mma` (`load_ldmatrix` of A + MMA): a **load→sync→compute→repeat** cadence with **no
+   `cp.async` double-buffering** overlapping the next k-block weight load with the current MMA. At
+   M=128 the per-tile MMA work is small, so serialized weight-load latency dominates → 2.9% (dense) /
+   35%-of-BW (MoE). **Caveat (the GB10 wall):** a *deep* pipeline + XOR-swizzle collapses GB10
+   occupancy (`W4A16_MARLIN_KERNEL_PLAN.md`). The fix is **occupancy-first** (raise resident CTAs to
+   hide latency via CTA-parallelism), **shallow 2-stage prefetch second**, never Marlin's 4-stage.
+
+3. **`mmq_x` maximized for dense = occupancy-heavy, but pinned by the one-read constraint.** At dense
+   decode the 128×128 tile (8 warps, large shared) is low-occupancy on the occupancy-dominated GB10 —
+   but you cannot shrink `mmq_x` without doubling the 18 GB weight read. So the dense occupancy fix is
+   **`mmq_y`↓** (BW-neutral), not `mmq_x`↓.
+
+4. **MoE per-expert M-tile waste (the structural MoE gap).** The 128-wide (or patch-0015 64-wide)
+   tile is applied per expert at density ≈4, so the accumulator is ~3–6% filled and ~1 `tile_C` frag-
+   group is live, the rest masked `need_check` tails. Ideal `mmq_x` ≈ tokens/expert ≈ 8 (= `tile_C::J`).
+   At ≤1 col-tile/expert this costs **no** extra weight read → strictly occupancy-positive. (This is
+   the MoE arm of inefficiency 3; scoped in `MOE_GROUPED_GEMM_SCOPE.md`.)
+
+5. **`iter_k=512` (FP4) couples to occupancy.** The FP4 main loop stages 512 K-elements/iter → larger
+   shared footprint → adverse in the occupancy-bound regime. A P2 tuning knob.
+
+**Ruled out (do not chase):** redundant weight reads on the *current* selection (none — dense
+`ntiles_x=1`, MoE ≤1 col-tile/expert); stream-K fixup (it *helps* fill the small GB10 grid at thin M);
+raw FP4-MMA peak rate (already beats Q4-MMQ and is BW-bound at batch 1 — latency-hiding binds first).
+
+---
+
+## 4. The specific build-ready changes
+
+All against DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`. Every change is gated and defaults to exact
+stock behavior until proven.
+
+### 4.1 Dense M-tile / occupancy (the make-or-break)
+
+- **Keep `mmq_x=128` at dense decode** (the one-weight-read optimum; do **not** shrink it — that
+  re-reads 18 GB). Lock this as an invariant in P0.
+- **Make `mmq_y` decode-selectable** (`get_mmq_y_host`/`get_mmq_y_device`, L143/L157). Today pinned
+  128; try **64** (and 96) at decode. `mmq_y` is coupled to `nwarps × tile_C::I` via the MMQ
+  static_assert, so this is a **warp/fragment remap** (bounded kernel change), not a pure host switch:
+  fewer N-frags per warp or fewer warps → smaller per-CTA shared → **more resident CTAs → latency
+  hidden by CTA-parallelism**, with **weights still read once** (BW-neutral). This is the primary
+  dense occupancy lever and respects every GB10 rule.
+- **Host-only knobs first (P1, zero kernel):** the `mmq_get_granularity_host` choice (L274 — sets
+  `rows_per_warp=2·granularity`, `ntx`), and the stream-k-vs-xy-tiling threshold (`launch_mul_mat_q`
+  ~L3954, `tiles_efficiency_percent` L4001). Plus one **empirical A/B**: does eating a 2× weight
+  re-read at `mmq_x=64` buy enough occupancy to net positive? (Diagnostic: if yes, occupancy is badly
+  broken and P2 `mmq_y`↓ has large upside; if no, the tile is already BW-saturated and P2's ceiling is
+  lower.) All behind `GGML_CUDA_FP4_MMQ_Y` / `GGML_CUDA_FP4_GRAN` / `GGML_CUDA_FP4_FORCE_STREAMK`.
+
+### 4.2 FP4-MMA fragment usage
+
+- Fragments stay `tile_A<16,8,int>` / `tile_B<8,8,int>` / `tile_C<16,8,float>` — these match the
+  `m16n8k64` block-scaled FP4 MMA and must not change (they are the instruction shape). At dense M=128
+  all 16 `tile_C::J`-groups are live → **no dense fragment work needed**. The lever is *how many of
+  these tiles are resident per SM* (occupancy), set by `mmq_y`/`nwarps`/granularity, not the fragment
+  shape.
+- MoE: shrink `mmq_x` toward `tile_C::J`=8 so the live frag-group count matches density (§4.3).
+
+### 4.3 MoE M-tile (`MOE_GROUPED_GEMM_SCOPE.md`, partly landed)
+
+- **Patch 0015 already auto-caps `mmq_x`→64 at decode** via per-expert density in `mul_mat_q_case`
+  (the `expert_bounds != nullptr` block, L4118-4165; env `LLAMA_MOE_DECODE_TILE`,
+  `LLAMA_MOE_DENSITY_MAX`). Tighten the decode tile toward **8–16** (= density) and sweep.
+- **Optional [2]: block-padded `mm_ids_helper`** (`mmid.cu`) — pad each expert segment to a multiple
+  of the tile, removing `need_check` masked tails and tightening the stream-k schedule. Medium risk
+  (scatter + write-back masking); behind `LLAMA_MOE_BLOCK_ALIGN`.
+
+### 4.4 Scale handling + the act-quant fusion handoff (the track A → B ABI contract)
+
+- **Weight scales** (`ue4m3`, one per 16 weights) load in `load_tiles_nvfp4_nvfp4` into `x_sc`
+  (`x_u32 + 64 + kbx`), consumed as `scaleA` in `vec_dot_fp4_fp4_mma` and passed as the block-scale
+  operand to `mma_block_scaled_fp4`. **No change** — already a first-class MMA scale operand.
+- **Activation scales** (`ue4m3`) live in the `block_fp4_mmq` y-tile `d4[4]`, consumed as `scaleB`.
+- **The handoff contract:** track B must hold the **`block_fp4_mmq` y-tile layout invariant**
+  (`uint32_t d4[4]` ue4m3 scales + `int8_t qs[128]` = 256 packed FP4, `mmq.cuh:53`). Track A's fused
+  `rms_norm+mul+nvfp4-quant` producer (task 39) writes exactly this struct; track B's "prequantized
+  MMQ consumer" (task 40) makes `mul_mat_q` accept a prebuilt `src1_q8_1` buffer and **skip the
+  `quantize_mmq_fp4_cuda` call** (`mmq.cu:138`/`200`). The numerics must be **bit-identical** to the
+  unfused path (same `e2m1` rounding, same `ue4m3` block scale per 16) so the parity gate stays green
+  with the fusion on or off. B owns the consumer seam; A owns the producer kernel; the `block_fp4_mmq`
+  struct is the frozen interface between them.
+
+### 4.5 GB10-fit rules (binding constraints on every kernel change)
+
+- **Small shared mem + high occupancy.** Do **not** add deep `cp.async` stages or XOR-swizzle shared
+  layouts — they are exactly what collapsed W4A16 on GB10 (`W4A16_MARLIN_KERNEL_PLAN.md`: a 16 KB
+  XOR-swizzle dropped q4_K from 6.63→2.84 TFLOPS).
+- **Preserve the skew-pad** (`MMQ_MMA_TILE_X_K_FP4 = 2·MMQ_TILE_NE_K + 8 + 4`, the `% 8 == 4`
+  padding, `mmq.cuh:221/233`) — conflict-free `ldmatrix` at ~zero shared cost.
+- **Stay on the FP4-MMA path** (`block_fp4_mmq` / `mma_block_scaled_fp4`) — the only path at GB10's
+  FP4 = 2× INT8/BF16 rate. Never descend to BF16/INT8 (1:1 on GB10).
+- **Occupancy beats a conflict-free-but-wide layout.** Buy latency-hiding with *more resident CTAs*
+  (smaller `mmq_y`, smaller shared), not a deeper pipeline.
+- Tuning is **empirical** — `nsys` (throughput) is available, **`ncu` is not** on the DGX (no driver
+  perms). Sweep configs, measure decode_agg, bracket thermals (same-session cold A/B only).
+
+---
+
+## 5. Correctness / parity gate (every phase)
+
+- **Primary, bit-exact:** `test-backend-ops test -o MUL_MAT -b CUDA0` and
+  `test-backend-ops test -o MUL_MAT_ID -b CUDA0` must stay **1103/1103** with the flag set **and**
+  unset, and **byte-identical** when unset. The CPU reference is the deterministic oracle; the op test
+  is exact (the GB10 greedy-decode non-determinism band applies only to end-to-end, never to the op
+  test).
+- **Add decode-shape cases if absent:** `type_a ∈ {NVFP4, MXFP4}`, `type_b = F32`, dense **n=128** at
+  the real FFN K/N; for `_ID`, `n_mats=128, n_expert_used=8, n_tokens ∈ {8,32,64,128}` **plus ragged
+  small-M** (experts with 0/1/2 tokens, `n_tokens` not a multiple of `mmq_x`) — exactly where `mmq_x`/
+  `mmq_y` changes and block-pad masking can leak.
+- **Fusion-handoff parity (P3):** with track A's fused producer on, the prequantized-consumer path
+  must produce dst **identical** to the unfused `quantize_mmq_fp4_cuda` path (same `e2m1`/`ue4m3`
+  rounding).
+- **End-to-end:** `llama-batched-bench -fa on -npp 512 -ntg 256 -npl 128` on `q36-27b-nvfp4.gguf`
+  (dense) and `q36-35b-a3b-nvfp4.gguf` (MoE); confirm decode_agg climbs per §6 and output stays within
+  the documented CUDA batch-shape non-determinism band vs the CPU oracle. All scripts **dev-tree-only**.
+
+---
+
+## 6. Phased plan, with expected decode_agg at each phase
+
+Per-step model used (ms @npl128): **dense 795** = GEMM 471 + act 65 + GDN 83 + attn 14 + rest 162;
+**MoE 384** = GEMM 227 + act 31 + GDN 38 + attn 8 + rest 81. `decode_agg = 128 / step_s`.
+
+### DENSE (parity target 391)
+
+| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 391** | risk |
+|---|---|---:|---:|---:|---:|---|
+| **P0** harness | Lock baseline: 1103/1103, decode n=128 perf, nsys window, the 471 ms / 2.9% eff datum. Pin `mmq_x=128` one-read invariant. | 471 | 795 | **161** | 41% | low |
+| **P1** host-only tile/grid + re-read A/B | granularity + stream-k threshold sweep; the `mmq_x=64` re-read-vs-occupancy diagnostic. **Honest: small** — `mmq_x` is pinned, so this mostly de-risks P2. | ~400 | ~724 | **~177** | ~45% | low |
+| **P2** `mmq_y`↓ + occupancy/shallow-prefetch | The make-or-break: raise resident CTAs (`mmq_y` 128→64, granularity, shallow 2-stage weight prefetch, skew-pad), push GEMM toward the **66–81 ms BW floor (17–21% FP4 eff)**. **KILL-GATE: if eff plateaus <15% (GEMM >110 ms) → dense parity OFF, report partial.** | **66–81** | 390–405 | **316–328** | **81–84%** | **med-high** |
+| **P3** co-land track A | Consume A's prequantized `block_fp4_mmq` y-tile; the 65 ms act bucket folds away. | 66–81 | **325–340** | **376–394** | **96–101%** | low |
+
+Dense climb: **161 → ~177 → 316–328 → 376–394** tok/s = **41% → 45% → 81–84% → 96–101% of vLLM 391.**
+Robust to the 273-vs-216 GB/s uncertainty (@216 GB/s P3 → ~359 tok/s = 92%). **Parity within error,
+contingent on P2 clearing the kill-gate and on A landing.**
+
+### MoE (parity target 811)
+
+| phase | work | GEMM ms | step ms | **decode_agg** | **% of vLLM 811** | risk |
+|---|---|---:|---:|---:|---:|---|
+| **P0** harness | Lock 1103/1103 + the monotonic `85→1771` batched-bench curve + 227 ms / 35%-BW datum. | 227 | 384 | **333** | 41% | low |
+| **P1/P4** MoE `mmq_x`↓ (patch 0015 → tighten to 8–16) | Free per-expert tile shrink (no re-read); reclaim the 3–6% fill waste, raise occupancy. | ~140 | ~297 | **~431** | ~53% | low |
+| **P2** block-pad align + occupancy | Remove `need_check` tails, tighten stream-k; push toward the 80 ms floor. | ~100 | ~257 | **~498** | ~61% | med |
+| **P3** co-land track A | act bucket (31 ms) folds away; GEMM at the ~80 ms floor. | 80 | **207** | **618** | **76% — CEILING** | low |
+
+MoE climb: **333 → ~431 → ~498 → 618** tok/s = **41% → 53% → 61% → 76% of vLLM 811.** **The 76% is the
+hard ceiling from the GEMM track:** even a *perfect* weight-read-floor grouped GEMM leaves llama's
+non-GEMM (GDN 38 + attn 8 + rest 81 = 127 ms) at **1.6× vLLM's whole ~78 ms non-GEMM**, so the step
+cannot drop below ~207 ms. The remaining ~49 ms to vLLM's 158 ms step is elementwise + host-loop
+(GDN state I/O is intrinsic and vLLM pays it identically — `GDN_DECODE_VERIFY.md`), **outside track B.**
+
+### Explicitly NOT in scope (and why)
+
+- A from-scratch W4A16 / CUTLASS SM120 collective — repeats the STOPPED occupancy dead-end and
+  CUTLASS's grouped FP4 is broken on sm_121.
+- Deep multi-stage `cp.async` / XOR-swizzle — proven to collapse GB10 occupancy.
+- "Make activations 4-bit" — already W4A4; no work, no win there.
+- The non-GEMM MoE residual (elementwise, host CUDA-graph, GDN bf16 state) — needed for MoE parity but
+  **separate tracks**; B owns the GEMM only.
+
+---
+
+## 7. The honest ceiling — does B reach TRUE PARITY?
+
+- **DENSE: TRUE PARITY is PLAUSIBLY REACHABLE, conditional, no margin.** The entire 2.42× gap is the
+  GEMM bucket; its ideal floor (66 ms) is 7× below the current 471 ms and is **bandwidth-bound, not
+  hardware-capped**. **B (GEMM → BW floor) + A (act-fuse) lands 376–394 tok/s = 90–103% of vLLM 391.**
+  The catch: it needs **~17–21% FP4-MMA efficiency at decode M=128**, and GB10 has only demonstrated
+  ~17% — and that at the *easier* prefill M=512 tile. It is a **reach, not a lock**, gated by the P2
+  occupancy kill-gate and contingent on track A. **GO (conditional).**
+
+- **MoE: full parity is NOT reachable from track B.** Realistic ceiling **~76% of vLLM (618 vs 811)**
+  even with a perfect weight-read-floor grouped GEMM, because (1) the MoE floor is the hardest
+  grouped-GEMM regime (M≈4/expert, vLLM ships purpose-built Marlin-NvFp4) and (2) ~24% of the step is
+  non-GEMM outside this track. Worth doing (333 → ~618, a 1.85× and a real win), but it **cannot
+  deliver 811 alone.** **PARTIAL / NO-GO for parity-from-B.**
+
+- **The 273 GB/s is not the ceiling — the GB10 FP4-MMA occupancy efficiency is.** Decode M=128 is a
+  *different* regime from the dead W4A16 path: bandwidth/occupancy-bound (saturate LPDDR5x at a thin
+  M-tile via resident CTAs), not compute-throughput-bound (pack MMAs). The existing path is already at
+  the BW floor at batch 1 (88 ms), so the work is **keeping it bandwidth-bound as M grows to 128**
+  (occupancy via `mmq_y`↓ + shallow prefetch), a **tune of a working path**, not the greenfield
+  rewrite. The binding risk is whether that occupancy can be bought without tripping the GB10 wall —
+  which is exactly what the P2 kill-gate measures.
+
+**Bottom line for the "TRUE PARITY" ask:** GB10 **can** plausibly deliver **dense** decode parity with
+vLLM via a tuned FP4-MMA decode GEMM **+ track A**, at the top of the demonstrated efficiency envelope
+with no margin. GB10 **cannot** deliver **MoE** decode parity from the GEMM track alone (ceiling ~76%);
+MoE parity is a B-plus-non-GEMM program. **Verdict: GO for dense (conditional, B+A, kill-gated),
+PARTIAL for MoE.**
+
+---
+
+## 8. One-paragraph summary
+
+The decode GEMM at M=128 is **bandwidth-bound on paper** (crossover M*≈611 ≫ 128) with weight-read
+floors 4–6× above vLLM, so **273 GB/s is not the wall** — but llama's FP4-MMA kernel runs at ~3% of
+FP4 peak, in **self-inflicted compute-bound territory** (471 ms vs a 66 ms floor). The path is already
+**W4A4** and already **beats vLLM at batch-1 prefill**, so the fix is **tuning the existing
+`mul_mat_q<NVFP4>`**, not a cutlass rewrite (a proven GB10 dead-end, and broken on sm_121 anyway). The
+M-tile asymmetry sets the levers: **dense** is pinned at `mmq_x=128` (one weight read) so its occupancy
+win is **`mmq_y`↓ + shallow prefetch** (BW-neutral), while **MoE**'s win is the free per-expert
+**`mmq_x`↓** (patch 0015). **Track B (GEMM → BW floor) + track A (fuse act-quant)** plausibly reaches
+**90–103% of vLLM dense (391)** — TRUE PARITY on the table for dense, but only at the **top of the
+demonstrated GB10 FP4-efficiency envelope (~17–21%)**, with **no margin**, gated by the P2 occupancy
+kill-gate. **MoE parity is not reachable from the GEMM alone** (ceiling ~76% of 811), because its floor
+sits in the hardest grouped-GEMM regime and ~24% of its step is non-GEMM. **Verdict: GO for dense
+(conditional, B+A), PARTIAL for MoE.**
+
+---
+
+## 9. Adversarial review (skeptical staff CUDA engineer, post-W4A16): the parity go / no-go
+
+Reviewer stance: I lived through the W4A16 GB10 effort that plateaued at ~9-15 TFLOP/s (~21% of the
+BF16 ceiling) after multi-week work and was STOPPED at the occupancy wall. I read this scope and the
+grounding (`QWEN36_NVFP4_BENCH`, `VLLM_DECODE_GROUNDING`, `GDN_DECODE_VERIFY`, `DECODE_GAP_STUDY`,
+`BLACKWELL_KERNEL_GAPS`, `W4A16_MARLIN_KERNEL_PLAN`) and stress-tested the verdict against them. Net:
+the plan is **directionally right and tractably scoped**, the kernel-approach decision (tune, do not
+rewrite) is correct, but the **"GO for dense, TRUE PARITY 96-103%" headline outruns its own caveats**.
+The honest landing is **dense ~80-90% (parity is the optimistic tail), MoE ~55-65% (parity not
+reachable from B)**. The decision to commit to B is nonetheless sound, for a reason the doc under-sells
+(low regret), and there is **one technical gap (TMA) and one sequencing error (A last) that must be
+fixed**.
+
+### 9.1 Is this the W4A16 wall again? No - and the batch-scaling signature proves why
+
+The decisive evidence the doc has but does not fully exploit is the **npl-sweep** (`QWEN36_NVFP4_BENCH`):
+dense llama-as-%-of-vLLM = **99 / 56 / 46 / 41** at npl 8 / 32 / 64 / 128. At **npl8 the kernels are at
+parity** (99%); the gap **opens monotonically as M grows**. Decompose this:
+
+- At M=8 the dense GEMM is weight-read-bound at the floor (~88 ms, same as batch-1). llama == vLLM there,
+  so **llama's FP4-MMA kernel demonstrably HITS the weight-read floor at small M.** This is the existence
+  proof the W4A16 path never had: it is a *working, floor-reaching* FP4-MMA kernel, not a greenfield
+  build stuck at 1/4 of MMQ.
+- At M=128 vLLM's GEMM **stays at ~88 ms** (flat: it amortizes the one weight read over 128 tokens and
+  hides the MMA behind the load), while **llama's balloons to 471 ms** (5.4x). llama **falls off the
+  floor** as M grows; vLLM **holds it**.
+
+So the problem is **not** "build a fast 4-bit GEMM from scratch on an occupancy-hostile part" (the dead
+W4A16 problem). It is **"keep a working FP4-MMA kernel on the bandwidth floor as the M-tile grows from 8
+to 128"** - a tune of a working path. **Verdict: this is NOT the W4A16 wall** (different regime, working
+path, dual existence proof at M=8 and from vLLM at M=128). **But it shares W4A16's one binding
+constraint:** holding the floor as M grows requires hiding LPDDR5x weight-load latency at the larger
+tile, which is the same occupancy / latency-hiding game GB10 historically loses. The doc is right that
+it is a different and more tractable regime; it under-states that the *binding risk is identical*.
+
+### 9.2 Why is vLLM 2.4x faster if both share 273 GB/s? Compute-side scheduling, and the gap is ~82% (not 100%) GEMM
+
+The load-bearing question, settled by 9.1: at M=128 the gap is **not** that vLLM beats the shared
+bandwidth floor - it is that **llama falls off the floor into self-inflicted compute/occupancy-bound
+territory while vLLM stays on it.** The lever is therefore latency-hiding at the M=128 tile
+(compute-side scheduling: occupancy, prefetch, tile shape), with the 273 GB/s weight-read floor as the
+hard target both engines share. This confirms the doc's roofline and its central claim that the kernel,
+not the hardware, is the limiter.
+
+**But the doc's "the entire 2.42x dense gap is the GEMM" is an ~82% truth, not a 100% one.** Decompose
+the dense step (numbers from the doc's own inputs):
+
+```
+llama step @npl128            795 ms   (decode_agg 161)
+vLLM step  @npl128            328 ms   (decode_agg 391)
+total gap                     467 ms
+
+llama GEMM                    471 ms
+vLLM GEMM (at the floor)      ~66-88 ms   (66 @273 GB/s spec, 88 @216 GB/s achieved)
+=> GEMM gap                   383-405 ms  = 82-87% of the 467 ms total gap
+=> non-GEMM gap                62-84 ms   = 13-18% of the total gap
+```
+
+So **B alone (GEMM -> floor) caps near ~80-84%** (step 412-390 ms = 311-328 t/s), **not parity.** Parity
+needs the non-GEMM 62-84 ms too: ~65 ms of it is track A's act-quant bucket, the residual ~0-19 ms is
+elementwise + host outside both A and B. This is the crux of the sequencing answer (9.6): **B is
+necessary but on its own lands ~80%; it is track A that tips dense over the parity line, not B.** The
+parity story is *entirely* contingent on A, which the P3 framing buries.
+
+### 9.3 The sharpest risk the doc misses: vLLM's existence proof uses the technique the doc forbids (TMA)
+
+vLLM holds the M=128 floor with **cutlass SM120 = TMA + a warp-specialized deep async producer/consumer
+pipeline** (Research 1). That deep pipeline is **exactly what the doc forbids on GB10** (rule 4.5: "do
+not add deep cp.async stages ... they collapsed W4A16"). So **B's chosen GB10-friendly route (`mmq_y`-down
+occupancy + a shallow 2-stage prefetch) is a different bet from the one that produced the existence
+proof.** Reaching the same floor by a friendlier route is plausible but **unproven**, and if the
+occupancy-only route plateaus short of the floor, B underperforms its target with no fallback in scope.
+
+The doc conflates two different things under "deep pipeline":
+- **manual `cp.async` + XOR-swizzle** - register/shared-hungry, **collapsed W4A16 occupancy on GB10**
+  (correctly banned).
+- **TMA (tensor-memory-accelerator) bulk async copy** - a single descriptor drives the copy, **far lower
+  register/occupancy cost**, and it is precisely how cutlass gets pipeline depth **without** the
+  occupancy hit (Research 1 says this explicitly). TMA is available on sm_120/121.
+
+**Recommendation (binding):** B must put a **TMA-driven weight feed in scope as a first-class P2 option**,
+not categorically forbid pipeline depth. The occupancy-only route is the right *first* experiment
+(cheapest, respects the W4A16 lesson), but if P2 plateaus below the floor, **TMA is the demonstrated way
+to get depth without the occupancy collapse** and is what the vLLM existence proof actually uses.
+Declaring the floor "unreachable" without trying TMA would repeat the W4A16 mistake in reverse:
+abandoning the path that works because the *manual* version of it failed.
+
+### 9.4 Tractability: bounded tune, confirmed - with the TMA caveat
+
+The proposed changes are genuinely **bounded and build-ready**, not a greenfield kernel:
+- **MoE arm = DEMONSTRATED tractable.** Patch 0015 already auto-caps `mmq_x` per-expert and is committed
+  and measured. Tightening to 8-16 + block-pad is the same lever, lower risk. This is real, banked
+  evidence that the "tune `mul_mat_q`" approach works on this exact kernel family.
+- **Dense arm = plausibly bounded.** `mmq_y`-down is a warp/fragment remap that touches the
+  `nwarps x tile_C::I == mmq_y` static_assert coupling, so it is a contained *kernel* edit (not a pure
+  host switch, as the doc itself notes). The host-only P1 knobs are zero-risk. The **prefetch piece is
+  where the residual occupancy risk lives** - and per 9.3, TMA belongs here.
+- **Rejecting (B) cutlass-rewrite and (C) BF16-Marlin-descent is correct.** Cutlass grouped FP4 is broken
+  on sm_121 (the reason vLLM itself falls to Marlin for MoE); BF16 Marlin concedes GB10's 2x FP4 edge.
+
+**Verdict: tractable, not greenfield.** The MoE arm is proven; the dense arm is a contained edit with a
+real but bounded occupancy risk, gated by the P2 kill-gate. The one scope gap is TMA (9.3).
+
+### 9.5 Honest expected outcome (the numbers I would defend)
+
+| | B alone | B + A (median) | B + A (optimistic, spec BW) | parity? |
+|---|---:|---:|---:|---|
+| **DENSE** (target 391) | ~80-84% (311-328 t/s) | **~92-95% (360-372 t/s)** | ~101% (394 t/s) | **optimistic tail only** |
+| **MoE** (target 811) | ~53-61% (431-498 t/s) | **~70-76% (570-618 t/s)** | 76% (618 t/s, CEILING) | **no** |
+
+Reconciliation with the doc: the doc's B+A = "96-103%" uses the **spec-BW (66 ms floor)** end. At the
+**achieved 216 GB/s (88 ms floor)** the same arithmetic gives **~94%**, and that still assumes B hits the
+floor. So the honest dense median is **~92-95%, with TRUE PARITY as the upside, not the expectation**,
+contingent on a conjunction of three things: (a) P2 clears the occupancy kill-gate to the floor, (b) the
+GB10-friendly *or* TMA feed actually reaches the cutlass floor (9.3), and (c) track A lands. Three ANDs =
+tail, not median.
+
+**The low-regret point the doc under-sells (and the real reason to commit):** even the *kill-gate-tripped*
+outcome is a large win. At the doc's own 15%-FP4-eff kill threshold (GEMM ~110 ms), B+A still lands
+**~89%** (step 369 ms); at a merely-partial occupancy win (eff 3% -> 5%, GEMM ~276 ms) B+A still lands
+**~61%**. Since the M=8 parity proof guarantees the floor is reachable in principle and patch 0015 proves
+the tune works, **getting *some* improvement at M=128 is high-probability; the only open question is how
+close to the floor.** So the outcome distribution is heavily positive (very likely 60-90%, possibly
+parity) with a bounded downside - B is **low-regret**, which matters more for the go decision than whether
+the parity tail hits.
+
+### 9.6 Sequencing vs track A: land A FIRST (the doc has this backwards)
+
+The doc runs A as a parallel track merging at **P3 (last)**. That is backwards for de-risking, for three
+reasons:
+1. **A defines B's interface.** B's "prequantized-MMQ consumer" consumes A's fused `block_fp4_mmq`
+   producer (the frozen struct in 4.4). Building B against a not-yet-landed producer means B's consumer
+   seam is speculative until P3.
+2. **A defines B's baseline and the kill-gate threshold.** A alone (act-fuse, folding the 65 ms /8.2%
+   bucket, plus any of the elementwise/host it captures) plausibly moves dense **41% -> ~50-55%** before
+   B touches a kernel. B's *true residual is the GEMM after A removed the act round-trip*, not the raw
+   59%. Running B's P2 against the stock 41% baseline mis-sizes the required GEMM speedup and the
+   <15%-eff kill-gate.
+3. **A is lower-risk and independently shippable.** It is the safe win; it should not wait behind the
+   risky kernel tune.
+
+**Recommendation:** land A (tasks 38-41) first, **re-measure** the decode_agg and the GEMM share
+post-A, **then** run B's P2 and recompute the kill-gate against the post-A number. This makes the
+make-or-break decision cheaper, better-informed, and bankable-either-way.
+
+### 9.7 Verdict (go / no-go)
+
+- **DENSE: CONDITIONAL GO - commit to B, but scope and message it as "close most of the GEMM gap"
+  (expected ~80-90%, parity the upside), NOT "true parity."** Justified because: the approach is
+  bounded/tractable (9.4), it is a working-path tune with a dual existence proof (9.1), and the outcome
+  is low-regret (9.5) - even a tripped kill-gate roughly doubles today's 41%. Conditions: (i) **land A
+  first** (9.6); (ii) **gate hard at P2** (eff < 15% -> stop chasing parity, but keep the partial win);
+  (iii) **put TMA in scope** as the floor-reaching fallback before declaring the floor unreachable (9.3).
+
+- **MoE: NO-GO for parity from B (confirmed).** The doc's ~76% ceiling is honest, arguably optimistic
+  (it assumes the ragged M~4/expert grouped GEMM hits its 80 ms floor, the hardest regime, where vLLM
+  ships purpose-built Marlin). Realistic B+A landing **~70-76%**, B alone ~55-61%. Still worth doing -
+  the `mmq_x`-down / block-pad work is cheap and partly landed (patch 0015) - but it must be sold as a
+  **1.7-1.85x win, not parity**; MoE parity is a **B-plus-non-GEMM** program (elementwise fusion, host
+  CUDA-graph, GDN bf16 state).
+
+- **One line for the parent:** GB10 can plausibly reach **dense** decode parity with vLLM only at the
+  **top of its FP4 envelope and only as B + A together** (B alone caps ~80%; A is what tips it over),
+  and **cannot** reach **MoE** parity from the GEMM track alone (ceiling ~76%). **Commit to B** as a
+  high-value, low-regret, bounded GEMM-gap-closing tune (honest expected landing **dense ~80-90%, MoE
+  ~55-65%**), **sequence track A first**, **gate at P2**, and **add a TMA weight-feed option** so the
+  occupancy-only route is not the only shot at the floor that vLLM's TMA pipeline demonstrably reaches.
--- a/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
+++ b/backend/cpp/llama-cpp/patches/paged/GDN_DECODE_VERIFY.md
@@ -0,0 +1,208 @@
+# GDN decode verify: is llama.cpp's Gated-Delta-Net decode O(1) or an O(ctx) re-scan?
+
+Verdict-first, then the evidence. This closes lever 5 of `VLLM_DECODE_GROUNDING.md` ("Verify
+llama's GDN/linear-attention decode path"): on the Qwen3.6 hybrid models, is llama re-scanning the
+context (O(ctx)) in the linear-attention layers, or keeping vLLM's O(1)-in-context recurrent state?
+
+Method: GGUF-metadata + source reading on the `paged` dev tree (`~/llama-paged-dev`, build-cuda
+sm_121) on `dgx.casa`, plus nsys CUDA-kernel decode traces on `~/bench/q36-27b-nvfp4.gguf`
+(GB10 / DGX Spark, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`). Models:
+`~/bench/q36-27b-nvfp4.gguf` (dense, arch `qwen35`), `~/bench/q36-35b-a3b-nvfp4.gguf`
+(MoE, arch `qwen35moe`).
+
+## TL;DR verdict
+
+**llama.cpp's GDN decode is EFFICIENT: it is O(1)-in-context, a single fused CUDA kernel that
+reads + updates a fixed-size cached recurrent state, structurally identical to vLLM's
+`fused_recurrent_gated_delta_rule`. It is NOT a re-scan, NOT a context-scaling blowup, and NOT a
+major contributor to the ~2.4x eager-decode gap.** There is no GDN-specific bottleneck to fix, so
+the cheap model-specific lever this probe was hunting for does not exist. The 2.4x is the general
+kernel work (the FP4 weight GEMM, which dominates the step, plus the O(ctx) full-attention decode
+kernel in the minority of full-attention layers), exactly as `VLLM_DECODE_GROUNDING.md` concluded.
+
+The decisive datum: at matched batch (npl4), pure decode, 4x more context, the GDN kernel time is
+**flat** while the full-attention kernel grows ~3.1x:
+
+| kernel | ctx 1024 | ctx 4096 | ratio | meaning |
+|--------|---------:|---------:|------:|---------|
+| `gated_delta_net_cuda` (GDN linear-attn) | 10.3 us/launch | 8.0 us/launch | **~1.0x (flat)** | **O(1) in ctx** |
+| `flash_attn_tile` (full-attn layers) | 27.1 us/launch | 85.0 us/launch | **3.1x** | O(ctx), as expected |
+| total ms / decode step | 84.9 | 86.0 | 1.01x | GEMM-bound, ctx-independent |
+
+Identical decode-step counts in both windows (~190 steps, ~9134 GDN launches), so this is a
+per-step like-for-like comparison: the GDN layers do **not** get more expensive as context grows.
+
+## 1. Architecture (confirmed from GGUF metadata + tensor names)
+
+Both Qwen3.6 models are hybrid: a `full_attention_interval` of 4 means every 4th layer is standard
+full attention and the other 3/4 are Gated-Delta-Net (GDN) linear attention with a recurrent state.
+
+**Dense Qwen3.6-27B (`general.architecture = qwen35`):**
+- `block_count = 64`, `full_attention_interval = 4` -> **16 full-attention layers + 48 GDN layers**.
+- Full-attn: `head_count = 24`, `head_count_kv = 4` (GQA), `key_length = value_length = 256`,
+  rope `freq_base = 1e7`, mrope sections `[11,11,10,0]`.
+- GDN/SSM: `ssm.state_size = 128`, `ssm.conv_kernel = 4`, `ssm.group_count = 16`,
+  `ssm.time_step_rank = 48`, `ssm.inner_size = 6144`. So the recurrent state per GDN layer is
+  `[S_v=128, S_v=128, H_v=48]` per sequence (`H_v = inner_size/state_size = 6144/128 = 48` value
+  heads), i.e. a 128x128 state matrix per head, ~3.1 MB (F32) per sequence per layer.
+
+**MoE Qwen3.6-35B-A3B (`general.architecture = qwen35moe`):**
+- `block_count = 41`, `full_attention_interval = 4` (~10 full-attn + ~31 GDN layers).
+- `head_count = 16`, `head_count_kv = 2`, `key_length = value_length = 256`,
+  `expert_count = 256`, `expert_used_count = 8`, `expert_feed_forward_length = 512`.
+- Same SSM dims: `state_size = 128`, `conv_kernel = 4`, `group_count = 16`,
+  `inner_size = 4096` -> `H_v = 32` value heads.
+
+**Tensor names confirm the op split (27B, per-layer dump):**
+- GDN layers (e.g. `blk.0.*`): `ssm_alpha`, `ssm_beta`, `ssm_conv1d`, `ssm_a`, `ssm_dt.bias`,
+  `ssm_norm`, `ssm_out`, plus `attn_qkv` / `attn_gate` (the in/out projections of the linear-attn
+  block). No `attn_k/v/output`, no per-head q/k norm.
+- Full-attn layers (e.g. `blk.3.*`, every 4th): `attn_q`, `attn_k`, `attn_v`, `attn_output`,
+  `attn_q_norm`, `attn_k_norm`. No `ssm_*`.
+
+llama loads the GDN layers through the **recurrent memory** (`llama-memory-recurrent`), not the KV
+cache: the conv state and the SSM state live in `conv_states_all` / `ssm_states_all` and are read
+and written every step. Only the 16/10 full-attention layers use the (paged) KV cache. This is the
+SSM-style recurrent path, not standard attention.
+
+## 2. llama.cpp GDN decode implementation: O(1) recurrent-state update (code-proven)
+
+Graph build (shared by both models): `src/models/delta-net-base.cpp`, dispatched from
+`src/models/qwen35.cpp` and `src/models/qwen35moe.cpp` (the MoE class inherits
+`llm_build_delta_net_base` and calls the same `build_recurrent_attn`, qwen35moe.cpp:472).
+
+**Decode dispatch (`build_delta_net`, delta-net-base.cpp:425-447):** when `n_seq_tokens == 1`
+(decode), it takes `build_delta_net_fused` if `cparams.fused_gdn_ar` (the default, see below), else
+`build_delta_net_autoregressive`. Both are O(1):
+
+- `build_delta_net_autoregressive` (delta-net-base.cpp:289-371) is the explicit rank-1 recurrence on
+  the fixed-size state `s` shaped `[S_v, S_v, H_v, n_seqs]`: `s *= exp(g)` (decay),
+  `sk = sum_rows(s * k)`, `d = (v - sk^T) * beta`, `s += k (x) d^T` (rank-1 update),
+  `o = sum_rows(s * q)`. **No loop over past tokens, no KV read** - it touches only the state and
+  the single new token's q/k/v/g/beta. `GGML_ASSERT(n_tokens == 1)`.
+- `build_delta_net_fused` (delta-net-base.cpp:373-423) collapses the same recurrence into one op,
+  `ggml_gated_delta_net(q, k, v, g, b, s, K=1)`.
+
+**State is cached across steps, not rebuilt (`build_recurrent_attn`, delta-net-base.cpp:527-606):**
+the input state `s` is read from `ssm_states_all` via `build_rs`, and the new state is copied back
+with `ggml_cpy(new_state, view(ssm_states_all, ... kv_head ...))` (lines 555-558). The causal-conv
+state is handled the same way in `build_conv_state` (449-525): the previous `conv_kernel-1 = 3`
+samples are read from `conv_states_all`, the new token is appended, and the last 3 are written back.
+So both pieces of GDN state persist in the recurrent cache exactly like a KV cache persists tokens -
+this is the recurrent analogue, fixed size, independent of context length.
+
+**Defaults (`src/llama-context.cpp:200-201`):** `cparams.fused_gdn_ar = true` and
+`fused_gdn_ch = true`. They are only auto-disabled if the fused op cannot be scheduled on the same
+device as the layer (`device_gdn != device_kv`, lines 540-595); on a single GB10 with `-ngl 99`
+that does not happen, so the **fused single-kernel path is what runs**.
+
+**The CUDA kernel (`ggml/src/ggml-cuda/gated_delta_net.cu`) is the crux, and it is unambiguously
+O(1) in context:**
+- Launch grid `dim3(H, n_seqs, ceil(S_v/4))` and block `(min(warp,S_v), 4, 1)` (lines 184-185):
+  the grid spans heads x sequences x state-columns. **There is no context-length dimension and no
+  context-length argument anywhere in the kernel signature** (q/k/v/g/beta are the new token(s)
+  `[S_v, H, n_tokens, n_seqs]`; `curr_state` is the fixed `[S_v, S_v, H, n_seqs]`).
+- Each warp loads its shard of the fixed-size state into registers **once** (lines 57-61), then
+  loops `for (t = 0; t < n_tokens; t++)` (line 63). At decode `n_tokens == 1`, so it is a single
+  iteration: read the one new token, do the rank-1 update
+  `s_shard[r] = g * s_shard[r] + k[i] * delta_col` and the readout `attn = S^T q` (lines 84-141),
+  then write the updated state back (lines 161-167). No second loop, no read of any past KV.
+- Work per decode step is therefore proportional to `S_v * S_v * H * n_seqs` (the state size x
+  batch) and **constant in context length**. This is precisely vLLM's
+  `fused_recurrent_gated_delta_rule_packed_decode_kernel` (one batched launch updating a
+  fixed-size `[K,V]` state) cited in the grounding doc.
+
+A chunked GPU kernel for prefill is a TODO (delta-net-base.cpp:181 `//TODO: Add chunked kernel`);
+the chunked CPU/graph path (`build_delta_net_chunking`) only runs for multi-token ubatches
+(prefill), never at decode.
+
+## 3. nsys decode profiling: GDN is a small share and does not scale with context
+
+Qwen3.6-27B NVFP4, sm_121, `GGML_CUDA_DISABLE_GRAPHS=1`, paged KV, `-fa on`, `llama-server` driven
+to steady decode by a looping completion client. Kernel time bucketed by name (full classifier and
+sqlites under `~/bench/gdn_study/`).
+
+**(a) Share at the headline batch (npl128, ctx 1024), GPU 92.7% busy:**
+
+| bucket | % of busy | us/launch |
+|--------|----------:|----------:|
+| GEMM_weight (`mul_mat_q`/`mul_mat_vec_q`) | 59.2 | - |
+| **GDN_recurrent (`gated_delta_net_cuda`)** | **8.9** | 369 |
+| GEMM_act_quant (`quantize_mmq_nvfp4`) | 8.2 | - |
+| elementwise / act_glu / norm / rope | ~13.5 | - |
+| embed_gather (`get_rows`) | 2.9 | - |
+| **ATTENTION_full (`flash_attn`, 16 layers)** | **1.8** | 107 |
+| copy_cast (`cpy`) | 1.8 | - |
+| **GDN_conv (`ssm_conv`)** | **1.5** | - |
+
+The whole GDN path (recurrent 8.9% + conv 1.5%) is ~10% of the step; full attention is ~2%; the
+**weight GEMM dominates at ~67% (59.2% GEMM + 8.2% act-quant requant)**. This is the dense model,
+where the grounding predicted the GEMM would be the lever.
+
+**(b) Share at low batch (npl32, ctx 1024), weight-bandwidth (GEMV) regime, GPU ~100%:**
+GEMM_weight 88.7%, GDN_recurrent 0.8%, ATTENTION_full 0.7%, GDN_conv 0.3%. At low batch the
+weight-read GEMV swamps everything and GDN is negligible; the GDN share tracks the batch, not the
+context.
+
+**(c) Context-scaling control (the decisive test): matched batch npl4, pure decode, ctx 1024 vs
+4096.** Small batch -> fast prefill -> a clean pure-decode capture (verified: GEMM is the M=1
+`mul_mat_vec_q` decode GEMV, and the client completed decode rounds inside the window). Identical
+decode-step counts (~190 steps, gated_delta_net launched 9141 vs 9134 times), so per-launch time is
+a true per-step comparison:
+
+| kernel / bucket | ctx 1024 | ctx 4096 | ratio |
+|-----------------|---------:|---------:|------:|
+| `gated_delta_net_cuda` us/launch | 10.3 | **8.0** | **0.78x (flat)** |
+| GDN_recurrent share | 0.6% | 0.4% | flat/down |
+| `ssm_conv` (GDN_conv) us/launch | 5.2 | 5.2 | 1.00x |
+| `flash_attn_tile` us/launch | 27.1 | **85.0** | **3.14x** |
+| ATTENTION_full share | 0.6% | 1.8% | 3.0x up |
+| total ms / decode step | 84.9 | 86.0 | 1.01x |
+
+The GDN kernel time is flat (even a hair faster) across a 4x context increase, while the
+full-attention kernel grows ~3x, exactly the O(1)-vs-O(ctx) signature. The total step time barely
+moves because at this batch the (context-independent) FP4 weight GEMM is 88% of the step. This is
+the empirical confirmation of the code analysis: **llama's GDN decode does not re-scan the context.**
+
+(An earlier npl32 ctx4096 attempt was discarded: with 32 parallel slots each independently
+prefilling ~4100 tokens, the nsys window caught prefill, not steady decode - the `mul_mat_q(M=128)`
+ `flash_attn_ext_f16(ctx4096)` signature gave it away. The npl4 runs above avoid this by keeping
+prefill short.)
+
+## 4. Verdict and fix scope
+
+**Efficient, not a bottleneck.** llama.cpp runs the Qwen3.6 GDN/linear-attention layers as a fused,
+single-CUDA-kernel, O(1)-in-context recurrent-state update, with the conv and SSM state cached in
+the recurrent memory across decode steps. It is algorithmically the same as vLLM's O(1)
+`fused_recurrent` decode. The probe's worst case (llama re-scanning context => GDN layers ballooning
+with context and concurrency) is **falsified**: the GDN kernel is flat across 4x context, and the
+op carries no context-length parameter at all.
+
+**So the GDN path is not the cheap model-specific lever.** It is a small-to-moderate, context-flat
+share of the step (~0.4-0.8% at low batch, ~10% including conv at batch 128), and removing it would
+not dent the 2.4x. The gap is the general kernel work, confirming `VLLM_DECODE_GROUNDING.md`:
+1. the **FP4 weight GEMM** is the dominant bucket (~59% GEMM + ~8% `quantize_mmq_nvfp4` requant that
+   vLLM fuses away via native FP4-MMA / grouped Marlin); this is the biggest, hardest lever.
+2. the **full-attention decode kernel** is the O(ctx) residual (the only thing that grows with
+   context, ~3x per-launch over 4x ctx), in the minority of full-attention layers.
+
+If anything on the GDN side is ever worth touching, it is a bounded micro-optimization, not a
+complexity fix: the kernel is memory-bound on the F32 recurrent state (state read+write is
+`S_v^2 * H * batch` = ~0.79 GB/step over 273 GB/s at batch 128, hence the ~8.9% share), and this
+traffic is **intrinsic to the architecture - vLLM pays the identical state I/O**, so it is not a
+llama-specific inefficiency. A future win could keep the recurrent state in bf16 or fuse the
+`ssm_conv` + gated-norm into the delta-net kernel to shave that ~10%, but the ceiling is small and
+it does not close the 2.4x. The throughput effort stays where the grounding put it: the FP4 GEMM
+(fused act-quant + native FP4-MMA) and the full-attention decode kernel, with a CUDA-graphed
+steady-state step as the bounded host-side add-on.
+
+## Reproduce
+
+- Metadata: `python3 gguf-py/gguf/scripts/gguf_dump.py --no-tensors ~/bench/q36-27b-nvfp4.gguf`.
+- Code: `src/models/delta-net-base.cpp` (build_delta_net 425, autoregressive 289, fused 373,
+  build_recurrent_attn 527, build_conv_state 449); `src/llama-context.cpp:200-201,540-595`
+  (fused_gdn defaults/guard); `ggml/src/ggml-cuda/gated_delta_net.cu` (kernel 4-168, launch grid
+  184-185, dispatch 226-312).
+- Profiles: `~/bench/gdn_study/drv.sh <label> <P> <K> <ctx> <delay> <dur>` runs `llama-server` under
+  nsys and drives `clientloop.py`; `catgdn.py <sqlite>` buckets kernels. Sqlites:
+  `gdn_npl128_ctx1024`, `gdn_npl32_ctx1024`, `gdn_npl4_ctx1024`, `gdn_npl4_ctx4096`.
--- a/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_DENSITY_AUTO_TILE.md
@@ -0,0 +1,143 @@
+# Patch 0015 findings: expert-density-aware MoE token-tile auto-select
+
+The durable follow-up to patch 0014 (`MOE_TOKEN_TILE_CAP.md`): replace the blunt,
+opt-in `LLAMA_MOE_MMQ_X` global cap with a host-side, **default-on** density-aware
+`mmq_x` auto-select in `mul_mat_q_case`. Companion to
+`0015-paged-expert-density-aware-moe-token-tile-auto-select.patch`. Dev tree
+`~/llama-paged-dev` (branch `paged`), `build-cuda` sm_121.
+
+Primary model: **Qwen3.6-35B-A3B NVFP4** (`~/bench/q36-35b-a3b-nvfp4.gguf`),
+**256 experts, top-8**, expert FFN 512, GDN linear attention (SSM inner 4096),
+41 layers. This is a different beast from 0014's Qwen3-Coder-30B-A3B (128 experts,
+larger expert FFN, standard attention).
+
+## What it does (vs 0014)
+
+`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max` (= `ne12`,
+the per-expert column upper bound = token count) in one column-tile, i.e. stock
+**maximizes** the tile (128 on Blackwell). Applied per expert at MoE decode, where
+per-expert density is tiny, that 128-wide tile is mostly padding.
+
+Patch 0014 capped `mmq_x` globally on the ids path via `LLAMA_MOE_MMQ_X` (decode
+**and** prefill), which cost ~1.3% prefill. Patch 0015 instead estimates the
+per-expert density host-side, from args the ids path already passes:
+
+```
+ne_get_rows = ncols_dst   = ne12 * n_expert_used        (token-expert assignments)
+n_experts   = nchannels_x = ne02
+density     = ceil(ne_get_rows / min(ne_get_rows, n_experts))   (tokens/expert)
+```
+
+and caps to the small tile (default 64) **only when `density <= density_max`**, so
+the high-density prefill ubatch keeps the big 128 tile. Prefill-safe by construction.
+No new kernel: the selection only lowers the loop's upper bound to an
+already-compiled, granularity- and shared-memory-validated `mmq_x`.
+
+## The threshold matters: `density_max = 8`, not `tile/4 = 16`
+
+The cap must fire for decode but not for a prefill ubatch. Each has per-expert
+density `n_tokens * n_used / n_experts`. At the standard `n_ubatch=512`, `n_used=8`:
+
+```
+                       128 experts   256 experts
+prefill ubatch (512)        32            16
+decode npl128 (128)          8             4
+```
+
+`tile/4 = 16` (0014's first auto-select draft default) **equals the 256-expert
+prefill density** and caps prefill: measured -2.0% to -2.9% S_PP on q36-35b-a3b.
+`density_max = 8` sits strictly between decode and prefill for every `n_experts` in
+`[128, 511]`, so it caps decode and leaves prefill on the big tile. This single
+default change is what makes the patch prefill-safe on the 256-expert model.
+
+## Measurements (default-on vs stock, median of 5 reps)
+
+`llama-batched-bench`, q36-35b-a3b-nvfp4.gguf, `-fa on -npp 128 -ntg 128`, GB10
+sm_121. STOCK = `LLAMA_MOE_AUTO_TILE=0` (exact stock selection); 0015 = default.
+
+```
+  npl   S_TG stock  S_TG 0015   dTG%     S_PP stock  S_PP 0015   dPP%
+    8      183.59     183.18  -0.22%        1489.2     1500.1  +0.73%
+   32      264.02     263.44  -0.22%        2034.5     2033.5  -0.05%
+   64      311.76     310.41  -0.43%        2028.3     2027.6  -0.03%
+  128      336.10     337.32  +0.36%        2025.0     2027.7  +0.13%
+```
+
+Raw npl128 reps: S_TG 0015 `[337.3, 336.9, 336.4, 338.9, 338.1]` vs stock
+`[336.2, 336.1, 335.9, 336.9, 335.8]` (distributions overlap); S_PP 0015
+`[2028.6, 2023.0, 2024.9, 2028.0, 2027.7]` vs stock `[2024.9, 2025.0, 2023.2,
+2029.4, 2029.0]`.
+
+### Honest read: neutral on this model
+
+On q36-35b-a3b the decode delta is **within run-to-run noise** (npl128 +0.36%,
+npl<=64 slightly negative) and prefill is **neutral** (within +/-0.7%, well inside
+the 1% target). The `+5%` decode target from the localmaxxing reference does **not**
+materialize here. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
+256-tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
+lever has nothing to bite on.
+
+### npl128 decode tile sweep confirms 64 is the only useful width
+
+`density_max=8` fixed, varying `LLAMA_MOE_DECODE_TILE`, S_TG @ npl128 vs stock:
+
+```
+  TILE8   TILE16  TILE32  TILE64  TILE96
+ -6.31%   -3.18%  -0.17%  +0.70%  -0.76%
+```
+
+Smaller tiles are **worse**, not better: more column-tiles per expert = more
+grid/scheduling overhead, and the FP4-MMA has a minimum efficient width. So matching
+the tile to the literal density (4) is counterproductive; 64 is the sweet spot,
+same as 0014.
+
+## Why ship it default-on anyway
+
+1. **Removes 0014's prefill cost by construction.** The cap is density-gated, not
+   global, so prefill keeps its 128 tile (S_PP neutral above).
+2. **Banks the col-tile-bound gain for free.** At npl128 the auto-select picks
+   `tile=64` for a 128-expert model (decode density 8 <= 8), i.e. exactly 0014's
+   `cap64`, so it reproduces 0014's **+4.8% @npl128 on Qwen3-Coder-30B** without the
+   -1.3% prefill cost. (That model was unavailable to re-bench here; the tile choice
+   is identical by construction.)
+3. **Prefill-safe and decode-neutral on the SSM model**, so it is harmless where it
+   does not help.
+4. **Correctness-gated** by the P0 harness (below).
+
+## Conservative by design (known limitation)
+
+A pure-density gate cannot separate two cases with the **same** per-expert density:
+Qwen3-Coder npl256 decode (density 16) and the 256-expert prefill ubatch (density
+16) are identical to the estimator. `density_max=8` therefore **forgoes 0014's
+2.3% @npl256** on the 128-expert model to keep 256-expert prefill safe. Recovering
+it needs an `ne12`-aware (absolute token count) gate in addition to density; scoped
+as future work, not implemented.
+
+## Knobs
+
+- `LLAMA_MOE_AUTO_TILE=0` : disable the auto-select, exact stock `mmq_x` selection.
+- `LLAMA_MOE_MMQ_X=<n>` (patch 0014) : **kept** as a manual override; when > 0 it
+  forces the old blunt global cap and bypasses the auto-select (explicit A/B knob).
+- `LLAMA_MOE_DECODE_TILE=<n>` : the small tile (default 64).
+- `LLAMA_MOE_DENSITY_MAX=<n>` : the density ceiling (default 8).
+
+## P0 correctness gate
+
+`tests/test-backend-ops` `test_mul_mat_id` is extended with a ragged small-M
+NVFP4/MXFP4 MoE decode-density block: 128 experts, top-8, m=768, k=2048, n in
+`{16,33,64,128,130,200,256,512}` spanning the cap boundary (n>=130 keeps the 128
+tile at `density_max=8`, n<=128 takes tile 64) and ragged token counts (experts with
+0/1/2 tokens, n not a multiple of the tile). All 16 shapes pass the CUDA-vs-CPU
+oracle on GB10 both default-on and with `LLAMA_MOE_AUTO_TILE=0`; full `MUL_MAT_ID`
+suite 2/2 backends OK. Off the ids path nothing changes (non-MoE `mul_mat`
+byte-identical to stock).
+
+## Verdict
+
+- Correct, prefill-safe, default-on density-aware tile select; the durable design
+  0014's own doc scoped. Supersedes 0014's global cap as the default path; the
+  `LLAMA_MOE_MMQ_X` knob is retained as a manual override.
+- **Net effect on q36-35b-a3b NVFP4: neutral** (decode within noise, prefill neutral)
+  because the model is SSM/bandwidth-bound, not col-tile-bound. The lever's real win
+  lives on col-tile-bound MoE (Qwen3-Coder-30B, +4.8% @npl128), banked here at zero
+  prefill cost.
--- a/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_GROUPED_GEMM_SCOPE.md
@@ -0,0 +1,220 @@
+# Durable scope: grouped FP4-MMA MoE GEMM for ggml CUDA on GB10 (sm_121)
+
+Build-ready plan. **Not implemented in this workflow** (large kernel work). This
+document scopes the durable path to match or beat vLLM MoE grouped-GEMM efficiency
+on GB10 for the Qwen3-30B-A3B-class mxfp4 MoE, and records the single honest
+finding that re-shapes the whole effort.
+
+Hardware: NVIDIA GB10 (sm_121, CC=1210 = `GGML_CUDA_CC_DGX_SPARK`), unified
+LPDDR5X ~273 GB/s. Model: Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
+(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`,
+HEAD at patch 0013), `build-cuda` sm_121.
+
+## TL;DR (the honest reframe)
+
+**The grouped GEMM the mission scoped to build from scratch already exists in
+upstream ggml, and it already runs on GB10 for mxfp4.** For mxfp4 experts on
+sm_121 `ggml_cuda_should_use_mmq()` returns true (`turing_mma_available`), so
+MUL_MAT_ID takes the **grouped mmq path**, which already contains both vLLM
+building blocks:
+
+1. a moe_align / token-sort-by-expert (`mmid.cu` `mm_ids_helper`:
+   count -> warp-scan/cumsum -> scatter into expert-sorted contiguous buffers),
+2. a **single persistent stream-k grouped FP4-MMA GEMM** (one `mul_mat_q` launch;
+   grid flattened into kbc-continuous space over expert x col-tile x row-tile x
+   k-block; native FP4 MMA via `block_fp4_mmq` under `BLACKWELL_MMA_AVAILABLE`).
+
+The per-expert host-side row-gather loop in `ggml-cuda.cu`
+`ggml_cuda_mul_mat_id()` (~L2632-2790) - the path the mission's root-cause
+analysis describes as "the cliff" - is a **fallback only reached when
+`should_use_mmq()==false`** (f16/bf16 experts, non-Blackwell). It is **never the
+GB10 mxfp4 path.**
+
+Consequence: the "npl128 MoE cliff" does not exist on the current dev HEAD.
+Re-measured batched-bench decode (`S_TG` t/s) on the mxfp4 MoE rises monotonically
+`85 / 278 / 637 / 950 / 1306 / 1771` at npl `1 / 8 / 32 / 64 / 128 / 256`. The
+original `253/505/830/620` cliff was a real high-batch regression that has since
+been **fixed upstream** (FP4-native grouped mmq + MoE stream-k balancing), not a
+batched-bench artifact.
+
+**Therefore the durable work is NOT "port moe_align + a grouped GEMM."** It is a
+**surgical fix to the one place ggml diverges from vLLM: the M-tile (token-tile)
+sizing heuristic.** This document scopes that delta, plus the optional
+block-padded align, plus the parity gate and phased plan. It also records what is
+intentionally NOT built and why (the W4A16 occupancy wall).
+
+## The one structural gap: M-tile sizing
+
+`mul_mat_q_case` / `launch_mul_mat_q` pick `mmq_x` (the token/M tile) by
+**minimizing** `ntiles_x = ceil(ncols_max / mmq_x)` over the **aggregate** token
+count (`ncols_max = ne12`). On Blackwell `get_mmq_x_max = 128`, so the heuristic
+always selects the **largest** `mmq_x` that fits shared memory. vLLM's
+CUTLASS/Triton fused_moe does the **opposite**: a small tuned `BLOCK_SIZE_M`
+(typ. 16/32/64), padded **per expert**.
+
+ggml then applies its over-large `mmq_x` **per expert**. In MoE decode the tokens
+per expert is tiny - Qwen3-30B-A3B top-8 of 128: at npl64 ~512 assignments over
+~126 activated experts ~= 4 tok/expert; at npl128 ~1024 over ~128 ~= 8 tok/expert.
+So each expert's single M-tile of width 128 is **3-6% filled** -> ragged tiny-M
+tiles run a dense-GEMM-tuned config, wasting MMA M-throughput, and (with
+`need_check`) every expert runs as a masked partial tail.
+
+The FP4 MMA N-fragment (`tile_C::J`) is 8, so the **ideal M-tile ~= tokens/expert
+(~8)**, 16x smaller than the 128 ggml picks. This mismatch is the durable gap.
+
+Critically for GB10: at tokens/expert <= 8 there is exactly **one col-tile per
+expert**, so a smaller `mmq_x` causes **no extra weight re-read** (weight rows are
+re-read only across multiple col-tiles, of which there is one) while it **lowers
+shared-mem footprint and raises occupancy** - strictly aligned with the GB10
+occupancy lessons.
+
+## What already exists (reuse, do NOT rebuild)
+
+Engine files on DGX `~/llama-paged-dev/ggml/src/ggml-cuda/`:
+
+- **[A] moe_align / scatter** = `mmid.cu` `mm_ids_helper`. One CUDA block per
+  expert (`gridDim.x = n_experts`); warp counts tokens routed to this expert,
+  warp-scan for the compaction index, scatters into `ids_src1` (column gather
+  permutation, expert-sorted contiguous), `ids_dst` (output scatter), and writes
+  `expert_bounds[expert] = prefix start`, `expert_bounds[n_experts] = total`.
+  This **is** count -> cumsum -> permute; `expert_bounds` is the analogue of
+  vLLM's `num_tokens_post_padded` boundaries. No `-1` pad today because segments
+  are exact (not block-padded).
+- **[B] persistent grouped FP4 GEMM** = `mmq.cuh` `mul_mat_q` stream-k
+  (kernel ~L3542, `process_tile` ~L3447, launch ~L3943, case-select ~L4055).
+  Single launch, fixed grid (`nsm` CTAs, or `ntiles` when >=90% tile efficiency).
+  Each CTA walks a contiguous `kbc` slice of (expert `zt` via `expert_bounds`,
+  col-tile `jt`, row-tile `it`, k-block) space; the weight row-tile (`mmq_y=128`
+  x K) is loaded once per col-tile in the `process_tile` k-loop; empty col-tiles
+  past `col_diff` are SKIPPED by advancing `kbc += blocks_per_ne00`; a
+  `stream_k_fixup` pass recombines split tiles.
+- **[C] native FP4-MMA expert weights** = `block_fp4_mmq` + `MMQ_MMA_TILE_X_K_FP4`
+  (== Q8_1 tile, skew-pad +4) under `BLACKWELL_MMA_AVAILABLE`;
+  `quantize_mmq_fp4_cuda` quantizes activations to the q8-style y-layout **with
+  the `ids_src1` gather fused** (one pass, no separate row-copy).
+
+Dispatch seam: `ggml-cuda.cu` `ggml_cuda_mul_mat_id()` (~L2632-2790). For mxfp4
+with `ne2`(tokens) > 7, `should_use_mmq()` -> true -> `ggml_cuda_mul_mat_q()`
+(`mmq.cu` id-branch ~L162-225) -> `mm_ids_helper` then ONE
+`mul_mat_q_switch_type`. The per-expert host loop below it is the gated fallback.
+
+(Below npl8, MXFP4 mmid routes through `mmvq` - `MMVQ_MAX_BATCH_SIZE=8`, mmid max
+7 for turing_plus - which is fine for thin batch and out of scope here.)
+
+## What to add (the durable delta, priority order)
+
+### [1] Expert-aware M-tile selection (host-side only, zero new kernel)
+
+In `mul_mat_q_case` / `launch_mul_mat_q`, when `ids != null`, choose `mmq_x` from
+**per-expert density** (~`ne_get_rows / n_active_experts`, derivable cheaply, or
+capped via env) instead of minimizing `ntiles` over aggregate `ncols_max`.
+
+- `mmq_x` is a **compile-time template** (switch 8..128 step 8), so this is a pure
+  host-side SELECTION change - it picks a different already-compiled instantiation.
+  **Zero new kernel. Very low risk, high leverage.** Matches vLLM `BLOCK_SIZE_M`.
+- Doubles as near-term lever-1: env-gated `LLAMA_MOE_MMQ_X` cap at the knee.
+- GB10-aligned: smaller `mmq_x` -> smaller shared mem -> higher occupancy, and at
+  tokens/expert <= 8 (one col-tile/expert) it costs no extra weight read.
+
+This is the single highest-leverage change and the seed of the durable port.
+
+### [2] Block-padded moe_align (the moe_align_block_size port proper)
+
+Extend `mm_ids_helper` to pad each expert segment up to a multiple of the chosen
+block: write a sentinel (`-1`) `ids_dst` for pad lanes, put `expert_bounds` on
+block boundaries. Then every col-tile is **full**, which:
+
+- drops the `need_check` masking + per-expert partial-tail MMA,
+- makes the stream-k `kbc` space exact (no skipped tiles, cleaner persistent
+  schedule), removing the `col_diff` skip branch.
+
+Medium risk: touches the scatter, the `col_diff`/`need_check` logic, and the
+`write_back` masking (pad rows must not write output). This is the proper
+`moe_align_block_size` analogue and the durable second step.
+
+### [3] Bespoke masked-grouped FP4 kernel - ONLY if [1]+[2] insufficient
+
+A CUTLASS/DeepGEMM-style masked-grouped FP4 kernel. **Largest risk, likely
+unnecessary** given [B] is already a persistent stream-k grouped GEMM. Listed for
+completeness; do not start without [1]+[2] measured as insufficient.
+
+## Integration into ggml_mul_mat_id (dispatch seam + gated fallback)
+
+- The seam is unchanged: `ggml_cuda_mul_mat_id()` -> `should_use_mmq()` ->
+  `ggml_cuda_mul_mat_q()`. [1] and [2] live entirely inside the mmq id-branch
+  (`mmq.cu` ~L162-225) and its callees (`mmq.cuh` selection/launch, `mmid.cu`
+  scatter). No change to the host dispatch decision.
+- **Gated fallback preserved**: the existing per-expert host loop
+  (`should_use_mmq()==false` path) stays as-is for f16/bf16 experts and
+  non-Blackwell GPUs. The new selection only fires on the grouped path.
+- **Env gates** (off = exact current behavior):
+  - `LLAMA_MOE_MMQ_X=<8..128>` - cap/override the token tile for the id-path
+    (lever-1 + [1] manual knob).
+  - `LLAMA_MOE_BLOCK_ALIGN=0|1` - enable block-padded scatter ([2]).
+  Default both off until parity + throughput proven, then flip [1]'s
+  auto-selection on by default.
+
+## Correctness / parity gate
+
+Primary: `tests/test-backend-ops.cpp` `test_mul_mat_id` (~L4181). The CPU
+reference is **deterministic** - the op test must be **bit-exact**.
+
+- Sweep `type_a` in {`MXFP4`, `NVFP4`}, `type_b = F32`, `n_mats = 128`,
+  `n_expert_used = 8`, `n_tokens` in {8, 32, 64, 128} (the decode-density band).
+- **Add ragged small-M shapes** to the harness if absent (n_tokens not a multiple
+  of mmq_x; experts with 0/1/2 tokens) - these are exactly where [1]/[2] change
+  tile geometry and where block-pad masking can leak.
+- Pass criterion: new `mmq_x` selection and padded-align produce dst **identical**
+  to current op-test output (op test is exact; the GB10 CUDA greedy-decode
+  non-determinism band applies only to end-to-end, never to the op test).
+- End-to-end sanity: `llama-batched-bench` on `~/bench/qwen3coder-mxfp4.gguf`,
+  `-fa on -npp 128 -ntg 128`, npl 8/32/64/128/256; confirm `S_TG` stays monotonic
+  and `S_PP` flat ~3050-3090. Verify greedy-decode output within the documented
+  CUDA batch-shape non-determinism band (CPU is the deterministic oracle).
+
+Bench/parity scripts stay **dev-tree-only** (`~/llama-paged-dev/benches/`).
+
+## Phased plan, expected payoff, risk per phase
+
+| Phase | Work | Expected payoff | Risk |
+|-------|------|-----------------|------|
+| **P0** harness | Add ragged small-M + MXFP4/NVFP4 mmid shapes to `test_mul_mat_id`; capture current bit-exact baseline + the monotonic batched-bench curve as the reference. | None (gate). Locks correctness + the 85->1771 t/s baseline so any regression is caught. | Low. |
+| **P1** sort op | Confirm `mm_ids_helper` is the moe_align; if [2] is pursued, prototype the block-pad scatter behind `LLAMA_MOE_BLOCK_ALIGN`. | Enables exact stream-k schedule; removes `need_check` masking (P3 payoff). | Medium (scatter + write-back masking). |
+| **P2** grouped GEMM ([1]) | Expert-aware `mmq_x` selection in `mul_mat_q_case`/launch, `LLAMA_MOE_MMQ_X` gate. | The headline: reclaim the 3-6% M-tile fill waste at npl64-128. Modeled as removing wasted MMA M-throughput on every activated expert; net throughput up at high batch with no extra weight read. | **Low** (host-side template selection, no new kernel). |
+| **P3** tune ([2] + fixup) | Land block-padded align; tune `mmq_x` per density, profile stream-k `fixup` overhead and `mmq_x`/`mmq_y` tile choice with nsys on the grouped `mul_mat_q<MXFP4>` kernel. | Remove per-expert partial-tail MMA; tighten the persistent schedule. Diminishing vs P2; this is pure micro-efficiency toward/past vLLM's saturated grouped-GEMM. | Medium-high (kernel masking paths). |
+
+**Honest payoff framing:** the npl128 "cliff" is already gone on HEAD, so there is
+no broken path to unlock. The durable win is **matching vLLM's saturated
+grouped-GEMM M-tiling** (small per-expert block) and erasing the dense-GEMM-tuned
+M-tile mismatch - a micro-efficiency gain at large effective batch, not a
+step-change. vLLM 0.23.0 cannot even serve this model on GB10 (bf16 MoE-warmup
+hang + hard reboot; GGUF loader can't map fused qwen3moe experts), and llama
+already uses the same sorted-grouped-GEMM algorithm, so structural parity is
+**already met**; this closes the residual kernel micro-gap.
+
+## The biggest risk: the GB10 W4A16 occupancy wall
+
+The dominant risk is **repeating the W4A16 dead-end** that hit only ~9 TFLOPS /
+178 t/s on GB10. GB10 is **occupancy-dominated**: deep `cp.async` pipelines and
+XOR-swizzle shared layouts **collapse occupancy** there. Any P3 kernel work MUST:
+
+- keep **small shared mem + high occupancy** (do NOT add deep `cp.async` stages
+  or XOR-swizzle - they are exactly what killed W4A16);
+- preserve the **skew-pad (+4)** tile layout already in `MMQ_MMA_TILE_X_K_FP4`;
+- stay on the **FP4-MMA path** (`block_fp4_mmq`), the only path that hits Blackwell
+  FP4 = 2x INT8/BF16 rate;
+- respect the ~273 GB/s LPDDR5X weight-read floor (dense decode is already at it;
+  MoE wins come from occupancy/tile fit, not bandwidth).
+
+Smaller `mmq_x` ([1]) is **strictly consistent** with these lessons: it reduces
+shared-mem footprint, raises occupancy, and at tokens/expert <= 8 adds no weight
+re-read. So the low-risk lever ([1]) is also the one most aligned with what GB10
+rewards - which is why it leads the plan and [3] is gated behind it.
+
+## Commit / hygiene
+
+Scope doc only (this file). No engine change committed in this workflow. Bench and
+parity scripts are dev-tree-only. Commit with `git -s`, trailer
+`Assisted-by: Claude:opus-4.8 [Claude Code]`, no `Co-Authored-By`, no em-dashes.
+Do not push (human pushes). When [1]/[2] are implemented they mirror to
+`backend/cpp/llama-cpp/patches/paged/0014-*` (next free slot).
--- a/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
+++ b/backend/cpp/llama-cpp/patches/paged/MOE_TOKEN_TILE_CAP.md
@@ -0,0 +1,99 @@
+# Patch 0014 findings: expert-aware MoE token-tile cap (LLAMA_MOE_MMQ_X)
+
+Near-term lever for the MoE-vs-vLLM workflow on GB10 (sm_121). Companion to
+`0014-paged-expert-aware-moe-token-tile-cap.patch`. Model:
+Qwen3-Coder-30B-A3B, 128 experts, top-8, mxfp4 experts
+(`~/bench/qwen3coder-mxfp4.gguf`). Dev tree `~/llama-paged-dev` (branch `paged`),
+`build-cuda` sm_121.
+
+## Headline (honest): there is no npl128 cliff to erase on this build
+
+The mission premise was a 25% decode drop at npl128 (batched-bench 253/505/830/620
+@ npl 8/32/64/128). It does **not** reproduce. Stock decode is monotonic:
+
+```
+llama-batched-bench, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, S_TG t/s
+  npl        1     8    32    64   128   256
+  stock     85   282   629   935  1295  1779     <- monotonic, no knee
+```
+
+The old cliff was a real high-batch regression since fixed upstream: mxfp4 MoE
+decode on GB10 already takes the sorted grouped FP4-MMA GEMM (MUL_MAT_ID ->
+`ggml_cuda_mul_mat_q` ids branch: `mm_ids_helper` moe_align/scatter + one
+persistent stream-k `mul_mat_q`), i.e. vLLM's algorithm. See
+`MOE_GROUPED_GEMM_SCOPE.md`.
+
+## What the knob does
+
+`mul_mat_q_case` picks the token-tile width `mmq_x` to cover `ncols_max`
+(= `ne12`, the per-expert column upper bound = token count, up to 128) in one
+column-tile. At MoE decode the per-expert density is `~ne12*k/n_experts`
+(top-8/128 => ~1/16 of `ne12`), so each expert's `mmq_x`-wide col-tile is only
+~6% filled: the MMA accumulator tile is `mmq_x`-wide at compile time and wastes
+throughput on the padding columns, and the larger y-tile lowers occupancy.
+
+`LLAMA_MOE_MMQ_X=<n>` caps `mmq_x` on the MUL_MAT_ID path only
+(`expert_bounds != nullptr`). It only lowers the selection-loop upper bound and
+still chooses from the same granularity/shared-memory-validated `mmq_x` set stock
+already uses for smaller batches - no new kernel configuration. Default
+(unset/<=0) = disabled => byte-identical to stock.
+
+## Measurements (same binary, only LLAMA_MOE_MMQ_X differs)
+
+Decode throughput, S_TG t/s:
+
+```
+  npl     stock   cap16   cap32   cap64
+   1       85      85      85      85
+   8      282     280     282     282
+  32      629     623     629     628
+  64      935     915     949     934
+ 128     1295    1204    1344    1357     <- cap64 +4.8% (cap16 -7%)
+ 256     1779    1370    1723    1820     <- cap64 +2.3% (cap16 -23%)
+```
+
+Prefill throughput, S_PP t/s (the cost):
+
+```
+  npl     stock   cap16   cap32   cap64
+ 128     3083    1817    2559    3038
+ 256     3084    1818    2560    3046
+                 -41%    -17%    -1.3%
+```
+
+Reproducibility (interleaved off/cap64, two reps each):
+
+```
+  npl    off rep1/rep2   cap64 rep1/rep2
+  128    1300 / 1290     1357.5 / 1357.0
+  256    1786 / 1782     1826.3 / 1824.5
+```
+
+cap64 is stable to <0.1% and the gain sits well above the ~1% run-to-run band.
+
+## Why 64 is the only value that helps net
+
+A 512-token prefill ubatch routes ~32 tokens/expert. cap16/cap32 force those into
+16/32-wide tiles, overflowing into extra col-tiles + weight re-reads -> prefill
+craters (-41% / -17%). cap64 still holds the prefill density in one tile (32 < 64)
+so prefill is near-neutral (-1.3%), while decode (~8 tokens/expert at npl128) gets
+the fuller, higher-occupancy tile.
+
+## Verdict
+
+- Real but **modest** high-effective-batch DECODE micro-optimization
+  (+4.8% npl128, +2.3% npl256), neutral at npl<=64, ~1.3% prefill cost at cap64.
+- **Not** a cliff fix (no cliff) and **not** a real-server unlock (llama-server
+  continuous batching already scales). Shipped as an opt-in, default-off knob;
+  recommended value 64 for decode-heavy high-concurrency deployments.
+- Correctness: greedy temp-0 server output with cap64 is byte-identical to stock
+  for single-stream generation and stays coherent; thousands of capped MoE
+  matmuls at npl128/256 ran with no CUDA error / NaN.
+
+## Durable follow-up (scoped, not implemented)
+
+Replace the blunt global cap with a density-aware auto-select: choose `mmq_x`
+from `ne_get_rows / n_active_experts` inside `mul_mat_q_case` so decode gets the
+small tile while prefill keeps its large tile automatically (removes the ~1.3%
+prefill cost). Plus the block-padded `moe_align` in `mm_ids_helper`. See
+`MOE_GROUPED_GEMM_SCOPE.md`.
--- a/Show More
+++ b/Show More