refactor(paged): stock llama-cpp is patch-free; paged backend owns its patch series

Move ALL paged-attention content out of the stock backend/cpp/llama-cpp
backend and into backend/cpp/llama-cpp-localai-paged, so the stock backend is
pure upstream llama.cpp and the paged backend owns and applies its own vendored
patch series.

- Delete the dead early-exploration scaffold backend/cpp/llama-cpp/paged/
  (kernel/w4a16 Marlin scaffold, standalone paged_kv_manager, bench/loadgen,
  its own 0001-0002 patches, dense-era design docs, tests). Zero references
  repo-wide.
- Move backend/cpp/llama-cpp/patches/ (the 28-patch paged series + paged/README
  + 3 operational docs, plus the kernel/ scaffold patch and the top-level paged
  README/BENCHMARKS) to backend/cpp/llama-cpp-localai-paged/patches/. The stock
  backend keeps no patches/ dir; it had no non-paged base patches.
- Purify the stock backend: remove the LLAMA_PAGED make variable, the
  patches/paged apply loop, and the LLAMA_PAGED passthrough to prepare.sh;
  remove the paged-series handling from prepare.sh. The stock llama.cpp target
  now only clones the pin and applies its own (currently empty) base patches/
  series. The runtime paged option hooks in the shared grpc-server.cpp are
  untouched (inert without the patches).
- The paged backend's Makefile now applies its OWN patches/paged/0*.patch onto
  each freshly cloned tree via strict git apply (apply-paged-patches), after the
  copied stock infra clones the pin and applies base patches.
- Repoint every reference to the old patches/paged path: the upstream canary
  workflow + apply script, bump_deps.yaml, gallery/index.yaml, the docs,
  backend/index.yaml, backend-matrix.yml, the top-level Makefile comments, and
  the moved PIN_SYNC / README docs. Drop the now-removed LLAMA_PAGED=on
  build-toggle from comments.

Verified: the full 28-patch series applies strict-clean (git apply, exit 0) to
a clean ggml-org/llama.cpp checkout at the pinned c299a92c, and the repointed
canary apply script resolves and applies the series end to end.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 11:01:22 +00:00
parent fb2dc33d52
commit 78fac9a28f
87 changed files with 109 additions and 3997 deletions

View File

@@ -6,14 +6,6 @@
# bump and is advanced only by the manual PIN_SYNC process.
LLAMA_VERSION?=9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1
LLAMA_REPO?=https://github.com/ggerganov/llama.cpp
# LLAMA_PAGED controls whether the vendored paged-attention patch series
# (patches/paged/) is applied on top of the pinned llama.cpp. Default on; set
# LLAMA_PAGED=off to build a clean-against-upstream backend (e.g. to unblock a
# dep-bump if an upstream change breaks a paged hook - the paged carry is then
# fixed independently). Runtime behaviour stays gated by the LLAMA_KV_PAGED env
# regardless, so an LLAMA_PAGED=on build is byte-identical to stock until that
# env is set.
LLAMA_PAGED?=on
CMAKE_ARGS?=
BUILD_TYPE?=
@@ -187,23 +179,14 @@ llama.cpp:
[ -e "$$p" ] || continue; \
echo "applying llama.cpp patch: $$p"; \
git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \
done && \
if [ "$(LLAMA_PAGED)" = "off" ]; then \
echo "LLAMA_PAGED=off: skipping paged-attention patch series"; \
else \
for p in $(CURRENT_MAKEFILE_DIR)patches/paged/0*.patch; do \
[ -e "$$p" ] || continue; \
echo "applying llama.cpp PAGED patch: $$p"; \
git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \
done; \
fi
done
llama.cpp/tools/grpc-server: llama.cpp
mkdir -p llama.cpp/tools/grpc-server
LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
bash prepare.sh
rebuild:
LLAMA_PAGED=$(LLAMA_PAGED) bash prepare.sh
bash prepare.sh
rm -rf grpc-server
$(MAKE) grpc-server

View File

@@ -1,7 +0,0 @@
tests/test_free_block_queue
tests/test_block_pool
tests/test_paged_kv_manager
tests/test_prefix_cache
tests/test_ggml_paged_rw
tests/test_ggml_paged_attn
paged-bench

View File

@@ -1,105 +0,0 @@
# Blackwell (GB10 / sm_121) kernel gaps — measured + the corrected strategy
Supersedes the "greenfield tcgen05 FP4 grouped GEMM" framing in `FP4_GROUPED_MOE_KERNEL.md`. Research +
profiling reframed the problem: the kernels we need **already exist in ggml**; they're just **untuned for
Blackwell**. And the parity target is far lower than the headline vLLM number implied.
## 1. The parity target was wrong — it's ~3,300 t/s single-stream, not 24,444
vLLM's dense "24,444 t/s" is **aggregate concurrent-batch** throughput, not single-sequence. The GB10
compute roofline caps **single-stream** Qwen3-32B prefill at **~3,300 t/s (BF16/INT8 ceiling)** / **~6,600
(FP4 ceiling)**. So: don't chase 24,444 with one kernel. Aggregate parity = (a kernel at the ceiling) +
(batched-prefill scheduling). The *kernel* job is to reach ~3,300 (matches vLLM, which on GB10 also runs at
the BF16 ceiling) or ~6,600 (beats it, via FP4).
## 2. GB10 per-precision DENSE peaks (measured, not spec)
| precision | dense peak | vs BF16 |
|---|---|---|
| BF16 / FP16 | ~213 TFLOP/s | 1.0× |
| INT8 | ~215 TOPS | **1.0×** |
| FP4 (MXFP4/NVFP4) | ~427500 TFLOP/s | **2.0×** |
Memory: ~273 GB/s LPDDR5X (the bottleneck for *decode*; prefill is compute-bound). **Critical:** GB10 is
**1:1:2** (BF16:INT8:FP4), NOT datacenter Blackwell's 1:2:4 — **INT8 gives ZERO speedup over BF16 here.** So
int8-MMQ has no precision advantage; only FP4 does. (NVIDIA spec sheets still claim 1:2:4 — contradicted by
direct GB10 measurement; on-the-record discrepancy.)
## 3. Measured gaps (nsys, GB10)
| path | kernel | % of prefill | achieved | % of ceiling |
|---|---|---|---|---|
| **Dense** Q4_K_M | `mul_mat_q<Q4_K/Q6_K>` (int8 MMQ) | 80% | ~46 TFLOP/s | **~21% of 215** |
| **MoE** MXFP4 | `mul_mat_q<MXFP4>` (FP4 MMA) | 37% | ~22 TFLOP/s | **~45% of 500** (or ~10% of BF16) |
Both kernels are **engaged correctly but untuned for Blackwell** — llama.cpp's MMQ was "tuned primarily for
RTX 3000/4000" (Ampere/Ada). The headroom (45×) is recoverable; it's not an architectural ceiling.
## 4. ggml's current quantized-matmul paths (what exists)
- **MMQ** (int8): quantizes activations to Q8_1, int8 `mma.sync`/`dp4a`. Prefill path. **Untuned for sm_12x.**
- **FP4 MMA** (#17906, merged): native MXFP4/NVFP4 `m16n8k64` block-scaled FP4 mma for cc≥12.0. Works on GB10
for MoE (we measured 3441 t/s MXFP4 prefill) — but underutilized (~5% of FP4 peak). On **sm_121** it's hit
by build-flag (`120f`) + nvcc `-O3` miscompile (#18331) + capability-gating issues.
- **dequant→cuBLAS-FP16**: unfused fallback (materializes FP16 weights, round-trips memory). Not a fused
Marlin. (Our `GGML_CUDA_FORCE_CUBLAS` no-op = this didn't even engage for Q4_K.)
- **NO fused Marlin-style W4A16 kernel** (dequant 4-bit→BF16 in-shared-mem → BF16 tensor cores). Real gap.
## 5. Strategy — match vs beat (this replaces the tcgen05-greenfield plan)
**To MATCH vLLM (~3,300 single-stream): FP4 is NOT required.** Because INT8 == BF16 on GB10, a tuned MMQ and
a BF16 Marlin kernel share the *same* ceiling — and vLLM hits parity via W4A16 Marlin (BF16), since its FP4
is also broken on sm_121.
Ranked, by effort:
1. **Probe: tune the existing int8 MMQ for Blackwell** (dense). Cheapest. We're at 21% of the ceiling —
recover via tile sizes, async copy (`cp.async`), double-buffered shared-mem pipeline, occupancy. Caveat:
the `nwarps*tile_C::I==mmq_y` static_assert (found earlier) couples the constants; and the Q8_1
activation-quant overhead caps pure-MMQ tuning. Bounded upside, but a fast experiment.
2. **Build a Marlin-style W4A16 BF16 GEMM** (dense) — the robust path to ~3,300 (4.3× over today's 765).
Dequant 4-bit→BF16 in shared memory, MMA on BF16 tensor cores, `cp.async` multi-buffer, offline weight
reshuffle. Mirrors vLLM's actual GB10 path; keeps activations BF16 (better quality than int8 MMQ); fills a
genuine ggml gap. **This is the recommended kernel to MATCH.**
**To BEAT vLLM (~6,600, 2×): fix — don't rewrite — the FP4 path on sm_121.**
3. **Get the existing FP4 MMA (#17906/#20644) fully working + tuned on sm_121.** It already works on sm_120
(RTX 5090: +4368% prefill) and on GB10 for MoE. The blockers are the `120f` arch flag, the `-O3`
miscompile (#18331), capability gating — **build/compiler fixes, not a new kernel.** Then tune the FP4 MMQ
(it's at ~5% of FP4 peak). This is where upstream momentum already is, and the only route past vLLM.
**Dropped:** the from-scratch tcgen05/CUTLASS grouped GEMM (the old scaffold). It aimed past the matchable
ceiling, duplicates work the FP4-MMA path already does, and FP4 on sm_121 is a *fix* problem not a *write*
problem. The `fp4-grouped-moe.cu` scaffold/hook stays as a useful dispatch seam, but the kernel behind it
should be one of (1)/(2)/(3), not a greenfield CUTLASS collective.
## 6. Cheap experiment — RESULT: MXFP4 dense = free 1.44×, but not parity (kernel still untuned)
Requantized Qwen3-32B dense → MXFP4 (forced attn+ffn to mxfp4 via `--tensor-type`, `--allow-requantize`,
speed-only test) and benched prefill:
| quant | kernel | pp512 | pp2048 | vs Q4_K |
|---|---|---|---|---|
| Q4_K_M | int8-MMQ | 765 | 763 | 1.0× |
| **MXFP4** | **FP4-MMA** | **1099** | **1153** | **1.44×** |
**Findings:**
- **MXFP4 dense is a real, free 1.44× over Q4_K** — just a requantize, the existing FP4-MMA path engages for
dense weights on GB10. Worth shipping as a **Blackwell dense-quant recommendation** in the gallery (no kernel).
- **But it is NOT parity.** 1153 t/s = **~17% of the FP4 ceiling (~6,600)** / ~35% of the BF16 ceiling. So the
**FP4-MMA kernel is itself untuned** (consistent with the MoE measurement, ~5% of FP4 peak). MXFP4 moves dense
from the int8 path (765) onto the FP4 path (1153), but the FP4 kernel leaves ~46× on the table.
- **So the kernel work is confirmed and now precise: tune the FP4-MMA kernel** (it's the highest-value, since it
serves both dense-MXFP4 and MoE, and FP4 is the only path that can *beat* vLLM). Strategy item (3) — fix +
tune the existing FP4-MMA on sm_121 — is the priority; a Marlin-style W4A16 BF16 kernel (2) is the alternative
to *match* on the BF16 ceiling if FP4 tuning stalls.
Conclusion: the cheap test did NOT collapse the kernel problem (the kernels are untuned, not just the quant), but
it (a) gives a free 1.44× to ship now, and (b) sharpens the target to **tuning the FP4-MMA kernel**.
## Sources
GB10 peaks (measured): forums.developer.nvidia.com/t/351993, /360142, /373618. Marlin: github.com/IST-DASLab/marlin,
arxiv 2408.11743, developers.redhat.com Marlin/Machete. MMQ untuned: llama.cpp docs/build.md, discussions/16578,
DandinPower/llama.cpp_bench. FP4 landing/sm121: llama.cpp PR #17906/#20644, issues #19662/#18331. Roofline:
vllm.ai/blog/2026-06-01-vllm-dgx-spark, lmsys.org DGX Spark.
> **Correction (measured):** the earlier `GGML_CUDA_FORCE_CUBLAS` env test was a no-op because it's a *compile-time* `#ifdef`, not a runtime flag — cuBLAS never engaged. A real rebuild with `-DGGML_CUDA_FORCE_CUBLAS=ON` shows cuBLAS is **slower** than MMQ for dense Q4 (pp2048 690 vs 750) and runs an **Ampere `cutlass_80_tensorop` FP16 kernel** — cuBLAS-13.0 has no sm_121-tuned GEMM and falls back to sm_80. So *both* MMQ and cuBLAS sit at ~46 TFLOP/s (~21% of the 213 BF16 peak); there is **no library shortcut** to the ceiling on GB10 — a hand-tuned sm_120a kernel (Marlin-style) is required.

View File

@@ -1,334 +0,0 @@
# Chunked prefill + n_batch/n_ubatch decouple — implementation plan
Scope: LocalAI's llama.cpp backend (`backend/cpp/llama-cpp/`). Companion to
`PHASED_VLLM_PARITY_PLAN.md` Phase 3. This document is the concrete, file-cited
plan for what the brief called "chunked prefill".
Line numbers below are from two trees:
- LocalAI: `backend/cpp/llama-cpp/grpc-server.cpp`, `core/backend/options.go`,
`backend/backend.proto`, `core/backend/hardware_defaults.go` — exact.
- Vendored upstream scheduler: `llama.cpp/tools/server/server-context.cpp`. The
build copies `llama.cpp/tools/server/*` into `tools/grpc-server/` (`prepare.sh`
lines 15-17) and only overrides `grpc-server.cpp` + `CMakeLists.txt`. So
`update_slots()` is **inherited upstream code, not LocalAI code**. Line numbers
cited for it are from a same-era checkout (`d12cc3d`, 2026-04-09); the pin is
`f3e1828` (Makefile line 2). The structure is identical; exact lines may drift
a few rows at the pin — match on the quoted comment strings, not the integers.
---
## TL;DR — the headline finding
**Chunked prefill with prefill/decode interleaving is ALREADY implemented** in the
llama.cpp server scheduler that LocalAI vendors. It is not a missing feature on
this version. `update_slots()` in `server-context.cpp`:
1. **Adds ongoing decode tokens first** — "first, add sampled tokens from any
ongoing sequences" (≈ line 2088). Every `SLOT_STATE_GENERATING` slot gets its
one sampled token into the shared `llama_batch` before any prefill is added.
2. **Then fills the remaining `n_batch` budget with prompt (prefill) tokens**
"next, batch any pending prompts without exceeding n_batch" (≈ line 2166),
gated by `params_base.cont_batching` (LocalAI sets `cont_batching = true` by
default, `grpc-server.cpp:547`). The per-slot prefill fill loop
(≈ line 2552) is `while (slot.prompt.n_tokens() < slot.task->n_tokens() &&
batch.n_tokens < n_batch)` — i.e. it caps each slot's prefill contribution to
the **remaining** budget and defers the rest to the next iteration.
3. **Decodes the combined batch in one pass** (≈ line 2728-2741): decode tokens
and prefill-chunk tokens go through the **same `llama_decode`**, which then
splits internally into `n_ubatch` physical sub-batches.
This is exactly the behavior the abandoned-looking draft **upstream PR #10718**
("server : chunked prefill support") asked for — "the first task is no longer
blocked by the second long prompt processing task." That PR is still marked OPEN
but its goal was absorbed into the natural evolution of `update_slots()`; we do
**not** need to port it. A long prefill no longer stalls the decode batch: decode
slots are serviced first every iteration, prefill consumes only the leftover
budget.
**Therefore: do not re-implement chunked prefill.** The real LocalAI gap is
narrow and is the rest of this plan:
- **Phase A (the actual gap): the `n_batch`/`n_ubatch` decouple.** LocalAI ties
the scheduler token budget (`n_batch`) to the physical forward width
(`n_ubatch`) at `grpc-server.cpp:515` + `:519`. This forces
`n_batch == n_ubatch`, so the logical scheduling window can never be wider than
one physical ubatch. You cannot keep `n_ubatch` at the Blackwell GEMM sweet
spot (2048) while widening `n_batch` so concurrent prefills + decodes co-batch
into a larger logical window. There is no first-class `batch:`/`ubatch:` split
on the Go side, and there is only a one-directional `ubatch` override on the C++
side (you can shrink ubatch below the coupled value, never grow n_batch above
it).
- **Phase B (optional policy lever): a decode-headroom prefill cap.** Upstream
caps prefill at the full `n_batch` shared with decode. Under heavy mixed load
one fat prefill chunk per iteration still adds inter-token latency (ITL) jitter
to the decoders sharing that forward. vLLM exposes
`long_prefill_token_threshold` / `max_num_partial_prefills` for this. A
LocalAI-specific per-iteration prefill cap (a patch to vendored `update_slots`)
bounds that jitter. This is genuinely not in upstream and is the only place a
scheduler-policy change is warranted.
---
## 1. Current behavior — precise citations
### 1.1 The scheduler is upstream, inherited verbatim
- `prepare.sh:15-17` copies all of `llama.cpp/tools/server/*` into the
`grpc-server` build dir; `grpc-server.cpp` (LocalAI) replaces only the HTTP/gRPC
service + `params_parse` + `parse_options`. `update_slots()`, the slot state
machine, and the batch builder are **upstream `server-context.cpp`**, untouched
by LocalAI today.
- Slot states: `server-context.cpp:36-42`
`SLOT_STATE_IDLE / WAIT_OTHER / STARTED / PROCESSING_PROMPT / DONE_PROMPT /
GENERATING`.
### 1.2 Decode-first, then prefill-fill, one shared batch
- `common_batch_clear(batch)` (≈ 2078) — one batch per `update_slots` iteration.
- Decode phase (≈ 2088-2156): for each `SLOT_STATE_GENERATING` slot,
`common_batch_add(batch, slot.sampled, …, /*logits=*/true)` adds exactly one
token. Decode is guaranteed a seat before prefill runs.
- Budget fetch (≈ 2158-2160): `n_batch = llama_n_batch(ctx)`,
`n_ubatch = llama_n_ubatch(ctx)`.
- Prefill phase (≈ 2166): `if (params_base.cont_batching || batch.n_tokens == 0)`
→ with cont_batching ON, prefill is added to the **same** batch as decode.
- Per-slot prefill fill (≈ 2552-2597):
`while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.n_tokens < n_batch)`
— adds prompt tokens until the slot is done **or** the shared budget is hit.
Whatever does not fit stays for the next iteration (the slot remains
`SLOT_STATE_PROCESSING_PROMPT`).
- Whole-prompt completion (≈ 2603-2615): when the slot's prompt is fully consumed
it flips to `SLOT_STATE_DONE_PROMPT`, sets `batch.logits[last] = true`, inits
the sampler. Next iteration it becomes `GENERATING`.
- Budget break (≈ 2693-2695): `if (batch.n_tokens >= n_batch) break;`.
- Decode (≈ 2728-2741): loops `batch_view` slices of `min(n_batch, remaining)` and
calls `llama_decode`; the physical `n_ubatch` split happens inside
`llama_decode`.
### 1.3 The chunking is gated by `can_split()`
- `server-context.cpp:225-231`: `can_split()` returns true unless the task needs
embeddings with non-LAST pooling. So **completion/generation tasks always
chunk-and-interleave**; only embeddings/rerank force the whole prompt into one
ubatch (≈ 2234-2244 raises "input is too large… increase the physical batch
size" — this is exactly why LocalAI bumped `n_ubatch` for rerank, see below).
### 1.4 LocalAI ties n_batch to n_ubatch (the gap)
- `grpc-server.cpp:515``params.n_batch = request->nbatch();`
- `grpc-server.cpp:519``params.n_ubatch = request->nbatch();` with the comment
that this fixes reranking being capped at the 512 default `n_ubatch`.
- `grpc-server.cpp:781-784` — the **only** decouple knob today: an `n_ubatch` /
`ubatch` option that overrides `n_ubatch` alone (added for embeddings/rerank).
There is **no** `batch` / `n_batch` option parse, so `n_batch` cannot be raised
above the coupled value from a model config. Confirmed: `grep '"n_batch"|"batch"'`
in `grpc-server.cpp` returns nothing.
- Options arrive via `request->options(i)` parsed as `optname:optval`
(`grpc-server.cpp:584-585`); these come from `ModelOptions.Options`
`c.Options` (`core/backend/options.go:221`).
### 1.5 Go side sends a single batch number
- `backend/backend.proto:341``int32 NBatch = 4;` is the only batch field; there
is **no** `NUBatch`.
- `core/backend/options.go:108-129` `EffectiveBatchSize`: returns `c.Batch` if set,
else context size for single-pass (score/embed/rerank), else
`hardwareDefaultBatchSize(512)`.
- `core/backend/options.go:228``NBatch: int32(b)` (single value to the
backend; becomes both `n_batch` and `n_ubatch` via 1.4).
- `core/backend/hardware_defaults.go:28,37-40``BlackwellBatchSize = 2048`;
on Blackwell an unset batch defaults to 2048, so today
`n_batch == n_ubatch == 2048` there.
---
## 2. Why the decouple matters for serving (not just rerank)
Invariant: `n_ubatch <= n_batch`. `n_ubatch` is the physical forward-pass GEMM
width (compute efficiency; GB10 sweet spot ≈ 2048). `n_batch` is the per-iteration
**scheduler token budget** — the logical window shared by decode + prefill chunks,
analogous to vLLM's `max_num_batched_tokens`.
With `n_batch == n_ubatch` (today), the scheduling window cannot exceed one
physical ubatch. Consequences:
- Under concurrency, the combined (decode + multiple prefill chunks) logical batch
is capped at the physical ubatch, so aggregate prefill cannot grow past one
ubatch worth of tokens per iteration even when more slots have prompts queued.
- A user who shrinks `batch:` for memory also shrinks the physical ubatch,
degrading prefill GEMM efficiency — and vice versa.
Decoupling lets us hold `n_ubatch = 2048` (efficient GEMM) while setting a larger
`n_batch` (e.g. 4096) so more concurrent prefill+decode tokens co-schedule into one
logical window, lifting aggregate prefill under mixed load — `llama_decode` still
tiles the physical work at 2048.
---
## 3. Phased implementation
### Phase 0 — Verification harness (do first; TDD red)
Bite-sized, no code change to the scheduler.
- **0.1 Token-identical greedy under mixed load.** Script: start the backend with
`n_parallel >= 4`, greedy sampling (temp 0, fixed seed). Fire (a) several short
decode streams and (b) one ~8k-token prompt concurrently (the exact repro from
PR #10718's body works). Capture each stream's full token id sequence. Re-run
with the prefill request absent. **Assert the short streams' token ids are
byte-identical** in both runs — proves interleaving does not perturb decode
numerics (KV/position correctness across chunk boundaries). Wire as a Ginkgo
spec under the backend e2e suite.
- **0.2 Mixed-workload throughput baseline.** Use `llama-batched-bench` (built from
the same tree) or a small driver hitting `/v1/chat/completions`: measure
aggregate prefill tok/s and decode tok/s, and p50/p99 ITL of the decode streams,
under the mixed workload. Record numbers for the current `n_batch==n_ubatch`
config. This is the before of Phase A/B.
Expected result of Phase 0: 0.1 already passes (interleave is correct today);
0.2 gives the baseline the decouple must beat.
### Phase A — Decouple n_batch from n_ubatch
Goal: let model config set the physical ubatch independently of the logical batch,
defaulting to today's behavior (no regression).
- **A.1 C++: accept a `batch`/`n_batch` option (and keep `ubatch`).**
In `grpc-server.cpp`, after the existing `ubatch` branch (`:781-784`), add a
sibling branch:
```cpp
} else if (!strcmp(optname, "n_batch") || !strcmp(optname, "batch")) {
if (optval != NULL) {
try { params.n_batch = std::stoi(optval_str); } catch (...) {}
}
```
This is the missing direction (raise `n_batch` above the coupled value). Order
matters: both `:515/:519` run first (coupling as default), then option parsing
overrides either independently. Add a clamp note: if a user sets
`n_ubatch > n_batch`, llama.cpp will clamp/upbatch; log a warning. Keep the
`:519` aliasing for backward compat (rerank still works with no options).
- **A.2 Proto: add an explicit physical ubatch field.**
`backend/backend.proto:341` add `int32 NUBatch = <next free tag>;` (do not reuse
4). Regenerate with `make protogen-go` + the C++ proto build.
- **A.3 C++: honor `NUBatch` when present.**
In `grpc-server.cpp` `params_parse`, after `:519`, add:
```cpp
if (request->nubatch() > 0) {
params.n_ubatch = request->nubatch();
}
```
so an explicit physical ubatch wins over the `n_batch` alias, with the `ubatch`
string-option as a third path for users who only edit `options:`.
- **A.4 Go: config surface + plumbing.**
- Add `UBatch *int` (yaml `ubatch`) to the llama config struct alongside `Batch`
(search `core/config` for the `Batch` field; mirror it).
- In `core/backend/options.go`: add `EffectiveUBatchSize(c)` mirroring
`EffectiveBatchSize` (return `c.UBatch` if set, else
`min(EffectiveBatchSize(c), BlackwellBatchSize-or-512)` so the physical ubatch
stays at the hardware sweet spot while `n_batch` may be larger). Set
`NUBatch: int32(EffectiveUBatchSize(c))` next to `NBatch:` (`:228`).
- Keep the default such that when neither is set, `NUBatch == NBatch` ⇒
byte-identical to today.
- **A.5 Serving default (the lever).**
In `hardware_defaults.go`, introduce `BlackwellLogicalBatch = 4096` (or a
measured value) and let `EffectiveBatchSize` return it for **multi-slot serving**
configs (when `n_parallel > 1` and the model is a completion model), while
`EffectiveUBatchSize` stays at `BlackwellBatchSize = 2048`. Gate behind the same
Blackwell detection already used at `:37-40`. Single-stream/embedding/rerank
paths keep `n_batch == n_ubatch`. This is the only behavioral change shipped by
Phase A; Phase 0.2 must show it is net-positive before defaulting it on.
- **A.6 Tests.** Extend `hardware_defaults_internal_test.go` with
`EffectiveUBatchSize` cases; add a `grpcModelOpts` test asserting
`NUBatch <= NBatch` and that unset config yields `NUBatch == NBatch`. Re-run
0.1 (must still be token-identical) and 0.2 (must show aggregate-prefill gain or
neutral ITL) at `n_batch=4096, n_ubatch=2048`.
### Phase B — Decode-headroom prefill cap (optional policy, vendored patch)
Only if Phase 0.2 / A shows decode ITL jitter from fat prefill chunks. This is the
one change that touches the inherited scheduler, so it lives as a patch in
`backend/cpp/llama-cpp/patches/` (applied by `prepare.sh:6-11` / Makefile
`:141-145`), never as an edit to a checked-in upstream file.
Policy (pseudocode; insert into `update_slots()` prefill fill loop, the
`while (… && batch.n_tokens < n_batch)` at ≈ `server-context.cpp:2552`):
```
# token budget for THIS iteration, decode already seated:
n_decode_in_batch = batch.n_tokens # set after the decode phase
prefill_budget = n_batch # default == today
if serving_mode and n_decode_in_batch > 0:
# leave room so decoders are not starved/jittered by one giant prefill chunk
# max_prefill_per_iter defaults to n_ubatch (one physical tile) when decode active
prefill_budget = min(n_batch, n_decode_in_batch + max_prefill_per_iter)
# fill loop guard becomes:
while slot.prompt.n_tokens() < slot.task->n_tokens()
and batch.n_tokens < prefill_budget:
...
```
- `max_prefill_per_iter` is a new `common_params` field surfaced as an
`options:` knob (`max_prefill_tokens` / `mpt`) parsed in `grpc-server.cpp`
exactly like A.1, default `0` = disabled = today's behavior.
- Semantics mirror vLLM `long_prefill_token_threshold`: cap the prefill share so
ongoing decodes keep a steady cadence; the remaining prompt rides the next
iteration (already supported by the state machine — slot stays
`PROCESSING_PROMPT`).
- **Correctness:** unchanged KV/position path — chunk boundaries already advance
`slot.prompt.tokens.pos_next()` per added token (≈ 2570) and the slot resumes
from `slot.prompt.n_tokens()` next iteration. Capping the budget only changes
*how many* tokens are added this iteration, not *which* positions, so 0.1 must
remain token-identical.
### Phase C — Docs + defaults rollout
- Document `batch` / `ubatch` (and `max_prefill_tokens` if B ships) in
`docs/content/` model-config reference, with the serving recipe
(`n_parallel>1`, `n_batch=4096`, `ubatch=2048`).
- Note the orthogonality to paged KV (below) in
`PHASED_VLLM_PARITY_PLAN.md` Phase 3.
---
## 4. Risk / correctness
- **KV-cache & positions across chunks:** already handled upstream. Each prefill
token added advances `pos_next()` (≈ 2570) and is pushed to `slot.prompt.tokens`
(≈ 2573); the next iteration resumes from `slot.prompt.n_tokens()`. Chunk
boundaries are transparent to the KV cache because positions are absolute, not
per-chunk. Phase A changes only budgets, not positions; Phase B changes only the
per-iteration count. The 0.1 token-identical test is the guardrail.
- **Unified KV cache (LocalAI default, `n_parallel` slots share one cache):**
unaffected — co-batching prefill+decode across slots is what the unified cache is
for; positions are per-`seq_id` (`{ slot.id }` in `common_batch_add`).
- **`n_ubatch > n_batch`:** invalid; A.4 clamps `EffectiveUBatchSize <=
EffectiveBatchSize` and A.1 logs a warning if options violate it.
- **Embeddings / rerank:** must keep `n_ubatch >= prompt length` (single pass,
`can_split()==false`). The existing `:519` alias + `EffectiveBatchSize`
context-sizing for single-pass usecases (`options.go:119-124`) must be preserved
— do not let the serving `BlackwellLogicalBatch` default leak into single-pass
configs (A.5 gates on completion + `n_parallel>1`).
- **Turboquant fork:** the fork lacks some `common_params` fields (see
`LOCALAI_LEGACY_LLAMA_CPP_SPEC` precedent at `grpc-server.cpp:755`). `n_batch` /
`n_ubatch` are ancient fields and safe; if Phase B adds `max_prefill_per_iter`,
guard the new field behind a `#ifndef` like the checkpoint block does.
## 5. Orthogonality to paged KV (Phase 2)
Keep them independent. Paged KV (the `-kvp` / block-manager effort, draft #22569,
and `paged/`) changes **where** KV blocks live (allocation/utilization). Chunked
prefill / this decouple changes **how many tokens per iteration** the scheduler
batches (the `n_batch` budget and decode/prefill interleave). They compose: paged
KV raises the concurrency ceiling (more slots), the decouple widens the per-iter
scheduling window to feed those slots; neither touches the other's data structures.
The only contact point is `update_slots()` — if both ship a vendored patch to it,
land them as separate, ordered patches in `patches/` and keep the hunks disjoint
(paged touches allocation/seq_rm; chunked-prefill Phase B touches the prefill fill
budget).
---
## 6. Bottom line
- Chunked prefill + decode interleave: **already present and correct** on the
pinned llama.cpp — verify (Phase 0.1), do not rebuild.
- Real work: the **n_batch/n_ubatch decouple** (Phase A) — small, additive,
default-preserving — plus an **optional decode-headroom prefill cap** (Phase B)
if measurements show ITL jitter. Both are LocalAI-side: A in `grpc-server.cpp`
+ proto + `options.go`; B as a vendored `patches/` hunk.

View File

@@ -1,215 +0,0 @@
# llama.cpp multi-user decode overhead on DGX Spark (GB10, sm_121)
Investigation of the Qwen3-32B concurrent-decode throughput gap (llama.cpp ~547 t/s
vs vLLM ~667 t/s) on the GB10 box, build `~/llama.cpp-pr24423/build` (Release,
sm_121, `LLAMA_MAX_SEQ=256`, flash-attn on), model
`~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
## TL;DR (the result overturns the brief's premise)
On **this** build the prime suspect is wrong and the host-overhead premise does not
hold:
1. **CUDA graphs are NOT disabled at high concurrency.** At npl=128, 94 of 98
decode `graph_compute` calls **replay a captured CUDA graph** (0 resets, stable
key, no property churn post-warmup). The keyed-warmup gate works.
2. **There is no ~170ms/step host hotspot here.** The GPU is **~96% active during
decode with graphs ON and ~96% active with graphs OFF**. Decode at npl=128 is
**GPU-compute-bound**, not host-bound.
3. The brief's "20% GPU util / 66ms GPU / 170ms host per step" was measured on a
different/earlier build (mainline without these graph fixes). It is not
reproducible on `llama.cpp-pr24423`.
4. Because the GPU is the bottleneck, re-enabling graphs cannot lift the number:
the clean A/B shows graphs ON vs OFF = **+1.5% at npl=128** (and +2.9% at
npl=32 - the benefit shrinks as concurrency rises and the GPU saturates).
5. The real gap to vLLM is the **quantized decode GEMM kernel**: `mul_mat_q`
(Q4_K + Q6_K) is ~68% of decode GPU time and runs ~2.1x above the GB10
memory-bandwidth floor. Closing the gap requires Marlin/Machete-style int4
GEMM kernels, not host-side work. This is a kernel project (the direction the
prior session's uncommitted `marlin-w4a16.cu` / `fp4-grouped-moe.cu` already
started, though those target w4a16/GPTQ-int4, not the K-quants this GGUF uses).
## 1. Why CUDA graphs are (not) disabled - exact code + measurement
### The gate (code)
PR24423 refactored the CUDA-graph path into a keyed, warmup-based scheme in
`~/llama.cpp-pr24423/ggml/src/ggml-cuda/ggml-cuda.cu`:
- `ggml_cuda_graph_get_key(cgraph)` (~L3343) keys the cached CUDA graph by
`cgraph->nodes[0]` (first-node pointer).
- `ggml_cuda_graph_check_compability(cgraph)` (~L3301) disables graphs only for:
- **split buffers** (`ggml_backend_buft_is_cuda_split`), and
- **`GGML_OP_MUL_MAT_ID`** when `src0` is non-quantized **or**
`ne[2] > get_mmvq_mmid_max(...)` (MoE expert routing needs a stream sync).
Qwen3-32B is **dense** -> no `MUL_MAT_ID` -> this condition never fires.
- `ggml_backend_cuda_graph_compute` (~L4514) warmup gate: a graph is used only
after **2 consecutive calls with no property change** (`warmup_complete`); any
property change resets warmup. `ggml_cuda_graph_update_required` (~L3347)
detects change by `memcmp` of the full `ggml_tensor` struct + per-src
data-ptr/ne/nb, with a fast path when `cgraph->uid` is unchanged.
### Why it stays enabled across decode steps
The graph stays stable because llama.cpp's host-side graph reuse holds during
decode, so node pointers/props (and `cgraph->uid`) do not churn:
- `llama_kv_cache::get_n_kv` (`src/llama-kv-cache.cpp` L1223-1233) **pads n_kv to
a multiple of 256** ("so that the graph remains constant across batches and can
be reused"). For ntg<=256 within the first KV block, n_kv is constant.
- `can_reuse_kq_mask` (`src/llama-graph.cpp` L43) keeps the KQ-mask dims stable:
`ne=[n_kv, n_tokens/n_stream, 1, n_stream]` = `[256,1,1,128]` every decode step
at npl=128.
- `can_reuse` (`src/llama-context.cpp` L1283) therefore returns true, so the
scheduler is **not** reset/re-split. `graph->uid` is only reassigned inside
`ggml_backend_sched_split_graph` (`ggml/src/ggml-backend.cpp` L1033, L1485),
which is skipped on the reuse path -> stable uid -> CUDA graph replays.
### Measurement (instrumented build, npl=128, ntg=96)
Env-gated counters added to `ggml_backend_cuda_graph_compute` /
`ggml_cuda_graph_update_required` (since `GGML_LOG_DEBUG` is compiled out in
Release / NDEBUG). End-of-run summary:
```
[GTRACE-SUMMARY] calls=98 notenab=0 warming=3 warmdone=1 RESET=0 USED=94 incompat=0 distinct_keys=1
```
94/98 decode `graph_compute` calls **replayed** a captured CUDA graph; **0**
warmup resets; a **single** distinct graph key for the whole decode; no node
property churn after warmup. Graphs are fully engaged at npl=128.
(The instrumentation was reverted afterwards; the checkout is back to its
pre-task state and the `.so` rebuilt clean.)
## 2. The per-step CPU "hotspot" - there isn't one on this build
GPU utilization during npl=128 decode (ntg=256):
- **Graphs ON** - `nvidia-smi` sampled every 0.7s through the decode phase:
steady **96% GPU util**, SM clock **2184 MHz** (not throttled), 45-47 W.
- **Graphs OFF** (`GGML_CUDA_DISABLE_GRAPHS=1`) - nsys CUDA trace, 8s window:
total GPU kernel time = `3,983,292,128 ns / 0.516` = **~7.72s of the 8s
window = ~96% GPU-active**. Even with every kernel launched individually from
the host, the GPU is still ~96% busy. There are essentially **no host gaps**.
Per-step wall = 60.6s / 256 steps = **~237 ms/step**, and the sum of one decode
graph's kernel times (nsys, graphs-on capture) is ~244 ms -> GPU kernel time per
step ~= wall time per step. The host work between steps is in the low single-digit
ms (the ~4% idle), consistent with graphs ON giving only +1.5% at npl=128.
This directly contradicts the brief's 66ms-GPU / 170ms-host split, which must have
come from a pre-graphs build.
### Per-step GPU breakdown (nsys, npl=128 decode, graphs off, 8s window)
| Kernel | % GPU time | ~ms/step |
|--------|-----------:|---------:|
| `mul_mat_q` Q4_K (type 12) | 51.6 | ~118 |
| `flash_attn_ext_f16` | 19.3 | ~44 |
| `mul_mat_q` Q6_K (type 14) | 16.2 | ~37 |
| `unary_gated` silu | 4.1 | ~9 |
| mmq stream-k fixup + quantize_q8_1 | ~5 | ~12 |
| rms_norm / rope / set_rows / add | ~4 | ~10 |
Quantized matmul = **~68%** of decode GPU time (~155 ms/step). Attention ~19%.
`perf` could not profile the host (kernel `perf_event_paranoid=4`), but it is moot:
the host is ~4% of the wall, so there is no ~170ms host hotspot to chase.
## 3. Fix attempt + measured result
### The requested fix (re-enable graphs / pad the decode batch) is a no-op here
Graphs are already enabled and the batch is already stable (n_kv padded to 256,
kq_mask dims constant). The clean cold A/B (cooldowns between every run):
| npl | graphs ON (t/s) | graphs OFF (t/s) | delta |
|----:|----------------:|-----------------:|------:|
| 32 | 242.60 | 235.75 | +2.9% |
| 64 | 398.59 | 389.06 | +2.5% |
| 128 | 543.95 | 535.71 | +1.5% |
Baseline (separate cold runs, original non-instrumented build):
npl=32 243.9, npl=64 397.1, **npl=128 544.95** (matches the ~546 baseline).
Graphs help, but the benefit **monotonically shrinks** as concurrency rises and
the GPU saturates. At npl=128 there is only ~1.5% of host launch overhead left to
remove, and GPU util is ~96% in both columns. **You cannot lift npl=128 decode
toward 667 by working on graphs/host overhead - the GPU is the bottleneck.**
### Where the number actually is, and the real lever
- vLLM 667 t/s at this concurrency = **192 ms/step**; llama.cpp 547 = **237
ms/step**. The ~45 ms/step gap maps almost entirely onto the quantized matmul.
- GB10 memory-bandwidth floor for a 32B Q4_K_M (~19.8 GB of weights, read once
per step and shared across the 128 sequences) at ~273 GB/s is **~72 ms/step**.
llama.cpp's `mul_mat_q` spends ~155 ms/step on matmul = **~2.1x the bandwidth
floor**. vLLM's Marlin/Machete int4 GEMMs run much closer to the floor; that
efficiency difference is the ~547 -> 667 gap.
- The Q6_K matmul (`mul_mat_q` type 14) also shows pathological tail latency
(median 0.89 ms, max 5.5 ms) - the MMQ kernel is not well-tuned for the skinny
n=128 decode shape.
**The lever to beat 547 is a faster quantized decode GEMM**, i.e. a Marlin-style
int4 kernel for the decode shapes. This is exactly the direction of the prior
session's uncommitted `ggml/src/ggml-cuda/marlin-w4a16.cu` and
`fp4-grouped-moe.cu` (already wired via
`if (!split && ggml_cuda_w4a16_mul_mat(...)) return;` in `ggml_cuda_mul_mat`).
Note those target **w4a16 / GPTQ-int4**, while this GGUF is **K-quant (Q4_K/Q6_K)**,
so they are inert for this model - a Marlin path for K-quants (or shipping the
model in a Marlin-friendly int4 format) would be required. That is a multi-day
kernel effort, out of scope for this session, but it is the only lever that can
move the number.
### Why the "bump LLAMA_MAX_SEQ to 1024 -> 377" data point is consistent
`llama_batch_allocr` keeps `seq_cpl` as an `LLAMA_MAX_SEQ x LLAMA_MAX_SEQ` table
(`src/llama-batch.cpp`), so per-batch seq bookkeeping scales ~O(MAX_SEQ^2). At
MAX_SEQ=1024 that host cost becomes large enough (~70 ms/step) to dominate and
drop decode to 377. At MAX_SEQ=256 the same term is ~4.4 ms/step (the ~1.5% that
graphs reclaim); lowering to 128 would save ~3 ms/step (~1%). So MAX_SEQ tuning
confirms the host term is real but tiny at 256 - not a path to 667.
## How this would land in LocalAI
- **No host/graph patch is warranted** for this build: graphs already engage and
the decode is GPU-bound. A "pad the decode batch / force graph capture" patch
would change nothing measurable at high concurrency.
- The actionable upstream/vendored work is a **Marlin-style int4 decode GEMM**
(extend the prior `marlin-w4a16.cu` to cover K-quants, or quantize the served
model into a Marlin-friendly int4 layout). That is where the ~547 -> 667+ lives.
- If a small host win is still wanted, keep `LLAMA_MAX_SEQ` no larger than the max
concurrency actually used (the per-batch `seq_cpl` table is O(MAX_SEQ^2)).
## Reproduction
```
# baseline / A/B (cold, 30s cooldowns)
llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -npp 16 -ntg 128 -npl 32,64,128 \
-ngl 99 -b 2048 -ub 2048 -fa on # graphs on
GGML_CUDA_DISABLE_GRAPHS=1 ...same... # graphs off
# GPU util (graphs on): sample nvidia-smi during decode -> ~96%, 2184 MHz
# GPU active (graphs off): nsys profile -t cuda --delay=6 --duration=8 ...
# nsys stats --report cuda_gpu_kern_sum -> sum/0.516 ~= 7.72s of 8s = ~96%
```
## UPDATE: NVFP4 closes most of the decode gap (no Marlin-for-K-quants needed)
The diagnosis above said the lever is "a more bandwidth-efficient int4 decode GEMM"
and feared a multi-day Marlin-for-K-quants kernel. But the FP4-MMA path is already
that kernel. Measured (npl=128, cold A/B, npp=16 ntg=128):
| quant | decode S_TG (t/s) | vs Q4_K | vs vLLM 667 |
|---|---|---|---|
| Q4_K_M | 547 (548/546) | - | 82% |
| **NVFP4** | **619 (617/622)** | **+13%** | **93%** |
NVFP4's `mul_mat_q<NVFP4>` runs closer to the GB10 bandwidth floor at the thin n=128
decode shape than Q4_K's int8-MMQ (which ran ~2.1x above it). So shipping the model
as NVFP4 closes the decode gap from ~22% to ~7% AND wins prefill (1209 vs Q4 767 /
vLLM 800). Net on GB10: llama.cpp+NVFP4 is ahead on prefill (1.5x) and within ~7% on
decode. The remaining ~7% would be incremental FP4-MMA decode-kernel tuning, NOT a
from-scratch Marlin kernel - a much smaller, optional effort. NVFP4 is the answer to
both the prefill and the decode gap.

View File

@@ -1,253 +0,0 @@
# Closing the vLLM Gap on Blackwell (GB10 / DGX Spark) — Living Plan & Results
Target hardware: NVIDIA **GB10** (Grace-Blackwell, `sm_121a`, 119 GiB unified LPDDR5X), `dgx.casa`.
Model under test: **Qwen3-Coder-30B-A3B-Instruct** (MoE, 128 experts, top-8, ~3B active).
Engines: llama.cpp (CUDA, `~/llama.cpp-pr24423`, build `7a6ddc5`, `CMAKE_CUDA_ARCHITECTURES=121`) vs vLLM 0.23.0 (`~/vllm-bench`, torch 2.11.0+cu130).
> This is a working document. Each phase appends measured numbers, what was learned, and what's next.
> Methodology: `llama-bench` (single-stream pp/tg, built-in reps) and `llama-batched-bench` (`-npl` sweep,
> decode-phase aggregate `S_TG`, prefill aggregate `S_PP`); vLLM via `~/bench/vllm_conc.py` (decode-phase
> aggregate matched to `S_TG`). Same model/prompt/seed. Precision matched where possible.
---
## Baseline results (established)
### Single-stream (B=1), matched ~8-bit
| Engine / precision | prefill pp512 (t/s) | decode tg128 (t/s) |
|---|---|---|
| llama.cpp **Q8_0** | 2215 ± 15 | **54.8 / 62.2** * |
| llama.cpp **F16** | 700 ± 24 | 32.9 ± 0.05 |
| vLLM **FP8** | 9155 ± 308 | 52.45 ± 0.05 |
\* two sessions; ~55 right after worker-stop (clocks settling), ~62 steady state. Both ≥ vLLM → **single-stream parity holds**.
### Concurrency sweep (decode-phase aggregate `S_TG`, prefill aggregate)
| B | llama Q8 prefill | vLLM FP8 prefill | llama Q8 decode | vLLM FP8 decode |
|---|---|---|---|---|
| 1 | 1080 | 9644 | 60.1 | 48.0 |
| 8 | 2189 | 33373 | 160.8 | 312.4 |
| 32 | 2198 | 99398 | 357.1 | 1171 |
| 64 | 2194 | 151990 | 519.2 | 2064 |
llama F16 prefill also flat: B=1 452 → B=8 723 → B=32 778. **Prefill flat at both precisions = kernel-throughput ceiling.**
### Our paged patch (LLAMA_KV_PAGED) — concurrency effect: NONE
Same Q8 binary, paged branch confirmed firing (137 placements at B=8), throughput identical within noise:
| | B=1 | B=8 | B=32 |
|---|---|---|---|
| stock decode | 61.2 | 171.7 | 377.0 |
| paged decode | 62.7 | 170.8 | 376.8 |
Patch is placement-only correctness prototype; doesn't implement concurrency mechanics. Single-stream-neutral, concurrency-neutral.
---
## Root-cause diagnosis (nsys + code audit)
- **74.5% of GPU compute = `mul_mat_q`** (Q8_0 int8 MMQ GEMM, the MoE experts). Only cutlass kernel seen is `cutlass_80_tensorop` = **Ampere (sm_80)**, not Blackwell.
- ggml-cuda has **NO FP8 path** (no e4m3/e5m2 GEMM, no cuBLASLt FP8). Q8_0 runs the **Ampere-class int8 `mma.sync s8.s8.s32`** even on GB10 (`mma.cuh:924`, dispatched unconditionally `mmq.cu:307`).
- ggml-cuda **DOES** have a **native Blackwell FP4 path** (MXFP4 + NVFP4, `mma...kind::mxf4...e2m1`, `mma.cuh:1126`, gated `BLACKWELL_MMA_AVAILABLE`). Merged via #17906/#20644/#21074.
- **No fused MoE grouped GEMM**, no tcgen05/wgmma (warp-level `mma.sync` only).
- **Small per-expert GEMMs**: 512-tok ubatch → ~32 tok/expert (128 exp, top-8) → thin GEMMs, memory-bound, can't fill tensor-core tiles. vLLM processes 8192 tok/step → ~512 tok/expert → compute-bound + FP8.
- **The 4569× gap is partly apples-to-oranges**: we compared llama Q8 (Ampere int8) vs vLLM FP8 (Blackwell). Upstream/NVIDIA benches put the *real* FP4-vs-FP8 prefill gap at **~2550% long-context**, not 4569×.
Key upstream refs: discussion #22042 (FP8 design: `ggml_mul_mat_ext` + scale tensors), #17906 (native MXFP4), #18250 (NVFP4-MoE closed not-planned).
---
## The levers (cheap → expensive) — execution log
### Lever 1 — NVFP4/MXFP4 model (use existing Blackwell FP4 path) + ubatch bump
Status: **IN PROGRESS** — single-stream done, concurrency next.
Quant: `llama-quantize F16 -> MXFP4_MOE` (type 38), 15.9 GiB / 4.47 BPW. (No NVFP4 in llama-quantize; MXFP4_MOE puts experts in MXFP4 = Blackwell FP4 MMA.)
Single-stream (llama-bench), MXFP4 vs Q8 vs vLLM-FP8:
| metric | llama Q8 | **llama MXFP4** | vLLM FP8 |
|---|---|---|---|
| prefill pp512 (ub512) | 2215 | **3061 ± 22** | 9155 |
| prefill pp2048 (ub512) | ~2200 | 3137 ± 7 | — |
| prefill pp2048 (**ub2048**) | — | **3441 ± 14** | — |
| decode tg128 | 62.2 | **86.4 ± 0.3** | 52.45 |
Findings:
- **MXFP4 decode 86.4 beats vLLM FP8 52.45 by 1.65×** (4-bit = less memory traffic; decode is memory-bound). llama wins decode outright.
- MXFP4 prefill +38% over Q8; **ub2048 lifts prefill +10%** (3137→3441). Single-stream prefill gap to vLLM: 4.1× (Q8) → **2.7× (MXFP4)**.
- Caveat: MXFP4 is 4-bit vs vLLM FP8 8-bit — not precision-matched. Fair match = vLLM NVFP4 (4-bit); pending.
Concurrency (decode-phase aggregate `S_TG`, ub2048), MXFP4 vs Q8 vs vLLM-FP8:
| B | Q8 dec | **MXFP4 dec** | vLLM dec | Q8 pp | **MXFP4 pp** | vLLM pp |
|---|---|---|---|---|---|---|
| 1 | 60.1 | **83.4** | 48.0 | 1080 | 1625 | 9644 |
| 8 | 160.8 | **267.4** | 312.4 | 2189 | 3634 | 33373 |
| 32 | 357.1 | **551.2** | 1171 | 2198 | 3651 | 99398 |
| 64 | 519.2 | **770.2** | 2064 | 2194 | 3648 | 151990 |
**Lever-1 verdict:** MXFP4 is a large, free win — decode +5066% over Q8, prefill plateau +66% (2200→3650). MXFP4 decode **wins at B=1, near-parity at B=8** vs vLLM; only falls behind at high concurrency. **Prefill still plateaus (~3650)** — the MoE prefill GEMM doesn't scale with batch (no fused grouped GEMM; ubatch-limited). That plateau is the real remaining structural gap → Levers 23. Quality caveat unchanged (MXFP4 4-bit vs vLLM FP8 8-bit; quality not yet evaluated).
### Lever 2 — `n_ubatch` / `n_batch` tuning (standalone)
Status: **DONE + SHIPPED (auto-default implemented)**
MXFP4 pp4096 vs ubatch: ub512=2994, **ub2048=3316**, ub4096=2820(noisy), ub8192=3180.
**Verdict:** prefill saturates at ub=2048; larger ubatch gives nothing. The ~33003650 ceiling is the **MoE GEMM kernel**, not batch size. → No more free config wins; the rest is kernel work (Levers 35).
**Implemented:** `core/backend/hardware_defaults.go``EffectiveBatchSize` now defaults the physical batch
(n_batch→n_ubatch alias) to **2048 on Blackwell** (`xsysinfo.IsNVIDIABlackwell`, cc≥12 / sm_120/121) when the
config leaves `batch:` unset; explicit `batch:` always wins. Detection is a shared Go helper; placed at the
common ModelOptions builder so it covers the C++ llama.cpp backend too. Tests: `hardware_defaults_internal_test.go`.
### Lever 1b — Standard Q4 vs MXFP4 (what's actually MXFP4-specific)
**Q4_K_M** (17.3 GiB) vs **MXFP4** (15.9 GiB), ub2048:
| metric | Q4_K_M | MXFP4 | Q8 |
|---|---|---|---|
| decode tg128 | **93.5** | 86.4 | 62.2 |
| prefill pp512 | 2164 | **3061** | 2215 |
| prefill pp2048 | 2953 | **3441** | ~2200 |
**Verdict:** the **decode win is just "4-bit"** — plain Q4_K_M matches/beats MXFP4 on decode (both memory-bound).
MXFP4's *only* real edge is **prefill (+41% over Q4_K_M)** via Blackwell FP4 tensor cores. So for shipping,
**"4-bit quant + ubatch=2048" captures most of the win portably**; MXFP4 is a Blackwell-only prefill extra.
### Lever 3 — Fused FP4/FP8 MoE grouped GEMM (+ activation-quant fusion)
Status: **DESIGNED + PROFILED, not built** (multi-week kernel R&D). The single biggest remaining prefill win.
**Decisive measurements:**
- Prefill does NOT scale with bigger single prompts (attention O(N²) confounds): MXFP4 pp2048=3295, pp8192=1524,
pp16384=2051. So the plateau is not a batch-size fix.
- Real gap is batched many-sequence prefill: B=32 llama 3651 vs vLLM 99398 = **27×**. llama.cpp MoE prefill runs
at only **~22 effective TFLOP/s** on the GB10 — far below the GPU. Large headroom.
- **nsys (MXFP4 pp2048):** `mul_mat_q<type39>` (MoE FP4 GEMM) = **37.2%**, `quantize_mmq_mxfp4` (act-quant) = 8.0%,
`mul_mat_q<type8>` (dense/attn, still Q8) = 10.1%, flash_attn = 8.8%. The native FP4 MMA *is* engaged — the
inefficiency is the **per-expert thin-tile MMQ scheduler** + **un-fused activation quant**.
**Target (precise):** the ~45% in `mmq.cu`'s grouped MoE path (`ggml_cuda_mul_mat_q` + `ids`, `mmid.cu`). Replace
the per-expert thin-tile scheduler with a CUTLASS-style grouped GEMM (full tiles regardless of tokens/expert) and
fuse `quantize_mmq_mxfp4` into the permute/gather. Dense Q8 matmuls (10%) are the separate Lever-4 (FP8) target.
Problem (measured): the prefill ceiling is the MoE expert GEMM. Today `ggml_cuda_mul_mat_q` with `ids`
(`mmq.cu:127`) launches one grouped MMQ over a 3D grid (z = expert), but each expert's tile is thin
(~tokens/expert columns) so int8/FP4 tensor cores run underfilled; throughput is memory-bound on weight
streaming and flat vs batch.
Approach:
- Replace the per-expert thin-tile scheduler with a **CUTLASS-style grouped GEMM** that concatenates all
experts' token-blocks into one problem with per-group offsets, so tiles are always full (m16n8k64 FP4 /
m16n8k32 FP8) regardless of per-expert token count. Mirrors vLLM's `fused_moe` + cutlass grouped GEMM.
- **Fuse activation quantization into the permute/gather** (the `quantize_mmq_q8_1`/FP4 quantize currently a
separate 3.3% kernel) so the routed activations are quantized as they're scattered into expert order.
- Files: new kernel under `ggml/src/ggml-cuda/` (e.g. `moe-grouped-gemm.cu`) + dispatch hook in
`ggml_cuda_mul_mat_id` (`ggml-cuda.cu:2622`); reuse `mmid.cu` routing/`expert_bounds`.
- Effort: high (24 wks expert CUDA). Risk: numerics + sm_121 tile tuning. Expected payoff: the bulk of the
prefill gap (vLLM's MoE prefill advantage is mostly this). Upstream: #18250 (NVFP4-MoE) was closed
not-planned, so this would be a LocalAI patch or a fresh upstream proposal.
### Lever 4 — FP8 (e4m3) GEMM for dense layers
Status: **DESIGNED, not built** (blocked on a core ggml API change).
Problem: ggml-cuda has no FP8 matmul (only int8/FP4). vLLM runs qkv/o_proj/lm_head in FP8 on Blackwell
tensor cores. Our dense layers run int8-MMQ or f16-cuBLAS.
Approach (two options):
- (a) **cuBLASLt FP8**: route dense `mul_mat` through `cublasLtMatmul` with `CUDA_R_8F_E4M3` A/B and FP32
compute + scale pointers. Lowest kernel effort; gets library-tuned Blackwell FP8 immediately. Needs the
scale-tensor plumbing below.
- (b) **Hand-written sm_121 `mma.sync ...e4m3.e4m3.f32`** kernels in `mma.cuh`/`mmf.cu`. More control, more work.
- Prerequisite (both): the **`ggml_mul_mat_ext` / scale-tensor API** from upstream discussion #22042
per-tensor FP8 scales don't fit the block-scaled quant struct; `MUL_MAT`/`MUL_MAT_ID` must accept optional
scale tensors. This is a cross-cutting ggml change (graph + ops + all backends' fallbacks).
- Effort: high (API change is the hard part; cuBLASLt path is then moderate). Payoff: closes dense-layer
prefill/compute gap; complements Lever 3. Note: for *this* MoE model the experts dominate, so Lever 3 > 4.
### Lever 5 — tcgen05 / wgmma-class kernels for large-prefill tiles
Status: **DESIGNED, not built** (very high effort; last increment).
Problem: ggml's tensor-core path is warp-level `mma.sync` only (no `wgmma`/`tcgen05`). Blackwell's
tensor-memory `tcgen05` MMA (what CUTLASS uses) extracts substantially more throughput at large prefill tiles.
Approach: introduce warpgroup/tcgen05 GEMM main-loops for the FP4/FP8 paths (effectively adopting CUTLASS
3.x collective mainloops for sm_120/121), used when tile size is large enough (prefill). Decode (thin) keeps
`mma.sync`.
- Effort: very high (CUTLASS-class engineering). Payoff: the final slice of large-prefill throughput; only
worth it after Levers 34 land. Realistically: depend on/upstream CUTLASS kernels rather than hand-roll.
---
## Paged attention — complete implementation (after kernels are fair)
The placement prototype is insufficient (measured: zero concurrency benefit). A real implementation needs all
four gaps. CPU foundation already built & verified (`PagedKVManager` P0P3, `README.md`); the in-model parts
are unbuilt. **Build order and concrete design:**
1. **On-demand block allocation from a shared pool** (capacity win — more concurrent seqs before OOM).
- Replace `find_slot`'s ring-buffer (`llama-kv-cache.cpp:818`) with `PagedKVManager` block allocation; the
KV tensor becomes a shared block pool `[n_embd, block_size*num_blocks]`, sequences draw blocks on demand
(already prototyped on CPU: `paged_kv_manager.{h,cpp}`, `test_ggml_paged_rw.cpp`).
- Win measured where it counts: max concurrent sequences before OOM (not yet benchmarked — needs this).
2. **Gather-read** so each seq attends only its own blocks (`get_k`/`get_v` `:1145/1165``ggml_get_rows`
gather into scratch, then existing attention). Numerically proven on CPU (`test_ggml_paged_attn.cpp`,
7.5e-08 vs reference). Needs `build_attn_paged` branch in `llama-graph.cpp` + Gate 0 in a real model.
3. **Continuous batching / scheduler** (no head-of-line blocking on mixed-length traffic). New scheduler in
the server slot path; admit/evict at block granularity; the dimension where paging beats llama.cpp's
current static batching. This is where the *real* concurrency win lives (vs our synthetic uniform test).
4. **Automatic prefix sharing** (block-hash dedup; `PagedKVManager::{compute_block_hashes,get_computed_blocks}`
already implemented & tested). Cross-tenant shared system prompts reuse physical blocks.
Status: design in `2026-06-19-paged-attention-llamacpp-design.md`; CPU P0P3 done; in-model #1#4 unbuilt.
**Then** measure concurrency in paging's real scenarios — **memory-pressured (max seqs before OOM)** and
**mixed-length continuous batching** — on the MXFP4 (fair-quant) footing, not the uniform/over-provisioned
test that (correctly) showed no benefit.
> Reality check from this session's data: paged attention is a **capacity + scheduling** win, not a per-token
> speed win. On GB10 with 119 GB unified memory and uniform requests we are not memory-bound at B≤64, so the
> placement prototype showed nothing. Paging's value appears under memory pressure (many/long sequences) and
> bursty mixed-length traffic. The per-token throughput gap is a **kernel** problem (Levers 13), separate
> from paging.
---
## Implementation plan A — Lever 3: FP4 MoE GEMM to vLLM parity
Goal: lift batched MoE prefill from ~3.65k t/s (B=32) toward vLLM's ~99k. Root cause (profiled):
`mul_mat_q<MXFP4>` runs at ~22 effective TFLOP/s — warp-level `mma.sync`, not Blackwell tcgen05.
Cheap knobs are exhausted (ubatch saturates at 2048; `GGML_CUDA_FORCE_CUBLAS` is a no-op 3419↔3423;
tile width already full at mmq_x=128). So parity needs kernel work, done iteratively on the DGX
(`~/llama.cpp-pr24423`, editable + rebuildable; diffs captured as `patches/`).
Phases (each: hypothesis → edit `ggml/src/ggml-cuda/``cmake --build build --target llama-bench`
`llama-bench` MXFP4 pp/concurrency → record):
1. **Cheap kernel tweaks (low confidence, fast).** nwarps (occupancy), `mmq_y` tile, stream-k on/off,
FP4 load-tile path. Measure each. Likely small (<1.3x) — these don't change the warp-MMA ceiling.
- **Result (nwarps):** DEAD END. `nwarps` is locked by `static_assert(nwarps*tile_C::I == mmq_y)`
(mmq.cuh:3234) → nwarps=8 for mmq_y=128. Can't raise occupancy without co-scaling mmq_y to 256
(nwarps=16), which blows Blackwell shared-memory limits. The MMQ constants are tightly coupled;
it is not freely tunable. Confirms parity needs the kernel rewrite (phase 3), not knobs.
2. **Fuse activation quant** (`quantize_mmq_mxfp4`, 8%) into the permute/gather. Removes a kernel +
a global round-trip. Tractable, ~1.1x.
- **Result:** NOT AVAILABLE as a cheap patch. `quantize_mmq_fp4_cuda` (mmq.cu:200) *already* takes
`ids_src1` — the gather is already fused into the quant. The only remaining fusion is quantize-on-load
*inside* the GEMM hot loop (intricate, ~8% ceiling, risky). ORippler's #24481 fuses the decode (MMVQ)
post-scale and intends a "BS>1" (prefill) follow-up — unwritten. Marginal; skip.
**Upstream survey (2026-06):** there is NO tcgen05/CUTLASS grouped-GEMM MoE kernel in ggml — not merged,
not in-flight, not a draft (Discussion #18369 is talk, no PR; #18250 closed not-planned). CUTLASS is not a
dependency (the profile's `cutlass_80_tensorop` is cuBLAS-internal). No fork has a portable MoE kernel
(croll83/llama.cpp-dgx is GatedDeltaNet-focused). Maintainer signal (woachk on #17906): "the path forward
is to wait for cuTile C++." So **nothing to cherry-pick; phase 3 is genuinely from-scratch.**
3. **The real lever — tcgen05 / CUTLASS FP4 grouped GEMM.** Replace the per-expert MMQ scheduler with a
CUTLASS 3.x collective-mainloop grouped GEMM (sm_120a, `e2m1` block-scaled, tcgen05 tensor-memory MMA),
one problem over all experts with per-group offsets, fused act-quant. This is what vLLM/FlashInfer use.
Multi-week; the honest path to parity. Prefer **upstream ggml** (issue drafted) over a private patch.
4. **Full-model low precision.** Quantize dense layers (qkv/o_proj/lm_head, the 10% Q8) to FP4/FP8 too so
the whole prefill runs on FP4 tensor cores, not int8-MMQ.
Exit per phase: measured t/s recorded here; stop a phase when it's a dead end (recorded as such).
Matching vLLM realistically requires phase 3; phases 12 are the warm-up + de-risking.
## Implementation plan B — Complete paged attention (the pivot)
CPU foundation done (P0P3, `README.md`): vLLM-parity block manager + ggml write/gather + attention
numerics + placement Gate 0 (token-identical in-model). Remaining = make it deliver the multi-tenant wins.
Phases:
1. **On-demand shared-block pool** — replace `find_slot` ring buffer (`llama-kv-cache.cpp:818`) with
`PagedKVManager` block allocation; KV tensor = `[n_embd, block_size*num_blocks]` shared pool. Win:
fit more concurrent seqs before OOM. Test: max concurrent seqs at fixed budget vs contiguous.
2. **Gather-read** (`get_k/get_v` `:1145/1165``ggml_get_rows` into scratch) + `build_attn_paged` branch
in `llama-graph.cpp`. Numerically proven on CPU (7.5e-08). Gate 0: token-identical multi-seq.
3. **Continuous batching / scheduler** — admit/evict at block granularity in the server slot path. The
real concurrency win on mixed-length traffic (where the placement prototype showed nothing).
4. **Automatic prefix sharing** — block-hash dedup (`PagedKVManager::{compute_block_hashes,get_computed_blocks}`
already implemented + tested). Cross-tenant shared system prompts reuse physical blocks.
Then benchmark in paging's real regimes — **memory-pressured** + **mixed-length continuous batching** — on
the MXFP4 (fair-quant) footing. Note: GB10's 119 GB unified memory means win-1 needs genuine pressure
(long/many seqs) to show; the win is capacity + scheduling, not per-token speed.
## Honest scope note
Levers 35 and the complete paged implementation are each substantial (weeks of expert CUDA/systems work). This doc tracks what is **measured** vs **designed** vs **not-yet-built**, and never claims a number that wasn't run on the box.

View File

@@ -1,59 +0,0 @@
# FP4 grouped-GEMM MoE kernel (Lever 3) — scaffold + implementation plan
The one piece of work that actually closes the vLLM gap on Blackwell (GB10/sm_121). Both phases are
bottlenecked by the same kernel: `mul_mat_q<MXFP4>` (warp-level `mma.sync` grouped MMQ, ~22 TFLOP/s) is
**37%** of prefill and **54.6%** of decode-at-B=64 GPU time (`BENCHMARKS.md`). Paged attention can't touch
it (proven). The fix is a CUTLASS-3.x collective-mainloop grouped GEMM with block-scaled `e2m1` operands via
tcgen05 tensor-memory MMA — what vLLM/FlashInfer/TRT-LLM use.
## Scaffold (DONE — builds clean, default byte-identical)
Lives in the DGX checkout `~/llama.cpp-pr24423/ggml/src/ggml-cuda/` (to be rebased onto the pin as a patch /
upstreamed). Captured diff: `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`.
- `fp4-grouped-moe.{cuh,cu}` — entry `ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst) -> bool`
(true = handled, false = fall back to MMQ). Gated behind env `GGML_CUDA_FP4_GROUPED`. Currently always
returns false → **default build unchanged**.
- Hook in `ggml_cuda_mul_mat_id` (the MoE dispatch), before the `ggml_cuda_mul_mat_q(...ids...)` call:
`if (ggml_cuda_fp4_grouped_moe(...)) return;`. Builds via the `file(GLOB "*.cu")` (re-run cmake configure
after adding the file — GLOB is configure-time).
This is the integration seam. The kernel fills the stub.
## Implementation phases (each: build on GB10 → numerical parity vs `mul_mat_q<MXFP4>` → bench)
1. **Reference grouped GEMM (correctness first, slow OK).** Per-expert problem sizes + offsets from `ids`;
dequant `e2m1`+scales → BF16; loop CUTLASS (or cuBLAS) per group. Gate: output matches MMQ within fp tol
on a 2-expert toy + the real model (token-identical greedy). Establishes the harness + the data plumbing.
2. **CUTLASS GemmGrouped, sm_120a, BF16 operands.** Replace the loop with one `cutlass::gemm::device::
GemmGrouped` launch over all experts (per-group offsets). Measures the grouping win alone.
3. **Block-scaled FP4 operands (the real lever).** `e2m1` A/B with `e8m0`(MX)/`e4m3`(NV) block scales via the
Blackwell scaled-MMA collective (tcgen05 tensor-memory). This is where the TFLOP/s jumps. Needs CUTLASS
3.x + sm_120a; verify the block-scale layout matches ggml's MXFP4/NVFP4 packing.
4. **Fuse activation quant** (the F32→FP4 of src1) into the gather/permute prologue.
5. **Enable by default** on sm_120/121 when parity holds + faster; keep the env as an escape hatch.
## Dependencies / decisions
- **CUTLASS is not currently a ggml dependency** (the profile's `cutlass_80_tensorop` is cuBLAS-internal).
Adding it = submodule/fetch + include dir, gated to CUDA sm_120+. Float the approach with ggml maintainers
early (Discussion #18369 is the home; JohannesGaessler asked to discuss arch before big kernel work).
- Target sm_120a/121a (consumer Blackwell). Datacenter Blackwell (sm_100) is a separate tile config.
- Risk: needs ncu-driven iteration on the GB10; this is multi-week, expert-CUDA. No upstream base to fork
(exhaustive search confirmed). Net-new value upstream.
## DENSE scope — RESOLVED (TODO #28, benchmarked): dense needs an FP4 GEMM too
Benchmarked Qwen3-32B dense, vLLM W4A16 vs llama.cpp Q4_K_M (`BENCHMARKS.md`). **Dense prefill is 7.632×
behind** (llama int8-MMQ plateaus ~765 t/s; vLLM FP4 scales to 24.4k); decode ~parity at B=1, 2.2× at B=64.
So the kernel track is **two kernels, not one**:
- **(a) Dense FP4 GEMM** — a plain non-grouped CUTLASS/tcgen05 block-scaled FP4 GEMM. **Simpler than grouped;
land this FIRST** — it's the easier first kernel, benefits every dense model, and de-risks the FP4 collective
before the grouped variant. Hook: the non-MoE `ggml_cuda_mul_mat_q` (no `ids`) path.
- **(b) MoE grouped FP4 GEMM** — the scaffold above (`ggml_cuda_fp4_grouped_moe`), per-expert offsets.
Both share the same block-scaled `e2m1` collective; (a) is (b) with one group. Suggested order: build (a),
prove the FP4 collective + parity harness, then generalize to (b). (Aside: full W4A4 NVFP4 doesn't run on
GB10 today — FlashInfer ships no FP4 cubins for sm_121, so the dense `mm_fp4` kernel hangs/returns zeros; the
W4A16 Marlin path is the fast, correct one and is the fair comparison. See `BENCHMARKS.md` for the root cause.)

View File

@@ -1,140 +0,0 @@
# MXFP4-dense vs Q4_K_M quality check (Qwen3, GB10 / DGX Spark)
## Question
MXFP4-quantized **dense** Qwen3-32B is measurably faster on GB10 (Blackwell) than
Q4_K_M: ~1.58x concurrent prefill, ~1.2x decode, for free (just a requantize that
routes onto the FP4-MMA kernel). Before LocalAI recommends MXFP4-dense as a Blackwell
default, we must confirm its **quality is acceptable versus Q4_K** (Q4_K is normally the
stronger 4-bit format).
Critical caveat going in: the pre-existing `~/bench/q3-32b-mxfp4-dense.gguf` was built
with `--allow-requantize`, so it was suspected to be **double-quantized** (Q4_K_M ->
MXFP4), which would unfairly penalize MXFP4. The goal here was a *fair* answer.
## Verdict
**Do NOT recommend MXFP4-dense as a quality-equivalent replacement for Q4_K on
Blackwell.** A clean apples-to-apples test (same BF16 source, both 4-bit, no imatrix)
shows MXFP4-dense carries a **large** quality penalty that Q4_K does not:
- Q4_K_M costs **+2.6%** perplexity vs the BF16 baseline.
- MXFP4-dense costs **+30.8%** perplexity vs the BF16 baseline (i.e. **+27.5% worse
than Q4_K**).
The double-quant suspicion was correct but it was **not** the main culprit: even a clean
MXFP4-from-BF16 is dramatically worse than Q4_K. The ~1.58x prefill / ~1.2x decode
speedup is real, but it is not free on quality. MXFP4-dense output is still coherent (not
gibberish), so it is usable where raw throughput dominates and a quality hit is
acceptable, but it must not be presented as a drop-in, quality-neutral Q4_K replacement.
## Evidence
### 1. Provenance of the existing 32B MXFP4 (it is double-quant)
`~/dense_mxfp4.sh` (mtime matches the `q3-32b-mxfp4-dense.gguf` mtime, Jun 20 09:47)
created it:
```
SRC=$HOME/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf # <-- source is Q4_K_M, not F16/BF16
OUT=$HOME/bench/q3-32b-mxfp4-dense.gguf
$QB --allow-requantize --tensor-type "attn=mxfp4" --tensor-type "ffn=mxfp4" \
"$SRC" "$OUT" MXFP4_MOE
```
Confirmed **double-quantized** (Q4_K_M -> MXFP4). Any PPL measured on this file
overstates MXFP4's true penalty, so the 32B number below is a loose upper bound, not the
fair answer.
### 2. 32B quick read (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99)
`llama-perplexity`, PR build `~/llama.cpp-pr24423/build` (sm_121):
| 32B model | PPL | vs Q4_K |
|---|---|---|
| Qwen3-32B-Q4_K_M | **7.3865** +/- 0.177 | - |
| q3-32b-mxfp4-dense (double-quant) | **8.4638** +/- 0.206 | +14.6% |
MXFP4 is much worse than Q4_K here, **and** it is double-quant, so the quick read is
unfair -> escalated to a clean small-model comparison.
### 3. Fair comparison: clean small dense model (Qwen3-4B BF16)
The MXFP4-vs-Q4_K delta is a *format* property and roughly model-size-independent, so a
small model gives a fast, clean answer. Downloaded `Qwen3-4B-BF16.gguf` (unsloth, ~7.7
GiB) and quantized it **from that same BF16 source** to both formats with the identical
recipe used for the 32B (no `--allow-requantize` needed, no imatrix on either side):
```
llama-quantize q3-4b-bf16.gguf q3-4b-q4km.gguf Q4_K_M
llama-quantize --tensor-type attn=mxfp4 --tensor-type ffn=mxfp4 \
q3-4b-bf16.gguf q3-4b-mxfp4.gguf MXFP4_MOE
```
Perplexity (wikitext-2-raw test, 50 chunks, ctx 512, ngl 99):
| Qwen3-4B | size | PPL | vs BF16 | vs Q4_K |
|---|---|---|---|---|
| BF16 (baseline) | 7672 MiB | **13.3188** +/- 0.416 | - | - |
| Q4_K_M | 2497 MiB | **13.6605** +/- 0.426 | **+2.57%** | - |
| MXFP4 (clean) | 2236 MiB (4.66 BPW) | **17.4183** +/- 0.561 | **+30.78%** | **+27.5%** |
This is the apples-to-apples quality answer: **clean MXFP4-from-BF16 is ~12x more lossy
than Q4_K relative to the BF16 baseline** (30.8% vs 2.6%). Notably the clean-4B MXFP4-vs-
Q4_K gap (+27.5%) is *wider* than the 32B double-quant gap (+14.6%), consistent with
smaller models being more quantization-sensitive - the double-quant did not invent the
problem, it is intrinsic to the format as quantized by `llama-quantize`.
### 4. Coherence spot-check (32B, llama-simple, n=60)
MXFP4-dense 32B is fully coherent, not degraded gibberish:
- "The capital of France is" -> MXFP4: "...Paris, is located near the Seine River..."
(correct); Q4_K similar.
- "Q: What is 17 multiplied by 23? A:" -> MXFP4 reasons via the distributive property
(sound); Q4_K answers 391 directly (correct).
- "def fibonacci(n):" -> both emit valid Python.
So the quality cost shows up as measurably higher perplexity (and would surface on harder
/ longer tasks), not as obviously broken text at short generation lengths.
## Why
`MXFP4_MOE` is a 4-bit float format (E2M1 values, shared E8M0 scale per block of 32,
round-to-nearest) designed for MoE expert tensors (gpt-oss et al.) with a coarse
per-block scale. Q4_K uses 6-bit superblock scales plus per-sub-block mins - materially
better for dense attention/FFN weights. Forcing MXFP4 onto dense layers to reach the FP4
kernel trades ~1.58x prefill for a large accuracy loss. The FP4-MMA speed path is real,
but the weights it accepts (MXFP4 here) are lossy for dense.
## Caveat, stated precisely
This measures **llama.cpp's `llama-quantize` MXFP4** (OCP MX FP4, RTN, **no imatrix**)
against **llama.cpp's Q4_K_M** (k-quant superblocks, also no imatrix here). It is a fair
format-vs-format comparison of exactly what LocalAI would ship if it routed a requantize
through this path. It does **not** claim FP4 is fundamentally unviable on Blackwell:
- An imatrix-aware MXFP4, or a better FP4 format with two-level scaling
(**NVFP4** - there are already `q3-32b-nvfp4` / `q3-32b-nvfp4a16` dirs on the box),
may close much of this gap and is the more promising Blackwell FP4 path to evaluate.
- The result is for Qwen3 dense; other families may differ in magnitude but the
format-level disadvantage of plain MXFP4 RTN vs Q4_K is expected to hold.
## Recommendation
- **Do not** ship a blanket "use MXFP4-dense on Blackwell" recommendation as a Q4_K
quality equivalent. The ~1.58x prefill / ~1.2x decode win comes with a real ~30% PPL
inflation (vs ~2.6% for Q4_K). Q4_K_M stays the right dense default on Blackwell.
- If exposing MXFP4-dense at all, gate it as an explicit **throughput-over-quality**
option with the perplexity caveat surfaced, not a default.
- MXFP4/FP4 remains correct where the model is trained for it (MoE / gpt-oss-style).
Pursue **NVFP4** (and/or imatrix-aware FP4) as the quality-competitive Blackwell FP4
format before making any FP4-dense recommendation.
## Reproduction (DGX Spark, GB10, build `~/llama.cpp-pr24423/build`, sm_121)
- Dataset: `~/wikitext-2-raw/wiki.test.raw` (wikitext-2-raw-v1 test).
- 32B: `~/ppl32b.sh` -> `~/ppl32b.out`; coherence `~/coh32b.sh` -> `~/coh32b.out`.
- Clean 4B: `~/fair4b.sh` -> `~/fair4b.out` (quantize + 3x perplexity).
- All runs `-ngl 99`, `--chunks 50`, `-c 512`. GB10 thermal-throttles but PPL is a
correctness metric, so thermal state does not affect these numbers.

View File

@@ -1,41 +0,0 @@
CXX ?= g++
CXXFLAGS ?= -std=c++17 -O2 -Wall -Wextra -I.
TESTS = test_free_block_queue test_block_pool test_paged_kv_manager test_prefix_cache
BINS = $(addprefix tests/,$(TESTS))
all: $(BINS)
tests/%: tests/%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ $< paged_kv_manager.cpp
check: all
@for t in $(BINS); do echo "== $$t =="; ./$$t || exit 1; done
paged-bench: paged-bench.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -o $@ paged-bench.cpp paged_kv_manager.cpp
bench: paged-bench
./paged-bench
# --- Optional ggml integration test (Phase 1: paged write/gather mechanism) ---
# Requires a built ggml. Override these to point at your checkout / build:
# make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build>
GGML_SRC ?= ../../llama-cpp-fallback-build/llama.cpp/ggml
GGML_BUILD ?= /tmp/ggml-build
GGML_LIBDIR = $(GGML_BUILD)/src
GGML_TESTS = test_ggml_paged_rw test_ggml_paged_attn
GGML_BINS = $(addprefix tests/,$(GGML_TESTS))
tests/test_ggml_%: tests/test_ggml_%.cpp paged_kv_manager.cpp paged_kv_manager.h
$(CXX) $(CXXFLAGS) -I$(GGML_SRC)/include -o $@ $< paged_kv_manager.cpp \
-L$(GGML_LIBDIR) -lggml -lggml-base -lggml-cpu -Wl,-rpath,$(GGML_LIBDIR)
ggml-check: $(GGML_BINS)
@for t in $(GGML_BINS); do echo "== $$t =="; ./$$t || exit 1; done
clean:
rm -f $(BINS) $(GGML_BINS) paged-bench
.PHONY: all check ggml-check clean

View File

@@ -1,114 +0,0 @@
# NVFP4-dense on DGX Spark (GB10, sm_121): is it the quality-preserving FP4 win MXFP4 wasn't?
Test rig: DGX Spark GB10 (sm_121), `~/llama.cpp-pr24423/build` (PR #24423, FP4 MMA + NVFP4
kernel), wikitext-2-raw, clean BF16 source `q3-4b-bf16.gguf` (the same source used for the
established MXFP4 / Q4_K fair test). NVFP4 and all comparison quants were produced clean from
BF16, no imatrix.
## Verdict (short)
YES on all the load-bearing questions, with one honest caveat:
1. llama.cpp CAN produce an NVFP4 GGUF.
2. NVFP4 quality is Q4_K-class, NOT MXFP4-class: +7.4% PPL vs BF16 (MXFP4 was +30.8%). It is
slightly behind Q4_K (+4.8% relative) but in the same ballpark, not on the quality cliff.
3. NVFP4 routes onto the FP4 MMA kernel and gets the FP4 prefill speedup: ~1.29x Q4_K on the
4B, tracking MXFP4 to within 5% (MXFP4 hit 1.58x on the 32B; NVFP4 should track it there too).
4. Output is coherent.
Bottom line: NVFP4-dense IS the quality-preserving FP4 win MXFP4 wasn't. It delivers
essentially the full FP4 prefill speedup at roughly Q4_K quality, where MXFP4 paid a 27% quality
tax for the same speed. LocalAI can support/recommend NVFP4-dense on Blackwell for prefill-bound
workloads, with the caveat that it is marginally (~5%) behind Q4_K on perplexity; an imatrix-guided
NVFP4 quant would likely close most of that remaining gap.
## 1. Feasibility: can llama-quantize produce an NVFP4 GGUF? YES
- The type exists with a full quantize path, not just a kernel:
- `GGML_TYPE_NVFP4 = 40` (`ggml.h`), `GGML_FTYPE_MOSTLY_NVFP4 = 26`
- `quantize_nvfp4` / `quantize_row_nvfp4_ref` / `dequantize_row_nvfp4` registered in `ggml.c`
- type_name is `"nvfp4"`, block `QK_NVFP4` (per-16 FP8/E4M3 block scale + global scale)
- NVFP4 is NOT a top-level `llama-quantize` ftype (no `NVFP4` entry in the allowed-types list,
no reference in `tools/quantize/quantize.cpp` or `src/llama-quant.cpp`), BUT
`--tensor-type name=nvfp4` resolves it: `parse_ggml_type` matches the arg against
`ggml_type_name(...)`, which returns `"nvfp4"`. This is the exact same mechanism that produced
MXFP4-dense.
- Recipe used (mirrors the MXFP4-dense GGUF byte-for-byte in structure: token_embd Q8_0, all
norms F32, all 2D attn+ffn weights to FP4):
```
llama-quantize --tensor-type "attn=nvfp4" --tensor-type "ffn=nvfp4" \
q3-4b-bf16.gguf q3-4b-nvfp4.gguf Q8_0
```
Result: `q3-4b-nvfp4.gguf`, 2343.93 MiB, 4.89 BPW, ~5 s. (MXFP4-dense was 2350 MiB; same shape.)
Every `blk.N.attn_*` and `blk.N.ffn_*` reported `converting to nvfp4`; token_embd Q8_0; norms F32.
The on-box `~/bench/q3-32b-nvfp4*` dirs are vLLM HF safetensors (already 4-bit), not GGUF, and
do not feed llama.cpp - confirmed and irrelevant.
## 2. Quality (decisive): NVFP4 is Q4_K-class, not MXFP4-class
`llama-perplexity -f wiki.test.raw --chunks 50 -c 512 -ngl 99`, all clean from the same BF16 4B:
| Quant | PPL | vs BF16 | vs Q4_K |
|---------|--------|----------|----------|
| BF16 | 13.32 | - | - |
| Q4_K_M | 13.66 | +2.6% | - |
| NVFP4 | 14.31 | +7.4% | +4.8% |
| MXFP4 | 17.42 | +30.8% | +27.6% |
(NVFP4 measured this run: Final PPL = 14.3097 +/- 0.4457.)
NVFP4 lands much closer to Q4_K (gap 0.65 PPL) than to MXFP4 (gap 3.11 PPL). MXFP4's finer
sibling delivers: the two-level scaling (per-16 FP8 block scale + global scale) recovers almost
all of the quality MXFP4's coarse per-32 E8M0 scale threw away. It is not quite Q4_K, but it is
firmly in the "acceptable 4-bit" regime, not the lossy one.
## 3. Speed: NVFP4 routes onto the FP4 MMA kernel
No clean BF16 32B was on the box (only the vLLM NVFP4 safetensors and the Q4_K/MXFP4 32B GGUFs),
so per the brief this is the 4B speed signal - a 3-way cold A/B on the SAME 4B model, 45 s
cooldowns between runs (`-npp 512 -ntg 128 -npl 8,32,64 -b 2048 -ub 2048 -ngl 99`):
Prefill S_PP (t/s):
| B | Q4_K | NVFP4 | MXFP4 | NVFP4 / Q4_K | NVFP4 / MXFP4 |
|-----|--------|--------|--------|--------------|---------------|
| 8 | 4862 | 6313 | 6602 | 1.30x | 0.96x |
| 32 | 5020 | 6497 | 6836 | 1.29x | 0.95x |
| 64 | 5031 | 6490 | 6831 | 1.29x | 0.95x |
- NVFP4 prefill is within ~5% of MXFP4 at every batch size -> both land on the same FP4 MMA
kernel. NVFP4 does NOT fall back to a slow path.
- NVFP4 beats Q4_K's int8-MMQ prefill by ~1.29x on the 4B. The established 32B figures were
Q4_K S_PP ~767 and MXFP4 ~1209 (1.58x); since NVFP4 tracks MXFP4 to within 5%, NVFP4 on the
32B should likewise approach ~1.5x. (The 4B shows a smaller multiplier than the 32B because a
smaller model spends proportionally less time in the matmul the FP4 kernel accelerates.)
- Token-gen (S_TG) is comparable across all three (memory-bound), as expected.
## 4. Coherence
`llama-simple` (llama-cli hangs - avoided), NVFP4 4B:
- "The capital of France is" -> "...Paris. ...Germany is in Berlin. ...Italy is in Rome.
...Spain is in Madrid. ...Netherlands is in Amsterdam." (all correct)
- "Q: What is 17 plus 25? A:" -> "42." (correct)
Coherent and factually accurate.
## Recommendation for LocalAI on Blackwell
Support and recommend NVFP4-dense as the FP4 prefill option on Blackwell (sm_120/121), produced
via `--tensor-type attn=nvfp4 --tensor-type ffn=nvfp4` over a BF16 source (token_embd Q8_0,
norms F32). It gives ~the full FP4 prefill speedup (FP4 MMA kernel, ~1.3x Q4_K on 4B and
expected ~1.5x on larger models) at roughly Q4_K quality (+7.4% PPL vs BF16). This is the win
MXFP4 failed to deliver: MXFP4 paid a +30.8% quality tax for the same speed and was rejected.
Caveats / follow-ups:
- NVFP4 is still ~4.8% behind Q4_K on PPL. For quality-first deployments where the prefill win
does not matter, Q4_K_M remains the better pick.
- These NVFP4/Q4_K numbers are clean (no imatrix). An imatrix-guided NVFP4 quant is the obvious
next step and would likely close most of the remaining gap to Q4_K - worth measuring before a
blanket recommendation.
- A direct 32B NVFP4-vs-Q4_K speed run (needs a clean BF16 32B GGUF, not on the box) would
confirm the projected ~1.5x; the 4B signal plus the MXFP4-tracking already make this very likely.

View File

@@ -1,115 +0,0 @@
# Paged KV at high concurrency on a single GB10 - the datacenter-scale test
Closes the open question left by `PR22569_EVAL.md`: that eval could not test the
"paged KV unlocks thousands of sequences" thesis because **both** KV paths hit the
`LLAMA_MAX_SEQ=256` compile cap, and the 32B-dense model it used is compute-bound
(plateaus by npl=128 for an unrelated reason). This run removes both confounders:
**recompiled `LLAMA_MAX_SEQ=2048`** and used a **bandwidth-bound model (Qwen3-1.7B-Q8_0)**
where decode aggregate is free to keep climbing with concurrency.
Hardware: NVIDIA GB10 (sm_121, 119 GiB unified LPDDR5X, ~273 GB/s). Build:
`~/llama.cpp-pr22569` (PR #22569 paged path + the reshape fix), `LLAMA_MAX_SEQ=2048`,
sm_121 Release. Contiguous = `llama-batched-bench` (unified KV) `S_TG`. Paged =
`llama-paged -kvp --fit off` `aggregate tps`. `npp=16, ntg/n_predict=128, b=ub=2048,
-ngl 99`. Cold runs, 12 s cooldowns.
## TL;DR for the decision
**On a single GB10, paged KV does NOT deliver a throughput or concurrency win - the
aggregate-decode ceiling is set by the hardware, not the KV layout, and contiguous KV
already reaches it.** Measured across two model regimes and concurrency up to 2048
sequences:
- Aggregate decode **plateaus** once the GPU saturates - for both KV layouts:
- 32B-dense (compute-bound): ~540 t/s, flat from npl=128 (prior eval).
- 1.7B (bandwidth-bound): ~3,200-3,700 t/s, flat from npl=512 (this run).
- Paged and contiguous land at the **same ceiling**; PR #22569's paged op was 12-13%
*slower* than the mature contiguous flash-attention path at equal concurrency on 32B.
- Pushing concurrency past the plateau is **actively harmful to UX**: per-sequence
throughput collapses (23 -> 1.9 tok/s) and TTFT explodes (0.6 s -> 4.3 s avg, **64 s
max**) while aggregate stays flat.
**vLLM's ~24k aggregate headline is unreachable on a single GB10 with these models
regardless of KV layout** - it needs aggregate memory bandwidth / compute that one GB10
does not have (i.e. many GPUs). Paged KV is a **memory-capacity / anti-fragmentation /
prefix-sharing** feature, not a single-node throughput-ceiling feature. The static
single-model benchmark deliberately does not create the memory-pressure regime where
paging pays off, which is exactly why no win appears.
## The numbers
### Aggregate decode vs concurrency, Qwen3-1.7B-Q8_0 (bandwidth-bound), `LLAMA_MAX_SEQ=2048`
| npl | contiguous `S_TG` (t/s) | paged `aggregate tps` (t/s) | paged per-seq tps | paged TTFT avg / max |
|----:|------------------------:|----------------------------:|------------------:|---------------------:|
| 128 | 2,643 | 2,887 | 23-25 | - |
| 256 | 2,925 | - | - | - |
| 512 | 3,215 | 3,637 | 7.2-7.8 | 0.57 s / 0.90 s |
| 1024 | 3,118 | 3,695 | 3.7-4.2 | 1.17 s / 2.37 s |
| 2048 | (not run) | 3,608 | 1.9-14.6 | 4.28 s / **63.8 s** |
Both paths flatten by npl~512. 8x more concurrency (128->1024) buys contiguous only
**+18%** and paged **+28%**, then both stop. (The two tools meter slightly differently -
`llama-paged` aggregate vs `batched-bench` decode-only `S_TG` - so the small paged-vs-
contiguous offset is not a real paged advantage; the prior apples-to-apples 32B eval had
paged 12-13% *behind*.)
### Why it plateaus (the hardware ceiling, not the KV layout)
Decode is memory-bandwidth-bound: each step reads the model weights once and shares that
read across the whole batch. Once concurrency is high enough that the shared weight-read
is amortized, the per-step cost is dominated by KV reads + attention + host work, none of
which paging makes cheaper. The GB10's ~273 GB/s sets the floor; at the plateau the GPU
is ~saturated. Adding sequences past that point cannot raise aggregate - it only divides
the same throughput across more users (per-seq tps falls, TTFT rises). The 32B-dense case
plateaus even earlier (npl=128) because it saturates on **compute** (weight matmuls), not
bandwidth - the kernel decomposition is in `VLLM_DECOMPOSITION.md`.
## What paged KV is actually for (the honest, deliverable value)
Paging never helps a static, uniform-length, single-model benchmark on a GPU with memory
to spare - there is no fragmentation and no over-reservation to reclaim. Its real wins,
which require the regime this hardware+benchmark does not exercise, are:
1. **Concurrent-tenant capacity under memory pressure.** Block KV fits more *diverse*
in-flight sequences (variable, dynamically arriving/leaving contexts) without the
contiguous path's per-slot reservation/fragmentation. Pays off when KV memory, not
compute/bandwidth, is the binding constraint - i.e. at multi-GPU datacenter scale or
with very long/variable contexts.
2. **Cross-request prefix sharing.** A chained-hash block cache shares identical system
prompts / RAG preambles across requests (vLLM's `block_pool.py` + block-hash map). A
real token-budget win for shared-prefix workloads; PR #22569 defers this to a
non-existent Phase 2 (our from-scratch P0 has the machinery).
These are measured as **max concurrent distinct tenants** and **KV memory saved**, not as
aggregate tok/s on one model. They do not move the single-GB10 throughput ceiling.
## Recommendation
- **Do not pitch paged KV as a single-GB10 throughput lever** - it is measured flat to
the contiguous ceiling (and PR #22569 is slower). Doing so would not survive a
benchmark.
- **The single-GB10 throughput story is already strong without paging:** llama.cpp is
ahead of vLLM single-stream (MXFP4 1153 > 800) and at ~70-81% of vLLM aggregate at
npl<=128 with a near-identical batching multiplier (`VLLM_DECOMPOSITION.md`). Ship the
MXFP4/NVFP4-dense prefill win (`NVFP4_TEST.md`) - that is the cheap, real, defensible
Blackwell number.
- **If datacenter-scale (thousands of concurrent tenants) is the genuine target,** the
lever is **multiple GPUs** plus paged KV's **capacity + prefix-sharing** features -
framed and measured as concurrent-tenant capacity and KV memory saved, on a
variable-context / shared-prefix workload. A single GB10 cannot produce the ~24k
aggregate regardless of KV layout; that is a fleet-level result.
## Reproduction (DGX, `~/llama.cpp-pr22569`, `LLAMA_MAX_SEQ=2048`)
```sh
M=~/bench/draft17/Qwen3-1.7B-Q8_0.gguf
# contiguous
for NPL in 128 256 512 1024; do
./build/bin/llama-batched-bench -m $M -npp 16 -ntg 128 -npl $NPL -ngl 99 \
-b 2048 -ub 2048 -fa on -c $((NPL*160)); done
# paged
for NPL in 512 1024 2048; do
./build/bin/llama-paged -m $M -kvp --fit off -ngpub 32768 -ncpub 128 \
-np $NPL -ns $NPL -n 128 -b 2048 -ub 2048 -ngl 99; done
```

View File

@@ -1,170 +0,0 @@
# Paged KV: target-readiness (correctness, dynamic benchmark, 2xH200 projection)
Target hardware: **~2x H200** (281 GB HBM3e total, ~4.8 TB/s per GPU). The GB10 box is
the *test* rig, not the target - and several earlier "no win" findings are GB10-specific
artifacts (low bandwidth caps throughput before KV memory ever binds). This document
delivers the three things needed to push paged KV toward the real target:
1. **Correctness** of the paged path - verified (and a blocking bug found + fixed).
2. **A dynamic-load benchmark** that actually exercises where paging wins (`paged-loadgen.cpp`).
3. **A projection** of the paged-KV payoff on 2x H200, grounded in measured GB10 numbers.
---
## 1. Correctness: PASS (after fixing the auto-fit OOM)
`test-paged-kv-e2e` checks the paged decode path against the contiguous reference
(greedy argmax + top-5 set overlap >= 4). On the box it was previously **unverified** -
it aborted at context creation. Root cause found:
- `common_fit_paged_kv_blocks` (`common/common.cpp:1144`) **unconditionally overrides**
`n_gpu_blocks` from `ggml_backend_dev_memory`, which **over-reports free VRAM on the
GB10 integrated/unified device** (it sized **~245 GB of KV on a 119 GB box** ->
`cudaMalloc` OOM -> `GGML_ASSERT` abort in `llama-kv-cache-paged.cpp:74`). The test's
explicit `n_gpu_blocks=64` was being clobbered because `params.fit_params` defaults on.
**Fix (item-1 patch, applied on the box):**
```diff
--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ run_paged()
params.kv_paged = true;
+ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
params.n_gpu_blocks = 64;
```
**Result (Qwen3-0.6B-Q8_0, GB10):**
```
test-paged-kv-e2e: top-5 argmax match: ref=3743 paged=3743
test-paged-kv-e2e: top-5 set overlap: 5/5 (require >= 4)
test-paged-kv-e2e: PASSED
```
The paged op is **numerically greedy-equivalent to the contiguous path**. The reshape
bug from `PR22569_EVAL.md` (decoupled head_dim) is already applied in the checkout.
**Target-readiness caveat (the durable fix, not just the test):** the auto-fit itself is
brittle and must be hardened before it runs on a real serving box - even though
`ggml_backend_dev_memory` reports correctly on a discrete H200, the function should still
(a) early-return when `!params.fit_params`, (b) **clamp** the computed `n_gpu_blocks` so
`n_gpu_blocks * block_bytes <= free_vram - margin` using the *actual* KV element size, and
(c) not override an explicitly-set value. One-screen change in `common_fit_paged_kv_blocks`.
---
## 2. Dynamic-load benchmark - `paged-loadgen.cpp`
**Why the existing tools show no paged win:** `llama-batched-bench` and the stock
`examples/paged/paged.cpp` both run **fixed-length, all-arrive-at-once, single-prompt**
load. That has no over-reservation and no fragmentation, so contiguous KV is already
memory-optimal and paging has nothing to reclaim (`PAGED_KV_HIGH_CONCURRENCY.md`). The
paged win only exists under **variable lengths + continuous arrival + shared prefixes** -
the real serving regime. No tool in the tree creates it.
`paged-loadgen.cpp` (committed here) does, via the confirmed `llama_paged_scheduler_*`
API:
- **shared system prefix** (`LG_PREFIX` tokens) prepended to every request -> exercises
cross-request prefix sharing,
- **variable prompt length** (`LG_SUFMIN..LG_SUFMAX` unique suffix),
- **bimodal generation length** (`LG_GENLONG` for `LG_LONGPCT`% of requests, else
`LG_GENSHORT`) - the over-reservation driver,
- **continuous arrival**: keeps `LG_INFLIGHT` requests live, admitting a new one each time
one finishes.
It reports the load-bearing number for the buy decision - the **capacity ratio**:
```
paged peak KV = sum over live seqs of ceil(used/block)*block * kv_bytes_per_token
contiguous reserve = peak_inflight * max_ctx * kv_bytes_per_token (worst-case per slot)
CAPACITY RATIO = contiguous_reserve / paged_peak (+ prefix sharing on top)
```
`kv_bytes_per_token = 2 * n_layer * n_head_kv * head_dim * sizeof(f16)` - confirmed against
`llama-kv-cache-paged.cpp` (e.g. Qwen3-32B: 2*64*8*128*2 = **256 KiB/token**).
**How to run (on the target):** drop into PR #22569's `examples/paged/`, add to its
CMakeLists next to `llama-paged`, build, then e.g.
`LG_INFLIGHT=2048 LG_LONGPCT=15 paged-loadgen -m <model> -kvp --fit off -ngpub <N> -ncpub <M> -ngl 99`.
Sweep `LG_INFLIGHT` to the throughput plateau and read the capacity ratio at that point.
It is written to run on the target (2x H200) where the regime exists; on GB10 it runs but
the ratio is uninteresting because throughput plateaus before memory binds (see below).
---
## 3. Projection to 2x H200 (grounded in measured GB10 numbers)
### Measured on GB10 (this work)
| model | decode plateau (aggregate) | plateau concurrency | bound by |
|---|---|---|---|
| Qwen3-32B-Q4_K_M (dense) | ~540 t/s | npl ~128 | compute |
| Qwen3-1.7B-Q8_0 | ~3,200 t/s | npl ~512 | bandwidth |
### Hardware ratios (per GPU, then 2x TP at ~85% scaling)
| | GB10 | H200 | per-GPU x | 2x H200 (TP) x |
|---|---|---|---|---|
| mem bandwidth | 273 GB/s | ~4.8 TB/s | 17.6 | ~30 |
| BF16 compute | ~213 TFLOP | ~989 TFLOP | 4.6 | ~8 |
| HBM | 119 GB | 141 GB | 1.18 | 2.4 (281 GB) |
Decode is bandwidth-bound, so **both the aggregate ceiling and the concurrency at which it
is reached scale with bandwidth (~30x on 2x H200)**:
- **32B-dense aggregate decode ceiling:** 540 x 30 ~= **16,000 t/s**, reached at
~128 x 30 ~= **3,800 concurrent sequences**.
### Why paged KV becomes the binding lever on 2x H200 (and didn't on GB10)
To reach that ~16k t/s ceiling you must hold **~3,800 sequences** of KV. The memory math:
- 32B weights (FP8) ~= 32 GB, sharded over 2 GPUs -> ~250 GB HBM free for KV.
- 32B KV = 256 KiB/token. At an avg held context of 2,000 tokens, **per seq = 512 MiB**.
- Contiguous unified KV (reserve for the live set) fits ~250 GB / 512 MiB ~= **~490
sequences** - **8x short of the 3,800 needed to reach the throughput ceiling.**
So on 2x H200 **KV memory is the binding constraint at the throughput-optimal concurrency**,
and contiguous KV strands most of the bandwidth (you'd run at a fraction of 16k t/s). This
is the gap paged KV closes. On GB10 it never appeared because GB10's 30x-lower bandwidth
caps decode at npl ~128, whose KV fits in memory trivially - the constraint order is
inverted on the real target.
### Magnitude of the paged win
Paging recovers concurrency two ways, both multiplicative on achievable throughput:
1. **No over-reservation.** Contiguous must back `max_ctx` per slot; paging uses
`ceil(actual/block)`. For a realistic bimodal workload (most generations short, ~15%
long, prompts ~512) the average held context is several-fold below `max_ctx` ->
`paged-loadgen` capacity ratio typically **~4-10x** (it measures the exact number for
your workload's length distribution).
2. **Cross-request prefix sharing** of shared system prompts / RAG preambles - additional,
workload-dependent (chained-hash block cache; vLLM's `block_pool.py`).
Net: on 2x H200, paged KV is plausibly the difference between serving **~500 and ~3,800**
concurrent 32B sequences in HBM, i.e. between a fraction of and ~all of the **~16k t/s**
decode ceiling. **That is the datacenter payoff, and it is real on the target even though
GB10 cannot exhibit it.**
### Honest caveats for the buy case
- These are **projections** from GB10 + spec ratios; the capacity multiplier depends on the
workload's context-length distribution (more variable -> bigger paged win) and TP
efficiency. `paged-loadgen` measures it directly once you have target-GPU time.
- The **paged op itself still needs work**: PR #22569's `ggml_paged_attn` was 12-13%
*slower* than the mature contiguous flash-attention path at equal concurrency
(`PR22569_EVAL.md`), lacks prefix sharing (deferred to a non-existent Phase 2), and has
the fit-robustness bug above. Adopting paged KV for the target means either hardening
#22569 or finishing the from-scratch P4 - the capacity win above assumes a *correct,
competitive* op, which is the remaining engineering.
- Prefill on either KV layout is compute-capped, not a paged concern.
**Bottom line for the decision:** paged KV **is** the right lever for the 2x H200 target -
the GB10 "no win" result is a bandwidth artifact, not a verdict. The paged path is now
**correctness-verified**, the **benchmark to size the win exists**, and the projection
says the payoff is **~5-10x concurrent-tenant capacity -> several-fold higher aggregate
decode** on the target. The remaining work is hardening/finishing the paged op, not
proving the thesis.

View File

@@ -1,55 +0,0 @@
# Making llama.cpp/LocalAI a viable vLLM alternative — phased plan
Goal: close the practical gap to vLLM for both single-user *speed* and multi-user *throughput*, while keeping
quality (no lossy quant). Grounded in measured benchmarks + research (`BENCHMARKS.md`, `BLACKWELL_KERNEL_GAPS.md`,
`VLLM_THROUGHPUT_GAP.md`). The gap is NOT one thing — each phase targets a distinct, independent lever.
## Where vLLM actually leads (measured, GB10 / Qwen3-32B)
- **Single-user decode:** ~parity (10.2 vs 11.7) — bandwidth-bound. vLLM's edge is **spec-dec** (lossless).
- **Multi-user decode:** gap grows to ~2.2× at B=64 (kernel + scheduler).
- **Prefill aggregate:** llama plateaus ~765, vLLM scales to 24k — **paged KV + chunked prefill + kernel**.
- Note: on GB10 vLLM's FP4 trump card is *broken* (falls back to Marlin); llama.cpp runs reliably — a real
viability point. vLLM is structurally ahead mainly via **paged KV, chunked prefill, cross-request prefix cache**.
## Phases
### Phase 1 — Hardware-tuned config (PR #10411) — DONE
Folded into the hardware-defaults path (`core/config/hardware_defaults.go`):
- Blackwell physical batch (n_ubatch) = 2048.
- **VRAM-scaled `n_parallel` default** (>=32GiB→8, >=8→4, >=4→2): turns on concurrency + continuous batching,
which the backend leaves OFF at its `n_parallel=1` default. Unified KV → slots share the budget (no extra
KV memory). Single-host (local GPU) + distributed router (per node). Already-good defaults confirmed:
flash-attn=auto, context=4096.
### Phase 2 — Paged / block KV cache ← biggest structural multi-user lever
vLLM's PagedAttention lifts KV utilization ~20-38% → ~96%. llama.cpp's own A10G data (draft PR #22569):
contiguous OOMs at 26 seqs / 496 t/s → paged 247 seqs / 1256 t/s (**~9.5× concurrency, 2.5× aggregate**).
- Build on / complete **upstream draft PR #22569** (`-kvp`, block manager + paged-attn ggml op, FCFS scheduler)
rather than the from-scratch series we prototyped (`paged/`). Our CPU-verified block manager + gather-read
design informs the review/port; the upstream momentum is the place to land it.
- Phase 2b: cross-request prefix sharing (block-hash dedup) — our `PagedKVManager` already implements it.
### Phase 3 — Prefill amortization (chunked prefill + n_batch/n_ubatch split)
llama aggregate prefill plateaus because (a) one prompt saturates compute, (b) the per-forward GEMM M-dim is
capped at `n_ubatch`=512, (c) no scheduler chunked prefill (draft #10718 abandoned).
- Split logical `n_batch` from physical `n_ubatch` (LocalAI ties them today) so concurrent prefills batch into
a larger logical batch while keeping ubatch at the Blackwell sweet spot (2048).
- Chunked prefill + prefill/decode co-batching in the server slot scheduler.
### Phase 4 — Batched-GEMM kernel tuning (the decode 2.2× + prefill height)
Per `BLACKWELL_KERNEL_GAPS.md`: dense int8-MMQ at ~21% of ceiling, MoE FP4-MMA at ~5%. Both untuned for
Blackwell. To MATCH: tune MMQ or a Marlin-style W4A16 BF16 GEMM (FP4 not required — GB10 is INT8==BF16). To
BEAT (2×): fix+tune the existing FP4-MMA on sm_121 (build-flag/`-O3`-miscompile, not greenfield).
### Phase 5 — Backend GPU sampling
CPU per-sequence sampling caps GPU util ~60% beyond n_parallel ~8-16 (upstream PR #17004). Track/adopt.
### Cross-cutting — Speculative decoding (single-user speed, quality-preserving)
Dense ≥14B: lossless ~1.8-3×. llama.cpp has `-md`/`--spec-draft-*`. Wire a draft-model field in the model
config + ship Qwen3 target+draft (1.7B) pairs in the gallery. NOT for MoE-A3B (nothing to amortize).
## Sequencing rationale
Phase 1 (config) ships now — biggest immediate multi-user win for zero kernel work (concurrency was OFF).
Phase 2 (paged KV) is the highest-leverage structural build and has upstream momentum. Phases 3-4 are deeper
(scheduler + kernel). Spec-dec is independent and can land any time for single-user speed.

View File

@@ -1,90 +0,0 @@
# PR #17004 (backend / GPU sampling) evaluation on DGX Spark (GB10, sm_121)
Date: 2026-06-21. Hardware: NVIDIA GB10 (GB10, sm_121), CUDA 13.0, cmake 3.28.
Model: `Qwen3-32B-Q4_K_M.gguf`. LocalAI pin: `LLAMA_VERSION=f3e182816421c648188b5eab269853bf1531d950` (2026-06-17).
## TL;DR (clean negative)
1. **PR #17004 is MERGED and is ALREADY present in our pinned llama.cpp `f3e1828`.** There is nothing to apply / cherry-pick / patch. The `-bs/--backend-sampling` CLI arg, the `llama_set_sampler` / `llama_get_sampled_*` API, and the GPU argsort/top-k/cumsum/softmax kernels are all in the pin.
2. **The prescribed benchmark cannot test the fix.** `llama-batched-bench` does ZERO sampling - it feeds random tokens (`std::rand() % n_vocab`). Its ~540 t/s plateau is therefore **not** sampling-bound, and enabling backend sampling does nothing to it. The valid tool is `llama-batched` (examples/batched), which the PR updated to drive per-sequence sampler chains and which actually exercises `-bs`.
3. **In a controlled real-sampling A/B (same `llama-batched` harness, CPU vs GPU sampler), GPU sampling gave only +25% at np=32, +3% at np=64, and CRASHED (`GGML_ASSERT(obj_new)`, graph-context alloc) at np=128 and np=256** - exactly the multi-user regime the investigation cares about.
4. **nsys at np=64: GPU kernel profile and GPU-busy time are essentially identical with and without the fix** (CPU 392.5 t/s / GPU 404.2 t/s; total GPU kernel+memop time ~4.05 s in both). Sampling kernels do not even appear among the top GPU contributors. GPU utilization did **not** rise.
5. **Conclusion: PR #17004, in the state shipped by our pin, does NOT break the ~540 plateau and does not move decode aggregate toward the ~2700 GPU-bound ceiling or past vLLM's 667.** It is modest at low parallelism and unusable (crash) at the high parallelism in question. The PR's own guidance ("recommended `--parallel 1`", "will take time to mature") matches what we measured.
## 1. What PR #17004 does + state
- Title: "sampling : add support for backend sampling". **State: MERGED** into `master` (PR head branch `gpu-sampling`). 44 files, +4133/-296.
- `libllama`: new `llama_context_params.samplers` / `n_samplers`, `llama_set_sampler`, `llama_get_sampled_*`, `llama_sampler_seq_config`, updated `llama_sampler_i`. Sampler chain can now run inside the compute graph on the backend (GPU) instead of on the CPU after `llama_decode`.
- CUDA: optimized/new `argsort`, `top-k`, `cumsum`, `softmax` kernels; CMake option `-DGGML_CUDA_CUB_3DOT2=ON` (builds a CCCL v3.2 prerelease for faster top-k).
- Tools: new `-bs, --backend-sampling` arg in `common/arg.cpp` (line 1921); server (`server-context.cpp`) per-slot wiring; `examples/batched/batched.cpp` updated.
- Supported backend samplers: `top-k`, `top-p`, `min-p`, `temp` (+ dist). **Limitations (from the PR): not compatible with grammar sampling; single output per sequence per batch; no save/load of sampling state; recommended only with `--parallel 1` and CUB_3DOT2.** Open follow-ups: #18547 (avoid graph reallocations), #18550 (skip inactive samplers in parallel decode).
- It DOES target the CPU-side per-sequence sampling stall we hypothesised - the mechanism is correct. Maturity is the problem.
Note: the GitHub API reports `mergedAt: 2026-01-04`, but the PR contains June 2026 upstream-merge commits and the feature is verified present in our 2026-06-17 pin, so treat the date field as a metadata quirk. What matters: the code is in `f3e1828`.
## 2/3. Apply + build
No apply needed (already in pin). Built from a clean `git worktree` at `f3e1828` (`~/llama-pr17004`), to avoid disturbing the existing diffusion build:
```
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=121 -DLLAMA_MAX_SEQ=256 \
-DGGML_CUDA_CUB_3DOT2=ON -DLLAMA_CURL=OFF
cmake --build build --target llama-batched llama-batched-bench -j20
```
**Build: SUCCESS** (CUB_3DOT2=ON FetchContent fetched and compiled despite flaky net; sm_121; LLAMA_MAX_SEQ=256). `-bs/--backend-sampling` confirmed present in `llama-batched --help`.
## 4. Decode aggregate: fix vs baseline vs vLLM
### 4a. `llama-batched-bench` (NO sampling - reconfirms the plateau, unaffected by the fix)
`-npp 16 -ntg 128 -npl 32,64,128,256 -c 40960 -b 2048 -ub 2048`
| npl | S_TG t/s |
|-----|----------|
| 32 | 241.8 |
| 64 | 395.1 |
| 128 | 542.6 |
| 256 | 567.2 |
Reproduces the ~540 plateau. Because this tool never samples, `-bs` is irrelevant here - the plateau is decode/host-overhead-bound, not sampling-bound.
### 4b. `llama-batched` real-sampling A/B (CPU sampler vs `-bs` GPU sampler, identical harness)
`-kvu -n 128 -np {32,64,128,256} -c 40960 --seed 1` (samplers: top-k 40 / top-p 0.95 / temp 0.8)
| np | CPU sampling t/s | GPU `-bs` sampling t/s | delta |
|-----|------------------|------------------------|-------|
| 32 | 174.1 | 217.5 | +25% |
| 64 | 390.5 | 403.4 | +3.3% |
| 128 | 497.9 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
| 256 | 396.7 | **CRASH** `GGML_ASSERT(obj_new) ggml.c:1768` | - |
(`llama-batched` absolute t/s is lower than `batched-bench` because it does real sampling plus per-token detokenize/string/stream work; the A/B *within* this harness isolates the sampler cost.)
**Does the fix break the plateau? No.** GPU sampling helps only at low parallelism and the gain shrinks as np rises (+25% -> +3%), then the path crashes at np>=128 - i.e. it fails in exactly the multi-user regime where the plateau matters. It does not approach the ~2700 ceiling and does not pass vLLM's 667. The CPU-sampling curve itself peaks at np=128 (498) and *drops* at np=256 (397), confirming CPU sampling is a scaling wall - but PR #17004 as shipped does not lift it because the GPU path is unstable there.
## 5. GPU-utilization mechanism (nsys, np=64, the highest np where `-bs` survives)
`nsys profile -t cuda ... -n 96 -np 64`
| mode | decode t/s | total GPU kernel+memop time | top GPU contributors |
|------|-----------|------------------------------|----------------------|
| CPU sampling | 392.5 | ~4.07 s | mul_mat_q (55%+17%), flash_attn (5.7%), mul_mat_vec (2%) |
| GPU `-bs` | 404.2 | ~4.04 s | identical set; sampling kernels not in top contributors |
GPU-busy time and the kernel mix are **essentially unchanged** between modes. The argsort/top-k/cumsum/softmax sampling kernels are negligible in the timeline; the only visible difference is H2D memcpy *instances* rising 1,495 -> 7,076 (pinned-memory sampler transfers) at ~unchanged total memcpy time. **GPU utilization did not rise.** This directly refutes the idea that, at this workload, the GPU idle is dominated by CPU sampler arithmetic - moving the sampler onto the GPU barely changed throughput (+3%) and did not raise GPU occupancy. The ~80% idle measured elsewhere is dominated by something other than the sampler math (host-side batch construction / synchronization / detokenize), which PR #17004 does not address.
(np=256 nsys "with fix" could not be captured: `-bs` aborts there. Fixing the crash needs the unmerged follow-ups #18547/#18550, not in our pin.)
## LocalAI adoption path
**The code arrives transparently with a version bump; enabling it is not transparent.**
- `backend/cpp/llama-cpp/prepare.sh` copies all of upstream `llama.cpp/tools/server/*` (including the #17004-modified `server-context.cpp` / `server-task.cpp` / `server-common.cpp`) into `tools/grpc-server/`, and `grpc-server.cpp` `#include`s them. So once `LLAMA_VERSION` points at a commit containing #17004 (our pin `f3e1828` already does), the backend-sampling machinery compiles into `grpc-server` automatically. **No vendored patch in `patches/` is required for the code.**
- The vendored `server-context.cpp` already does the per-slot wiring (around line 1615): `backend_sampling &= task.params.sampling.backend_sampling`, also disabled for speculative decode and for pre-sampling logits (`n_probs>0`), then `llama_set_sampler(ctx_tgt, slot.id, common_sampler_get(slot.smpl))`.
- **But it is OFF unless `task.params.sampling.backend_sampling == true`.** LocalAI's `grpc-server` builds `params` itself from the gRPC request and never sets this flag (and does not pass the upstream `--backend-sampling` CLI arg). So as-is, LocalAI compiles the feature but never uses it. **A small grpc-server change is needed**: read a LocalAI model option / env and set `params.sampling.backend_sampling = true` (global or per-request).
- For performant CUDA top-k, add `-DGGML_CUDA_CUB_3DOT2=ON` to the llama-cpp CUDA `CMAKE_ARGS` in the Makefile (optional; a non-CUB fallback exists).
- **Caveats that blunt the benefit for LocalAI specifically:** grammar-constrained requests (JSON-schema / tool calls - a large share of LocalAI traffic), `logprobs`/`n_probs>0`, and speculative decoding all fall back to CPU sampling by the gating above; and the GPU path crashes at np>=128 in this pin. So even after wiring the flag, the multi-user throughput case would not benefit (and would crash) until the follow-up PRs (#18547/#18550) land and stabilise high-parallelism backend sampling.
### Recommendation
Do **not** adopt PR #17004 as the multi-user throughput fix yet. It is already in the tree but is immature at the parallelism that matters (crashes at np>=128, modest gains below). The measured bottleneck at this workload is not the sampler arithmetic (nsys shows GPU-busy unchanged when sampling moves to GPU). Re-evaluate after #18547/#18550 merge into a future pin; revisit the host-side decode/batch-construction overhead as the more likely real lever.

View File

@@ -1,136 +0,0 @@
# Evaluation: llama.cpp PR #22569 (paged KV cache, `-kvp`) on DGX Spark (GB10, sm_121)
Question: is upstream draft PR #22569 the right base to give LocalAI vLLM-class
high-concurrency GPU throughput, or should we finish our own from-scratch P4
(`backend/cpp/llama-cpp/paged/`)?
Date: 2026-06-21. Hardware: NVIDIA GB10 (compute 12.1 / sm_121), 122502 MiB unified
memory, CUDA 13.0, gcc 13.3. Models: `Qwen3-32B-Q4_K_M.gguf` (18.4 GB, 64 layers,
n_head 64 / n_head_kv 8 / head_dim 128 / n_embd 5120) and `Qwen3-0.6B-Q8_0.gguf` for
the correctness gate.
## TL;DR verdict: DO NOT adopt #22569. Finish our own P4.
On GB10 with a 32B dense model, PR #22569 delivers **no throughput win and no concurrency
win** - it is ~12% *slower* than the existing contiguous path and hits the *same*
256-sequence ceiling. The "scale to thousands of sequences like vLLM" premise does not
hold for this PR or this hardware/model. On top of that it is broken out of the box,
wired to the wrong integration surface, and a contested draft.
## 1. Builds? Correct?
- **Builds: YES.** Cloned `matiaslin/llama.cpp@paged_attention` (PR #22569, single commit
`0b0f7bd...`, base = current master). Clean CUDA build for sm_121
(`-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=121 -DCMAKE_BUILD_TYPE=Release`).
`llama-paged`, `llama-batched-bench`, `test-paged-kv`, `test-paged-kv-e2e` all link.
It is self-contained (ships its own CPU+CUDA `ggml_paged_attn` op) and does **not**
depend on the competing CUDA PR #17579 (ericcurtin, `--pagedattention`).
- **Runs out of the box: NO.** `llama-paged -kvp` on Qwen3-32B *and* Qwen3-0.6B crashes
at context creation:
`build_attn(llm_graph_input_attn_kv_paged*) -> ggml_reshape_2d ->`
`GGML_ASSERT(ggml_nelements(a) == ne0*ne1)` (src/llama-graph.cpp:2556). Same crash with
`--fit off` (so it is the real graph, not just the memory probe).
**Root cause:** the paged path hardcodes `ggml_reshape_2d(cur, hparams.n_embd, ...)`,
wrong for any model where `n_head*head_dim != n_embd`. Qwen3 decouples head_dim:
32B = 64*128 = **8192** vs n_embd 5120; 0.6B = 16*128 = **2048** vs 1024. The PR's
"qwen3 verified" claim does **not** hold against current Qwen3 GGUFs. Fix is ~1 line
(use the real attention width `cur->ne[0]*cur->ne[1]`); applied for the rest of the eval.
- **`fit_params` (`-ngpub` auto-sizing) also crashed on GB10** in the same reshape path
during the device-memory probe (before the fix). After the reshape fix, paged
auto-fit works (sized 96624 GPU blocks on the 0.6B from 85 GiB free).
- **Correctness after the reshape fix:** paged decode runs and produces **coherent**
output on Qwen3-32B (sensible mercury / miso-soup / Starry-Night answers across 128 and
256 concurrent sequences), indicating the `ggml_paged_attn` op is functionally roughly
correct. PR's own greedy/top-K equivalence test (`test-paged-kv-e2e`, top-K argmax +
top-5 overlap >= 4 + first-4-token greedy match vs non-paged) on Qwen3-0.6B did
**not** reach a PASS/FAIL verdict on GB10: its paged auto-fit grabs ~88 GiB
(96531 blocks) and the run then stalls at cache init (a third GB10 fit-robustness
issue, distinct from the reshape bug). So the formal greedy-equivalence gate is
**unverified on this box**, but the qualitative evidence (coherent multi-sequence 32B
output with explicit small `-ngpub`) indicates the fixed op is roughly correct. This
does not change the verdict, which is decided by throughput below.
## 2. Throughput: paged vs contiguous on GB10 (Qwen3-32B-Q4_K_M)
Contiguous = `llama-batched-bench` (unified KV, continuous batching), S_TG decode tok/s.
Paged = `llama-paged -kvp --fit off` (its scheduler-driven continuous-batching loop),
`aggregate tps`. Both `npp~16, ntg/n_predict=128, n_batch=n_ubatch=2048, -ngl 99`.
| npl | contiguous (S_TG t/s) | paged `-kvp` (agg t/s) | outcome |
|------|----------------------|------------------------|---------|
| 128 | **537** (S 553) | **477** | both run; paged ~12% slower |
| 256 | **541** (S 550) | **471** | both run; paged ~13% slower; neither gains over 128 |
| 512 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
| 1024 | FAIL | FAIL | **both** die: `n_seq_max must be <= 256` |
### The decisive facts
1. **PR #22569 does NOT lift the 256-sequence ceiling.** Both contiguous and paged fail
identically at npl 512/1024 with `n_seq_max must be <= 256` (llama.cpp's compile-time
`LLAMA_MAX_SEQ`). It is **not** an OOM - GB10 has 119 GiB and at npl=256 contiguous KV
is only 16 GiB. Paging gives **zero** concurrency headroom over contiguous here. The
"paged unlocks thousands of seqs" premise is false for this PR.
2. **Paged is slower, not faster.** The fresh `ggml_paged_attn` op (477/471 t/s) loses to
the mature CUDA flash-attention contiguous path (537/541 t/s) by ~12-13% at equal
concurrency. The PR's A10G "2.5x" came entirely from contiguous OOMing at 26 seqs on a
24 GiB card; that lever does not exist on GB10's 119 GiB.
3. **The 32B dense model is compute-bound and plateaus by npl=128 on GB10.** Aggregate is
flat from 128->256 (contiguous 537->541; paged 477->471). Doubling concurrency buys
nothing because the GPU is already saturated on the 32B weight matmuls. Even if we
recompiled with a larger `LLAMA_MAX_SEQ`, aggregate would not climb - so vLLM-class
~24k aggregate is **unreachable for 32B-dense on a single GB10 regardless of KV
layout**. The throughput gap to vLLM at this model/hardware is a compute/bandwidth
problem, not a KV-fragmentation problem.
## 3. Verdict and reasoning: finish our own P4
**Do not adopt #22569 as the base.** Reasons:
- **No win on target hardware.** Even fully completed, on GB10 + 32B it is slower than
what we already have and capped at the same 256 seqs. There is no throughput or
concurrency dividend to harvest here.
- **Wrong integration surface.** Paged is driven only by a brand-new parallel C API
(`llama_paged_scheduler_init/add_request/prepare_batch/get_batch_info/update/...`) and a
bespoke `examples/paged` loop. `-kvp`/`--kv-paged` is gated to `LLAMA_EXAMPLE_PAGED`
only - it is NOT wired into `llama-server`/`batched-bench`/`parallel`, i.e. NOT the path
LocalAI's grpc-server derives from. Adopting it means rewriting LocalAI's serving loop
around the new scheduler API.
- **Broken / restricted.** Crashes out of the box on all current Qwen3 (and any
decoupled-head-dim model); fit_params crashed; Phase-1 restrictions enforced at context
creation: single CUDA device, full offload only, `n_batch == n_ubatch`, no SWA
(gemma3/llama4/etc. unsupported), no CoW / prefix-caching, no
`seq_cp`/`seq_keep`/`seq_div`/`seq_add`, no state save/load.
- **Contested draft.** Unmerged; the author is openly asking maintainers whether the C
API is even the right design; maintainers are skeptical of paged for single-node use.
**What P4 should actually target (re-scoped by this data).** The aggregate-throughput
gap to vLLM on a compute-bound dense model on one GB10 is not addressable by paged KV.
The durable, real LocalAI wins from paging are the ones our from-scratch P0 already
implements the machinery for and that #22569 explicitly omits:
- **on-demand KV sizing** (fit more *diverse* concurrent tenants without per-seq
over-reservation), and
- **automatic cross-tenant prefix sharing** (chained-hash block cache - shared system
prompts / RAG preambles), which #22569 defers to a non-existent Phase 2.
Finish our own P4 (CPU gather-read + a CUDA gather-read) against these capacity/
prefix-sharing objectives - measured as max concurrent *distinct* tenants and KV memory
saved, not single-model aggregate tok/s. To chase raw aggregate, the levers are lifting
`LLAMA_MAX_SEQ` and smaller/MoE models in memory-bandwidth-bound regimes - orthogonal to
paged attention. The ~1-line reshape fix found here (and the GB10 fit_params crash) are
worth upstreaming to #22569 regardless, but the PR is not our base.
### Reproduction (DGX, `~/llama.cpp-pr22569`)
```sh
export PATH=/usr/local/cuda/bin:$PATH
# contiguous
./build/bin/llama-batched-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -npp 16 -ntg 128 \
-npl 128 -c 20480 -b 2048 -ub 2048 # 256/512/1024 -> n_seq_max must be <= 256
# paged (needs the src/llama-graph.cpp:2556 reshape fix: hparams.n_embd -> cur->ne[0]*cur->ne[1])
./build/bin/llama-paged -m Qwen3-32B-Q4_K_M.gguf -kvp --fit off -ngpub 2048 -ncpub 128 \
-np 128 -ns 128 -n 128 -b 2048 -ub 2048 -ngl 99 # 512/1024 -> n_seq_max must be <= 256
```

View File

@@ -1,95 +0,0 @@
# Paged Attention for llama.cpp (vLLM-parity), CPU-first
A from-scratch port of vLLM V1's paged KV-cache model into the llama.cpp / ggml
world, built CPU-first and verified incrementally. The host-side block manager is
a faithful port of vLLM; the compute stays in ggml (no new op — the read path
gathers blocks with `ggml_get_rows` and feeds the existing attention ops).
Design: `docs/superpowers/specs/2026-06-19-paged-attention-llamacpp-design.md`
Plan: `docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md`
## Status
| Phase | What | State |
|------|------|-------|
| P0 | vLLM-parity host block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache) | ✅ verified — `make check`, 4/4 suites |
| P1 | ggml paged write/gather mechanism (`set_rows` by slot_mapping → `get_rows` gather) | ✅ verified — `make ggml-check`, non-contiguous blocks `[2,1,5]` round-trip + isolation |
| P2 (core) | attention over gathered paged KV matches independent host reference | ✅ verified — max abs err **7.5e-08** |
| P3 (partial) | capacity & prefix-sharing wins | ✅ measured — `make bench`: **9.2×** more concurrent seqs, **11.3×** less KV memory |
| **P3 (in-model placement)** | **paged, non-contiguous block KV placement in the real model** | ✅ **Gate 0 PASSED** — Qwen3-0.6B token-identical (`patches/0001-paged-kv-block-placement.patch`) |
| P4 (in-model compute) | gather-read (`build_attn_paged`, read only a seq's blocks) + win-2 throughput + multi-seq | ⛔ remaining |
The design's central risk — *does paged (non-contiguous) KV produce correct attention?*
is **retired at two levels**: (1) at the ggml-op level (P2, 7.5e-08 vs reference) and
(2) **in a real model** (P3): with KV physically scattered across permuted, non-contiguous
blocks (cells `0-15, 144-159, 32-47, …`), Qwen3-0.6B greedy generation is **token-for-token
identical** to the contiguous cache. Reproduce:
```sh
# from backend/cpp/llama-cpp-fallback-build/llama.cpp (patch applied, CPU build)
B=build-cpu/bin/llama-simple; M=<Qwen3-0.6B.Q4_K_M.gguf>; P="...long prompt..."
"$B" -m "$M" -n 40 "$P" > base.txt
LLAMA_KV_PAGED=1 "$B" -m "$M" -n 40 "$P" > paged.txt
diff base.txt paged.txt && echo TOKEN-IDENTICAL
# LLAMA_KV_PAGED_DEBUG=1 prints the permuted physical cells per step
```
This proves the **storage/placement** layer of paged attention in-model. What remains (P4)
is the **compute** optimization that yields the throughput win: a gather-read that attends
only a sequence's own blocks (instead of scanning `[0,n_kv)` with a mask), plus the
multi-sequence driver to measure tok/s vs concurrency. The patch is single-sequence scope.
## Build & test
```sh
make check # P0 host-manager unit suites (pure C++, no deps)
make ggml-check GGML_SRC=<llama.cpp>/ggml GGML_BUILD=<ggml-build> # P1/P2 ggml tests
make bench # P3 capacity + prefix-sharing numbers
```
`ggml-check` needs a built ggml. To build one CPU-only from a llama.cpp checkout:
`cmake -S <llama.cpp>/ggml -B /tmp/ggml-build -DGGML_CUDA=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build /tmp/ggml-build -j`
(if it complains about a missing `ggml.pc.in`, add a minimal pkg-config stub).
## Files
- `paged_kv_manager.{h,cpp}` — the vLLM-parity block manager (no ggml/llama dep).
- `tests/test_free_block_queue.cpp` — intrusive LRU free list.
- `tests/test_block_pool.cpp` — alloc/touch/free/evict/cache.
- `tests/test_paged_kv_manager.cpp` — allocate/block_table/slot_mapping/free.
- `tests/test_prefix_cache.cpp` — chained block hashing + first-miss cache hit.
- `tests/test_ggml_paged_rw.cpp` — paged write/gather through real ggml ops.
- `tests/test_ggml_paged_attn.cpp` — attention over paged KV vs host reference.
- `paged-bench.cpp` — capacity (win 1) + prefix-sharing (win 3) measurements.
## Remaining work — integration map (for the next session)
Target: a paged read path active behind a flag, producing **token-identical** greedy
output vs the contiguous cache on a real model (Gate 0), then `paged-bench` win 2.
Exact seams in the vendored llama.cpp (`backend/cpp/llama-cpp-fallback-build/llama.cpp`,
the pinned build fetches `LLAMA_VERSION=f3e182816421…`):
1. **Memory type**`src/llama-model.cpp:2070` `create_memory()` constructs `llama_kv_cache`.
Add a paged variant (or a flag on the existing cache) implementing `llama_memory_i`
(`src/llama-memory.h`), backed by `PagedKVManager`.
2. **Allocation**`src/llama-kv-cache.cpp:818` `find_slot()` produces `slot_info.idxs`.
Replace the ring-buffer scan with block-aligned allocation from `PagedKVManager`.
3. **Read path**`src/llama-kv-cache.cpp:1145/1165` `get_k`/`get_v` return a contiguous
`[0,n_kv)` view. For paged, gather the sequence's blocks (`ggml_get_rows`) into scratch.
The new branch lives alongside `build_attn` in `src/llama-graph.cpp` (`build_attn_mha`).
4. **Mask**`src/llama-graph.cpp` `build_attn_inp_kq_mask` sizes the mask to the gathered
length per sequence.
5. **Gate 0 driver**`build-cpu/bin/llama-simple` (greedy argmax) on
`Qwen3-0.6B.Q4_K_M.gguf`; assert paged output == contiguous output token-for-token.
### Honest caveats (from the maintainer discussion + reading `find_slot`)
- llama.cpp's **unified cache already shares one KV pool** across sequences and already
tolerates non-contiguous slots. So win-1 vs *unified* is smaller than vs per-seq
reservation (stream mode). The durable LocalAI wins are **on-demand sizing** and
**automatic cross-tenant prefix sharing** (P0 implements the block-hash machinery).
- vLLM's classic `paged_attention_v1/v2` CUDA kernel is **deprecated**; the live path is
FlashAttention/FlashInfer over a block table. The port targets that pattern, not the
old kernel. Upstream draft PRs #22569 (new `ggml_paged_attn` op) and #17579 (CUDA) are
unmerged; maintainers are skeptical for single-user use.

View File

@@ -1,78 +0,0 @@
# Upstream ggml issue draft: MXFP4 MoE prefill underutilizes Blackwell (GB10) — ~22 TFLOP/s, ~27× behind vLLM
**Title:** CUDA: MXFP4 MoE prefill runs the Ampere-class warp `mma.sync`, far below Blackwell FP4 peak (GB10 / sm_121)
## Summary
On a GB10 (DGX Spark, sm_121), MXFP4 MoE prefill for Qwen3-Coder-30B-A3B is bottlenecked by
`mul_mat_q<MXFP4>` (the per-expert grouped MMQ), which runs at only **~22 effective TFLOP/s** — a small
fraction of the GPU's FP4 capability. Batched prefill plateaus at ~3.65k tok/s (B=32) vs vLLM FP8 ~99k
on the same box (~27×). The native FP4 block-scaled `mma.sync` path (PR #17906 et al.) *is* engaged — the
limit is that it's a warp-level MMA kernel, not a tcgen05/CUTLASS-class grouped GEMM.
## Hardware / build
- NVIDIA GB10, compute capability 12.1, 119 GiB unified LPDDR5X.
- llama.cpp built `-DCMAKE_CUDA_ARCHITECTURES=121` (sm_121a/compute_121a confirmed in cubins).
- Model: Qwen3-Coder-30B-A3B-Instruct, `MXFP4_MOE` (15.9 GiB, 4.47 BPW).
## Measurements
Single-stream (`llama-bench`, ub2048):
| metric | Q8_0 | MXFP4 | vLLM FP8 |
|---|---|---|---|
| prefill pp2048 | ~2200 | 3441 | — |
| decode tg128 | 62 | 86 | 52 |
Batched (decode-phase aggregate `S_TG`; prefill aggregate `S_PP`):
| B | llama MXFP4 prefill | vLLM FP8 prefill | llama MXFP4 decode | vLLM FP8 decode |
|---|---|---|---|---|
| 1 | 1625 | 9644 | 83 | 48 |
| 8 | 3634 | 33373 | 267 | 312 |
| 32 | 3651 | 99398 | 551 | 1171 |
| 64 | 3648 | 151990 | 770 | 2064 |
Decode is competitive (we win at B=1). **Prefill plateaus and is the gap.**
## Profiling (nsys, MXFP4 pp2048 kernel time)
| kernel | % |
|---|---|
| `mul_mat_q<(ggml_type)39>` (MXFP4 MoE GEMM) | **37.2** |
| `mul_mat_q<(ggml_type)8>` (dense/attn, still Q8) | 10.1 |
| `flash_attn_ext_f16` | 8.8 |
| `quantize_mmq_mxfp4` (activation quant) | 8.0 |
Only cutlass kernel present is `cutlass_80_tensorop` (Ampere). No tcgen05 / wgmma anywhere.
## What we ruled out (so it's the kernel, not config)
- **ubatch**: saturates at 2048 (pp4096: ub512 2994 → ub2048 3316 → ub8192 3180).
- **tile width**: `mmq_x` already selects the full 128-wide tile at ub2048 (~128 tokens/expert).
- **cuBLAS fallback**: `GGML_CUDA_FORCE_CUBLAS` is a no-op (3419 ↔ 3423 t/s) — dequant→cuBLAS-FP16 neither
helps nor hurts, i.e. the FP4 MMQ kernel isn't worse than FP16 cuBLAS, both hit a common ceiling.
- prefill does **not** scale with bigger single prompts (attention O(N²) confounds): pp2048 3295, pp8192
1524, pp16384 2051 — so it's the many-sequence batched MoE GEMM, not batch size.
## Proposal
A tcgen05 / CUTLASS-3.x grouped-GEMM path for FP4 (MXFP4 + NVFP4) MoE on sm_120/121:
- One grouped GEMM over all experts with per-group token offsets (full tiles regardless of tokens/expert),
vs today's per-expert MMQ scheduler.
- Block-scaled `e2m1` operands via tcgen05 tensor-memory MMA (`mma.sync.aligned.kind::mxf4…` is the
warp-level form; the collective-mainloop/tcgen05 form is what extracts Blackwell throughput at prefill
tile sizes).
- Fuse activation quantization (`quantize_mmq_mxfp4`, ~8%) into the permute/gather.
- Optionally extend to dense layers (qkv/o_proj/lm_head) so full-model prefill is FP4/FP8.
This mirrors what vLLM/FlashInfer/TensorRT-LLM do for Blackwell MoE. Happy to test iterations on the GB10.
## Repro
```sh
llama-quantize qwen3coder-f16.gguf qwen3coder-mxfp4.gguf MXFP4_MOE
llama-bench -m qwen3coder-mxfp4.gguf -ngl 99 -p 2048 -n 0 -ub 2048
llama-batched-bench -m qwen3coder-mxfp4.gguf -ngl 99 -c 45056 -b 2048 -ub 2048 -npp 512 -ntg 128 -npl 1,8,32,64
```

View File

@@ -1,83 +0,0 @@
# What makes vLLM fast on GB10 — kernel vs scheduler (code-grounded, measured)
Decisive analysis (vLLM v0.23.0, torch 2.11+cu130, sm_121, model `RedHatAI/Qwen3-32B-NVFP4A16`, source at tag
`v0.23.0`). **Answer: it's the scheduler, not the kernel.** This closes the kernel track and opens the
scheduler track.
## The decomposition (measured on the DGX, prefix-cache OFF, unique prompts)
| | vLLM W4A16 Marlin | llama.cpp | verdict |
|---|---|---|---|
| **single-stream prefill** | ~800 t/s (~52 TFLOPS) | 718 MMQ / **1153 MXFP4** | **tied; llama.cpp MXFP4 wins** |
| decode batch-1 | 11.8 t/s | ~similar | bandwidth-bound (≈190/273 GB/s); no kernel helps |
| **aggregate decode** | 328 (N32) / 569 (N64) / **667 (N128)** | the gap | **~56× multiplier = scheduler** |
vLLM's single-stream Marlin is **not** at the roofline — it's in the same ~4×-under regime as MMQ. The 24k
headline is entirely the aggregate decode multiplier.
## The kernel vLLM actually runs on sm_121 (W4A16, forced)
Dispatch (vLLM v0.23.0): `compressed_tensors.py:704` (NVFP4 + no input-quant → `W4A4Fp4(use_a16=True)`) →
`compressed_tensors_w4a4_nvfp4.py:28``kernels/linear/__init__.py:894` (`if use_a16: force_kernel =
MarlinNvFp4LinearKernel`, **unconditional, no cc gate**) → `nvfp4/marlin.py``marlin_utils_fp4.py:182`
`ops.marlin_gemm(b_q_type=float4_e2m1f)`, activations FP16/BF16. csrc: `csrc/quantization/marlin/marlin.cu`
+ `marlin_template.h` + `marlin.cuh`.
Techniques = **exactly the playbook we proved loses on GB10**: XOR shared swizzle (`marlin_template.h:722
^ (row%8)`), 4-stage cp.async pipeline (`marlin.cu:396 stages=4`, `cp_async_wait<stages-2>`), ldmatrix+mma,
FP16/BF16 acts. Native FP4 (`FlashInferB12xNvFp4LinearKernel`) needs `Sm120BlockScaledDenseGemm` cubins absent
on GB10 → W4A4 hangs → forced W4A16 Marlin fallback. **Nothing to port; vLLM's kernel is occupancy-blocked too.**
## The scheduler (the real multiplier) — what llama.cpp lacks
- **Paged KV cache** (`vllm/v1/core/kv_cache_manager.py`, `block_pool.py`): block KV, no fragmentation → very
high concurrent batch. **llama.cpp: NO** (contiguous per-slot KV → fragmentation caps real concurrency).
- **Chunked prefill** (`config/scheduler.py:84 enable_chunked_prefill=True`, default ON): interleaves prefill
chunks with decode so decode batches stay full. **llama.cpp: NO** (a long prefill stalls the decode batch).
- **Continuous batching** (`v1/core/sched/scheduler.py`): per-step admit/evict. **llama.cpp: YES** (`n_parallel`,
rudimentary — we enabled VRAM-scaled slots in #10411).
## Sizing the scheduler gap — MEASURED (llama.cpp aggregate, the surprise)
`llama-batched-bench` Qwen3-32B-Q4_K_M, npp=128 ntg=128, npl scaling (DGX):
| npl | S_PP (agg prefill) | **S_TG (agg decode)** | vLLM decode | llama % of vLLM |
|---|---|---|---|---|
| 1 | 628 | 10.2 | 11.8 | 86% |
| 8 | 773 | 59.8 | - | - |
| 32 | 763 | **235** | **328** | **72%** |
| 64 | 761 | **391** | **569** | **69%** |
| 128 | 762 | **540** | **667** | **81%** |
**The "30x gap" headline is wrong for realistic concurrency.** llama.cpp's continuous batching already
captures **~70-81% of vLLM's aggregate decode** at npl<=128, with a near-identical multiplier (10.2 -> 540 =
**53x**, vs vLLM's 56x). And it is still climbing linearly at 128 (not plateaued). Combined with llama.cpp being
*ahead* single-stream (MXFP4 1153 > vLLM 800), **llama.cpp is already broadly competitive with vLLM on GB10 at
self-hosted concurrency.**
Two real findings remain:
1. **Aggregate prefill is flat ~760** regardless of npl - but that is the **GB10 compute roofline** (vLLM single-
stream is ~800; neither can prefill faster aggregate, it is compute-bound). So prefill is **not a throughput
gap**; chunked prefill is a **latency/TTFT** win (stop a long prefill stalling the decode batch), not a
throughput one.
2. **vLLM's ~24k headline lives at thousands-of-sequences concurrency**, which **paged KV** unlocks (block KV,
no fragmentation). llama.cpp's contiguous KV caps how far npl can scale before memory/fragmentation bite. So
paged KV is the **high-concurrency (datacenter) lever**, not a moderate-concurrency one.
## Recommendation
**Pivot to the scheduler; treat the GEMM kernel as good-enough / roofline-blocked on GB10.**
Now that the gap is measured, ROI-ordered:
1. **Ship the MXFP4-dense win** — 1153 t/s single-stream beats vLLM's 800; a Blackwell dense-quant
recommendation (requantize, no kernel work). Already documented in `BLACKWELL_KERNEL_GAPS.md` §6. Cheapest.
2. **Chunked prefill** — the tractable scheduler win: interleave prefill chunks with decode so a long prompt
doesn't stall the decode batch. Payoff is **latency/TTFT under mixed load** (and steadier decode batches),
not aggregate prefill throughput (that's GB10-compute-capped at ~760-800 for both engines). A grpc-server
scheduler change; no KV-layout rewrite.
3. **Paged KV** — the **high-concurrency (thousands-of-seqs) lever** that unlocks vLLM's 24k regime. Heavy
(block KV manager; contested upstream PR #22569 / vendored `patches/`). Worth it only if datacenter-scale
concurrency is a target; at self-hosted concurrency (npl<=128) llama.cpp is already ~75-80% of vLLM.
**Reframed expectation:** llama.cpp on GB10 is NOT 30x behind vLLM. It is ahead single-stream (MXFP4) and
~70-81% of vLLM aggregate at npl<=128. The genuine differentiator vLLM still has is **scaling to very high
concurrency via paged KV**. Kernel tracks (W4A16 178 t/s; FP4-MMA) stay **banked** - not the lever.

View File

@@ -1,59 +0,0 @@
# Where vLLM beats llama.cpp on a DGX Spark (GB10), and how to close it — keeping quality
The question: "vLLM is faster at the end — what do we improve, while keeping good quality?" Answer: the
gap is **three independent things**, and the biggest *per-user, quality-preserving* one is **speculative
decoding**, which llama.cpp already supports.
## Decomposition (measured + researched)
| vLLM advantage | helps single user? | llama.cpp answer | quality cost | status |
|---|---|---|---|---|
| **Per-user decode speed** | **yes** | **speculative decoding** (Qwen3 draft / EAGLE3) | **none** (target-verified, lossless) | mature in llama.cpp; **the main lever** |
| Prefill / TTFT | no (it's first-token latency) | tune FP4-MMA / Marlin W4A16 kernel | none | hard; `BLACKWELL_KERNEL_GAPS.md` |
| Aggregate throughput @ concurrency | no (per-user = 0) | continuous batching (paged engine) | none | also kernel-bound |
Key measured fact: **single-user decode is already at parity** (Qwen3-32B: llama 10.2 vs vLLM 11.7 t/s) —
both hit GB10's ~273 GB/s bandwidth wall (~15 t/s ceiling) **without** spec-dec. So vLLM's real per-user
speed edge is spec-dec, not architecture.
## Why spec-dec is THE lever here (and quality-safe)
- **Lossless:** the 32B target verifies every drafted token (accept/reject) — output distribution is
identical to no-drafting. So you keep **Q4_K_M quality** (no lossy MXFP4 needed) *and* get speed.
- **GB10 is best-case for it:** decode is bandwidth-bound (one ~17 GB weight-read per token) with huge idle
compute. Spec-dec verifies K drafted tokens in **one** weight-read → converts the loop to compute-bound,
where GB10 has headroom. Realized speedup ≈ mean accepted length.
- **Measured (others, same model class):** llama.cpp Qwen2.5-32B dense + 0.5B draft = **2.9×** (13→38 t/s);
vLLM EAGLE3 on Qwen3-32B = ~1.82.5× general, up to ~3× code/structured. **Competitive.**
- **Regime caveat:** spec-dec gives **~nothing for MoE-A3B** models (only ~3B active → not bandwidth-bound,
nothing to amortize). It shines for **dense** 2732B — the opposite regime. So this lever is *dense-model*
specific.
## Qwen3-32B specifics
- **No native MTP head** (MTP is a Qwen3-*Next*/MoE feature). Options: a **same-family draft**
(Qwen3-0.6B or **1.7B** — same tokenizer, llama.cpp vocab check passes) or an external **EAGLE3 head**
(RedHatAI/AngelSlim Qwen3-32B-eagle3, accept length 2.152.49).
- Draft pick: **lean Qwen3-1.7B** (0.6B had ~60% lower acceptance in AWS's test; on a bandwidth-bound box the
32B weight-read dwarfs the draft cost, so maximize acceptance). `--spec-draft-n-max 58`.
## Recommended LocalAI actions (quality-preserving, ranked)
1. **Make speculative decoding easy/recommended for dense ≥14B models on Blackwell** — a draft-model field in
the model config (`-md` / `--spec-draft-*`), with a suggested Qwen3-1.7B draft for the Qwen3 family. This
is the biggest per-user speed win, lossless, available **now** (no kernel). Gallery: ship target+draft pairs.
2. Kernel work (FP4-MMA tuning / Marlin W4A16) — improves **prefill/TTFT**, separate metric.
3. Continuous batching (paged engine) — **aggregate** concurrency only; per-user = 0.
## Honesty / status
The research conclusion is solid (sources below). **Our own empirical spec-dec run on the DGX is pending**
the box rebooted mid-session and `llama-cli` now hangs at 0% GPU (while `llama-bench` works), plus the network
is dropping ssh mid-command. Drafts (Qwen3-0.6B/1.7B Q8) are downloaded and the spec-dec flags are confirmed;
re-run `llama-cli -m Qwen3-32B-Q4_K_M -md Qwen3-1.7B-Q8_0 -ngl 99 -ngld 99 --spec-draft-n-max 8` when the box
is stable to confirm the ~2× locally. The conclusion does not depend on it (it's measured-reproducible by
others on this exact model class), but we should bank our own number.
Sources: llama.cpp Discussion #10466 (Qwen2.5-32B+0.5B = 2.9×), #16578 (DGX Spark), DandinPower/llama.cpp_bench
(32B = 10.7 t/s, bandwidth-bound); vLLM MTP docs + Red Hat EAGLE3 article (lossless, up to 2.5×); AWS spec-dec
blog (Qwen3-32B+1.7B up to 3×, 0.6B ~60% lower accept); RedHatAI/AngelSlim Qwen3-32B-eagle3 heads.

View File

@@ -1,176 +0,0 @@
# W4A16 Marlin-style GEMM for ggml-cuda on Blackwell (sm_120/121) — implementation plan
> **STOPPED (2026-06-21): the kernel is NOT the lever — validated by a code-grounded vLLM analysis.**
> Measured on the DGX: vLLM's single-stream W4A16 prefill on GB10 = **~800 t/s (~52 TFLOPS), statistically TIED
> with llama.cpp MMQ (718/47)** — and vLLM uses the *exact* XOR-swizzle + 4-stage cp.async Marlin we proved
> collapses GB10 occupancy (vLLM even warns at load that Marlin "may degrade performance for compute-heavy
> workloads"). There is no kernel trick to port. Moreover llama.cpp's **MXFP4 path (1153 t/s) already BEATS
> vLLM single-stream (800)** — vLLM has no FP4 cubins on sm_121 and falls back to slower W4A16 Marlin, so
> llama.cpp is *ahead* on the kernel. **vLLM's entire 24k headline is the aggregate decode multiplier (~56×)
> from paged KV + chunked prefill + continuous batching — a SCHEDULER win.** llama.cpp lacks paged KV +
> chunked prefill. **Effort pivots to the scheduler** (see the paged-attention work). This kernel work is
> banked + resumable (178 t/s, P0/P1/P2/P3/P3b committed) but is not the throughput lever on GB10. Detail:
> `VLLM_DECOMPOSITION.md`.
The committed multi-week kernel. Goal: get 4-bit-weight dense matmul to the GB10 **BF16 ceiling (~213
TFLOP/s ≈ ~3,300 t/s prefill on Qwen3-32B)**, ~4.3× over today's 765. This is the *match-vLLM* path; vLLM's
own GB10 dense throughput runs on W4A16 Marlin (its FP4 path is broken on sm_121).
## Why a custom kernel (validated, not assumed)
On GB10 (sm_121), measured: **both** llama-MMQ (int8, Ampere-tuned) **and** cuBLAS-FP16 sit at ~46 TFLOP/s
(~21% of peak). cuBLAS falls back to an Ampere `cutlass_80_tensorop` kernel (CUDA-13 has no sm_121 GEMM for
these shapes); rebuilt with `-DGGML_CUDA_FORCE_CUBLAS=ON` it's *slower* than MMQ (690 vs 750). **No library
path reaches the ceiling on consumer Blackwell** — a hand-tuned sm_120a kernel is required. `mmapeak` measures
the 213 BF16 peak as reachable, and vLLM's Marlin hits it, so the ceiling is real; the work is reaching it.
## What Marlin does (the design we mirror)
Weights stored 4-bit, **dequantized in-register/shared-mem** in-flight; GEMM math on **FP16/BF16 tensor
cores** (`mma.sync m16n8k16`). Speed comes from: `cp.async` global→shared with a **multi-stage double-buffered
pipeline**, **offline weight reshuffle** into the MMA-friendly layout, activations kept resident in registers,
and **Stream-K** partitioning. Sources: IST-DASLab/marlin, arXiv 2408.11743, vLLM machete (Hopper successor).
## Phases (each ends with: numerical parity vs MMQ + a prefill benchmark)
### P0 — Harness + baseline — DONE
- **Correctness gate (GREEN):** `test-backend-ops test -o MUL_MAT -b CUDA0`**1103/1103 passed** (CUDA vs CPU
reference, covers Q4_0/Q4_K at the real FFN shapes m=4096,k=14336,n=1..512). This is *the* parity check the
W4A16 kernel must keep green at every phase — it tests the CUDA MUL_MAT path the kernel will hook. The
`not supported` lines are `type_b=f16` combos (irrelevant; prefill uses f32 activations).
- **Perf baseline:** `llama-bench` dense Q4_K prefill = **~750 t/s (pp512 718 / pp2048 750) ≈ 46 TFLOP/s ≈ 21%
of the 213 BF16 ceiling**. The kernel must beat this toward ~3,300. (`test-backend-ops perf -o MUL_MAT` gives
per-shape GFLOPS too; build it once with the harness.)
- **Op-level baseline (the canonical kernel target), `test-backend-ops perf -o MUL_MAT`, m=4096 k=14336 (FFN):**
| n (tokens) | q4_0 | q4_K | regime |
|---|---|---|---|
| 1 | 817 GFLOPS | 761 GFLOPS | decode / mat-vec (memory-bound) |
| 8 | 5.77 TFLOPS | 4.11 TFLOPS | small-batch |
| **512** | **49.5 TFLOPS** | **47.1 TFLOPS** | **prefill GEMM — ~22% of the 213 ceiling** |
So the prefill GEMM target: lift q4_K n=512 from **47 → toward ~213 TFLOPS** (~4.5×). This per-shape number
is cleaner than end-to-end for kernel iteration.
- **Harness script:** `~/p0harness.sh` on the DGX (build test-backend-ops + correctness + perf). Reusable each
phase: `test-backend-ops test -o MUL_MAT -b CUDA0` must stay 1103/1103; the q4_K n=512 perf must climb from 47.
- test-backend-ops needed `-DLLAMA_BUILD_TESTS=ON`; now built in `~/llama.cpp-pr24423/build`.
### P1 — Dispatch seam (no behavior change) — DONE
- `marlin-w4a16.{cuh,cu}` + a gated hook in `ggml_cuda_mul_mat` (dense, non-ids path), behind
`GGML_CUDA_W4A16` + sm_120/121 (`cc >= GGML_CUDA_CC_BLACKWELL`) + type∈{Q4_0,Q4_K} + f32 activations.
Returns false → falls back to MMQ. Source + apply instructions: `kernel/w4a16/` (`HOOK.md`).
- **Verified on GB10:** clean build; `test-backend-ops MUL_MAT` = **1103/1103** (byte-identical default);
`llama-bench` dense Q4 pp512 unchanged (717.77 default / 718.26 with flag); `GGML_CUDA_W4A16=1` reaches the
seam (stderr `[w4a16] ... P1 seam - using MMQ`) and falls back. The empty frame P2/P3 fills.
### P2 — Correctness-first kernel (slow OK) — DONE
- **Kernel:** `marlin-w4a16.cu` replaces the P1 TODO with a real W4A16 GEMM. In-kernel dequant Q4→BF16 into
shared mem, `mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32` via ggml's `mma.cuh` tile abstractions
(`tile<16,8,nv_bfloat162>` A, `tile<8,8,nv_bfloat162>` B, `tile<16,8,float>` C), F32 accumulate, F32 write.
One warp per 16(M)x8(N) output tile, K looped in steps of 16. Both src0 (weights, row m) and src1 (acts,
row n) are row-major `[row][k]`, so A and B load symmetrically via `load_generic`; the mma does the dot over k.
- **Types handled:** Q4_0 and Q4_K. Q4_0 dequant `w=d*(q-8)` inline; Q4_K via the superblock decode mirrored
from `convert.cu` (`get_scale_min_k4`, 8x32 sub-blocks, `d*q-m`).
- **Shape classes handled:** contiguous 2D GEMM (the prefill path), `ne2==ne3==1`, f32 activations, K%16==0
(always true: Q4_0 K%32, Q4_K K%256). **Falls back to MMQ (returns false)** for batched (bs!=[1,1]),
broadcast (nr!=[1,1]), permuted / non-contiguous (per!=[0,1,2,3]), and any non-f32 activation (e.g. f16) -
keeps the gate green. M / N boundaries are zero-padded in-kernel (handles M not %16, N not %8).
- **Parity (the gate):** `GGML_CUDA_W4A16=1 test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103 passed**
(the Q4_0/Q4_K f32 contiguous shapes run the kernel and match the CPU reference; batched/permuted/f16 fall
back). Default (flag-unset) build still **1103/1103** (byte-identical, seam returns false).
- **Model sanity / P2 perf:** `GGML_CUDA_W4A16=1 llama-bench -m Qwen3-32B-Q4_K_M.gguf -ngl 99 -p 512 -n 16
-ub 2048` runs clean: **pp512 = 31.75 t/s**, tg16 = 6.28 t/s. Slow as expected (naive 1-warp/tile, weights
re-dequantized per n-tile, no pipeline) - this is the correctness checkpoint; P3 brings the speedup. The real
Q4_K model matmul path engages the kernel without error.
### P3 — The Marlin pipeline (the speedup) — STEP 1 + SKEW-PAD/TILING LANDED; PREPACK + PIPELINE + STREAM-K DEFERRED
Goal: `cp.async` double/triple-buffered global->shared; offline weight reshuffle (a one-time repack of the Q4
tensor into the mma+pipeline layout); register-resident activation tiles; Stream-K split for the prefill M.
Target: >=150 TFLOP/s (>=~2,300 t/s), then ~213. **MMQ baseline to beat: 47.1 TFLOPS (q4_K n=512) / pp512 718.**
**Kernel structure now (committed P3b):** block-tiled multi-warp GEMM with a CONFLICT-FREE shared feed via skew
padding. `blockDim=(32, WM*WN)` so `threadIdx.x` is the warp lane (required by `mma.cuh` get_i/get_j) and
`threadIdx.y` is the warp index; the original 1-warp P2 launch put 128 threads on `threadIdx.x` and exploded
`get_j` into an out-of-bounds shared read (found via compute-sanitizer). `WM*WN` warps compute a
`BM(=WM*FM*16) x BN(=WN*FN*8)` output tile; each warp owns an `FM x FN` grid of m16n8k16 mma fragments
accumulated in F32. Per k-step (16-deep): all warps cooperatively dequant the `BM x 16` Q4 weight strip + load
the `BN x 16` f32->bf16 activation strip into shared, one `__syncthreads`, then `ldmatrix.x4` (A) / `ldmatrix.x2`
(B) fragments + `FM*FN` mmas. The shared rows hold 8 bf162 of data but are stored at a PADDED stride of 12 bf162
(`W4A16_SPAD`): ldmatrix's per-lane address is `row*stride`, and the natural stride 8 (a divisor of the
32-bank / 128-byte cycle) collides rows 0,4,8,12 into a 2-way bank conflict; skewing to 12 (4-byte aligned, so
ldmatrix's 16-byte alignment holds) makes `{r*12 mod 32}` hit 8 distinct bank-quads for r in 0..7, so both
halves of ldmatrix are conflict-free at only +50% on the small staged tile (~12 KB at the shipping tile).
Shipping config `WM=4,WN=4,FM=2,FN=4` -> `BM=128, BN=128`, 16 warps, 8 m16n8 C-tiles per warp (keeping
register pressure low is what lets BN grow without an occupancy cliff). M/N tails zero-padded in-kernel; still
gated to contiguous 2D Q4_0/Q4_K f32 prefill, else falls back to MMQ.
**Per-step results (q4_K n=512 via `test-backend-ops perf`; pp512/pp2048 via llama-bench Qwen3-32B-Q4_K_M):**
| step | q4_K n=512 | q4_0 n=512 | pp512 | pp2048 | vs MMQ 47 / 718 | notes |
|---|---|---|---|---|---|---|
| P2 (1 warp/tile) | ~2 TFLOPS | - | 31.75 | - | 0.04x | correctness checkpoint |
| Step 1: block tiling (load_generic, BM64/4w) | 6.63 (cold) | 7.53 | 119 | 123 | 0.14x | original committed kernel |
| P3b-1: skew-pad ldmatrix + BM128/8w | 8.50 (cold) | 10.56 | 148.5 | 153.9 | 0.18x | +28% q4_K, +40% q4_0 over step 1 |
| **P3b-2: + BN128/16w (current)** | **9.92 (cold)** | **11.68** | **177.6** | **185.0** | **0.21x** | +17% q4_K, +20% pp512 over P3b-1 (+49% pp512 over step 1) |
Parity gate **1103/1103** at every step, flag set and unset (byte-identical when unset). All P3b numbers above
are from thermally-bracketed cold A/B sessions (committed measured immediately before AND after each candidate,
identical both times -> the deltas are real, not thermal). P3b-1 cold A/B: 6.63/7.53 vs 8.52/10.49. P3b-2 cold
A/B: BN64/8w 10.56/8.50 then 10.51/8.45 (bracket) vs BN128/16w 11.68/9.92.
**What landed / what was tried (honest):**
- **P3b - LANDED (committed).** Two combined changes lift the prior committed kernel: (1) **skew-pad
conflict-free ldmatrix** (shared row stride 8->12 bf162; makes `ldmatrix.x4`/`.x2` bank-conflict-free at near
zero occupancy cost) and (2) **bigger tile / more warps** (`BM=128, BN=64`, 8 warps). Cold A/B: q4_K
6.63->8.52 (+28%), q4_0 7.53->10.49 (+40%), pp512 119->148.5 (+25%). **Still ~5.5x under MMQ (47) per-op and
~4.8x under pp512 718 - does NOT beat MMQ.** This is forward progress, not the finish line.
- **The XOR-swizzle-FIRST plan was tested and is WRONG for this GPU - documented so it is not re-tried.** A
wide-row (BK=64, 128-byte rows) XOR swizzle `seg ^ (row&7)` IS conflict-free, but the 16 KB shared it needs
collapsed occupancy and dropped q4_K n=512 to **2.84 TFLOPS** (worse than the unswizzled 6.63) - the same
occupancy cliff P3 hit with a 32 KB pipeline. The conflict-free feed must be bought WITHOUT widening shared:
skew padding (above) does exactly that (6 KB), which is why it is the committed form. Lesson: on GB10 occupancy
dominates bank-conflict latency; never trade occupancy for a conflict-free layout.
- **Conflict-free feed alone did NOT beat the unswizzled kernel - the limiter moved.** At the SAME BM64/4w tile,
skew-pad ldmatrix (6.70) ~= load_generic (6.63): removing bank conflicts bought ~nothing. The win came only
when the tile grew (BM128/8w). A 5-config tile sweep then split the two quant types:
- **q4_0 SCALES with warps/tiles** (7.7 -> 10.5 -> **15.8 TFLOPS at BM128/16w**): feed/global-traffic bound,
helped by cutting redundant activation re-reads (more BM = fewer M-blocks each re-reading the act column).
- **q4_K is largely DEQUANT-COMPUTE bound** (the BM64/16w tile gives q4_0=15.8 but q4_K=6.8 - they diverge
hard). This **refines P3's "within 12%" finding**: that held only in the low-throughput memory-bound regime;
once the feed is unblocked, q4_K's per-element 6-bit superblock decode (`get_scale_min_k4` + superblock
indexing, redone every k-step AND re-done by every N-block) becomes the wall. BM256 regressed both (too few
blocks / register pressure).
- **Growing BN partly relieves the q4_K dequant wall (P3b-2).** Because every N-block re-decodes the same
weight strip, halving the N-block count (BN 64->128) halves that redundant q4_K decode - but only when BN is
spread across MORE WARPS (16w, 8 C-tiles/warp), not more fragments-per-warp: the FN=8 / FM=4 variants (16
C-tiles/warp) regressed to ~6.6 on register pressure, while WM=4,WN=4,FM=2,FN=4 (16w, 8 tiles/warp) lifted
q4_K 8.5->9.9 and q4_0 10.6->11.7 cold. BN=256 was no better and costs more shared. **BN128/16w is the
shipping tile.**
- **Next blocker (the remaining q4_K unlock) = offline prepack.** BN growth only divides the redundant decode by
the N-block count; it cannot remove the per-k-step decode itself. The full fix is the **one-time offline
repack** - decode the Q4 tensor ONCE into a cached device buffer keyed off the tensor data pointer, in a layout
with the scale/min pre-applied (store reshuffled 4-bit + per-subblock bf16 d,m, ~1.25x the q4 size, NOT a full
bf16 blow-up which would be ~4x), so the in-kernel path becomes a cheap `q*d - m` with coalesced loads. Then
`cp.async` multi-stage (sized to NOT widen shared past the occupancy cliff) and **Stream-K** over M. These
remain the multi-week core; **prepack is the highest-value next step for q4_K specifically** (it should let
q4_K join q4_0 on the feed-bound scaling curve instead of plateauing at ~10).
- **Methodology note (unchanged):** the box thermally throttles under sustained perf+bench runs (identical code
~8.8 cold vs ~6.6 hot earlier), so only same-session A/Bs are trustworthy. The P3b deltas above were taken in
one bracketed cold session for exactly this reason.
### P4 — Tune
- Tile (mmq_x/y analogues), warps, pipeline depth, occupancy. We have nsys (throughput) but **not ncu** on the
DGX — tuning is empirical (sweep configs, measure t/s). Note ncu would need sudo/driver perms we lack.
### P5 — Enable
- Default on for sm_120/121 + Q4_0/Q4_K dense when parity holds + faster; keep the flag as an escape hatch.
Ship as a LocalAI llama.cpp patch (the patches/ series) and/or upstream (ggml has no Marlin-equivalent —
issue #1519 — so it's net-new upstream value; float it with maintainers first).
## Risks / notes
- **Multi-week, expert-CUDA, DGX-only** (GB10 is the only sm_121). The session's network flakiness +
`llama-cli` hang make `llama-bench`/`test-backend-ops` the reliable verification tools (both work).
- Quantization correctness: Q4_K's superblock structure (256-elem, 6-bit scales) is more complex to dequant
in-kernel than Q4_0; consider landing Q4_0 first, then Q4_K.
- **Beat-path follow-on:** the FP4-MMA path (`mul_mat_q<MXFP4>`, ~5% of FP4 peak) tuned/fixed on sm_121 reaches
~6,600 (2× BF16). Separate track; this W4A16 kernel is the match-path foundation.
- Reuse ggml's `mma.cuh` tile abstractions (MMQ already uses them) rather than raw PTX where possible.

View File

@@ -1,31 +0,0 @@
# W4A16 seam — how to apply to a llama.cpp / ggml-cuda checkout
Two source files + two one-line edits to `ggml/src/ggml-cuda/ggml-cuda.cu`. The build picks up the
new `.cu` via the existing `file(GLOB)` after a `cmake -S . -B build` reconfigure (no CMakeLists edit).
## Files (copy into `ggml/src/ggml-cuda/`)
- `marlin-w4a16.cuh`
- `marlin-w4a16.cu`
## Edit `ggml/src/ggml-cuda/ggml-cuda.cu`
1. **Include** — after the existing `#include "ggml-cuda/fp4-grouped-moe.cuh"` (sibling-header style):
```cpp
#include "ggml-cuda/marlin-w4a16.cuh"
```
2. **Dispatch hook** — immediately before the dense dispatch chain, i.e. before
`if (!split && use_mul_mat_vec_f) {` in `ggml_cuda_mul_mat(...)` (after `const int cc = ...`):
```cpp
if (!split && ggml_cuda_w4a16_mul_mat(ctx, src0, src1, dst)) { return; }
```
## Verify (P1 acceptance — met)
- `cmake --build build --target test-backend-ops llama-bench` → builds clean.
- `test-backend-ops test -o MUL_MAT -b CUDA0` → **1103/1103** (byte-identical default).
- `llama-bench` dense Q4 pp512 → unchanged (~718, MMQ).
- `GGML_CUDA_W4A16=1 llama-bench` → unchanged + stderr `[w4a16] ... P1 seam - using MMQ` (seam reached,
gating passes on sm_121, falls back).
The kernel body (P2 correctness → P3 Marlin pipeline) replaces the `TODO(P2/P3)` block in `marlin-w4a16.cu`
and returns `true` once parity holds.

View File

@@ -1,66 +0,0 @@
# W4A16 kernel - subagent dispatch briefs (P3, P4, P5)
**Dispatch strategy.** Each phase = one fresh **Opus-4.8** subagent handed a complete zero-context brief.
Phases are **sequential** (P3 needs P2's correct kernel; P4 needs P3's pipeline; P5 needs P4's tuned kernel),
so dispatch phase N+1 only after phase N's commit lands, and before dispatching, splice phase N's *actual*
deliverable (final kernel shape, configs, fallback set) into the next brief. P2's brief (already dispatched)
is the template; reuse the COMMON section below verbatim in every dispatch.
---
## COMMON (paste into every phase brief)
- **Kernel dev is on the remote DGX** (GB10, sm_121): `ssh -o ConnectTimeout=25 -o ServerAliveInterval=10 -o ServerAliveCountMax=10 dgx.casa '<cmd>'`. Network is FLAKY (re-poll on drop; nohup jobs survive). `llama-cli` HANGS - never use it. Only `llama-bench` + `test-backend-ops` work.
- Checkout `~/llama.cpp-pr24423`, build `~/llama.cpp-pr24423/build` (sm_121, `-DLLAMA_BUILD_TESTS=ON`). Kernel file `ggml/src/ggml-cuda/marlin-w4a16.cu`. Build auto-GLOBs it; no CMakeLists edits. Hook already in `ggml-cuda.cu`, gated behind env `GGML_CUDA_W4A16`.
- Dense test model: `~/bench/q3-32b-gguf/Qwen3-32B-Q4_K_M.gguf`.
- **Builds run detached + poll** (never blocking foreground): write a `~/pN.sh` that builds `--target test-backend-ops llama-bench`, echoes `RC=$?`, runs the gate, echoes `PN_DONE`; `nohup` it; poll `for i in $(seq 1 90); do grep -q PN_DONE ~/pN.out && break; sleep 20; done; tail ~/pN.out`.
- **GPU hygiene:** check `docker ps | grep local-ai` + `nvidia-smi`; `docker stop` a running localai worker if present (authorized); never pkill native procs; never start model servers.
- **Parity gate (must stay green every step):** `GGML_CUDA_W4A16=1 CUDA_VISIBLE_DEVICES=0 ./build/bin/test-backend-ops test -o MUL_MAT -b CUDA0` = **1103/1103**; and flag-unset stays 1103/1103 (byte-identical). A wrong result is worse than a fallback - return false for any shape you can't do correctly.
- **Perf measurement:** `test-backend-ops perf -o MUL_MAT -b CUDA0` (per-shape GFLOPS; the canonical target is q4_K m=4096 k=14336 **n=512**, baseline **47.1 TFLOPS**, ceiling ~213) + `llama-bench -m <model> -ngl 99 -p 512,2048 -n 0 -ub 2048` (baseline pp512 ~718).
- **LocalAI repo (commit here; you do NOT inherit cwd - `cd` explicitly):** `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`. Plan: `backend/cpp/llama-cpp/paged/W4A16_MARLIN_KERNEL_PLAN.md`. Source mirror: `backend/cpp/llama-cpp/paged/kernel/w4a16/`. After a phase passes: fetch the final `marlin-w4a16.cu` from the DGX (`ssh ... 'cat ...'`), overwrite the mirror, update the plan (mark the phase DONE with numbers), `git commit -s` (DCO sign-off; user is Ettore Di Giacinto <mudler@localai.io>). **No `Co-Authored-By`. No em-dashes anywhere. Trailer `Assisted-by: Claude:opus-4.8 [Claude Code]`. Do NOT push.**
- Final message = the result (gate ?/1103, the perf delta, blockers + resolutions, commit hash). A precise partial result beats a vague success claim.
---
## P3 brief - the Marlin pipeline (the speedup)
**Goal.** Take P2's correct-but-slow kernel from ~47 toward ~150+ TFLOPS (then ~213) on the q4_K n=512 prefill GEMM, **without ever breaking parity**. This is the Marlin design: the math is the same BF16 mma; the speed comes from feeding the tensor cores without stalling.
**Implement, incrementally (re-run the parity gate after each):**
1. **`cp.async` multi-stage pipeline** - double/triple-buffer global->shared loads of both the Q4 weight tiles and the activation tiles so dequant+mma on stage k overlaps the load of stage k+1. (Study `mma.cuh` + how `mmq.cu`/`mmf.cu` stage shared memory; ggml already uses `cp.async`/`__pipeline_*`.)
2. **Offline weight reshuffle** - repack the Q4 weights once into the mma+pipeline-friendly layout (Marlin's interleave) so loads are coalesced and the mma fragment maps directly. Do this as a load-time transform of src0 (a new prepacked buffer keyed off the tensor) - NOT per-call. Document where the repack lives + its memory cost.
3. **Register-resident activation tiles + Stream-K** split of the M dimension across blocks for the prefill (large-M) case so all SMs stay busy.
**Acceptance.** Parity gate stays **1103/1103** at every commit; `test-backend-ops perf` q4_K n=512 climbs materially above 47 TFLOPS (target >=150) and `llama-bench` pp512 climbs above ~718. Report the TFLOPS + t/s after each of the 3 steps so the contribution of each is visible. If a step regresses parity, revert it and report why.
**Reference.** IST-DASLab/marlin (github), arXiv 2408.11743, vLLM machete. Mirror `mmf.cu`'s BF16 GEMM structure; Marlin = that + Q4 dequant-on-load + the pipeline/reshuffle.
**Splice before dispatch:** P2's final kernel structure (tile sizes, which types/shapes it handles vs falls back, helper functions it defined).
---
## P4 brief - tune to the ceiling
**Goal.** Drive the P3 kernel as close to the ~213 TFLOPS ceiling as empirical tuning allows. **No `ncu` on this box** (no driver perms) - tune by throughput: `test-backend-ops perf` + `llama-bench` + `nsys` (throughput only).
**Do.** Parametrize the kernel (template params / constants) over: tile M/N/K, warps per block, pipeline depth (stages), and occupancy (regs, shared-mem budget). Sweep systematically (a script that rebuilds + benches each config, logs q4_K n=512 TFLOPS + pp512/pp2048 t/s), pick the best, hard-set it (with a short comment on the sweep). Check both prefill shapes (n=512 and n=2048) and confirm decode (n=1) didn't regress (it should still route to mat-vec, not this kernel - verify the gating).
**Acceptance.** Best config maximizes q4_K n=512 TFLOPS (stretch ~150-213) with parity **1103/1103** intact; the sweep table (config -> TFLOPS/t-s) is recorded in the plan's P4 section. Report the chosen config + the final pp512/pp2048 t/s vs the 718/750 baseline and vs vLLM's ~3300 single-stream target.
**Splice before dispatch:** P3's pipeline structure + the perf it reached + which knobs are already fixed vs free.
---
## P5 brief - enable + package + (maybe) upstream
**Goal.** Make W4A16 the default dense-Q4 path on Blackwell and ship it through LocalAI.
**Do.**
1. **Flip the gate:** default-ON for sm_120/121 + Q4_0/Q4_K dense when faster, keep an opt-out env (e.g. `GGML_CUDA_W4A16=0`) as an escape hatch. The existing return-false-on-unhandled-shape path is the correctness safety net; keep it. Verify the default (no env) build now runs W4A16 for dense Q4, gate green, faster than the old MMQ baseline.
2. **Package as a LocalAI llama.cpp patch:** produce `backend/cpp/llama-cpp/paged/patches/kernel/0002-w4a16-marlin.patch` (the new files + the `ggml-cuda.cu` hook + the gate flip) that applies cleanly to the pinned llama.cpp, mirroring the existing `patches/kernel/0001-fp4-grouped-moe-scaffold.patch`. Confirm LocalAI's `make backends/llama-cpp` build path can consume it (read `.agents/llama-cpp-backend.md` + the build memory: `make -C backend/cpp/llama-cpp clean` before rebuilds).
3. **Docs:** update `BLACKWELL_KERNEL_GAPS.md` + the plan with the shipped result; add a short note to the LocalAI docs if there's a Blackwell/performance page.
4. **Upstream decision (do NOT open without surfacing first):** ggml has no Marlin-equivalent (issue #1519) so this is net-new upstream value. Draft (do not submit) an upstream PR description + note the sm_121 build-flag caveats; report it for the user to decide.
**Acceptance.** Default Blackwell build uses W4A16 for dense Q4, parity 1103/1103, measurably faster than MMQ; the patch applies + the LocalAI llama-cpp backend builds with it (verify or, if the full backend build is too heavy, document the exact build command + that the patch applies cleanly). Report the end-to-end LocalAI dense-Q4 prefill number vs the start-of-project 765 t/s.
**Splice before dispatch:** P4's final kernel + config + the measured ceiling reached; the exact enable condition decided.

View File

@@ -1,258 +0,0 @@
#include "marlin-w4a16.cuh"
#include "mma.cuh"
#include <cstdio>
#include <cstdlib>
#include <cuda_bf16.h>
// W4A16 Marlin-style GEMM.
//
// In-kernel dequantize Q4 weights -> BF16, multiply against BF16-converted F32
// activations using mma.sync m16n8k16 BF16 tensor-core ops, accumulate in F32,
// write F32 output. Handles only the contiguous 2D GEMM (prefill) case for
// Q4_0 / Q4_K; everything else returns false and falls back to MMQ.
//
// ggml MUL_MAT convention: dst[m,n] = sum_k src0[k,m] * src1[k,n].
// src0 (weights): ne0=K (contiguous), ne1=M -> row m is K contiguous quants.
// src1 (acts,f32): ne0=K (contiguous), ne1=N -> row n is K contiguous floats.
// dst (f32): ne0=M (contiguous), ne1=N -> element (m,n) at m + n*M.
// Both operands are row-major [row][k]; m16n8k16 computes C[m,n] += sum_k A[m,k]*B[n,k].
//
// Thread layout: blockDim = (32, WM*WN). threadIdx.x is the warp lane (0..31,
// required by mma.cuh get_i/get_j), threadIdx.y is the warp index.
//
// P3b step 1 - conflict-free shared layout via SKEW PADDING:
// - WM*WN warps compute a BM(=WM*FM*16) x BN(=WN*FN*8) output tile; each warp
// owns an FM x FN grid of m16n8k16 mma fragments accumulated in F32.
// - Per 16-deep k-step the warps cooperatively dequant the BM x 16 Q4 weight
// strip + load the BN x 16 f32->bf16 activation strip into shared, then feed
// the tensor cores with ldmatrix.x4 (A) / ldmatrix.x2 (B).
// - The shared rows are PADDED to SPAD(=12) bf162 instead of the natural 8.
// ldmatrix's per-lane address is row*stride; with the natural stride 8 (a
// divisor of the 32-bank / 128-byte cycle) rows 0,4,8,12 collide -> 2-way
// bank conflict on every fragment load (this is why P3 measured a plain
// ldmatrix swap as neutral). Skewing the stride to 12 (4-byte aligned, so
// ldmatrix's 16-byte alignment holds) makes {r*12 mod 32} hit 8 distinct
// bank-quads for r in 0..7, so both halves of ldmatrix.x4 and ldmatrix.x2 are
// conflict-free. The pad costs only +50% on the small (~4 KB) staged tile, so
// unlike a 128-byte-row XOR swizzle it does NOT collapse occupancy on GB10
// (a wide-row swizzle pushed shared to 16 KB and dropped this to ~2.8 TFLOPS).
//
// Dead-ends already proven (do not re-try): a double-buffered KSTAGE=64 cp.async
// pipeline collapsed occupancy (32 KB shared -> 2.7 TFLOPS); a plain ldmatrix on
// the UNpadded layout was neutral (bank conflicts); a wide-row (BK=64) XOR swizzle
// was conflict-free but occupancy-starved (16 KB shared -> 2.8 TFLOPS). Skew
// padding gets the conflict-free feed at near-zero occupancy cost.
using namespace ggml_cuda_mma;
typedef tile<16, 8, nv_bfloat162> tile_A; // 16(M) x 16(K)
typedef tile< 8, 8, nv_bfloat162> tile_B; // 8(N) x 16(K)
typedef tile<16, 8, float> tile_C; // 16(M) x 8(N)
// bf162 columns actually live per shared row (16 k-values = 8 bf162) ...
#define W4A16_KP 8
// ... padded to this stride to bank-skew the ldmatrix row addresses.
#define W4A16_SPAD 12
static bool w4a16_enabled() {
static const bool en = (std::getenv("GGML_CUDA_W4A16") != nullptr);
return en;
}
// 6-bit packed scale/min decode for Q4_K (mirrors convert.cu get_scale_min_k4).
static __device__ __forceinline__ void w4a16_scale_min_k4(int j, const uint8_t * q, uint8_t & d, uint8_t & m) {
if (j < 4) {
d = q[j] & 63; m = q[j + 4] & 63;
} else {
d = (q[j+4] & 0xF) | ((q[j-4] >> 6) << 4);
m = (q[j+4] >> 4) | ((q[j-0] >> 6) << 4);
}
}
// Dequantize a single Q4_0 weight at column k of a row.
static __device__ __forceinline__ float w4a16_dq_q4_0(const char * row, int k) {
const block_q4_0 * blk = (const block_q4_0 *) row + (k / QK4_0);
const int j = k % QK4_0;
const float d = __half2float(blk->d);
const int q = (j < QK4_0/2) ? (blk->qs[j] & 0xF) : (blk->qs[j - QK4_0/2] >> 4);
return (q - 8) * d;
}
// Dequantize a single Q4_K weight at column k of a row.
static __device__ __forceinline__ float w4a16_dq_q4_K(const char * row, int k) {
const block_q4_K * blk = (const block_q4_K *) row + (k / QK_K);
const int e = k % QK_K;
const int il = e / 64; // 0..3
const int within = e % 64;
const int half = within / 32; // 0..1
const int pos = within % 32;
const int ir = pos / 4; // 0..7
const int l = pos % 4; // 0..3
const int is = 2*il + half;
const float dall = __low2half (blk->dm);
const float dmin = __high2half(blk->dm);
uint8_t sc, mn;
w4a16_scale_min_k4(is, blk->scales, sc, mn);
const float d = dall * sc;
const float m = dmin * mn;
const uint8_t qb = blk->qs[32*il + 4*ir + l];
const int q = (half == 0) ? (qb & 0xF) : (qb >> 4);
return d * q - m;
}
template <bool IS_Q4_K, int WM, int WN, int FM, int FN>
static __global__ void __launch_bounds__(WM*WN*32, 1)
w4a16_gemm_kernel(
const char * __restrict__ src0,
const char * __restrict__ src1,
float * __restrict__ dst,
const int M, const int N, const int K,
const int64_t nb01, const int64_t nb11, const int64_t dst_ne0) {
constexpr int KP = W4A16_KP; // 8 bf162 = 16 k per row
constexpr int SPAD = W4A16_SPAD; // padded row stride (bank skew)
constexpr int BM = WM*FM*16;
constexpr int BN = WN*FN*8;
constexpr int NTH = WM*WN*32;
const int m0 = blockIdx.x * BM;
const int n0 = blockIdx.y * BN;
const int warp_id = threadIdx.y; // 0 .. WM*WN-1
const int warp_n = warp_id % WN;
const int warp_m = warp_id / WN;
const int tid = threadIdx.y*32 + threadIdx.x;
__shared__ nv_bfloat162 sW[BM*SPAD]; // [m][kpair], padded row stride SPAD
__shared__ nv_bfloat162 sB[BN*SPAD]; // [n][kpair], padded row stride SPAD
tile_C C[FM][FN]; // zero-initialized accumulators
for (int k0 = 0; k0 < K; k0 += 16) {
// Dequantize the BM x 16 weight strip once; reused across the block's BN span.
#pragma unroll
for (int idx = tid; idx < BM*KP; idx += NTH) {
const int m = idx / KP;
const int kk = idx % KP;
const int k = k0 + 2*kk;
float w0 = 0.0f, w1 = 0.0f;
if (m0 + m < M) {
const char * row = src0 + (int64_t)(m0 + m) * nb01;
if (IS_Q4_K) { w0 = w4a16_dq_q4_K(row, k); w1 = w4a16_dq_q4_K(row, k + 1); }
else { w0 = w4a16_dq_q4_0(row, k); w1 = w4a16_dq_q4_0(row, k + 1); }
}
sW[m*SPAD + kk] = __floats2bfloat162_rn(w0, w1);
}
// Load the BN x 16 activation strip (f32 -> bf16).
#pragma unroll
for (int idx = tid; idx < BN*KP; idx += NTH) {
const int n = idx / KP;
const int kk = idx % KP;
const int k = k0 + 2*kk;
float a0 = 0.0f, a1 = 0.0f;
if (n0 + n < N) {
const float * arow = (const float *)(src1 + (int64_t)(n0 + n) * nb11);
a0 = arow[k]; a1 = arow[k + 1];
}
sB[n*SPAD + kk] = __floats2bfloat162_rn(a0, a1);
}
__syncthreads();
tile_A Af[FM];
tile_B Bf[FN];
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
const int mrow = (warp_m*FM + fm) * 16;
load_ldmatrix(Af[fm], sW + mrow*SPAD, SPAD);
}
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
const int ncol = (warp_n*FN + fn) * 8;
load_ldmatrix(Bf[fn], sB + ncol*SPAD, SPAD);
}
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
mma(C[fm][fn], Af[fm], Bf[fn]);
}
}
__syncthreads();
}
#pragma unroll
for (int fm = 0; fm < FM; ++fm) {
#pragma unroll
for (int fn = 0; fn < FN; ++fn) {
const int mbase = m0 + (warp_m*FM + fm) * 16;
const int nbase = n0 + (warp_n*FN + fn) * 8;
#pragma unroll
for (int l = 0; l < tile_C::ne; ++l) {
const int m = mbase + tile_C::get_i(l);
const int n = nbase + tile_C::get_j(l);
if (m < M && n < N) {
dst[(int64_t)n * dst_ne0 + m] = C[fm][fn].x[l];
}
}
}
}
}
bool ggml_cuda_w4a16_mul_mat(
ggml_backend_cuda_context & ctx,
const ggml_tensor * src0,
const ggml_tensor * src1,
ggml_tensor * dst) {
if (!w4a16_enabled()) {
return false;
}
if (src0->type != GGML_TYPE_Q4_0 && src0->type != GGML_TYPE_Q4_K) {
return false;
}
if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) {
return false;
}
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
if (!GGML_CUDA_CC_IS_NVIDIA(cc) || cc < GGML_CUDA_CC_BLACKWELL) {
return false; // consumer Blackwell (sm_120/121) only
}
if (src0->ne[2] != 1 || src0->ne[3] != 1 ||
src1->ne[2] != 1 || src1->ne[3] != 1 ||
dst->ne[2] != 1 || dst->ne[3] != 1) {
return false;
}
if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) {
return false;
}
const int64_t K = src0->ne[0];
const int64_t M = src0->ne[1];
const int64_t N = src1->ne[1];
if (src1->ne[0] != K || dst->ne[0] != M || dst->ne[1] != N) {
return false;
}
if (K % 16 != 0) {
return false;
}
cudaStream_t stream = ctx.stream();
// Block tile config: WM*WN warps compute BM(=WM*FM*16) x BN(=WN*FN*8).
constexpr int WM = 4, WN = 4, FM = 2, FN = 4; // BM=128, BN=128, 16 warps
constexpr int BM = WM*FM*16;
constexpr int BN = WN*FN*8;
const dim3 grid((unsigned)((M + BM - 1) / BM), (unsigned)((N + BN - 1) / BN), 1);
const dim3 block(32, WM*WN, 1);
if (src0->type == GGML_TYPE_Q4_K) {
w4a16_gemm_kernel<true, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
(const char *) src0->data, (const char *) src1->data, (float *) dst->data,
(int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
} else {
w4a16_gemm_kernel<false, WM, WN, FM, FN><<<grid, block, 0, stream>>>(
(const char *) src0->data, (const char *) src1->data, (float *) dst->data,
(int) M, (int) N, (int) K, src0->nb[1], src1->nb[1], dst->ne[0]);
}
return true;
}

View File

@@ -1,14 +0,0 @@
#pragma once
#include "common.cuh"
// W4A16 Marlin-style BF16 GEMM for NVIDIA Blackwell consumer GPUs (sm_120/121).
// Dense (non-MoE) 4-bit-weight matmul run on BF16 tensor cores, the path that
// reaches the GB10 BF16 ceiling where MMQ (int8, Ampere-tuned) and cuBLAS (sm_80
// fallback) both plateau at ~22% of it. Returns true if it handled the op; false
// to fall back to MMQ. Gated behind GGML_CUDA_W4A16 until correct + faster.
bool ggml_cuda_w4a16_mul_mat(
ggml_backend_cuda_context & ctx,
const ggml_tensor * src0, // 4-bit weights (Q4_0/Q4_K)
const ggml_tensor * src1, // F32 activations
ggml_tensor * dst); // F32 output

View File

@@ -1,129 +0,0 @@
// paged-bench: quantify the multi-tenant wins of paged KV allocation that are
// properties of the host-side block model (vLLM-parity), independent of the
// in-model compute path.
//
// Win 1 (capacity): on-demand block allocation vs contiguous per-seq
// reservation, under a fixed KV block budget.
// Win 3 (prefix sharing): automatic cross-tenant prefix dedup via block
// hashing.
//
// Win 2 (throughput) is intentionally NOT here: it requires the paged read
// path wired into llama-graph.cpp (Gate 0). Measuring it at this layer would
// be dishonest, so it is reported as pending.
#include "paged_kv_manager.h"
#include <cstdio>
#include <vector>
#include <numeric>
using namespace paged;
// A deterministic LCG so sequence lengths vary without Math.random-style nondeterminism.
struct Lcg {
uint64_t s;
explicit Lcg(uint64_t seed) : s(seed) {}
uint32_t next() { s = s * 6364136223846793005ULL + 1442695040888963407ULL; return (uint32_t)(s >> 33); }
int range(int lo, int hi) { return lo + (int)(next() % (uint32_t)(hi - lo + 1)); }
};
static size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
int main() {
const int block_size = 16;
const int n_ctx = 2048; // max context a sequence could use
const int num_blocks = 512; // fixed KV budget: 512 blocks * 16 = 8192 cells
printf("paged-bench (block_size=%d, n_ctx=%d, budget=%d blocks = %d cells)\n\n",
block_size, n_ctx, num_blocks, num_blocks * block_size);
// ---------------------------------------------------------------------
// WIN 1: concurrency capacity. Sequences have realistic, VARYING lengths
// (most short, a few long) - the regime where reserving n_ctx per seq
// wastes the most. Count how many fit under the same block budget.
// ---------------------------------------------------------------------
{
Lcg rng(12345);
const int blocks_per_ctx = (int) cdiv(n_ctx, block_size); // contiguous reserves this per seq
// Contiguous (stream-style) reservation: every seq reserves n_ctx worth.
int contiguous_fit = num_blocks / blocks_per_ctx;
// Paged on-demand: draw real lengths until the pool is exhausted.
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
int paged_fit = 0;
long total_tokens = 0;
for (int seq = 0; ; ++seq) {
// 80% short (8-128 tok), 20% long (up to n_ctx)
int len = (rng.range(0, 99) < 80) ? rng.range(8, 128) : rng.range(128, n_ctx);
if (!m.allocate(seq, (size_t) len)) break;
paged_fit++;
total_tokens += len;
}
printf("WIN 1 concurrency capacity @ %d-block budget\n", num_blocks);
printf(" contiguous (reserve n_ctx/seq): %d sequences\n", contiguous_fit);
printf(" paged (on-demand blocks): %d sequences (avg %ld tok/seq)\n",
paged_fit, paged_fit ? total_tokens / paged_fit : 0);
printf(" --> paged fits %.1fx more concurrent sequences\n\n",
contiguous_fit ? (double) paged_fit / contiguous_fit : 0.0);
}
// ---------------------------------------------------------------------
// WIN 3: cross-tenant prefix sharing. N tenants share a long system
// prompt / RAG context, then diverge. Compare physical blocks consumed
// with prefix caching on vs off.
// ---------------------------------------------------------------------
{
const int n_tenants = 32;
const int shared_len = 1024; // shared system prompt (64 blocks)
const int distinct_len = 64; // per-tenant suffix (4 blocks)
// Shared prefix token ids (identical across tenants -> identical block hashes).
std::vector<int> shared(shared_len);
for (int i = 0; i < shared_len; ++i) shared[i] = 1000 + i;
// --- prefix caching OFF: every tenant pays for the whole prefix ---
long blocks_off = 0;
{
PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/false);
for (int t = 0; t < n_tenants; ++t) {
m.allocate(t, (size_t) (shared_len + distinct_len));
blocks_off += m.block_table(t).size();
}
}
// --- prefix caching ON: shared blocks are deduped to one physical copy ---
long blocks_on = 0;
{
PagedKVManager m(num_blocks * 8, block_size, /*enable_caching=*/true);
// tenant 0 fills + caches the shared prefix
auto h = m.compute_block_hashes(shared);
m.allocate(0, (size_t) (shared_len + distinct_len));
m.cache_blocks(0, h, (size_t) shared_len);
long physical = m.block_table(0).size();
// tenants 1..N-1 hit the cached prefix; only their distinct suffix is new
for (int t = 1; t < n_tenants; ++t) {
size_t cached_tokens = m.get_computed_blocks(h); // shared blocks reused
size_t new_tokens = (shared_len - cached_tokens) + distinct_len;
m.allocate(t, (size_t) (shared_len + distinct_len));
// physically new blocks = only what wasn't already resident
physical += (long) cdiv(new_tokens, block_size);
}
blocks_on = physical;
}
printf("WIN 3 cross-tenant prefix sharing (%d tenants, %d-tok shared prefix)\n",
n_tenants, shared_len);
printf(" prefix-cache OFF: %ld physical blocks\n", blocks_off);
printf(" prefix-cache ON: %ld physical blocks\n", blocks_on);
printf(" --> %.1fx less KV memory for the shared workload\n\n",
blocks_on ? (double) blocks_off / blocks_on : 0.0);
}
printf("WIN 2 aggregate throughput under load: PENDING\n");
printf(" Requires the paged gather-read path wired into llama-graph.cpp\n");
printf(" (Gate 0) to measure tok/s vs concurrency. Not measurable at the\n");
printf(" allocation layer; not reported here to avoid overclaiming.\n");
return 0;
}

View File

@@ -1,169 +0,0 @@
// paged-loadgen: a dynamic-load benchmark for paged KV that actually exercises the
// regime where paging wins - variable prompt lengths, variable generation lengths,
// staggered (continuous) arrival, and a shared system prefix. The stock
// examples/paged/paged.cpp adds all requests up front with a fixed n_predict from a
// 20-prompt pool, so it never creates KV-memory pressure or fragmentation and
// therefore never shows a paged advantage (see PAGED_KV_HIGH_CONCURRENCY.md).
//
// Build: drop into PR #22569's examples/paged/ and add to its CMakeLists.txt next to
// llama-paged (it uses the same llama_paged_scheduler_* API). Run on the TARGET GPU
// (e.g. 2xH200) where bandwidth lets decode scale to thousands of sequences and KV
// memory becomes the binding constraint - that is where paged KV pays off and where
// this harness produces a meaningful number. On a low-bandwidth box (GB10) throughput
// plateaus long before memory binds, so the win is not observable there regardless.
//
// Metrics reported:
// - goodput (decode tokens/s aggregate) under the dynamic load
// - peak concurrent in-flight sequences actually sustained
// - paged peak KV bytes used vs the contiguous reservation a unified cache needs
// (n_seq_peak * max_ctx), i.e. the capacity ratio = the headroom paging unlocks
//
// The capacity ratio is the load-bearing number for the buy decision: it is how many
// more concurrent tenants a fixed HBM budget serves with paging than without.
#include "common.h"
#include "llama.h"
#include <cmath>
#include <cstdio>
#include <cstring>
#include <random>
#include <string>
#include <vector>
// ---- workload knobs (env-overridable so the harness is sweepable without rebuilds) ----
static int env_int(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
struct workload_cfg {
int total_requests = env_int("LG_TOTAL", 2000); // total requests to serve
int target_inflight = env_int("LG_INFLIGHT", 256); // continuous-batching concurrency target
int prefix_tokens = env_int("LG_PREFIX", 512); // shared system-prompt prefix (prefix-cache target)
int suffix_min = env_int("LG_SUFMIN", 16); // per-request unique prompt suffix range
int suffix_max = env_int("LG_SUFMAX", 768);
int gen_short = env_int("LG_GENSHORT", 32); // bimodal generation: most short...
int gen_long = env_int("LG_GENLONG", 1024); // ...some long (the over-reservation driver)
int gen_long_pct = env_int("LG_LONGPCT", 15); // % of requests that are long
int block_size = env_int("LG_BLOCK", 16); // must match -kvbls
unsigned seed = (unsigned) env_int("LG_SEED", 1234);
};
// Per-request plan drawn from the workload distribution.
struct req_plan { int prompt_len; int gen_len; };
int main(int argc, char ** argv) {
common_params params;
params.n_predict = -1; // per-request, controlled by the plan below
if (!common_params_parse(argc, argv, params, LLAMA_EXAMPLE_PAGED)) {
fprintf(stderr, "usage: %s -m <model> -kvp --fit off -ngpub N -ncpub M -ngl 99\n", argv[0]);
return 1;
}
params.kv_paged = true;
common_init_result init = common_init_from_params(params);
llama_model * model = init.model.get();
llama_context * ctx = init.context.get();
if (!model || !ctx) { fprintf(stderr, "load failed\n"); return 1; }
const llama_vocab * vocab = llama_model_get_vocab(model);
workload_cfg cfg;
std::mt19937 rng(cfg.seed);
std::uniform_int_distribution<int> suf(cfg.suffix_min, cfg.suffix_max);
std::uniform_int_distribution<int> pct(1, 100);
// KV bytes/token = 2(K,V) * n_layers * n_head_kv * head_dim * sizeof(f16). Confirmed
// against llama-kv-cache-paged.cpp (block_bytes formula). Used for the capacity ratio.
const int n_layers = llama_model_n_layer(model);
const int n_head_kv = llama_model_n_head_kv(model);
const int head_dim = llama_model_n_embd(model) / llama_model_n_head(model);
const size_t kv_bytes_per_token = (size_t)2 * n_layers * n_head_kv * head_dim * sizeof(uint16_t);
// A long shared system prefix that every request reuses (the prefix-cache target).
std::vector<llama_token> prefix = common_tokenize(ctx, std::string(cfg.prefix_tokens, 'x'), true);
// Pre-draw all request plans so paged peak usage and the contiguous reservation are
// computed from the SAME workload.
std::vector<req_plan> plans(cfg.total_requests);
int max_ctx = 0;
for (auto & p : plans) {
p.prompt_len = cfg.prefix_tokens + suf(rng);
p.gen_len = (pct(rng) <= cfg.gen_long_pct) ? cfg.gen_long : cfg.gen_short;
max_ctx = std::max(max_ctx, p.prompt_len + p.gen_len);
}
llama_paged_scheduler * sched = llama_paged_scheduler_init(ctx);
if (!sched) { fprintf(stderr, "scheduler init failed\n"); return 1; }
// ---- continuous-arrival loop: keep ~target_inflight requests live at all times ----
int next_req = 0, done = 0, inflight = 0, peak_inflight = 0;
long total_decoded = 0;
size_t peak_kv_bytes_paged = 0; // sum over live seqs of ceil(used/block)*block*kv_bytes
size_t live_used_tokens = 0; // running sum of actual KV tokens held by live seqs
auto admit = [&](int rid) {
const req_plan & p = plans[rid];
std::vector<llama_token> toks = prefix; // shared prefix...
std::vector<llama_token> suff = common_tokenize(ctx, std::string(p.prompt_len - cfg.prefix_tokens, 'y'), false);
toks.insert(toks.end(), suff.begin(), suff.end()); // ...+ unique suffix
if (llama_paged_scheduler_add_request(sched, toks.data(), toks.size(), rid)) {
inflight++; peak_inflight = std::max(peak_inflight, inflight);
live_used_tokens += p.prompt_len;
}
};
const int64_t t0 = ggml_time_us();
for (int i = 0; i < cfg.target_inflight && next_req < cfg.total_requests; ++i) admit(next_req++);
llama_batch batch = {};
std::vector<llama_token> sampled; std::vector<int8_t> stop_flags;
while (done < cfg.total_requests) {
if (!llama_paged_scheduler_prepare_batch(sched, &batch)) break;
const llama_paged_batch_info * info = llama_paged_scheduler_get_batch_info(sched);
sampled.assign(info->n_seq, 0); stop_flags.assign(info->n_seq, 0);
// (decode is done inside the scheduler/update path in PR #22569; greedy here)
for (int i = 0; i < info->n_seq; ++i) {
const int rid = info->seq_ids[i];
llama_paged_seq_state st{};
llama_paged_scheduler_get_seq_state(sched, rid, &st);
// greedy argmax from the i-th row of logits
const float * lg = llama_get_logits_ith(ctx, i);
int best = 0; float bv = lg[0];
for (int t = 1; t < llama_vocab_n_tokens(vocab); ++t) if (lg[t] > bv) { bv = lg[t]; best = t; }
sampled[i] = best;
const bool stop = llama_vocab_is_eog(vocab, best) || st.n_decoded + 1 >= plans[rid].gen_len;
stop_flags[i] = stop ? 1 : 0;
if (!stop) { total_decoded++; live_used_tokens++; }
if (stop) {
done++; inflight--;
live_used_tokens -= (plans[rid].prompt_len + st.n_decoded);
if (next_req < cfg.total_requests) admit(next_req++); // continuous arrival
}
}
// paged peak KV: blocks are allocated per live seq = ceil(used/block); approximate
// current paged footprint from live_used_tokens rounded up per the block size.
const size_t paged_now = (size_t)std::ceil((double)live_used_tokens / cfg.block_size)
* cfg.block_size * kv_bytes_per_token;
peak_kv_bytes_paged = std::max(peak_kv_bytes_paged, paged_now);
llama_paged_scheduler_update(sched, &batch, sampled.data(), stop_flags.data());
}
const double secs = (ggml_time_us() - t0) / 1e6;
// Contiguous unified-KV reservation needed to serve the SAME peak concurrency without
// mid-generation eviction: every live slot must be backed for the worst-case context.
const size_t contig_reserve = (size_t)peak_inflight * max_ctx * kv_bytes_per_token;
printf("\n==== paged-loadgen ====\n");
printf("requests served : %d (target inflight %d, peak inflight %d)\n", done, cfg.target_inflight, peak_inflight);
printf("goodput (decode) : %.1f tok/s (%ld tokens / %.2f s)\n", total_decoded / secs, total_decoded, secs);
printf("kv bytes / token : %zu (n_layer=%d n_head_kv=%d head_dim=%d f16)\n", kv_bytes_per_token, n_layers, n_head_kv, head_dim);
printf("paged peak KV : %.2f GiB (allocated on demand)\n", peak_kv_bytes_paged / 1073741824.0);
printf("contiguous reserve : %.2f GiB (peak_inflight * max_ctx %d)\n", contig_reserve / 1073741824.0, max_ctx);
printf("CAPACITY RATIO : %.2fx <- tenants-per-HBM paging unlocks\n",
peak_kv_bytes_paged ? (double)contig_reserve / peak_kv_bytes_paged : 0.0);
printf(" (plus cross-request prefix sharing of the %d-token shared prefix, not counted above)\n", cfg.prefix_tokens);
llama_paged_scheduler_free(sched);
return 0;
}

View File

@@ -1,296 +0,0 @@
#include "paged_kv_manager.h"
#include <cassert>
#include <stdexcept>
namespace paged {
// ---------------------------------------------------------------------------
// FreeBlockQueue (port of kv_cache_utils.py FreeKVCacheBlockQueue)
// ---------------------------------------------------------------------------
FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
num_free_blocks = blocks.size();
for (size_t i = 0; i < blocks.size(); ++i) {
if (i > 0) blocks[i]->prev_free = blocks[i - 1];
if (i + 1 < blocks.size()) blocks[i]->next_free = blocks[i + 1];
}
if (!blocks.empty()) {
fake_head.next_free = blocks.front();
blocks.front()->prev_free = &fake_head;
fake_tail.prev_free = blocks.back();
blocks.back()->next_free = &fake_tail;
} else {
fake_head.next_free = &fake_tail;
fake_tail.prev_free = &fake_head;
}
}
KVCacheBlock* FreeBlockQueue::popleft() {
KVCacheBlock* first = fake_head.next_free;
if (first == &fake_tail || first == nullptr) {
assert(num_free_blocks == 0);
throw std::runtime_error("No free blocks available");
}
fake_head.next_free = first->next_free;
first->next_free->prev_free = &fake_head;
first->prev_free = first->next_free = nullptr;
num_free_blocks--;
return first;
}
std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
std::vector<KVCacheBlock*> ret;
if (n == 0) return ret;
assert(num_free_blocks >= n);
num_free_blocks -= n;
KVCacheBlock* curr = fake_head.next_free;
ret.reserve(n);
for (size_t i = 0; i < n; ++i) {
assert(curr != nullptr);
ret.push_back(curr);
KVCacheBlock* last = curr;
curr = curr->next_free;
last->prev_free = last->next_free = nullptr;
}
if (curr != nullptr) {
fake_head.next_free = curr;
curr->prev_free = &fake_head;
}
return ret;
}
void FreeBlockQueue::remove(KVCacheBlock* block) {
if (!block->prev_free || !block->next_free)
throw std::runtime_error("remove() called on an invalid block");
block->prev_free->next_free = block->next_free;
block->next_free->prev_free = block->prev_free;
block->prev_free = block->next_free = nullptr;
num_free_blocks--;
}
void FreeBlockQueue::append(KVCacheBlock* block) {
KVCacheBlock* last = fake_tail.prev_free;
last->next_free = block;
block->prev_free = last;
block->next_free = &fake_tail;
fake_tail.prev_free = block;
num_free_blocks++;
}
void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
if (blocks.empty()) return;
KVCacheBlock* last = fake_tail.prev_free;
for (KVCacheBlock* b : blocks) {
b->prev_free = last;
last->next_free = b;
last = b;
}
last->next_free = &fake_tail;
fake_tail.prev_free = last;
num_free_blocks += blocks.size();
}
void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
if (blocks.empty()) return;
KVCacheBlock* first = fake_head.next_free;
KVCacheBlock* prev = &fake_head;
for (KVCacheBlock* b : blocks) {
b->prev_free = prev;
prev->next_free = b;
prev = b;
}
prev->next_free = first;
first->prev_free = prev;
num_free_blocks += blocks.size();
}
std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
std::vector<KVCacheBlock*> ret;
const KVCacheBlock* curr = fake_head.next_free;
while (curr && curr->next_free != nullptr) {
ret.push_back(const_cast<KVCacheBlock*>(curr));
curr = curr->next_free;
}
return ret;
}
// ---------------------------------------------------------------------------
// BlockPool (port of block_pool.py)
// ---------------------------------------------------------------------------
static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
std::vector<KVCacheBlock*> p;
p.reserve(v.size());
for (auto& b : v) p.push_back(&b);
return p;
}
static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
std::vector<KVCacheBlock> v;
v.reserve(num_blocks);
for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
return v;
}
BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
: enable_caching_(enable_caching),
blocks_(make_block_vec(num_blocks)),
ptrs_(make_ptrs(blocks_)),
free_queue_(ptrs_) {
// vLLM reserves block_id 0 as the null block (never cached).
null_block = free_queue_.popleft();
null_block->is_null = true;
}
bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
if (!block->has_hash) return false;
auto it = cached_block_hash_to_block_.find(block->block_hash);
if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
cached_block_hash_to_block_.erase(it);
block->reset_hash();
return true;
}
std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
if (n > get_num_free_blocks())
throw std::runtime_error("Cannot get free blocks from pool");
auto ret = free_queue_.popleft_n(n);
for (KVCacheBlock* b : ret) {
if (enable_caching_) maybe_evict_cached_block(b);
assert(b->ref_cnt == 0);
b->ref_cnt += 1;
}
return ret;
}
KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
auto it = cached_block_hash_to_block_.find(block_hash);
return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
}
void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
for (KVCacheBlock* b : blocks) {
// ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
b->ref_cnt += 1;
}
}
void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
std::vector<KVCacheBlock*> without_hash, with_hash;
for (KVCacheBlock* b : ordered_blocks) {
if (b->is_null) continue;
b->ref_cnt -= 1;
if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
}
free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
free_queue_.append_n(with_hash); // hashed: kept warm (tail)
}
void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
size_t num_cached_blocks, size_t num_full_blocks,
const std::vector<uint64_t>& block_hashes) {
for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
KVCacheBlock* blk = req_blocks[i];
if (blk->has_hash) continue;
blk->has_hash = true;
blk->block_hash = block_hashes[i];
cached_block_hash_to_block_[blk->block_hash] = blk;
}
}
// ---------------------------------------------------------------------------
// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager)
// ---------------------------------------------------------------------------
static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
: block_size_(block_size), pool_(num_blocks, enable_caching) {}
bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
auto& req = req_to_blocks_[seq_id];
size_t need = cdiv(total_tokens, block_size_);
if (need <= req.size()) return true;
size_t add = need - req.size();
if (add > pool_.get_num_free_blocks()) return false; // OOM
auto nb = pool_.get_new_blocks(add);
req.insert(req.end(), nb.begin(), nb.end());
return true;
}
std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
std::vector<int32_t> bt;
auto it = req_to_blocks_.find(seq_id);
if (it == req_to_blocks_.end()) return bt;
bt.reserve(it->second.size());
for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
return bt;
}
int64_t PagedKVManager::slot(int seq_id, int pos) const {
const auto& req = req_to_blocks_.at(seq_id);
int32_t phys = req[pos / block_size_]->block_id;
return (int64_t)phys * block_size_ + (pos % block_size_);
}
std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
std::vector<int64_t> sm;
sm.reserve(positions.size());
for (int p : positions) sm.push_back(slot(seq_id, p));
return sm;
}
void PagedKVManager::free(int seq_id) {
auto it = req_to_blocks_.find(seq_id);
if (it == req_to_blocks_.end()) return;
// Free in reverse so the tail of the block chain is evicted first (vLLM order).
std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
pool_.free_blocks(ordered);
req_to_blocks_.erase(it);
}
// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
// hash into the seed so each block hash transitively encodes its whole prefix
// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
uint64_t h = 1469598103934665603ull ^ parent_hash;
for (int t : token_ids) {
h ^= (uint64_t)(uint32_t)t;
h *= 1099511628211ull;
}
if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
return h;
}
std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
std::vector<uint64_t> hashes;
uint64_t parent = 0; // NONE_HASH analogue
size_t n_full = token_ids.size() / block_size_;
for (size_t i = 0; i < n_full; ++i) {
std::vector<int> blk(token_ids.begin() + i * block_size_,
token_ids.begin() + (i + 1) * block_size_);
parent = hash_block(parent, blk);
hashes.push_back(parent);
}
return hashes;
}
size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
std::vector<KVCacheBlock*> hits;
for (uint64_t bh : block_hashes) { // stop at first miss (prefix property)
KVCacheBlock* cb = pool_.get_cached_block(bh);
if (!cb) break;
hits.push_back(cb);
}
pool_.touch(hits); // ++ref_cnt, pull from free list
return hits.size() * (size_t)block_size_;
}
void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
auto& req = req_to_blocks_[seq_id];
size_t n_full = num_tokens / block_size_;
pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
}
} // namespace paged

View File

@@ -1,108 +0,0 @@
#pragma once
// Paged KV cache block manager for llama.cpp (CPU-first prototype).
//
// Host-side block management is a faithful port of vLLM V1:
// vllm/v1/core/kv_cache_utils.py (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
// vllm/v1/core/block_pool.py (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
// vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
//
// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
// dependency so it can be unit-tested in isolation.
#include <cstdint>
#include <vector>
#include <unordered_map>
#include <map>
namespace paged {
// vLLM KVCacheBlock (kv_cache_utils.py).
struct KVCacheBlock {
int32_t block_id = 0;
int ref_cnt = 0;
bool has_hash = false; // vLLM: _block_hash is set only when full+cached
uint64_t block_hash = 0;
bool is_null = false;
KVCacheBlock* prev_free = nullptr;
KVCacheBlock* next_free = nullptr;
explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
void reset_hash() { has_hash = false; block_hash = 0; }
};
// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
// O(1) middle removal is required so touch() can pull a warm cached block out of the
// free list when a later request hits its prefix.
class FreeBlockQueue {
public:
size_t num_free_blocks = 0;
explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
KVCacheBlock* popleft();
std::vector<KVCacheBlock*> popleft_n(size_t n);
void remove(KVCacheBlock* block);
void append(KVCacheBlock* block);
void append_n(const std::vector<KVCacheBlock*>& blocks);
void prepend_n(const std::vector<KVCacheBlock*>& blocks);
std::vector<KVCacheBlock*> get_all_free_blocks() const;
private:
KVCacheBlock fake_head{-1};
KVCacheBlock fake_tail{-1};
};
// vLLM BlockPool (block_pool.py).
class BlockPool {
public:
KVCacheBlock* null_block = nullptr;
BlockPool(int32_t num_blocks, bool enable_caching);
std::vector<KVCacheBlock*> get_new_blocks(size_t n);
KVCacheBlock* get_cached_block(uint64_t block_hash);
void touch(const std::vector<KVCacheBlock*>& blocks);
void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
size_t num_cached_blocks, size_t num_full_blocks,
const std::vector<uint64_t>& block_hashes);
size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
private:
bool maybe_evict_cached_block(KVCacheBlock* block);
bool enable_caching_;
std::vector<KVCacheBlock> blocks_; // owns all block descriptors
std::vector<KVCacheBlock*> ptrs_;
FreeBlockQueue free_queue_;
// vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
// prototype keeps the last writer (single KV-cache group is sufficient for the wins).
std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
};
// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
class PagedKVManager {
public:
PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
// Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
bool allocate(int seq_id, size_t total_tokens);
std::vector<int32_t> block_table(int seq_id) const;
int64_t slot(int seq_id, int pos) const;
std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
void free(int seq_id);
int block_size() const { return block_size_; }
// Prefix caching (win 3).
static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
protected:
int block_size_;
BlockPool pool_;
std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
};
} // namespace paged

View File

@@ -1,59 +0,0 @@
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index a49a055a6..d95102bbd 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -11,6 +11,8 @@
#include <cstring>
#include <limits>
#include <map>
+#include <numeric>
+#include <cstdlib>
#include <stdexcept>
static bool ggml_is_power_of_2(int n) {
@@ -931,6 +933,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return { };
}
+ // [paged, experimental] Place this sequence's tokens at permuted,
+ // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+ // This validates that attention is invariant to physical KV placement -
+ // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+ // Single-sequence scope (uses get_used() as the logical base); falls back
+ // to the normal allocator if the permuted cells aren't available.
+ static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ if (paged_mode) {
+ const uint32_t bs = 16; // block size (tokens/block)
+ const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ if (nblk >= 2) {
+ // stride coprime to nblk => block-index permutation is a bijection
+ uint32_t k = 1;
+ for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+ if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+ }
+ const uint32_t base = cells.get_used();
+ bool ok = true;
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const uint32_t L = base + i;
+ const uint32_t b = L / bs;
+ const uint32_t off = L % bs;
+ if (b >= nblk) { ok = false; break; }
+ const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+ if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+ res.idxs[s].push_back(phys);
+ }
+ if (ok && res.idxs[s].size() == n_tokens) {
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ }
+ continue; // paged placement succeeded for this sequence
+ }
+ res.idxs[s].clear(); // fall back to the normal allocator
+ }
+ }
+
uint32_t n_tested = 0;
// for continuous slots, we test that all tokens in the ubatch fit, starting from the current head

View File

@@ -1,12 +0,0 @@
diff --git a/tests/test-paged-kv-e2e.cpp b/tests/test-paged-kv-e2e.cpp
index 5a352e3..06ead50 100644
--- a/tests/test-paged-kv-e2e.cpp
+++ b/tests/test-paged-kv-e2e.cpp
@@ -115,6 +115,7 @@ static path_result run_paged(const std::string & model_path) {
params.sampling.temp = 0.0f; // greedy
params.warmup = false;
params.kv_paged = true;
+ params.fit_params = false; // honor explicit n_gpu_blocks; GB10 dev_memory over-reports free VRAM
params.n_gpu_blocks = 64;
params.n_cpu_blocks = 16;
params.n_sequences = 1;

View File

@@ -1,42 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
using namespace paged;
int main() {
BlockPool pool(/*num_blocks=*/8, /*enable_caching=*/true);
// block 0 is reserved as null_block (vLLM pops one at init)
assert(pool.null_block != nullptr && pool.null_block->block_id == 0);
assert(pool.get_num_free_blocks() == 7);
// get_new_blocks sets ref_cnt=1 and removes from free list
auto b = pool.get_new_blocks(2);
assert(b.size() == 2 && b[0]->ref_cnt == 1 && b[1]->ref_cnt == 1);
assert(pool.get_num_free_blocks() == 5);
// cache two full blocks with chained hashes, then look them up
std::vector<uint64_t> hashes = {1111, 2222};
pool.cache_full_blocks(b, /*num_cached=*/0, /*num_full=*/2, hashes);
assert(b[0]->has_hash && b[0]->block_hash == 1111);
assert(pool.get_cached_block(1111) == b[0]);
assert(pool.get_cached_block(2222) == b[1]);
assert(pool.get_cached_block(9999) == nullptr);
// free: hashed blocks go to tail (kept warm), so they remain queryable.
pool.free_blocks(b);
assert(b[0]->ref_cnt == 0);
assert(pool.get_num_free_blocks() == 7);
assert(pool.get_cached_block(1111) == b[0]); // still cached/warm
// touch a warm cached block: pulls it out of free list, ++ref_cnt
pool.touch({b[0]});
assert(b[0]->ref_cnt == 1);
assert(pool.get_num_free_blocks() == 6);
// exhausting the pool then allocating evicts a warm cached hash
auto rest = pool.get_new_blocks(pool.get_num_free_blocks());
(void) rest;
assert(pool.get_cached_block(2222) == nullptr); // evicted on reuse
printf("test_block_pool: OK\n");
return 0;
}

View File

@@ -1,44 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
#include <vector>
using namespace paged;
static std::vector<KVCacheBlock> make_blocks(int n) {
std::vector<KVCacheBlock> v;
v.reserve(n);
for (int i = 0; i < n; ++i) v.push_back(KVCacheBlock{i});
return v;
}
int main() {
// ordered 0..9 at init; popleft yields ascending block_ids
auto blocks = make_blocks(10);
std::vector<KVCacheBlock*> ptrs;
for (auto& b : blocks) ptrs.push_back(&b);
FreeBlockQueue q(ptrs);
assert(q.num_free_blocks == 10);
KVCacheBlock* b0 = q.popleft();
assert(b0->block_id == 0);
assert(q.num_free_blocks == 9);
auto two = q.popleft_n(2); // {1,2}
assert(two.size() == 2 && two[0]->block_id == 1 && two[1]->block_id == 2);
assert(q.num_free_blocks == 7);
// O(1) middle removal: remove block 5 (currently free), count drops
q.remove(ptrs[5]);
assert(q.num_free_blocks == 6); // free: 3,4,6,7,8,9
// append puts a block at the tail; it comes back out only after the rest
q.append(b0); // free order now: 3,4,6,7,8,9,0
assert(q.num_free_blocks == 7);
auto all = q.get_all_free_blocks();
assert(all.front()->block_id == 3);
assert(all.back()->block_id == 0);
printf("test_free_block_queue: OK\n");
return 0;
}

View File

@@ -1,133 +0,0 @@
// Phase 2 (core numeric de-risk): attention over GATHERED paged KV must equal
// an independent host-computed reference.
//
// This answers the central risk in the design: feeding gather-to-scratch KV
// (a sequence whose blocks are non-contiguous in the shared pool) into ggml's
// standard attention ops (mul_mat -> soft_max_ext -> mul_mat) produces correct
// attention. If this holds, the paged read path is numerically sound; the
// remaining work is wiring it into llama-graph.cpp (Gate 0 in a real model).
#include "../paged_kv_manager.h"
#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include <cassert>
#include <cstdio>
#include <cmath>
#include <vector>
using namespace paged;
int main() {
const int d = 8; // head dim
const int n_kv = 48; // 3 blocks worth of KV tokens
const int n_q = 4; // query tokens
const int block_size = 16;
const int num_blocks = 8;
const int total_slots = block_size * num_blocks;
const float scale = 1.0f / std::sqrt((float) d);
// Non-contiguous physical layout for the KV sequence (blocks [2,1,5]).
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
assert(m.allocate(0, 2 * block_size));
assert(m.allocate(1, 2 * block_size));
m.free(0);
assert(m.allocate(2, n_kv));
std::vector<int> positions(n_kv);
for (int i = 0; i < n_kv; ++i) positions[i] = i;
auto slots64 = m.slot_mapping(2, positions);
std::vector<int32_t> slots32(slots64.begin(), slots64.end());
// Deterministic K, V, Q in logical [d, n] layout (column-major: col = token).
std::vector<float> K(d * n_kv), V(d * n_kv), Q(d * n_q);
for (int t = 0; t < n_kv; ++t)
for (int e = 0; e < d; ++e) {
K[t * d + e] = std::sin(0.1f * t + 0.3f * e);
V[t * d + e] = std::cos(0.2f * t - 0.1f * e);
}
for (int q = 0; q < n_q; ++q)
for (int e = 0; e < d; ++e) Q[q * d + e] = std::sin(0.05f * q + 0.7f * e);
// ---- Independent host reference attention -------------------------------
std::vector<float> ref(d * n_q, 0.0f);
for (int q = 0; q < n_q; ++q) {
std::vector<float> score(n_kv);
float mx = -1e30f;
for (int t = 0; t < n_kv; ++t) {
float dot = 0.0f;
for (int e = 0; e < d; ++e) dot += K[t * d + e] * Q[q * d + e];
score[t] = dot * scale;
mx = std::fmax(mx, score[t]);
}
float sum = 0.0f;
for (int t = 0; t < n_kv; ++t) { score[t] = std::exp(score[t] - mx); sum += score[t]; }
for (int t = 0; t < n_kv; ++t) {
float p = score[t] / sum;
for (int e = 0; e < d; ++e) ref[q * d + e] += p * V[t * d + e];
}
}
// ---- ggml paged path ----------------------------------------------------
ggml_backend_t backend = ggml_backend_cpu_init();
struct ggml_init_params dp = { ggml_tensor_overhead() * 16, NULL, true };
struct ggml_context * ctx_data = ggml_init(dp);
struct ggml_tensor * poolK = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
struct ggml_tensor * poolV = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, total_slots);
struct ggml_tensor * kSrc = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
struct ggml_tensor * vSrc = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_kv);
struct ggml_tensor * qT = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, d, n_q);
struct ggml_tensor * wIdx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_kv);
struct ggml_tensor * gIdx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_kv);
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
std::vector<float> zeros(d * total_slots, 0.0f);
ggml_backend_tensor_set(poolK, zeros.data(), 0, ggml_nbytes(poolK));
ggml_backend_tensor_set(poolV, zeros.data(), 0, ggml_nbytes(poolV));
ggml_backend_tensor_set(kSrc, K.data(), 0, ggml_nbytes(kSrc));
ggml_backend_tensor_set(vSrc, V.data(), 0, ggml_nbytes(vSrc));
ggml_backend_tensor_set(qT, Q.data(), 0, ggml_nbytes(qT));
ggml_backend_tensor_set(wIdx, slots64.data(), 0, ggml_nbytes(wIdx));
ggml_backend_tensor_set(gIdx, slots32.data(), 0, ggml_nbytes(gIdx));
struct ggml_init_params cp = { ggml_tensor_overhead() * 64 + ggml_graph_overhead(), NULL, true };
struct ggml_context * ctx = ggml_init(cp);
struct ggml_tensor * wroteK = ggml_set_rows(ctx, poolK, kSrc, wIdx);
struct ggml_tensor * wroteV = ggml_set_rows(ctx, poolV, vSrc, wIdx);
struct ggml_tensor * gK = ggml_get_rows(ctx, wroteK, gIdx); // [d, n_kv]
struct ggml_tensor * gV = ggml_get_rows(ctx, wroteV, gIdx); // [d, n_kv]
struct ggml_tensor * kq = ggml_mul_mat(ctx, gK, qT); // [n_kv, n_q]
struct ggml_tensor * probs = ggml_soft_max_ext(ctx, kq, NULL, scale, 0.0f);
struct ggml_tensor * vT = ggml_cont(ctx, ggml_transpose(ctx, gV)); // [n_kv, d]
struct ggml_tensor * out = ggml_mul_mat(ctx, vT, probs); // [d, n_q]
ggml_set_output(out);
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, out);
ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
assert(ggml_gallocr_alloc_graph(galloc, gf));
assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
std::vector<float> got(d * n_q);
ggml_backend_tensor_get(out, got.data(), 0, ggml_nbytes(out));
// ---- compare ------------------------------------------------------------
double max_err = 0.0;
for (int i = 0; i < d * n_q; ++i) max_err = std::fmax(max_err, std::fabs(got[i] - ref[i]));
printf("paged attention max abs err vs host reference: %.3e\n", max_err);
assert(max_err < 1e-4 && "paged-gathered attention must match host reference");
ggml_gallocr_free(galloc);
ggml_free(ctx);
ggml_free(ctx_data);
ggml_backend_buffer_free(buf);
ggml_backend_free(backend);
printf("test_ggml_paged_attn: OK (attention over non-contiguous paged KV matches reference)\n");
return 0;
}

View File

@@ -1,142 +0,0 @@
// Phase 1 integration test: prove the paged KV write+read MECHANISM at the
// ggml-op level, driven by PagedKVManager.
//
// write: ggml_set_rows(pool, k_src, slot_mapping) // scatter by slot
// read: ggml_get_rows(pool, gather_idx) // gather seq's slots
//
// The decisive property: a sequence's physical blocks are NON-CONTIGUOUS and
// OUT-OF-ORDER (forced via allocate/free/reallocate), yet gather(write(x)) == x,
// and a second sequence written into disjoint blocks does not contaminate it.
// This is exactly how a paged read path feeds contiguous scratch to attention.
#include "../paged_kv_manager.h"
#include "ggml.h"
#include "ggml-cpu.h"
#include "ggml-alloc.h"
#include "ggml-backend.h"
#include <cassert>
#include <cstdio>
#include <cmath>
#include <vector>
using namespace paged;
int main() {
const int n_embd = 8;
const int block_size = 16;
const int num_blocks = 8; // block 0 reserved as null
const int total_slots = block_size * num_blocks; // 128
// --- Force a non-contiguous, out-of-order block layout for seqC ----------
PagedKVManager m(num_blocks, block_size, /*enable_caching=*/false);
assert(m.allocate(/*seqA=*/0, 2 * block_size)); // blocks {1,2}
assert(m.allocate(/*seqB=*/1, 2 * block_size)); // blocks {3,4}
m.free(0); // returns {1,2} to free list
assert(m.allocate(/*seqC=*/2, 3 * block_size)); // reuses freed blocks, reordered
auto btC = m.block_table(2);
auto btB = m.block_table(1);
printf("seqC block_table = [");
for (size_t i = 0; i < btC.size(); ++i) printf("%s%d", i ? "," : "", btC[i]);
printf("]\n");
assert(btC.size() == 3);
// sanity: seqC and seqB occupy disjoint physical blocks
for (int cb : btC) for (int bb : btB) assert(cb != bb);
const int n_tokens = 3 * block_size; // 48 tokens for seqC
// slot_mapping for seqC positions 0..n_tokens-1
std::vector<int> positions(n_tokens);
for (int i = 0; i < n_tokens; ++i) positions[i] = i;
std::vector<int64_t> slots64 = m.slot_mapping(2, positions); // I64 for set_rows
std::vector<int32_t> slots32(slots64.begin(), slots64.end()); // I32 for get_rows
// seqB occupies different blocks; write a sentinel there to prove isolation.
std::vector<int> posB(2 * block_size);
for (size_t i = 0; i < posB.size(); ++i) posB[i] = (int) i;
std::vector<int64_t> slotsB64 = m.slot_mapping(1, posB);
// --- ggml backend + persistent (statically allocated) tensors ------------
ggml_backend_t backend = ggml_backend_cpu_init();
assert(backend);
struct ggml_init_params dp = { /*mem_size=*/ ggml_tensor_overhead() * 16,
/*mem_buffer=*/ NULL, /*no_alloc=*/ true };
struct ggml_context * ctx_data = ggml_init(dp);
// The shared paged KV pool: one flat block pool, exactly like a paged layer.
struct ggml_tensor * pool = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, total_slots);
struct ggml_tensor * k_src = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, n_tokens);
struct ggml_tensor * w_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, n_tokens);
struct ggml_tensor * g_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I32, n_tokens);
struct ggml_tensor * kB_src = ggml_new_tensor_2d(ctx_data, GGML_TYPE_F32, n_embd, (int) posB.size());
struct ggml_tensor * wB_idx = ggml_new_tensor_1d(ctx_data, GGML_TYPE_I64, (int) posB.size());
ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors(ctx_data, backend);
assert(buf);
// pool starts zeroed
std::vector<float> zeros(n_embd * total_slots, 0.0f);
ggml_backend_tensor_set(pool, zeros.data(), 0, ggml_nbytes(pool));
// token t carries the value (float) t in every embedding lane -> easy to verify
std::vector<float> ksrc(n_embd * n_tokens);
for (int t = 0; t < n_tokens; ++t)
for (int e = 0; e < n_embd; ++e) ksrc[t * n_embd + e] = (float) t;
ggml_backend_tensor_set(k_src, ksrc.data(), 0, ggml_nbytes(k_src));
ggml_backend_tensor_set(w_idx, slots64.data(), 0, ggml_nbytes(w_idx));
ggml_backend_tensor_set(g_idx, slots32.data(), 0, ggml_nbytes(g_idx));
// seqB sentinel = 999 everywhere
std::vector<float> kBsrc(n_embd * posB.size(), 999.0f);
ggml_backend_tensor_set(kB_src, kBsrc.data(), 0, ggml_nbytes(kB_src));
ggml_backend_tensor_set(wB_idx, slotsB64.data(), 0, ggml_nbytes(wB_idx));
// --- compute graph: write seqB, write seqC, then gather seqC -------------
struct ggml_init_params cp = { /*mem_size=*/ ggml_tensor_overhead() * 32 + ggml_graph_overhead(),
/*mem_buffer=*/ NULL, /*no_alloc=*/ true };
struct ggml_context * ctx = ggml_init(cp);
struct ggml_tensor * wroteB = ggml_set_rows(ctx, pool, kB_src, wB_idx); // view(pool)
struct ggml_tensor * wroteC = ggml_set_rows(ctx, wroteB, k_src, w_idx); // chain so order is fixed
struct ggml_tensor * gathered = ggml_get_rows(ctx, wroteC, g_idx);
ggml_set_output(gathered);
struct ggml_cgraph * gf = ggml_new_graph(ctx);
ggml_build_forward_expand(gf, gathered);
ggml_gallocr_t galloc = ggml_gallocr_new(ggml_backend_cpu_buffer_type());
assert(ggml_gallocr_alloc_graph(galloc, gf));
assert(ggml_backend_graph_compute(backend, gf) == GGML_STATUS_SUCCESS);
// --- verify gather(write(x)) == x for the non-contiguous sequence --------
std::vector<float> out(n_embd * n_tokens);
ggml_backend_tensor_get(gathered, out.data(), 0, ggml_nbytes(gathered));
int mism = 0;
for (int t = 0; t < n_tokens; ++t)
for (int e = 0; e < n_embd; ++e)
if (std::fabs(out[t * n_embd + e] - (float) t) > 1e-6f) mism++;
assert(mism == 0 && "gathered paged KV must equal source (round-trip)");
// --- verify isolation: read seqC slots directly from pool, unaffected by seqB
std::vector<float> pool_host(n_embd * total_slots);
ggml_backend_tensor_get(pool, pool_host.data(), 0, ggml_nbytes(pool));
for (int t = 0; t < n_tokens; ++t) {
int slot = (int) slots64[t];
for (int e = 0; e < n_embd; ++e)
assert(std::fabs(pool_host[slot * n_embd + e] - (float) t) < 1e-6f);
}
ggml_gallocr_free(galloc);
ggml_free(ctx);
ggml_free(ctx_data);
ggml_backend_buffer_free(buf);
ggml_backend_free(backend);
printf("test_ggml_paged_rw: OK (non-contiguous paged write/gather round-trip)\n");
return 0;
}

View File

@@ -1,32 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
using namespace paged;
int main() {
PagedKVManager m(/*num_blocks=*/8, /*block_size=*/16, /*enable_caching=*/false);
// 20 tokens -> ceil(20/16)=2 blocks
assert(m.allocate(/*seq=*/0, 20));
auto bt = m.block_table(0);
assert(bt.size() == 2);
// slot arithmetic: pos 0 -> block bt[0]*16 + 0 ; pos 17 -> bt[1]*16 + 1
assert(m.slot(0, 0) == (int64_t)bt[0] * 16 + 0);
assert(m.slot(0, 17) == (int64_t)bt[1] * 16 + 1);
auto sm = m.slot_mapping(0, {0, 16, 17});
assert(sm.size() == 3 && sm[1] == (int64_t)bt[1] * 16 + 0);
// growing the same seq reuses existing blocks, adds only new ones
assert(m.allocate(0, 40)); // ceil(40/16)=3 -> +1 block
assert(m.block_table(0).size() == 3);
// OOM: blocks left = 8 - 1(null) - 3 = 4 blocks; ask for 5 blocks
assert(m.allocate(1, 5 * 16) == false);
// free returns blocks to the pool for reuse
m.free(0);
assert(m.allocate(1, 5 * 16)); // now fits
printf("test_paged_kv_manager: OK\n");
return 0;
}

View File

@@ -1,35 +0,0 @@
#include "../paged_kv_manager.h"
#include <cassert>
#include <cstdio>
#include <vector>
using namespace paged;
int main() {
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*enable_caching=*/true);
// shared prefix of 32 tokens (2 full blocks) + distinct suffix
std::vector<int> shared(32);
for (int i = 0; i < 32; ++i) shared[i] = 100 + i;
// chained hashing is deterministic and prefix-sensitive
auto h = m.compute_block_hashes(shared);
assert(h.size() == 2);
auto h2 = m.compute_block_hashes(shared);
assert(h == h2); // deterministic
std::vector<int> other = shared; other[0] = 999;
assert(m.compute_block_hashes(other)[0] != h[0]); // sensitive to content
// seq 0: cold, no cache hit yet
assert(m.get_computed_blocks(h) == 0);
assert(m.allocate(0, 32));
m.cache_blocks(0, h, 32);
// seq 1: warm — the 2 shared blocks are a cache hit (32 tokens)
assert(m.get_computed_blocks(h) == 32);
// first-miss stop: a chain that diverges after block 1 hits only 1 block
auto hmix = h; hmix[1] = 0xDEADBEEF;
assert(m.get_computed_blocks(hmix) == 16);
printf("test_prefix_cache: OK\n");
return 0;
}

View File

@@ -1,106 +0,0 @@
# Paged-attention / parity benchmarks (GB10 / DGX Spark)
Goal of the series: vLLM parity. This records the measured gap so the parity claim is data-backed, not asserted.
**Setup:** GB10 (sm_121, 119 GiB unified). Model Qwen3-Coder-30B-A3B. llama.cpp = pinned base + this series
(MXFP4_MOE, `-fa 1 -b 2048 -ub 2048`, `llama-batched-bench`, PP=512 TG=128). vLLM = 0.23.0 FP8 (recorded
prior run, same box/model). S_PP / S_TG are aggregate prefill / decode tok/s across B streams.
## Fresh llama.cpp (this series, MXFP4) vs vLLM (FP8)
| B | llama S_PP | vLLM S_PP | PP gap | llama S_TG | vLLM S_TG | TG gap |
|---|-----------|-----------|--------|-----------|-----------|--------|
| 1 | 1565 | 9644 | 6.2× | **83** | 48 | **llama wins** |
| 8 | 3648 | 33373 | 9.1× | 126 | 312 | 2.5× |
| 32 | 2074 | 99398 | 48× | 319 | 1171 | 3.7× |
| 64 | 3643 | 151990 | 42× | 771 | 2064 | 2.7× |
## Verdict — two distinct gaps, only one is the engine's
1. **Prefill (S_PP): 648× behind, and it does NOT scale with B** (plateaus ~3.6k). This is the **FP4 MoE
GEMM kernel** (`mul_mat_q<MXFP4>` ~22 TFLOP/s), confirmed earlier. **Paged attention cannot close this**
it's per-token compute. Needs the tcgen05/CUTLASS grouped-GEMM (Lever 3, multi-week, no upstream base).
2. **Decode at concurrency (S_TG): 2.53.7× behind for B≥8** (we *win* at B=1). This gap IS partly the
engine's domain — vLLM's block-paged KV + continuous batching pack more concurrent decode work per step.
**This is what patches 00030006 target.** The win here is realistic; the prefill win is not (kernel).
## CORRECTION — decode-phase profile (B=64, decode-dominated nsys)
The "decode gap is engine-addressable" read above was **wrong**. Profiling a decode-dominated B=64 run:
| kernel | % GPU time |
|---|---|
| `mul_mat_q<MXFP4>` (MoE GEMM) | **54.6** |
| `flash_attn_ext` (attention) | 19.8 |
| `mul_mat_q<Q8>` (dense) | 10.9 |
| KV writes / quant / norms / rest | ~15 |
**Decode at concurrency is ALSO dominated by the FP4 MoE GEMM (54.6%)** — the same Lever-3 kernel as prefill.
Attention (the only thing paging optimizes) is ~20%, and the gather-read reclaims only the *masked-cell*
fraction of that. So **the paged series (00030006) cannot close the vLLM gap in either phase** — both are
MoE-kernel-bound. vLLM's concurrency advantage is its MoE/attention *kernels*, not (mainly) its KV management.
### What the paged series IS still good for (just not throughput parity)
- **Capacity**: block-granular + on-demand allocation → fit more/longer concurrent sequences in fixed VRAM.
- **Prefix sharing**: cross-request block dedup → lower TTFT + memory on shared system prompts / RAG.
These are real wins on *memory-pressured* and *shared-prefix* workloads — but they are not tok/s parity, and
batched-bench (fresh, non-fragmented, no shared prefix) won't show them.
## DENSE model parity (Qwen3-32B) — does the kernel gap exist for dense too? YES.
The MoE work above is about the grouped MoE GEMM. Dense models use a different (non-grouped) matmul path,
so we benchmarked a dense 32B head-to-head.
**Headline comparison — vLLM NVFP4 W4A16 vs llama.cpp Q4_K_M.** This is the *correct apples-to-apples on
DGX Spark*: both are **4-bit weights / 16-bit activations** (same quant class). vLLM = `Qwen3-32B-NVFP4A16`
(FlashInfer Marlin W4A16 kernel); llama.cpp = `Qwen3-32B-Q4_K_M` (int8-MMQ compute). The only difference is
the compute kernel — which is exactly what we're measuring. (Full **W4A4** NVFP4 does not run on GB10 today;
root cause below — and it would *not* be a fair comparison even if it did, since Q4_K_M is also weight-only-4-bit.)
| B | llama Q4_K_M PP | vLLM W4A16 PP | PP gap | llama decode | vLLM decode | TG gap |
|---|---|---|---|---|---|---|
| 1 | 708 | 5367 | 7.6× | 10.2 | 11.7 | ~parity |
| 8 | 761 | 14941 | 20× | 58 | 92 | 1.6× |
| 32 | 763 | 21952 | 29× | 205 | 330 | 1.6× |
| 64 | 765 | 24444 | 32× | 253 | 569 | 2.2× |
**Findings:**
1. **Dense prefill has the SAME (larger) kernel gap.** llama dense prefill plateaus at ~765 t/s regardless of
B; vLLM scales to 24.4k (32×). Both read 4-bit weights — the gap is the compute kernel: vLLM's FP4 Marlin
tensor-core GEMM vs llama's int8-MMQ. (Note: on consumer Blackwell, W4A16 Marlin is also reported *faster*
than the experimental W4A4 path, so W4A16 isn't a handicapped stand-in — it's the fast path.)
2. **Decode is ~parity at B=1** (10.2 vs 11.7 — both weight-bandwidth-bound reading 4-bit weights), and the
gap grows with batch (compute starts to matter → the kernel gap reappears: 2.2× at B=64).
3. **Scope decision (the reason for this benchmark): the Lever-3 kernel track must also deliver a NON-grouped
block-scaled FP4 GEMM for dense**, not only the MoE grouped GEMM. The dense GEMM is the simpler of the two
(a plain CUTLASS dense GEMM), so it's a good first kernel to land — and it benefits every dense model.
- **No cheap lever:** `GGML_CUDA_FORCE_CUBLAS` is a **no-op for dense too** (Q4_K pp512: 720.8 vs 721.8) —
dequant→cuBLAS-BF16 doesn't engage / isn't faster than int8-MMQ on GB10. With ubatch (saturates) and
nwarps (static_assert) already ruled out for MoE, **every config/flag lever is now exhausted** for both
model classes. Parity is strictly the FP4 tensor-core kernel.
4. **Why full W4A4 NVFP4 hangs on GB10 (root cause, researched).** This is a *known consumer-Blackwell
limitation, not a misconfiguration*. **FlashInfer ships no FP4 cubins for sm_120/sm_121** — its precompiled
kernels are all datacenter `Sm100a/Sm103a` (B200/B300). So on GB10 the dense `mm_fp4` W4A4 GEMM has no
working kernel: the optimized path is gated off for sm_121 (heuristic checks `minor==0`; 12.1 fails), the
CUTLASS dense FP4 fallback is documented to silently return **all-zeros**, and TRT-LLM errors at capability
120. Our exact symptom — loads weights, then stalls at the first profiling forward pass with
`enable_flashinfer_autotune=True` at 03% GPU — is the **FlashInfer FP4 autotuner/JIT spinning on an arch
with no FP4 cubins** (matches vllm #30163/#26381, flashinfer #2577/#3294). The "NVFP4 on DGX Spark" story
everyone cites is about *quantization + memory footprint + W4A16/MoE*, **not dense W4A4 inference**, which
isn't validated on sm_121 yet (where people patched it working, it was slower than W4A16 anyway).
**Therefore W4A16 vs Q4_K_M above is the right, reproducible apples-to-apples** for DGX Spark today.
Optional W4A4 retry (verify output isn't zeros first): `VLLM_SKIP_FLASHINFER_AUTOTUNE=1` +
`VLLM_NVFP4_GEMM_BACKEND=cutlass` + `--enforce-eager`, or NVIDIA's `vllm/vllm-openai:cu130-nightly` container.
## So, honestly, where parity stands
- **Decode single-stream: already at/above parity** (B=1: 83 vs 48).
- **Decode concurrency: a real, engine-addressable gap** the paged series can narrow (0004 on-demand pool +
0005 continuous batching). Target: close the 2.53.7× at B≥8.
- **Prefill: kernel-bound, not engine-bound.** No amount of paging reaches vLLM here; that's a separate track.
**Series status when measured:** 0001 (vendor) + 0002 (placement, token-identical) done; 0003 (gather-read)
turn-key-planned, not yet implemented. These numbers are the *baseline* the engine patches must improve on at
B≥8 decode — re-run this table after 0004/0005 to show the concurrency gap closing.

View File

@@ -1,82 +0,0 @@
# llama.cpp patch series — paged attention (vLLM-parity engine)
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
## Base
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
## The series (phases → patches)
| # | Patch | What | Verifies |
|---|-------|------|----------|
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately under `../paged/` |
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
## Regen workflow (the anti-drift recipe)
```sh
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
```
## Build integration
`../Makefile`'s `llama.cpp:` target runs, after `git checkout -b build $(LLAMA_VERSION)`:
```
for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
```
All variants (avx/avx2/avx512/cuda/…) copy the patched `llama.cpp/` tree, so the series ships everywhere.
## Status
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
(see `paged/README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
makes paged placement token-identical under flash-attn.**
- 00040006 follow.
### Honest parity note (important)
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
`../paged/UPSTREAM_GGML_ISSUE.md` and `DGX_BLACKWELL_PLAN.md`). So full vLLM parity = this series **AND** the
kernel; neither alone suffices.

View File

@@ -1,91 +0,0 @@
diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cu b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
new file mode 100644
index 0000000..5f5a782
--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cu
@@ -0,0 +1,46 @@
+#include "fp4-grouped-moe.cuh"
+
+#include <cstdlib>
+#include <cstdio>
+
+// SCAFFOLD for the FP4 grouped-GEMM MoE kernel (Lever 3).
+//
+// Why: on GB10 (sm_121) the MoE matmul runs mul_mat_q<MXFP4> - a warp-level mma.sync grouped MMQ -
+// at ~22 effective TFLOP/s, ~27x behind vLLM prefill, and it also dominates decode at concurrency
+// (54.6% of GPU time at B=64). It is the single bottleneck to vLLM parity in BOTH phases; paged
+// attention cannot touch it (proven by profiling). The fix is a CUTLASS-3.x collective-mainloop
+// grouped GEMM over all experts, block-scaled e2m1 operands via tcgen05 tensor-memory MMA.
+//
+// This file is the integration seam. It is currently a no-op that always falls back to MMQ, so the
+// default build is byte-identical. The kernel is filled in over the phases in the design doc.
+
+static bool fp4_grouped_enabled() {
+ static const bool en = (std::getenv("GGML_CUDA_FP4_GROUPED") != nullptr);
+ return en;
+}
+
+bool ggml_cuda_fp4_grouped_moe(
+ ggml_backend_cuda_context & ctx,
+ const ggml_tensor * src0,
+ const ggml_tensor * src1,
+ const ggml_tensor * ids,
+ ggml_tensor * dst) {
+ GGML_UNUSED(ctx); GGML_UNUSED(src1); GGML_UNUSED(ids); GGML_UNUSED(dst);
+
+ if (!fp4_grouped_enabled()) {
+ return false; // default: existing MMQ path
+ }
+ if (src0->type != GGML_TYPE_MXFP4 && src0->type != GGML_TYPE_NVFP4) {
+ return false;
+ }
+
+ // TODO(kernel - see kernel design doc): CUTLASS 3.x GemmGrouped, sm_120a, block-scaled e2m1,
+ // tcgen05 MMA; per-expert problem offsets from `ids`; fused activation quant; numerical parity
+ // vs mul_mat_q<MXFP4> before enabling by default.
+ static bool warned = false;
+ if (!warned) {
+ warned = true;
+ fprintf(stderr, "[fp4-grouped] GGML_CUDA_FP4_GROUPED set, kernel not yet implemented - using MMQ\n");
+ }
+ return false; // scaffold: fall back until the kernel lands
+}
diff --git a/ggml/src/ggml-cuda/fp4-grouped-moe.cuh b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
new file mode 100644
index 0000000..29e1b5a
--- /dev/null
+++ b/ggml/src/ggml-cuda/fp4-grouped-moe.cuh
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.cuh"
+
+// Entry point for the tcgen05/CUTLASS block-scaled FP4 (MXFP4/NVFP4) grouped-GEMM MoE kernel for
+// Blackwell consumer GPUs (sm_120/121). Returns true if it handled the op; false to fall back to
+// the existing warp-mma MMQ path. Gated behind GGML_CUDA_FP4_GROUPED until correct + faster.
+bool ggml_cuda_fp4_grouped_moe(
+ ggml_backend_cuda_context & ctx,
+ const ggml_tensor * src0, // expert weights, MXFP4/NVFP4 [n_embd, n_ff, n_expert]
+ const ggml_tensor * src1, // activations, F32 [n_embd, n_tokens, ...]
+ const ggml_tensor * ids, // expert routing, I32
+ ggml_tensor * dst); // F32 output
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index 8ea462a..104d131 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -30,6 +30,7 @@
#include "ggml-cuda/im2col.cuh"
#include "ggml-cuda/mmf.cuh"
#include "ggml-cuda/mmq.cuh"
+#include "ggml-cuda/fp4-grouped-moe.cuh"
#include "ggml-cuda/mmvf.cuh"
#include "ggml-cuda/mmvq.cuh"
#include "ggml-cuda/norm.cuh"
@@ -2701,6 +2702,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor *
}
if (ggml_cuda_should_use_mmq(src0->type, cc, ne12, /*n_experts=*/ne02)) {
+ if (ggml_cuda_fp4_grouped_moe(ctx, src0, src1, ids, dst)) { return; }
ggml_cuda_mul_mat_q(ctx, src0, src1, ids, dst);
return;
}

View File

@@ -1,447 +0,0 @@
From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 22:54:49 +0000
Subject: [PATCH] vendor paged kv manager
vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool,
PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change -
nothing uses it yet; wired in by later patches in the series.
---
src/CMakeLists.txt | 1 +
src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++
src/paged-kv-manager.h | 108 ++++++++++++++
3 files changed, 405 insertions(+)
create mode 100644 src/paged-kv-manager.cpp
create mode 100644 src/paged-kv-manager.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index d15ccfd99..a030940b8 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -24,6 +24,7 @@ add_library(llama
llama-io.cpp
llama-kv-cache.cpp
llama-kv-cache-iswa.cpp
+ paged-kv-manager.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
new file mode 100644
index 000000000..ca0dcd83a
--- /dev/null
+++ b/src/paged-kv-manager.cpp
@@ -0,0 +1,296 @@
+#include "paged-kv-manager.h"
+#include <cassert>
+#include <stdexcept>
+
+namespace paged {
+
+// ---------------------------------------------------------------------------
+// FreeBlockQueue (port of kv_cache_utils.py FreeKVCacheBlockQueue)
+// ---------------------------------------------------------------------------
+
+FreeBlockQueue::FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks) {
+ num_free_blocks = blocks.size();
+ for (size_t i = 0; i < blocks.size(); ++i) {
+ if (i > 0) blocks[i]->prev_free = blocks[i - 1];
+ if (i + 1 < blocks.size()) blocks[i]->next_free = blocks[i + 1];
+ }
+ if (!blocks.empty()) {
+ fake_head.next_free = blocks.front();
+ blocks.front()->prev_free = &fake_head;
+ fake_tail.prev_free = blocks.back();
+ blocks.back()->next_free = &fake_tail;
+ } else {
+ fake_head.next_free = &fake_tail;
+ fake_tail.prev_free = &fake_head;
+ }
+}
+
+KVCacheBlock* FreeBlockQueue::popleft() {
+ KVCacheBlock* first = fake_head.next_free;
+ if (first == &fake_tail || first == nullptr) {
+ assert(num_free_blocks == 0);
+ throw std::runtime_error("No free blocks available");
+ }
+ fake_head.next_free = first->next_free;
+ first->next_free->prev_free = &fake_head;
+ first->prev_free = first->next_free = nullptr;
+ num_free_blocks--;
+ return first;
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::popleft_n(size_t n) {
+ std::vector<KVCacheBlock*> ret;
+ if (n == 0) return ret;
+ assert(num_free_blocks >= n);
+ num_free_blocks -= n;
+ KVCacheBlock* curr = fake_head.next_free;
+ ret.reserve(n);
+ for (size_t i = 0; i < n; ++i) {
+ assert(curr != nullptr);
+ ret.push_back(curr);
+ KVCacheBlock* last = curr;
+ curr = curr->next_free;
+ last->prev_free = last->next_free = nullptr;
+ }
+ if (curr != nullptr) {
+ fake_head.next_free = curr;
+ curr->prev_free = &fake_head;
+ }
+ return ret;
+}
+
+void FreeBlockQueue::remove(KVCacheBlock* block) {
+ if (!block->prev_free || !block->next_free)
+ throw std::runtime_error("remove() called on an invalid block");
+ block->prev_free->next_free = block->next_free;
+ block->next_free->prev_free = block->prev_free;
+ block->prev_free = block->next_free = nullptr;
+ num_free_blocks--;
+}
+
+void FreeBlockQueue::append(KVCacheBlock* block) {
+ KVCacheBlock* last = fake_tail.prev_free;
+ last->next_free = block;
+ block->prev_free = last;
+ block->next_free = &fake_tail;
+ fake_tail.prev_free = block;
+ num_free_blocks++;
+}
+
+void FreeBlockQueue::append_n(const std::vector<KVCacheBlock*>& blocks) {
+ if (blocks.empty()) return;
+ KVCacheBlock* last = fake_tail.prev_free;
+ for (KVCacheBlock* b : blocks) {
+ b->prev_free = last;
+ last->next_free = b;
+ last = b;
+ }
+ last->next_free = &fake_tail;
+ fake_tail.prev_free = last;
+ num_free_blocks += blocks.size();
+}
+
+void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
+ if (blocks.empty()) return;
+ KVCacheBlock* first = fake_head.next_free;
+ KVCacheBlock* prev = &fake_head;
+ for (KVCacheBlock* b : blocks) {
+ b->prev_free = prev;
+ prev->next_free = b;
+ prev = b;
+ }
+ prev->next_free = first;
+ first->prev_free = prev;
+ num_free_blocks += blocks.size();
+}
+
+std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
+ std::vector<KVCacheBlock*> ret;
+ const KVCacheBlock* curr = fake_head.next_free;
+ while (curr && curr->next_free != nullptr) {
+ ret.push_back(const_cast<KVCacheBlock*>(curr));
+ curr = curr->next_free;
+ }
+ return ret;
+}
+
+// ---------------------------------------------------------------------------
+// BlockPool (port of block_pool.py)
+// ---------------------------------------------------------------------------
+
+static std::vector<KVCacheBlock*> make_ptrs(std::vector<KVCacheBlock>& v) {
+ std::vector<KVCacheBlock*> p;
+ p.reserve(v.size());
+ for (auto& b : v) p.push_back(&b);
+ return p;
+}
+
+static std::vector<KVCacheBlock> make_block_vec(int32_t num_blocks) {
+ std::vector<KVCacheBlock> v;
+ v.reserve(num_blocks);
+ for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i);
+ return v;
+}
+
+BlockPool::BlockPool(int32_t num_blocks, bool enable_caching)
+ : enable_caching_(enable_caching),
+ blocks_(make_block_vec(num_blocks)),
+ ptrs_(make_ptrs(blocks_)),
+ free_queue_(ptrs_) {
+ // vLLM reserves block_id 0 as the null block (never cached).
+ null_block = free_queue_.popleft();
+ null_block->is_null = true;
+}
+
+bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) {
+ if (!block->has_hash) return false;
+ auto it = cached_block_hash_to_block_.find(block->block_hash);
+ if (it == cached_block_hash_to_block_.end() || it->second != block) return false;
+ cached_block_hash_to_block_.erase(it);
+ block->reset_hash();
+ return true;
+}
+
+std::vector<KVCacheBlock*> BlockPool::get_new_blocks(size_t n) {
+ if (n > get_num_free_blocks())
+ throw std::runtime_error("Cannot get free blocks from pool");
+ auto ret = free_queue_.popleft_n(n);
+ for (KVCacheBlock* b : ret) {
+ if (enable_caching_) maybe_evict_cached_block(b);
+ assert(b->ref_cnt == 0);
+ b->ref_cnt += 1;
+ }
+ return ret;
+}
+
+KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) {
+ auto it = cached_block_hash_to_block_.find(block_hash);
+ return it == cached_block_hash_to_block_.end() ? nullptr : it->second;
+}
+
+void BlockPool::touch(const std::vector<KVCacheBlock*>& blocks) {
+ for (KVCacheBlock* b : blocks) {
+ // ref_cnt==0 means the block is a free-list eviction candidate; pull it out.
+ if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b);
+ b->ref_cnt += 1;
+ }
+}
+
+void BlockPool::free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks) {
+ std::vector<KVCacheBlock*> without_hash, with_hash;
+ for (KVCacheBlock* b : ordered_blocks) {
+ if (b->is_null) continue;
+ b->ref_cnt -= 1;
+ if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b);
+ }
+ free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front)
+ free_queue_.append_n(with_hash); // hashed: kept warm (tail)
+}
+
+void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+ size_t num_cached_blocks, size_t num_full_blocks,
+ const std::vector<uint64_t>& block_hashes) {
+ for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) {
+ KVCacheBlock* blk = req_blocks[i];
+ if (blk->has_hash) continue;
+ blk->has_hash = true;
+ blk->block_hash = block_hashes[i];
+ cached_block_hash_to_block_[blk->block_hash] = blk;
+ }
+}
+
+// ---------------------------------------------------------------------------
+// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager)
+// ---------------------------------------------------------------------------
+
+static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; }
+
+PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching)
+ : block_size_(block_size), pool_(num_blocks, enable_caching) {}
+
+bool PagedKVManager::allocate(int seq_id, size_t total_tokens) {
+ auto& req = req_to_blocks_[seq_id];
+ size_t need = cdiv(total_tokens, block_size_);
+ if (need <= req.size()) return true;
+ size_t add = need - req.size();
+ if (add > pool_.get_num_free_blocks()) return false; // OOM
+ auto nb = pool_.get_new_blocks(add);
+ req.insert(req.end(), nb.begin(), nb.end());
+ return true;
+}
+
+std::vector<int32_t> PagedKVManager::block_table(int seq_id) const {
+ std::vector<int32_t> bt;
+ auto it = req_to_blocks_.find(seq_id);
+ if (it == req_to_blocks_.end()) return bt;
+ bt.reserve(it->second.size());
+ for (KVCacheBlock* b : it->second) bt.push_back(b->block_id);
+ return bt;
+}
+
+int64_t PagedKVManager::slot(int seq_id, int pos) const {
+ const auto& req = req_to_blocks_.at(seq_id);
+ int32_t phys = req[pos / block_size_]->block_id;
+ return (int64_t)phys * block_size_ + (pos % block_size_);
+}
+
+std::vector<int64_t> PagedKVManager::slot_mapping(int seq_id, const std::vector<int>& positions) const {
+ std::vector<int64_t> sm;
+ sm.reserve(positions.size());
+ for (int p : positions) sm.push_back(slot(seq_id, p));
+ return sm;
+}
+
+void PagedKVManager::free(int seq_id) {
+ auto it = req_to_blocks_.find(seq_id);
+ if (it == req_to_blocks_.end()) return;
+ // Free in reverse so the tail of the block chain is evicted first (vLLM order).
+ std::vector<KVCacheBlock*> ordered(it->second.rbegin(), it->second.rend());
+ pool_.free_blocks(ordered);
+ req_to_blocks_.erase(it);
+}
+
+// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
+// hash into the seed so each block hash transitively encodes its whole prefix
+// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
+uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector<int>& token_ids) {
+ uint64_t h = 1469598103934665603ull ^ parent_hash;
+ for (int t : token_ids) {
+ h ^= (uint64_t)(uint32_t)t;
+ h *= 1099511628211ull;
+ }
+ if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash")
+ return h;
+}
+
+std::vector<uint64_t> PagedKVManager::compute_block_hashes(const std::vector<int>& token_ids) const {
+ std::vector<uint64_t> hashes;
+ uint64_t parent = 0; // NONE_HASH analogue
+ size_t n_full = token_ids.size() / block_size_;
+ for (size_t i = 0; i < n_full; ++i) {
+ std::vector<int> blk(token_ids.begin() + i * block_size_,
+ token_ids.begin() + (i + 1) * block_size_);
+ parent = hash_block(parent, blk);
+ hashes.push_back(parent);
+ }
+ return hashes;
+}
+
+size_t PagedKVManager::get_computed_blocks(const std::vector<uint64_t>& block_hashes) {
+ std::vector<KVCacheBlock*> hits;
+ for (uint64_t bh : block_hashes) { // stop at first miss (prefix property)
+ KVCacheBlock* cb = pool_.get_cached_block(bh);
+ if (!cb) break;
+ hits.push_back(cb);
+ }
+ pool_.touch(hits); // ++ref_cnt, pull from free list
+ return hits.size() * (size_t)block_size_;
+}
+
+void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens) {
+ auto& req = req_to_blocks_[seq_id];
+ size_t n_full = num_tokens / block_size_;
+ pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
+}
+
+} // namespace paged
diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
new file mode 100644
index 000000000..740280a7f
--- /dev/null
+++ b/src/paged-kv-manager.h
@@ -0,0 +1,108 @@
+#pragma once
+// Paged KV cache block manager for llama.cpp (CPU-first prototype).
+//
+// Host-side block management is a faithful port of vLLM V1:
+// vllm/v1/core/kv_cache_utils.py (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens)
+// vllm/v1/core/block_pool.py (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks)
+// vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit)
+//
+// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting,
+// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp
+// dependency so it can be unit-tested in isolation.
+
+#include <cstdint>
+#include <vector>
+#include <unordered_map>
+#include <map>
+
+namespace paged {
+
+// vLLM KVCacheBlock (kv_cache_utils.py).
+struct KVCacheBlock {
+ int32_t block_id = 0;
+ int ref_cnt = 0;
+ bool has_hash = false; // vLLM: _block_hash is set only when full+cached
+ uint64_t block_hash = 0;
+ bool is_null = false;
+ KVCacheBlock* prev_free = nullptr;
+ KVCacheBlock* next_free = nullptr;
+
+ explicit KVCacheBlock(int32_t id = 0) : block_id(id) {}
+ void reset_hash() { has_hash = false; block_hash = 0; }
+};
+
+// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue).
+// O(1) middle removal is required so touch() can pull a warm cached block out of the
+// free list when a later request hits its prefix.
+class FreeBlockQueue {
+public:
+ size_t num_free_blocks = 0;
+
+ explicit FreeBlockQueue(const std::vector<KVCacheBlock*>& blocks);
+ KVCacheBlock* popleft();
+ std::vector<KVCacheBlock*> popleft_n(size_t n);
+ void remove(KVCacheBlock* block);
+ void append(KVCacheBlock* block);
+ void append_n(const std::vector<KVCacheBlock*>& blocks);
+ void prepend_n(const std::vector<KVCacheBlock*>& blocks);
+ std::vector<KVCacheBlock*> get_all_free_blocks() const;
+
+private:
+ KVCacheBlock fake_head{-1};
+ KVCacheBlock fake_tail{-1};
+};
+
+// vLLM BlockPool (block_pool.py).
+class BlockPool {
+public:
+ KVCacheBlock* null_block = nullptr;
+
+ BlockPool(int32_t num_blocks, bool enable_caching);
+ std::vector<KVCacheBlock*> get_new_blocks(size_t n);
+ KVCacheBlock* get_cached_block(uint64_t block_hash);
+ void touch(const std::vector<KVCacheBlock*>& blocks);
+ void free_blocks(const std::vector<KVCacheBlock*>& ordered_blocks);
+ void cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
+ size_t num_cached_blocks, size_t num_full_blocks,
+ const std::vector<uint64_t>& block_hashes);
+ size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+
+private:
+ bool maybe_evict_cached_block(KVCacheBlock* block);
+
+ bool enable_caching_;
+ std::vector<KVCacheBlock> blocks_; // owns all block descriptors
+ std::vector<KVCacheBlock*> ptrs_;
+ FreeBlockQueue free_queue_;
+ // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the
+ // prototype keeps the last writer (single KV-cache group is sufficient for the wins).
+ std::unordered_map<uint64_t, KVCacheBlock*> cached_block_hash_to_block_;
+};
+
+// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager /
+// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode.
+class PagedKVManager {
+public:
+ PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching);
+
+ // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty).
+ bool allocate(int seq_id, size_t total_tokens);
+ std::vector<int32_t> block_table(int seq_id) const;
+ int64_t slot(int seq_id, int pos) const;
+ std::vector<int64_t> slot_mapping(int seq_id, const std::vector<int>& positions) const;
+ void free(int seq_id);
+ int block_size() const { return block_size_; }
+
+ // Prefix caching (win 3).
+ static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
+ std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
+ size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
+ void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+
+protected:
+ int block_size_;
+ BlockPool pool_;
+ std::map<int, std::vector<KVCacheBlock*>> req_to_blocks_;
+};
+
+} // namespace paged
--
2.43.0

View File

@@ -1,75 +0,0 @@
From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 19 Jun 2026 23:04:17 +0000
Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED)
Place each sequence's tokens at permuted, non-contiguous fixed-size block
positions in find_slot, proving attention is invariant to physical KV placement
(token-identical greedy generation). Default off; single-sequence scope; falls
back to the normal allocator. The paged-placement substrate for the gather-read.
---
src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 41 insertions(+)
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 2802103bd..999e2ae61 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -11,6 +11,8 @@
#include <cstring>
#include <limits>
#include <map>
+#include <numeric>
+#include <cstdlib>
#include <stdexcept>
static bool ggml_is_power_of_2(int n) {
@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return { };
}
+ // [paged, experimental] Place this sequence's tokens at permuted,
+ // non-contiguous fixed-size BLOCK positions instead of a contiguous run.
+ // This validates that attention is invariant to physical KV placement -
+ // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
+ // Single-sequence scope (uses get_used() as the logical base); falls back
+ // to the normal allocator if the permuted cells aren't available.
+ static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ if (paged_mode) {
+ const uint32_t bs = 16; // block size (tokens/block)
+ const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ if (nblk >= 2) {
+ // stride coprime to nblk => block-index permutation is a bijection
+ uint32_t k = 1;
+ for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
+ if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
+ }
+ const uint32_t base = cells.get_used();
+ bool ok = true;
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const uint32_t L = base + i;
+ const uint32_t b = L / bs;
+ const uint32_t off = L % bs;
+ if (b >= nblk) { ok = false; break; }
+ const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
+ if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
+ res.idxs[s].push_back(phys);
+ }
+ if (ok && res.idxs[s].size() == n_tokens) {
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ }
+ continue; // paged placement succeeded for this sequence
+ }
+ res.idxs[s].clear(); // fall back to the normal allocator
+ }
+ }
+
uint32_t n_tested = 0;
// for continuous slots, we test that all tokens in the ubatch fit, starting from the current head
--
2.43.0

View File

@@ -1,369 +0,0 @@
From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 10:24:22 +0200
Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003
Gather K, V and the kq_mask down to each sequence stream's non-empty cells
before build_attn_mha. Position-sorted per stream so the flash-attn online
softmax reduction order matches stock byte-for-byte. Multi-stream: one index
column per stream over k->ne[3], padded to the max non-empty count with a
masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset.
---
src/CMakeLists.txt | 1 +
src/llama-graph.cpp | 9 ++-
src/llama-kv-cache.cpp | 74 ++++++++++++++++++++++++
src/llama-kv-cache.h | 11 ++++
src/paged-attn.cpp | 128 +++++++++++++++++++++++++++++++++++++++++
src/paged-attn.h | 40 +++++++++++++
6 files changed, 262 insertions(+), 1 deletion(-)
create mode 100644 src/paged-attn.cpp
create mode 100644 src/paged-attn.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index a030940..58083b3 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -25,6 +25,7 @@ add_library(llama
llama-kv-cache.cpp
llama-kv-cache-iswa.cpp
paged-kv-manager.cpp
+ paged-attn.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index 68c9e60..b59d2a5 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -6,6 +6,8 @@
#include "llama-cparams.h"
#include "llama-kv-cache.h"
+
+#include "paged-attn.h"
#include "llama-kv-cache-iswa.h"
#include "llama-kv-cache-dsa.h"
#include "llama-memory-hybrid.h"
@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * k = mctx_cur->get_k(ctx0, il);
ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il);
+ // [paged 0003] gather K, V and the mask to the sequence's used cells only
+ // (no-op unless env LLAMA_KV_PAGED is set).
+ ggml_tensor * kq_mask_g = kq_mask;
+ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+
+ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
cb(cur, "kqv_out", il);
if (inp->self_v_rot) {
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 999e2ae..30d02d7 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1,4 +1,6 @@
#include "llama-kv-cache.h"
+#include <vector>
+#include <utility>
#include "llama-impl.h"
#include "llama-io.h"
@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k
ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0);
}
+// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the
+// single stream addressed by sinfo. With paged placement (patch 0002) these are
+// the sequence's scattered block cells; gathering K/V/mask by this index list
+// compacts the attention read while preserving every unmasked (token,cell) pair.
+uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const {
+ // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather,
+ // n_stream], so n_gather is the MAX non-empty count across the batch streams.
+ // Streams with fewer cells are padded (see get_gather_idxs) with a masked
+ // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op.
+ // K is laid out over physical streams [s0, s1]; index v_cells the same way.
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ uint32_t mx = 0;
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ uint32_t cnt = 0;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ ++cnt;
+ }
+ }
+ mx = std::max(mx, cnt);
+ }
+ return mx;
+}
+
+void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const {
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ const uint32_t n_gather = get_n_gather(n_kv, sinfo);
+ // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..].
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ // Collect the non-empty cells, then order them by token POSITION (not by
+ // physical cell index). The attention reduction (flash-attn online
+ // softmax, and the non-flash soft_max) runs over cells in array order and
+ // is order-sensitive in floating point. Stock (contiguous) placement
+ // happens to store cells in position order, so emitting the gathered
+ // indices in position order reproduces stock's exact reduction order -
+ // making the paged read bit-identical, not merely math-equivalent.
+ std::vector<std::pair<llama_pos, int32_t>> pc;
+ pc.reserve(n);
+ int32_t pad = -1;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ pc.emplace_back(cells.pos_get(i), (int32_t) i);
+ } else if (pad < 0) {
+ pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad
+ }
+ }
+ std::sort(pc.begin(), pc.end());
+ int32_t * col = dst + (size_t) j * n_gather;
+ for (size_t k = 0; k < pc.size(); ++k) {
+ col[k] = pc[k].second;
+ }
+ // Pad the tail to n_gather with a masked (empty) cell so the rectangular
+ // gather drops to zero contribution for streams shorter than the max.
+ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+ for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) {
+ col[k] = padv;
+ }
+ }
+}
+
ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
GGML_UNUSED(sinfo);
@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons
return kv->get_v(ctx, il, n_kv, sinfos[i_cur]);
}
+uint32_t llama_kv_cache_context::get_n_gather() const {
+ return kv->get_n_gather(n_kv, sinfos[i_cur]);
+}
+
+void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
+ kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
+}
+
ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
}
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 3d68f98..494c0fb 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -171,6 +171,12 @@ public:
ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const;
+ // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of
+ // sinfo (position-sorted, padded across streams). Used by paged-attn
+ // gather-read. get_n_gather returns the max count across streams.
+ uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
+ void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+
// store k_cur and v_cur in the cache based on the provided head location
ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const;
@@ -368,6 +374,11 @@ public:
ggml_tensor * get_k(ggml_context * ctx, int32_t il) const;
ggml_tensor * get_v(ggml_context * ctx, int32_t il) const;
+ // [paged 0003] gather-read helpers (delegate to the kv cache for the
+ // current ubatch's stream).
+ uint32_t get_n_gather() const;
+ void get_gather_idxs(int32_t * dst) const;
+
// store k_cur and v_cur in the cache based on the provided head location
// note: the heads in k_cur and v_cur should be laid out contiguously in memory
// - k_cur [n_embd_head_k, n_head_k, n_tokens]
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
new file mode 100644
index 0000000..ade75e8
--- /dev/null
+++ b/src/paged-attn.cpp
@@ -0,0 +1,128 @@
+#include "paged-attn.h"
+
+#include "llama-graph.h"
+#include "llama-kv-cache.h"
+
+#include "ggml.h"
+#include "ggml-backend.h"
+
+#include <cstdlib>
+#include <cstdio>
+
+namespace paged_attn {
+
+bool active() {
+ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ return a;
+}
+
+static bool debug() {
+ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+ return d;
+}
+
+namespace {
+
+// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor
+// with each stream's non-empty cell indices (position-sorted, padded with a
+// masked/empty cell) by delegating to the kv-cache context. Private to this
+// unit; default can_reuse()==false keeps the graph from being reused across
+// decodes (n_gather grows every step).
+class input_gather_idxs : public llm_graph_input_i {
+public:
+ input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs)
+ : mctx(mctx), idxs(idxs) {}
+
+ void set_input(const llama_ubatch * ubatch) override {
+ GGML_UNUSED(ubatch);
+ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+ mctx->get_gather_idxs((int32_t *) idxs->data);
+ }
+
+ const llama_kv_cache_context * mctx;
+ ggml_tensor * idxs;
+};
+
+} // namespace
+
+void gather(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask) {
+ if (!active()) {
+ return;
+ }
+
+ ggml_tensor * K = *k;
+ ggml_tensor * V = *v;
+ ggml_tensor * M = *kq_mask;
+
+ // Number of streams (sequences) in the unified batch. K is laid out
+ // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the
+ // gather is per-stream (one index column per stream), so a single
+ // ggml_get_rows over the stream axis handles 1..N streams uniformly.
+ const int64_t n_stream = K->ne[3];
+ GGML_ASSERT(M->ne[3] == n_stream);
+
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
+ if (n_gather <= 0) {
+ // Worst-case graph reserve (empty cache) or nothing placed yet: leave
+ // the full [0, n_kv) read untouched so buffer sizing stays worst-case.
+ return;
+ }
+
+ if (debug()) {
+ static int64_t once = 0;
+ if (once++ < 2) {
+ fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n",
+ (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+ }
+ }
+
+ // Per-stream index tensor [n_gather, n_stream], filled at set_input from
+ // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]==
+ // n_stream, so column s gathers from stream s of the source.
+ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream);
+ ggml_set_input(idx);
+ res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx)));
+
+ // --- gather K: collapse (head_dim, n_head) so cells become the row axis ---
+ {
+ ggml_tensor * t = ggml_cont(ctx0, K); // [d, h, n_kv, ns]
+ t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream); // [d*h, n_kv, ns]
+ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns]
+ *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream); // [d, h, n_gather, ns]
+ }
+
+ // --- gather V ---
+ // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered
+ // result is contiguous and build_attn_mha sees a consistent v_trans==false.
+ {
+ const bool v_trans = V->nb[1] > V->nb[2];
+ ggml_tensor * vsrc = v_trans
+ ? ggml_permute(ctx0, V, 2, 1, 0, 3) // [n_kv, h, d, ns] -> [d, h, n_kv, ns]
+ : V; // already [d, h, n_kv, ns]
+ ggml_tensor * t = ggml_cont(ctx0, vsrc); // [d, h, n_kv, ns]
+ t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream); // [d*h, n_kv, ns]
+ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns]
+ *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream); // [d, h, n_gather, ns]
+ }
+
+ // --- gather mask (cells are ne0): transpose so cells become the row axis,
+ // gather per stream, transpose back ---
+ {
+ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream); // [n_kv, n_tps, ns]
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_tps, n_kv, ns]
+ m = ggml_get_rows(ctx0, m, idx); // [n_tps, n_gather, ns] (F32)
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tps, ns]
+ m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream);
+ if (M->type != m->type) {
+ m = ggml_cast(ctx0, m, M->type); // flash-attn requires an F16 mask
+ }
+ *kq_mask = m;
+ }
+}
+
+} // namespace paged_attn
diff --git a/src/paged-attn.h b/src/paged-attn.h
new file mode 100644
index 0000000..c5b7bd7
--- /dev/null
+++ b/src/paged-attn.h
@@ -0,0 +1,40 @@
+#pragma once
+// Paged attention gather-read (patch 0003, experimental).
+//
+// Companion to the paged block placement in llama_kv_cache::find_slot (patch
+// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous
+// fixed-size block cells, but attention still reads the whole [0, n_kv) window
+// (empty cells masked to -inf). This unit compacts that read: it gathers K, V
+// and the kq_mask down to ONLY the sequence's used (non-empty) cells before
+// build_attn_mha.
+//
+// Correctness: attention is permutation-invariant over the KV set, and dropping
+// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output
+// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
+//
+// All logic lives here to keep the core files additive: build_attn gets one
+// call, llama_kv_cache_context gets two thin accessors, CMake gets one line.
+
+#include <cstdint>
+
+struct ggml_context;
+struct ggml_tensor;
+class llm_graph_result;
+class llama_kv_cache_context;
+
+namespace paged_attn {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Gather K, V and the kq_mask down to the current sequence's non-empty cells.
+// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask
+// point at the compacted tensors; pass them straight to build_attn_mha.
+void gather(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask);
+
+} // namespace paged_attn
--
2.43.0

View File

@@ -1,298 +0,0 @@
From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 10:50:35 +0200
Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch
0004
Drive the paged placement in find_slot through the vendored PagedKVManager
(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a
free pool on demand as the sequence crosses block boundaries (peak << full
reservation) and returned on sequence end (seq_rm full removal / clear). One
manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit,
so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a
gated call. Default off; stock path byte-identical.
---
src/CMakeLists.txt | 1 +
src/llama-kv-cache.cpp | 69 +++++++++++++++++----------
src/paged-alloc.cpp | 106 +++++++++++++++++++++++++++++++++++++++++
src/paged-alloc.h | 39 +++++++++++++++
4 files changed, 190 insertions(+), 25 deletions(-)
create mode 100644 src/paged-alloc.cpp
create mode 100644 src/paged-alloc.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 58083b3..4d9d7d1 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -26,6 +26,7 @@ add_library(llama
llama-kv-cache-iswa.cpp
paged-kv-manager.cpp
paged-attn.cpp
+ paged-alloc.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 30d02d7..1125d9a 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1,4 +1,5 @@
#include "llama-kv-cache.h"
+#include "paged-alloc.h"
#include <vector>
#include <utility>
@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache(
}
void llama_kv_cache::clear(bool data) {
+ // [paged 0004] return all on-demand blocks to the pool on cache clear.
+ if (paged_alloc::active()) {
+ paged_alloc::release_all(this);
+ }
+
for (uint32_t s = 0; s < n_stream; ++s) {
v_cells[s].reset();
v_heads[s] = 0;
@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
p1 = std::numeric_limits<llama_pos>::max();
}
+ // [paged 0004] free a stream's on-demand blocks when its whole sequence is
+ // removed (sequence end), so they return to the pool for reuse.
+ if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+ if (seq_id >= 0) {
+ paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+ } else {
+ paged_alloc::release_all(this);
+ }
+ }
+
if (seq_id >= 0) {
auto & cells = v_cells[seq_to_stream[seq_id]];
auto & head = v_heads[seq_to_stream[seq_id]];
@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
// the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED.
// Single-sequence scope (uses get_used() as the logical base); falls back
// to the normal allocator if the permuted cells aren't available.
- static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr);
- if (paged_mode) {
+ // [paged 0004] On-demand block allocation. Patch 0002 proved attention is
+ // invariant to physical KV placement; here that placement is driven by
+ // the vendored PagedKVManager (patch 0001): blocks are popped from a free
+ // pool only as the sequence crosses block boundaries (peak << full
+ // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED;
+ // falls back to the normal allocator on pool exhaustion or any conflict.
+ if (paged_alloc::active()) {
const uint32_t bs = 16; // block size (tokens/block)
- const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool
+ const uint32_t nblk = cells.size() / bs; // this stream's block budget
if (nblk >= 2) {
- // stride coprime to nblk => block-index permutation is a bijection
- uint32_t k = 1;
- for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) {
- if (std::gcd(cand, nblk) == 1u) { k = cand; break; }
- }
const uint32_t base = cells.get_used();
- bool ok = true;
- for (uint32_t i = 0; i < n_tokens; ++i) {
- const uint32_t L = base + i;
- const uint32_t b = L / bs;
- const uint32_t off = L % bs;
- if (b >= nblk) { ok = false; break; }
- const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block
- if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; }
- res.idxs[s].push_back(phys);
- }
- if (ok && res.idxs[s].size() == n_tokens) {
- if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
- fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens);
- for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
- fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base);
+ const int strm = (int) seq_to_stream[seq_id];
+ std::vector<uint32_t> placed;
+ if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+ bool ok = (placed.size() == n_tokens);
+ for (uint32_t i = 0; ok && i < n_tokens; ++i) {
+ if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
+ ok = false;
+ }
+ }
+ if (ok) {
+ for (uint32_t phys : placed) {
+ res.idxs[s].push_back(phys);
+ }
+ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) {
+ fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens);
+ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]);
+ fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base);
+ }
+ continue; // on-demand paged placement succeeded
}
- continue; // paged placement succeeded for this sequence
+ res.idxs[s].clear(); // fall back to the normal allocator
}
- res.idxs[s].clear(); // fall back to the normal allocator
}
}
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
new file mode 100644
index 0000000..1d13f9c
--- /dev/null
+++ b/src/paged-alloc.cpp
@@ -0,0 +1,106 @@
+#include "paged-alloc.h"
+#include "paged-kv-manager.h"
+
+#include <cstdlib>
+#include <cstdio>
+#include <map>
+#include <memory>
+#include <utility>
+
+namespace paged_alloc {
+
+bool active() {
+ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr);
+ return a;
+}
+
+static bool debug() {
+ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
+ return d;
+}
+
+namespace {
+
+using key_t = std::pair<const void *, int>;
+
+// One PagedKVManager per (kv-cache, stream): each stream owns a separate
+// physical pool of cells.size() cells, so a manager's block ids map directly to
+// cell ranges within that stream's pool. The internal request id is always 0.
+std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
+
+paged::PagedKVManager * get_mgr(const void * cache, int stream,
+ uint32_t pool_blocks, uint32_t block_size) {
+ const key_t k{cache, stream};
+ auto it = g_managers.find(k);
+ if (it == g_managers.end()) {
+ // enable_caching=false: prefix caching is a later patch; 0004 exercises
+ // only on-demand allocate / free.
+ auto mgr = std::make_unique<paged::PagedKVManager>(
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+ it = g_managers.emplace(k, std::move(mgr)).first;
+ }
+ return it->second.get();
+}
+
+} // namespace
+
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ uint32_t block_size, uint32_t pool_blocks,
+ std::vector<uint32_t> & out) {
+ if (n_tokens == 0) {
+ return true;
+ }
+
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+
+ const size_t before = mgr->block_table(0).size();
+
+ // Grow the request to cover the highest logical position. The manager pops
+ // free blocks only for the boundaries actually crossed - that is the on-
+ // demand behavior; an already-covered range adds nothing.
+ if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+ return false; // pool exhausted -> caller falls back to the stock path
+ }
+
+ out.reserve(out.size() + n_tokens);
+ for (uint32_t i = 0; i < n_tokens; ++i) {
+ const int64_t s = mgr->slot(0, (int) (base + i));
+ out.push_back((uint32_t) s);
+ }
+
+ if (debug()) {
+ const size_t after = mgr->block_table(0).size();
+ if (after != before) {
+ fprintf(stderr,
+ "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+ "(budget=%u; base=%u +%u tok)\n",
+ cache, stream, before, after, pool_blocks, base, n_tokens);
+ }
+ }
+
+ return true;
+}
+
+void release(const void * cache, int stream) {
+ auto it = g_managers.find({cache, stream});
+ if (it == g_managers.end()) {
+ return;
+ }
+ it->second->free(0);
+ g_managers.erase(it);
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+ }
+}
+
+void release_all(const void * cache) {
+ for (auto it = g_managers.begin(); it != g_managers.end(); ) {
+ if (it->first.first == cache) {
+ it = g_managers.erase(it);
+ } else {
+ ++it;
+ }
+ }
+}
+
+} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
new file mode 100644
index 0000000..bf66665
--- /dev/null
+++ b/src/paged-alloc.h
@@ -0,0 +1,39 @@
+#pragma once
+// On-demand paged KV block allocation (patch 0004, experimental).
+//
+// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
+// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
+// sequence's logical positions onto a fixed full-pool permutation, blocks are
+// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
+// and returned to the pool on sequence end. This is where the paged memory-
+// capacity benefit begins: a short sequence holds only a few blocks, not the
+// whole reserved window.
+//
+// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
+// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
+// struct stays untouched - find_slot only gains a gated call.
+
+#include <cstdint>
+#include <vector>
+
+namespace paged_alloc {
+
+// true iff env LLAMA_KV_PAGED is set (evaluated once).
+bool active();
+
+// Place n_tokens logical positions [base, base+n_tokens) of one stream on
+// demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
+// allocator. The caller still validates each returned cell is empty.
+bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+ uint32_t block_size, uint32_t pool_blocks,
+ std::vector<uint32_t> & out);
+
+// Return a stream's blocks to the pool (sequence end).
+void release(const void * cache, int stream);
+
+// Return every stream's blocks for a kv-cache (clear() / teardown).
+void release_all(const void * cache);
+
+} // namespace paged_alloc
--
2.43.0

View File

@@ -1,143 +0,0 @@
From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 12:13:44 +0200
Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) -
patch 0006
Add host-side cross-request prefix sharing to the vendored PagedKVManager
(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the
matching cached physical blocks (ref_cnt++) for the shared prefix and allocate
fresh blocks only for the divergent suffix. A shared block is freed only at
ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent
write so co-owners stay byte-correct. All logic lives in the vendored
src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the
core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED.
Wiring the physical-cell reuse into find_slot so the engine itself skips
recompute needs core seq-membership changes and is left to a later patch.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++
src/paged-kv-manager.h | 23 ++++++++++++++
2 files changed, 88 insertions(+)
diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
index ca0dcd8..4c6ee4c 100644
--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector<uint64_t>& block
pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes);
}
+// ---------------------------------------------------------------------------
+// Cross-request prefix caching + copy-on-write (patch 0006)
+// ---------------------------------------------------------------------------
+
+size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector<int>& token_ids) {
+ auto& req = req_to_blocks_[seq_id];
+
+ // Longest cached prefix: hash the full blocks and stop at the first miss.
+ // A block hash transitively encodes its whole prefix (FNV chaining), so the
+ // first miss bounds the reusable prefix (vLLM find_longest_cache_hit).
+ const std::vector<uint64_t> hashes = compute_block_hashes(token_ids);
+ std::vector<KVCacheBlock*> hits;
+ for (uint64_t bh : hashes) {
+ KVCacheBlock* cb = pool_.get_cached_block(bh);
+ if (!cb) break;
+ hits.push_back(cb);
+ }
+
+ // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then
+ // splice the shared physical blocks into this sequence's block table.
+ pool_.touch(hits);
+ req.insert(req.end(), hits.begin(), hits.end());
+
+ // Allocate fresh blocks only for the divergent suffix.
+ const size_t need = cdiv(token_ids.size(), block_size_);
+ if (need > req.size()) {
+ const size_t add = need - req.size();
+ if (add > pool_.get_num_free_blocks()) {
+ // OOM: roll the sequence back (un-touch the shared prefix so no ref
+ // leaks) and report no placement; the caller falls back to stock.
+ std::vector<KVCacheBlock*> ordered(req.rbegin(), req.rend());
+ pool_.free_blocks(ordered);
+ req.clear();
+ return 0;
+ }
+ auto nb = pool_.get_new_blocks(add);
+ req.insert(req.end(), nb.begin(), nb.end());
+ }
+ return hits.size();
+}
+
+std::pair<int32_t, int32_t> PagedKVManager::cow_block(int seq_id, size_t bi) {
+ auto& req = req_to_blocks_.at(seq_id);
+ KVCacheBlock* old = req.at(bi);
+ if (old->ref_cnt <= 1) {
+ return { old->block_id, old->block_id }; // already private - no copy
+ }
+ // Private copy for this sequence. get_new_blocks sets the fresh block's
+ // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so
+ // it is NOT returned to the pool and the other owners are left untouched.
+ KVCacheBlock* fresh = pool_.get_new_blocks(1).front();
+ pool_.free_blocks({ old });
+ req[bi] = fresh;
+ return { old->block_id, fresh->block_id };
+}
+
+int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const {
+ return req_to_blocks_.at(seq_id).at(bi)->ref_cnt;
+}
+
+size_t PagedKVManager::num_blocks(int seq_id) const {
+ auto it = req_to_blocks_.find(seq_id);
+ return it == req_to_blocks_.end() ? 0 : it->second.size();
+}
+
} // namespace paged
diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
index 740280a..34decbc 100644
--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
@@ -14,6 +14,7 @@
#include <vector>
#include <unordered_map>
#include <map>
+#include <utility>
namespace paged {
@@ -99,6 +100,28 @@ public:
size_t get_computed_blocks(const std::vector<uint64_t>& block_hashes); // returns num cached tokens
void cache_blocks(int seq_id, const std::vector<uint64_t>& block_hashes, size_t num_tokens);
+ // Cross-request prefix caching + copy-on-write (patch 0006).
+ //
+ // Splice the longest cached prefix of token_ids into seq_id (reuse the
+ // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and
+ // allocate fresh blocks only for the divergent suffix. Returns the number of
+ // shared (reused) blocks; the caller skips recomputing those tokens. On pool
+ // exhaustion the sequence is rolled back (no ref leak) and 0 is returned.
+ size_t place_with_prefix(int seq_id, const std::vector<int>& token_ids);
+
+ // Copy-on-write the block at logical index bi of seq_id. If that block is
+ // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on
+ // the shared one (other owners keep it, content untouched) and install the
+ // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the
+ // block was already private (ref_cnt<=1) and no copy is needed. The caller
+ // copies the physical cell contents old_block_id -> new_block_id.
+ std::pair<int32_t, int32_t> cow_block(int seq_id, size_t bi);
+
+ // Introspection for the prefix-share gate (debug/tests).
+ int block_ref_cnt_at(int seq_id, size_t bi) const;
+ size_t num_blocks(int seq_id) const;
+ size_t num_free_blocks() const { return pool_.get_num_free_blocks(); }
+
protected:
int block_size_;
BlockPool pool_;
--
2.43.0

View File

@@ -1,531 +0,0 @@
From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 12:46:28 +0200
Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) -
patch 0007
Wire the host-side cross-request prefix cache (patch 0006) into the engine so a
new sequence physically SHARES the cached prefix blocks and skips recomputing the
shared prefix - the actual compute win that 0006 (which only proved the host-side
machinery + realised reuse via the stock seq_cp) did not yet deliver from the
paged path itself.
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
* paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager
into ONE persistent caching PagedKVManager per (kv-cache, stream) whose
requests are keyed by the real llama_seq_id. free(seq) now releases exactly
one sequence, so ref-counted shared blocks survive while another sharer holds
them. New seams: share_prefix (place_with_prefix -> shared prefix tokens),
slot, commit (publish a sequence into the content cache), ref-counted release,
plus ref/num-free introspection.
* Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs):
paged_prefix_share() reuses the longest cached content prefix for a sequence
and marks the shared physical cells as belonging to it (cells.seq_add) so the
engine's attention mask includes the already-computed prefix KV; the caller
then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a
sequence's full blocks for later reuse.
* find_slot's paged branch anchors placement on each sequence's own logical base
(ubatch.pos) and keys the manager request by seq_id, so an independently-freed
sequence and a shared prefix coexist in one unified pool. seq_rm/clear free
per-sequence (ref-counted) instead of nuking the whole stream.
* paged-prefix-api: a thin gated shim so a caller holding only the public
llama.h can reach the seam and the introspection without the internal headers.
Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is
additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a
sequence B sharing A's prefix decodes greedy tokens byte-identical to B from
scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at
a block boundary AND mid-block; the shared block carries ref_cnt 2 while both
hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no
use-after-free) and returns to the pool only when all sharers are freed. The
0004 serving gate (unified and non-unified) stays byte-identical stock vs paged.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/CMakeLists.txt | 1 +
src/llama-kv-cache.cpp | 66 +++++++++++++++++++++++--
src/llama-kv-cache.h | 8 +++
src/paged-alloc.cpp | 104 ++++++++++++++++++++++++++++++---------
src/paged-alloc.h | 69 +++++++++++++++++++-------
src/paged-prefix-api.cpp | 48 ++++++++++++++++++
src/paged-prefix-api.h | 27 ++++++++++
7 files changed, 280 insertions(+), 43 deletions(-)
create mode 100644 src/paged-prefix-api.cpp
create mode 100644 src/paged-prefix-api.h
diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt
index 4d9d7d1..432f42d 100644
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -27,6 +27,7 @@ add_library(llama
paged-kv-manager.cpp
paged-attn.cpp
paged-alloc.cpp
+ paged-prefix-api.cpp
llama-kv-cache-dsa.cpp
llama-memory.cpp
llama-memory-hybrid.cpp
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 1125d9a..7510ff9 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
// removed (sequence end), so they return to the pool for reuse.
if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits<llama_pos>::max()) {
if (seq_id >= 0) {
- paged_alloc::release(this, (int) seq_to_stream[seq_id]);
+ paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id);
} else {
paged_alloc::release_all(this);
}
@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
const uint32_t bs = 16; // block size (tokens/block)
const uint32_t nblk = cells.size() / bs; // this stream's block budget
if (nblk >= 2) {
- const uint32_t base = cells.get_used();
+ // [paged 0007] Anchor placement on this sequence's own logical
+ // base position (ubatch.pos), not the shared used-count, and key
+ // the manager request by the real seq_id. slot(seq,pos) is then
+ // stable per sequence, so an independently-freed (ref-counted)
+ // sequence and a shared prefix can coexist in one unified pool.
+ const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens];
const int strm = (int) seq_to_stream[seq_id];
std::vector<uint32_t> placed;
- if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) {
+ if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) {
bool ok = (placed.size() == n_tokens);
for (uint32_t i = 0; ok && i < n_tokens; ++i) {
if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) {
@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch,
return res;
}
+// [paged 0007] Cross-request prefix recompute-skip.
+//
+// Reuse a cached content prefix for seq_id: share_prefix() splices the longest
+// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh
+// blocks for the divergent suffix. We then mark the shared physical cells as
+// belonging to seq_id - those cells already hold the owner's computed KV at the
+// matching logical positions, so the caller decodes ONLY the suffix and the
+// prefix is never recomputed. Returns the number of shared prefix tokens.
+// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise.
+int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return 0;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ auto & cells = v_cells[strm];
+ const uint32_t nblk = cells.size() / bs;
+ if (nblk < 2) {
+ return 0;
+ }
+
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk);
+
+ for (size_t p = 0; p < kshare; ++p) {
+ const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p);
+ if (cell < 0 || (uint32_t) cell >= cells.size() ||
+ cells.is_empty((uint32_t) cell) ||
+ cells.pos_get((uint32_t) cell) != (llama_pos) p) {
+ // Owner cell missing / repurposed: cannot safely share. Roll the
+ // sequence back so the caller recomputes the whole prompt.
+ paged_alloc::release(this, (int) strm, (int) seq_id);
+ return 0;
+ }
+ if (!cells.seq_has((uint32_t) cell, seq_id)) {
+ cells.seq_add((uint32_t) cell, seq_id);
+ }
+ }
+ return (int32_t) kshare;
+}
+
+// [paged 0007] Publish a sequence's full blocks into the content cache so a
+// later paged_prefix_share() can reuse them. Call after the sequence KV is
+// computed (its prefill decode has run).
+void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens) {
+ if (!paged_alloc::active() || tokens.empty()) {
+ return;
+ }
+ const uint32_t bs = 16;
+ const uint32_t strm = (uint32_t) seq_to_stream[seq_id];
+ const uint32_t nblk = v_cells[strm].size() / bs;
+ std::vector<int> toks(tokens.begin(), tokens.end());
+ paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk);
+}
+
void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) {
// TODO: refactor [TAG_KV_CACHE_SHARE_CELLS]
if (other) {
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index 494c0fb..f374ac6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -199,6 +199,14 @@ public:
// emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]]
void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch);
+ // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by
+ // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix
+ // for seq_id and returns the number of shared prefix tokens (the caller
+ // decodes only the suffix); paged_prefix_commit() publishes a sequence into
+ // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset.
+ int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+ void paged_prefix_commit(llama_seq_id seq_id, const std::vector<llama_token> & tokens);
+
//
// input API
//
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
index 1d13f9c..c1027fb 100644
--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
@@ -23,9 +23,13 @@ namespace {
using key_t = std::pair<const void *, int>;
-// One PagedKVManager per (kv-cache, stream): each stream owns a separate
-// physical pool of cells.size() cells, so a manager's block ids map directly to
-// cell ranges within that stream's pool. The internal request id is always 0.
+// One persistent PagedKVManager per (kv-cache, stream): each stream owns a
+// separate physical pool of cells.size() cells, so a manager's block ids map
+// directly to cell ranges within that stream's pool. Requests inside a manager
+// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one
+// sequence and shared blocks survive at ref>0 - this is what makes ref-counted
+// cross-request prefix sharing (0007) possible. Caching is enabled so commit()
+// can publish blocks and share_prefix() can hit them.
std::map<key_t, std::unique_ptr<paged::PagedKVManager>> g_managers;
paged::PagedKVManager * get_mgr(const void * cache, int stream,
@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream,
const key_t k{cache, stream};
auto it = g_managers.find(k);
if (it == g_managers.end()) {
- // enable_caching=false: prefix caching is a later patch; 0004 exercises
- // only on-demand allocate / free.
auto mgr = std::make_unique<paged::PagedKVManager>(
- (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false);
+ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true);
it = g_managers.emplace(k, std::move(mgr)).first;
}
return it->second.get();
}
+paged::PagedKVManager * find_mgr(const void * cache, int stream) {
+ auto it = g_managers.find({cache, stream});
+ return it == g_managers.end() ? nullptr : it->second.get();
+}
+
} // namespace
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out) {
if (n_tokens == 0) {
@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
- const size_t before = mgr->block_table(0).size();
+ const size_t before = mgr->block_table(seq).size();
- // Grow the request to cover the highest logical position. The manager pops
- // free blocks only for the boundaries actually crossed - that is the on-
- // demand behavior; an already-covered range adds nothing.
- if (!mgr->allocate(0, (size_t) base + n_tokens)) {
+ // Grow this sequence's request to cover its highest logical position. The
+ // manager pops free blocks only for boundaries actually crossed; if
+ // share_prefix() already reserved these blocks, this is a no-op.
+ if (!mgr->allocate(seq, (size_t) base + n_tokens)) {
return false; // pool exhausted -> caller falls back to the stock path
}
out.reserve(out.size() + n_tokens);
for (uint32_t i = 0; i < n_tokens; ++i) {
- const int64_t s = mgr->slot(0, (int) (base + i));
+ const int64_t s = mgr->slot(seq, (int) (base + i));
out.push_back((uint32_t) s);
}
if (debug()) {
- const size_t after = mgr->block_table(0).size();
+ const size_t after = mgr->block_table(seq).size();
if (after != before) {
fprintf(stderr,
- "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks "
+ "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks "
"(budget=%u; base=%u +%u tok)\n",
- cache, stream, before, after, pool_blocks, base, n_tokens);
+ cache, stream, seq, before, after, pool_blocks, base, n_tokens);
}
}
return true;
}
-void release(const void * cache, int stream) {
- auto it = g_managers.find({cache, stream});
- if (it == g_managers.end()) {
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ const size_t shared_blocks = mgr->place_with_prefix(seq, tokens);
+ const size_t shared_tokens = shared_blocks * (size_t) block_size;
+ if (debug() && shared_blocks > 0) {
+ fprintf(stderr,
+ "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks "
+ "(%zu tokens) - prefix NOT recomputed\n",
+ cache, stream, seq, shared_blocks, shared_tokens);
+ }
+ return shared_tokens;
+}
+
+int64_t slot(const void * cache, int stream, int seq, int pos) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->slot(seq, pos);
+}
+
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks) {
+ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size);
+ mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size());
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n",
+ cache, stream, seq, tokens.size());
+ }
+}
+
+void release(const void * cache, int stream, int seq) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
return;
}
- it->second->free(0);
- g_managers.erase(it);
+ mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
if (debug()) {
- fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream);
+ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
+ cache, stream, seq, mgr->num_free_blocks());
}
}
@@ -103,4 +146,21 @@ void release_all(const void * cache) {
}
}
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return -1;
+ }
+ const size_t bi = (size_t) pos / block_size;
+ if (bi >= mgr->num_blocks(seq)) {
+ return -1;
+ }
+ return mgr->block_ref_cnt_at(seq, bi);
+}
+
+size_t num_free(const void * cache, int stream) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ return mgr ? mgr->num_free_blocks() : 0;
+}
+
} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
index bf66665..88dedef 100644
--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
@@ -1,17 +1,27 @@
#pragma once
-// On-demand paged KV block allocation (patch 0004, experimental).
+// On-demand paged KV block allocation + cross-request prefix reuse
+// (patches 0004 + 0007, experimental).
//
-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the
-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a
-// sequence's logical positions onto a fixed full-pool permutation, blocks are
-// popped from a free pool ON DEMAND as the sequence crosses block boundaries,
-// and returned to the pool on sequence end. This is where the paged memory-
-// capacity benefit begins: a short sequence holds only a few blocks, not the
-// whole reserved window.
+// Backs the paged placement in llama_kv_cache::find_slot with the vendored
+// host-side PagedKVManager (patch 0001). Two responsibilities:
//
-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this
-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache
-// struct stays untouched - find_slot only gains a gated call.
+// * On-demand allocation (0004): a sequence's logical positions are mapped to
+// physical cells block-by-block, popped from a free pool only as the
+// sequence grows and returned on sequence end.
+//
+// * Cross-request prefix reuse (0007): before a new sequence's suffix is
+// decoded, share_prefix() reuses the cached physical blocks of a matching
+// content prefix (ref_cnt++), so the engine shares the already-computed KV
+// cells and the caller decodes ONLY the divergent suffix - the prefix is not
+// recomputed. commit() publishes a sequence's full blocks into the content
+// cache so later sequences can hit them. Freeing is ref-counted: a shared
+// block returns to the pool only when every sharer has been released.
+//
+// One persistent PagedKVManager per (kv-cache, stream); requests inside it are
+// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and
+// shared blocks survive at ref>0. All state lives in this unit (a static
+// registry), so the core kv-cache struct stays untouched - find_slot gains only
+// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset.
#include <cstdint>
#include <vector>
@@ -21,19 +31,42 @@ namespace paged_alloc {
// true iff env LLAMA_KV_PAGED is set (evaluated once).
bool active();
-// Place n_tokens logical positions [base, base+n_tokens) of one stream on
-// demand, appending their physical cell indices to `out`. pool_blocks =
-// cells.size()/block_size is this stream's block budget. Returns false (leaving
+// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
+// on demand, appending their physical cell indices to `out`. pool_blocks =
+// cells.size()/block_size is the stream's block budget. Returns false (leaving
// `out` unchanged) on pool exhaustion, so the caller falls back to the stock
// allocator. The caller still validates each returned cell is empty.
-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens,
+bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens,
uint32_t block_size, uint32_t pool_blocks,
std::vector<uint32_t> & out);
-// Return a stream's blocks to the pool (sequence end).
-void release(const void * cache, int stream);
+// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream,
+// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh
+// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS
+// (block-aligned); the caller marks those cells for seq and decodes only the
+// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back).
+size_t share_prefix(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens,
+ uint32_t block_size, uint32_t pool_blocks);
+
+// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or
+// -1 if seq is unknown. Used to map a shared prefix position to its cell.
+int64_t slot(const void * cache, int stream, int seq, int pos);
-// Return every stream's blocks for a kv-cache (clear() / teardown).
+// [0007] Publish seq's full (block-aligned) blocks into the content cache so a
+// later share_prefix() can reuse them. Call after the sequence's KV is computed.
+void commit(const void * cache, int stream, int seq,
+ const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+
+// Return one sequence's blocks to the pool (ref-counted; sequence end).
+void release(const void * cache, int stream, int seq);
+
+// Drop every manager for a kv-cache (clear() / teardown).
void release_all(const void * cache);
+// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the
+// ref count of the block backing logical position `pos`, or -1 if unknown.
+int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
+size_t num_free(const void * cache, int stream);
+
} // namespace paged_alloc
diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
new file mode 100644
index 0000000..8573cd2
--- /dev/null
+++ b/src/paged-prefix-api.cpp
@@ -0,0 +1,48 @@
+#include "paged-prefix-api.h"
+#include "paged-alloc.h"
+#include "llama-kv-cache.h"
+
+#include <vector>
+
+namespace paged_prefix_api {
+
+static llama_kv_cache * kv_of(llama_context * ctx) {
+ // The driver targets a plain unified KV-cache model; dynamic_cast yields null
+ // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does
+ // not apply, so the shim degrades to a safe no-op.
+ return dynamic_cast<llama_kv_cache *>(llama_get_memory(ctx));
+}
+
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return 0;
+ }
+ return kv->paged_prefix_share(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv || n <= 0) {
+ return;
+ }
+ kv->paged_prefix_commit(seq, std::vector<llama_token>(tokens, tokens + n));
+}
+
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return -1;
+ }
+ return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16);
+}
+
+long num_free(llama_context * ctx) {
+ llama_kv_cache * kv = kv_of(ctx);
+ if (!kv) {
+ return 0;
+ }
+ return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
+}
+
+} // namespace paged_prefix_api
diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
new file mode 100644
index 0000000..78a3864
--- /dev/null
+++ b/src/paged-prefix-api.h
@@ -0,0 +1,27 @@
+#pragma once
+// Thin test/diagnostic shim over the paged cross-request prefix engine seam
+// (patch 0007). Lets a driver that only includes the public llama.h reach the
+// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection
+// without pulling in the internal kv-cache headers. All entry points are no-ops
+// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API.
+
+#include "llama.h"
+
+namespace paged_prefix_api {
+
+// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and
+// return the number of shared prefix tokens (the caller decodes only the
+// suffix). 0 if nothing was shared.
+int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Publish `seq`'s full blocks into the content cache (call after its KV is computed).
+void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+
+// Ref count of the paged block backing logical position `pos` of `seq` (unified
+// stream 0), or -1 if unknown.
+int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
+
+// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
+long num_free(llama_context * ctx);
+
+} // namespace paged_prefix_api
--
2.43.0

View File

@@ -1,130 +0,0 @@
From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 17:02:22 +0200
Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED)
- patch 0008
Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam,
paged_prefix_api::share/commit) into the llama-server continuous-batching loop
(update_slots) so CONCURRENT requests that share a long prefix physically reuse
one committed copy of the prefix blocks and prefill only their divergent suffix.
Patch 0007 proved the engine seam correct via a standalone driver, but the server
never called it: two concurrent shared-prefix requests each recomputed the full
prefix. The server's native prompt cache only reuses a slot's OWN prior prompt
(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct
concurrent slots. 0008 adds that cross-slot share.
Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical):
* In update_slots prompt-processing, after the native n_past is computed and
only for a FRESH slot (n_past < one block, i.e. the native cache did not
already cover the prefix), call paged_prefix_api::share() to splice the
longest committed cross-request prefix into this sequence (ref_cnt++ on the
shared physical blocks) and advance n_past past it, so the batch fill computes
ONLY the suffix. The slot's own divergent tail cells are removed first so the
shared cells own [n_past, kshare) without colliding (the native path removes
these later anyway). The n_past < block gate guarantees any block-aligned
share the engine returns is strictly larger than n_past and therefore always
adopted, so the engine's reservation always matches the suffix-only batch and
never leaves stale blocks (which otherwise fragment the paged pool).
* When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix
KV just computed), call paged_prefix_api::commit() to publish its prefix so
concurrent/later sharers can reuse it.
The share() / commit() entry points are forward-declared (defined in libllama,
src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the
server translation unit.
Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence
holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their
~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens;
K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix
blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the
documented CUDA batch-shape non-determinism band (stock native prompt-caching
shows the same magnitude). Cross-request sharing requires the unified KV cache.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++
1 file changed, 50 insertions(+)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index 39b7eb2..b5f9d37 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -16,6 +16,16 @@
#include "mtmd.h"
#include "mtmd-helper.h"
+// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are
+// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops
+// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix
+// cache wires into update_slots() without pulling in internal kv-cache headers.
+// Fully gated; stock (paged off) is byte-identical.
+namespace paged_prefix_api {
+ int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+ void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n);
+}
+
#include <algorithm>
#include <cstddef>
#include <cinttypes>
@@ -3335,6 +3345,37 @@ private:
}
}
+ // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache
+ // above only reuses THIS slot's own prior prompt; when the paged KV
+ // engine is active, also reuse a committed CROSS-slot prefix so
+ // concurrent requests sharing a long prefix skip recompute. Gated on
+ // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical.
+ static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr;
+ // Only attempt the cross-request share on a FRESH slot (the native
+ // cache above did not already cover the prefix). With n_past < a
+ // block, any block-aligned share the engine returns is strictly
+ // larger than n_past and is therefore always adopted below - so the
+ // engine's full-prompt reservation always matches the suffix-only
+ // submission and never leaves stale blocks (which fragmented the
+ // paged pool and crashed the server under high fan-out otherwise).
+ if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) {
+ const llama_tokens ptoks = input_tokens.get_text_tokens();
+ // Drop this slot's own cells beyond the natively-cached prefix before
+ // splicing the shared physical prefix in, so the shared cells can own
+ // [n_past, kshare) without colliding (the native path removes exactly
+ // these later; a no-op for a fresh slot).
+ common_context_seq_rm(ctx_tgt, slot.id, n_past, -1);
+ const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size());
+ if (kshare > n_past) {
+ slot.prompt.tokens.keep_first(n_past);
+ for (int i = n_past; i < kshare; ++i) {
+ slot.prompt.tokens.push_back(ptoks[i]);
+ }
+ n_past = kshare;
+ SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past);
+ }
+ }
+
// [TAG_PROMPT_LOGITS]
if (n_past == slot.task->n_tokens() && n_past > 0) {
SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens());
@@ -3741,6 +3782,15 @@ private:
// prompt evaluated for next-token prediction
slot.state = SLOT_STATE_GENERATING;
+ // [paged 0008] Publish this slot's computed prefix so concurrent/later
+ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode
+ // for [0, n_tokens) has just run, so the prefix KV is computed.
+ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr;
+ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) {
+ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens();
+ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size());
+ }
+
if (slot.can_speculate()) {
common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens());
}
--
2.43.0

View File

@@ -1,609 +0,0 @@
From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 20:03:17 +0200
Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009
Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a
contiguous buffer) with an in-kernel paged read on the decode step. build_attn
passes the UNMODIFIED physical K/V views plus a block table (src[5] of
ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell
index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU
reference map logical KV index j -> physical cell block_table[seq*ne11+j] and
read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V
(the bulk of the gather) is gone. The mask stays a small compacted [n_view]
causal mask in the same position order; KV_max / parallel_blocks / stream_k
split-K are unchanged. The decode shape is forced onto the vec kernel (the only
one wired for the block table); a nullptr block table => the stock contiguous
read, byte-identical.
Token-POSITION ordering keeps the flash-attn reduction order identical to stock,
so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64
tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within
the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024
ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%),
recovering the gather regression to stock parity (647 ms). Gated behind
LLAMA_KV_PAGED; no-op (stock byte-identical) when unset.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 6 ++
ggml/src/ggml-cpu/ops.cpp | 10 ++-
ggml/src/ggml-cuda/fattn-common.cuh | 8 +-
ggml/src/ggml-cuda/fattn-mma-f16.cuh | 4 +-
ggml/src/ggml-cuda/fattn-tile.cuh | 4 +-
ggml/src/ggml-cuda/fattn-vec.cuh | 25 +++++--
ggml/src/ggml-cuda/fattn-wmma-f16.cu | 4 +-
ggml/src/ggml-cuda/fattn.cu | 9 +++
ggml/src/ggml.c | 14 ++++
src/llama-graph.cpp | 23 ++++--
src/llama-graph.h | 3 +-
src/llama-kv-cache.cpp | 31 ++++++++
src/llama-kv-cache.h | 4 +
src/paged-attn.cpp | 107 +++++++++++++++++++++++++++
src/paged-attn.h | 18 +++++
15 files changed, 248 insertions(+), 22 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index d6807b6..823f5a9 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2427,6 +2427,12 @@ extern "C" {
struct ggml_tensor * a,
struct ggml_tensor * sinks);
+ // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each
+ // logical KV index to the physical cell within K/V. nullptr => stock contiguous read.
+ GGML_API void ggml_flash_attn_ext_set_block_table(
+ struct ggml_tensor * a,
+ struct ggml_tensor * block_table);
+
// TODO: needs to be adapted to ggml_flash_attn_ext
GGML_API struct ggml_tensor * ggml_flash_attn_back(
struct ggml_context * ctx,
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 74611dc..63c07a2 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
const ggml_tensor * v = dst->src[2];
const ggml_tensor * mask = dst->src[3];
const ggml_tensor * sinks = dst->src[4];
+ const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5])
+ const int32_t * bt = block_table ? (const int32_t *) block_table->data : nullptr;
GGML_TENSOR_LOCALS(int64_t, neq, q, ne)
GGML_TENSOR_LOCALS(size_t, nbq, q, nb)
@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
float s; // KQ value
- const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3);
+ // [paged] map the logical KV index ic to its physical cell via the block table.
+ const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic;
+ const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3);
kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1);
s = s*scale; // scale KQ value
@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk(
float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value
float vs = 1.0f; // post-softmax KQ value, expf(s - M)
- const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3));
+ const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3));
if (v->type == GGML_TYPE_F16) {
if (s > M) {
@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16(
const int64_t dr = (nr + nchunk - 1) / nchunk;
static constexpr int64_t Q_TILE_SZ = ggml_fa_tile_config::Q;
- bool use_tiled = !use_ref &&
+ bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table
(q->type == GGML_TYPE_F32 &&
kv_is_f32_or_f16 &&
k->type == v->type &&
diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh
index 8dfa51a..3c6ddd5 100644
--- a/ggml/src/ggml-cuda/fattn-common.cuh
+++ b/ggml/src/ggml-cuda/fattn-common.cuh
@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33);
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table);
typedef float (*vec_dot_KQ_t)(
const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds);
@@ -981,6 +982,8 @@ void launch_fattn(
const ggml_tensor * mask = dst->src[3];
const ggml_tensor * sinks = dst->src[4];
+ const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map
+ const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr;
ggml_tensor * KQV = dst;
@@ -1217,7 +1220,8 @@ void launch_fattn(
K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13,
nb21, nb22, nb23,
mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0,
- mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0
+ mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0,
+ bt_ptr
);
CUDA_CHECK(cudaGetLastError());
diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
index 83478a0..0a92cd6 100644
--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh
+++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh
@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
ggml_cuda_pdl_sync(); // TODO optimize placement
#if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE))
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
index 0a09981..0ff14e6 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh
index 69dd936..a09e2fb 100644
--- a/ggml/src/ggml-cuda/fattn-vec.cuh
+++ b/ggml/src/ggml-cuda/fattn-vec.cuh
@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
ggml_cuda_pdl_lc();
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
return;
}
@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec(
K += nb13*sequence + nb12*(head / gqa_ratio);
V += nb23*sequence + nb22*(head / gqa_ratio);
+ // [paged] in-kernel block-table read: logical KV index j -> physical cell
+ // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The
+ // mask/KV_max stay logical (the table is in token-position order). nullptr =>
+ // the stock contiguous read below.
+ const char * GGML_CUDA_RESTRICT K0 = K;
+ const char * GGML_CUDA_RESTRICT V0 = V;
+ const int * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
const half * maskh = (const half *) (mask + nb33*(sequence % ne33) + nb31*ic0);
const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1);
@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) {
const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0;
+ const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11);
#pragma unroll
for (int j = 0; j < ncols; ++j) {
- float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]);
+ float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]);
sum = warp_reduce_sum<nthreads_KQ>(sum);
if (use_logit_softcap) {
@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) {
const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V);
+ const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21);
#ifdef V_DOT2_F32_F16_AVAILABLE
half2 KQ_k[ncols];
@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec(
half2 tmp[V_rows_per_thread/2];
if constexpr (type_V == GGML_TYPE_BF16) {
float2 tmp_f[V_rows_per_thread/2];
- dequantize_V(V + k*nb21, tmp_f,
+ dequantize_V(V_blk, tmp_f,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]);
}
} else {
- dequantize_V(V + k*nb21, tmp,
+ dequantize_V(V_blk, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
}
#pragma unroll
@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec(
#pragma unroll
for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) {
float2 tmp[V_rows_per_thread/2];
- dequantize_V(V + k*nb21, tmp,
+ dequantize_V(V_blk, tmp,
2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread);
#pragma unroll
for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) {
@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}
diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
index 6850716..5357849 100644
--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu
+++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu
@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16(
const int32_t nb11, const int32_t nb12, const int64_t nb13,
const int32_t nb21, const int32_t nb22, const int64_t nb23,
const int32_t ne31, const int32_t ne32, const int32_t ne33,
- const int32_t nb31, const int32_t nb32, const int64_t nb33) {
+ const int32_t nb31, const int32_t nb32, const int64_t nb33,
+ const int * __restrict__ block_table) {
+ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN))
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index d6c501b..e3771ee 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
+
+ // [paged] the block table (src[5]) is only honored by the vec kernel's
+ // in-kernel read; force it. build_attn only sets it for a vec-supported
+ // 1-token-per-stream decode shape.
+ if (dst->src[5] != nullptr) {
+ ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ return;
+ }
+
switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) {
case BEST_FATTN_KERNEL_NONE:
GGML_ABORT("fatal error");
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index b43016c..adbe52b 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks(
a->src[4] = sinks;
}
+void ggml_flash_attn_ext_set_block_table(
+ struct ggml_tensor * a,
+ struct ggml_tensor * block_table) {
+ if (!block_table) {
+ a->src[5] = NULL;
+ return;
+ }
+
+ GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT);
+ GGML_ASSERT(block_table->type == GGML_TYPE_I32);
+
+ a->src[5] = block_table;
+}
+
// ggml_flash_attn_back
struct ggml_tensor * ggml_flash_attn_back(
diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp
index b59d2a5..abdb48d 100644
--- a/src/llama-graph.cpp
+++ b/src/llama-graph.cpp
@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha(
ggml_tensor * sinks,
ggml_tensor * v_mla,
float kq_scale,
- int il) const {
+ int il,
+ ggml_tensor * block_table) const {
const bool v_trans = v->nb[1] > v->nb[2];
// split the batch into streams if needed
@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha(
hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f);
cb(cur, LLAMA_TENSOR_NAME_FATTN, il);
+ if (block_table) {
+ ggml_flash_attn_ext_set_block_table(cur, block_table);
+ }
ggml_flash_attn_ext_add_sinks(cur, sinks);
ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32);
@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn(
ggml_tensor * k = mctx_cur->get_k(ctx0, il);
ggml_tensor * v = mctx_cur->get_v(ctx0, il);
- // [paged 0003] gather K, V and the mask to the sequence's used cells only
- // (no-op unless env LLAMA_KV_PAGED is set).
- ggml_tensor * kq_mask_g = kq_mask;
- paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+ // [paged] decode read: when paging is active and this is a 1-token-per-stream
+ // decode step, present K/V as n_gather views + a block table so the fattn
+ // kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else
+ // fall back to the gather-read (prefill, transposed V, or env off). All a
+ // no-op unless env LLAMA_KV_PAGED is set => stock byte-identical.
+ ggml_tensor * kq_mask_g = kq_mask;
+ ggml_tensor * block_table = nullptr;
+ const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream
+ if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) {
+ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g);
+ }
- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il);
+ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table);
cb(cur, "kqv_out", il);
if (inp->self_v_rot) {
diff --git a/src/llama-graph.h b/src/llama-graph.h
index 5e8a658..c95ae49 100644
--- a/src/llama-graph.h
+++ b/src/llama-graph.h
@@ -969,7 +969,8 @@ struct llm_graph_context {
ggml_tensor * sinks, // [n_head_q]
ggml_tensor * v_mla, // [n_embd_head_v_mla, n_embd_head_v, n_head_v]
float kq_scale,
- int il) const;
+ int il,
+ ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table
llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const;
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 7510ff9..0351f86 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in
}
}
+void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const {
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ for (uint32_t j = 0; j < ns; ++j) {
+ const auto & cells = v_cells[sinfo.s0 + j];
+ const uint32_t n = std::min<uint32_t>(n_kv, cells.size());
+ std::vector<std::pair<llama_pos, int32_t>> pc;
+ pc.reserve(n);
+ int32_t pad = -1;
+ for (uint32_t i = 0; i < n; ++i) {
+ if (!cells.is_empty(i)) {
+ pc.emplace_back(cells.pos_get(i), (int32_t) i);
+ } else if (pad < 0) {
+ pad = (int32_t) i;
+ }
+ }
+ std::sort(pc.begin(), pc.end());
+ int32_t * col = dst + (size_t) j * n_blk;
+ for (size_t k = 0; k < pc.size(); ++k) {
+ col[k] = pc[k].second;
+ }
+ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second);
+ for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) {
+ col[k] = padv;
+ }
+ }
+}
+
ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const {
GGML_UNUSED(sinfo);
@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]);
}
+void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
+ kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+}
+
ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]);
}
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index f374ac6..e9980b6 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -176,6 +176,9 @@ public:
// gather-read. get_n_gather returns the max count across streams.
uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const;
void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const;
+ // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk
+ // per column with a masked empty cell) for the in-kernel paged read.
+ void get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const;
// store k_cur and v_cur in the cache based on the provided head location
ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const;
@@ -386,6 +389,7 @@ public:
// current ubatch's stream).
uint32_t get_n_gather() const;
void get_gather_idxs(int32_t * dst) const;
+ void get_block_table(int32_t * dst, uint32_t n_blk) const;
// store k_cur and v_cur in the cache based on the provided head location
// note: the heads in k_cur and v_cur should be laid out contiguously in memory
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
index ade75e8..8eebeaa 100644
--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
@@ -43,6 +43,25 @@ public:
ggml_tensor * idxs;
};
+// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream]
+// tensor with each stream's position-ordered cells, padded to n_blk (per column)
+// with a masked empty cell, by delegating to the kv-cache context.
+class input_block_table : public llm_graph_input_i {
+public:
+ input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk)
+ : mctx(mctx), idxs(idxs), n_blk(n_blk) {}
+
+ void set_input(const llama_ubatch * ubatch) override {
+ GGML_UNUSED(ubatch);
+ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+ mctx->get_block_table((int32_t *) idxs->data, n_blk);
+ }
+
+ const llama_kv_cache_context * mctx;
+ ggml_tensor * idxs;
+ uint32_t n_blk;
+};
+
} // namespace
void gather(ggml_context * ctx0,
@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0,
}
}
+bool in_kernel_decode(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask,
+ ggml_tensor ** block_table) {
+ if (!active()) {
+ return false;
+ }
+ // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode
+ // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only.
+ static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr);
+ if (force_gather) {
+ return false;
+ }
+
+ ggml_tensor * K = *k;
+ ggml_tensor * V = *v;
+ ggml_tensor * M = *kq_mask;
+
+ const int64_t n_stream = K->ne[3];
+ GGML_ASSERT(M->ne[3] == n_stream);
+
+ const int64_t n_gather = (int64_t) mctx->get_n_gather();
+ if (n_gather <= 0) {
+ // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read.
+ return false;
+ }
+
+ // The in-kernel read addresses V along its d-major (non-transposed) axis. If
+ // the cache stores V transposed, fall back to gather() (which normalizes it).
+ if (V->nb[1] > V->nb[2]) {
+ return false;
+ }
+
+ if (debug()) {
+ static int64_t once = 0;
+ if (once++ < 2) {
+ fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n",
+ (long long) n_stream, (long long) K->ne[2], (long long) n_gather);
+ }
+ }
+
+ // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells
+ // in token-POSITION order (identical to the gather index, so the reduction
+ // order matches stock bit-for-bit), padded with a masked empty cell. Filled
+ // at set_input from the kv-cache (get_gather_idxs), exactly like the gather.
+ // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel
+ // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan
+ // engages; padded entries point at a masked empty cell (0 contribution). Stays
+ // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv.
+ int64_t n_view = GGML_PAD(n_gather, 256);
+ if (n_view > K->ne[2]) {
+ n_view = K->ne[2];
+ }
+
+ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
+ ggml_set_input(idx);
+ res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
+
+ // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window:
+ // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell
+ // dim shrinks to n_view. NOT materialized - the kernel reads in place.
+ *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream,
+ K->nb[1], K->nb[2], K->nb[3], 0);
+ *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream,
+ V->nb[1], V->nb[2], V->nb[3], 0);
+
+ // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so
+ // the kernel's logical mask index aligns with the block table. Cheap: the
+ // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains.
+ {
+ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream);
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+ m = ggml_get_rows(ctx0, m, idx);
+ m = ggml_cont(ctx0, ggml_transpose(ctx0, m));
+ m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream);
+ if (M->type != m->type) {
+ m = ggml_cast(ctx0, m, M->type);
+ }
+ *kq_mask = m;
+ }
+
+ *block_table = idx;
+ return true;
+}
+
} // namespace paged_attn
diff --git a/src/paged-attn.h b/src/paged-attn.h
index c5b7bd7..23e2184 100644
--- a/src/paged-attn.h
+++ b/src/paged-attn.h
@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0,
ggml_tensor ** v,
ggml_tensor ** kq_mask);
+// [paged inc1] In-kernel paged decode read. Instead of materializing the
+// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the
+// full physical window and return the position-ordered physical-cell index list
+// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then
+// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V
+// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask
+// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index.
+// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does
+// not apply - env off, nothing placed, or a transposed V cache - so the caller
+// keeps the dense gather()/contiguous read.
+bool in_kernel_decode(ggml_context * ctx0,
+ llm_graph_result * res,
+ const llama_kv_cache_context * mctx,
+ ggml_tensor ** k,
+ ggml_tensor ** v,
+ ggml_tensor ** kq_mask,
+ ggml_tensor ** block_table);
+
} // namespace paged_attn
--
2.43.0

View File

@@ -1,269 +0,0 @@
From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Mon, 22 Jun 2026 22:36:09 +0200
Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env
LLAMA_KV_PAGED) - patch 0010
Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read
safe against silent mis-routing, and plumb the same read into the tile kernel
for the increment-3 GQA head-group work.
fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of
fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ
(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row
i is read from base + block_table[row_base + i]*stride instead of base + i*stride.
The table defaults to nullptr (default args + a null bt_seq when src[5] is unset),
so every existing non-paged caller is byte-identical to stock. The mask / KV_max
stay logical (token-position order), as in vec.
fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to
the vec or tile kernel and never fall through to the best-kernel switch. The
mma/wmma kernels GGML_UNUSED the table and would silently read the wrong
(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher
GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec
(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed
kernel once.
Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU
vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape:
Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to
vec, matching the nsys profile (flash_attn_ext_vec).
The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL
and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows
tile-paged diverging from tile-stock at the first cross-tile KV depth: the
GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with
oob_check=false while the compacted paged mask is not padded to cover the tile, so
past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding
the tile path is increment-3 work; the default vec route and all stock paths are
untouched.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++-----------
ggml/src/ggml-cuda/fattn.cu | 38 +++++++++++++++++++++++---
2 files changed, 64 insertions(+), 19 deletions(-)
diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh
index 0ff14e6..bb84d61 100644
--- a/ggml/src/ggml-cuda/fattn-tile.cuh
+++ b/ggml/src/ggml-cuda/fattn-tile.cuh
@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ,
// TODO: deduplicate with mma-f16
template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
static __device__ __forceinline__ void flash_attn_tile_load_tile(
- const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+ const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+ const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne;
const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}};
+ // [paged] remap the row through the block table (nullptr => stock contiguous read).
+ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
ggml_cuda_memcpy_1<cpy_nb>(
tile_KV + i*(J/2 + J_padding) + j,
- !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+ !oob_check || i < i_sup ? KV_row + j : zero);
}
}
}
@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
template<int warp_size, int nwarps, int I, int J, int J_padding, bool oob_check>
static __device__ __forceinline__ void flash_attn_tile_load_tile(
- const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) {
+ const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup,
+ const int * const __restrict__ block_table = nullptr, const int row_base = 0) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile(
const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}};
__align__(16) half2 tmp_h2[cpy_ne/2];
+ // [paged] remap the row through the block table (nullptr => stock contiguous read).
+ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV;
ggml_cuda_memcpy_1<sizeof(tmp_h2)>(
- tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero);
+ tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero);
__align__(16) float2 tmp_f2[cpy_ne/2];
#pragma unroll
@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
const int k_VKQ_0,
const int k_VKQ_sup,
const int k_KQ_0,
+ const int * const __restrict__ block_table,
float * KQ_acc) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ(
constexpr int cpw = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp
constexpr int np = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column
+ // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row.
+ const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2);
flash_attn_tile_load_tile<warp_size, nwarps, nbatch_fa, nbatch_K, cpy_ne, oob_check>
- (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup);
+ (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0);
__syncthreads();
#ifdef FAST_FP16_AVAILABLE
@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
T_acc * const VKQ,
const int k_VKQ_0,
const int k_VKQ_max,
- const int col_Q_0) {
+ const int col_Q_0,
+ const int * const __restrict__ block_table) {
constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes();
constexpr int cpy_ne = cpy_nb / 4;
@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
#pragma unroll
for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) {
flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>(
- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
}
if (nbatch_K_last > 0) {
constexpr int k_KQ_0 = DKQ - nbatch_K_last;
flash_attn_tile_iter_KQ<warp_size, nwarps, ncols1, ncols2, DKQ, nbatch_fa, nbatch_K_last, use_logit_softcap, oob_check>(
- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc);
+ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc);
}
// Apply logit softcap + mask, update KQ_max:
@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter(
static_assert(nbatch_V % np == 0, "bad nbatch_V");
#pragma unroll
for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) {
+ // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row.
+ const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2);
flash_attn_tile_load_tile<warp_size, nwarps, nbatch_V, DV, 0, oob_check>
- (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0);
+ (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0);
__syncthreads();
#ifdef FAST_FP16_AVAILABLE
@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile(
const int32_t ne31, const int32_t ne32, const int32_t ne33,
const int32_t nb31, const int32_t nb32, const int64_t nb33,
const int * __restrict__ block_table) {
- GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel
#ifdef FLASH_ATTN_AVAILABLE
const char * GGML_CUDA_RESTRICT Q = Q_ptr;
const char * GGML_CUDA_RESTRICT K = K_ptr;
@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
return;
}
@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile(
const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio));
const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape
+ // [paged] per-sequence logical->physical block table in token-position order
+ // (mask/KV_max stay logical); nullptr => the stock contiguous read.
+ const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr;
+
const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr;
const int stride_K2 = nb11 / sizeof(half2);
@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile(
constexpr bool oob_check = false;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
k_VKQ_0 += gridDim.y*nbatch_fa;
}
if (k_VKQ_0 < k_VKQ_max) {
constexpr bool oob_check = true;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
}
} else {
// Branch without out-of-bounds checks.
@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile(
constexpr bool oob_check = false;
flash_attn_tile_iter<warp_size, nwarps, ncols1, ncols2, DKQ, DV, nbatch_fa, nbatch_K, use_logit_softcap, oob_check>
(Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp,
- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0);
+ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq);
}
}
@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile(
nb11, nb12, nb13,
nb21, nb22, nb23,
ne31, ne32, ne33,
- nb31, nb32, nb33);
+ nb31, nb32, nb33, block_table);
NO_DEVICE_CODE;
#endif // FLASH_ATTN_AVAILABLE
}
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index e3771ee..afcafa2 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d
void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
ggml_cuda_set_device(ctx.device);
- // [paged] the block table (src[5]) is only honored by the vec kernel's
- // in-kernel read; force it. build_attn only sets it for a vec-supported
- // 1-token-per-stream decode shape.
+ // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by
+ // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would
+ // silently read the wrong (contiguous physical) cells. So when a block table
+ // is present we route here and NEVER fall through to the best-kernel switch
+ // below - no decode shape can silently reach an mma/wmma misread. build_attn
+ // only sets src[5] for the 1-token-per-stream decode shape; the vec
+ // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
+ // and any shape that should not be paged must take the host-side gather path
+ // (LLAMA_KV_PAGED_GATHER=1) instead.
+ //
+ // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
+ // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
+ // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
+ // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
+ // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
+ // with oob_check=false while the compacted paged mask is not padded to cover
+ // it, so it diverges from stock. Not for production paged decode until
+ // increment-3 bounds that path; the default vec route is unaffected.
if (dst->src[5] != nullptr) {
- ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+ if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
+ static bool logged = false;
+ if (!logged) {
+ logged = true;
+ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
+ paged_tile ? "TILE(experimental)" : "VEC",
+ (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
+ (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+ }
+ }
+ if (paged_tile) {
+ ggml_cuda_flash_attn_ext_tile(ctx, dst);
+ } else {
+ ggml_cuda_flash_attn_ext_vec(ctx, dst);
+ }
return;
}
--
2.43.0

View File

@@ -1,147 +0,0 @@
From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 00:18:35 +0200
Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16,
gqa>=2) - patch 0011
Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the
in-kernel decode to the tile kernel for the common grouped-query F16 case, and
keep the inc-1 vec kernel for everything else.
The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the
q-heads that share one kv-head, so each K/V row is loaded once for the whole
group instead of once per q-head. vec re-streams each kv-head's K/V once per
q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs ->
3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping.
The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010);
this patch makes it the default for {F16 K and V, gqa_ratio >= 2}.
Routing guard (why conditional): the tile kernel has no K/V type template - it
loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by
launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table
read (the table indexes the original paged layout, not the copy). So tile is
correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape
fall back to the inc-1 vec path, exactly as before this change. The head-group
reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B.
Note: paged decode is currently exercised with an F16 cache only; quantized +
paged is a separate pre-existing limitation, independent of this change
(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and
after this patch, since both route the non-F16 cache to vec).
Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32,
1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1,
same build, env-toggled:
STOCK (mma) 174.8 ms/step 183.1 t/s
PAGED-VEC (inc-1) 186.3 ms/step 171.8 t/s (+6.6% vs stock)
PAGED-TILE (inc-3) 177.9 ms/step 179.8 t/s (+1.8% vs stock)
GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings
paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs
vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention
takes a larger share of the step.
Why not the split-K tune: the vec decode grid is already block-saturated
(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48
SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is
intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks
directly; more split-K does not.
Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1):
- CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL.
- GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path
in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise
band where vec also drifts from stock. Stock uses the mma kernel for this
multi-stream GQA shape, so a different kernel = different rounding =
autoregressive token drift; vec and tile agree with each other while both
differ from stock (both pick 15678 where stock picks 38835), confirming the
drift is kernel choice, not a paging error.
- GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec
(seq3: tile == stock == 624 at the token where vec picked 13).
Stock is byte-identical: the dispatch guard only diverts when the block table
(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile
path reads the last nbatch_fa tile with oob_check=false and relies on the mask
-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged
mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding.
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]
---
ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++-----------
1 file changed, 36 insertions(+), 15 deletions(-)
diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index afcafa2..6b15810 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
// silently read the wrong (contiguous physical) cells. So when a block table
// is present we route here and NEVER fall through to the best-kernel switch
// below - no decode shape can silently reach an mma/wmma misread. build_attn
- // only sets src[5] for the 1-token-per-stream decode shape; the vec
+ // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile
// dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading,
// and any shape that should not be paged must take the host-side gather path
// (LLAMA_KV_PAGED_GATHER=1) instead.
//
- // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and
- // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the
- // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the
- // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte-
- // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile
- // with oob_check=false while the compacted paged mask is not padded to cover
- // it, so it diverges from stock. Not for production paged decode until
- // increment-3 bounds that path; the default vec route is unaffected.
+ // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct
+ // and a win, else the inc-1 vec path. Tile groups the q-heads that share one
+ // kv-head (ncols2), loading each K/V row once for the whole group instead of
+ // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168).
+ // Two constraints make this conditional: (1) the tile kernel has no K/V type
+ // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be
+ // converted by launch_fattn to a contiguous F16 copy, which breaks the
+ // in-kernel block-table read (the table indexes the original paged layout, not
+ // the copy); vec instead reads the original cache with in-kernel dequant, so it
+ // is the only correct paged path for non-F16 caches. (2) the head-group reuse
+ // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V,
+ // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends
+ // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B
+ // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3
+ // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and
+ // brings paged decode to ~1.8% of stock. Validated token-coherent with vec:
+ // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also
+ // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU
+ // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa
+ // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock
+ // uses for ncols2>1); the compacted paged mask is gathered to the n_view
+ // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces
+ // the inc-1 vec path for A/B.
if (dst->src[5] != nullptr) {
- static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr;
+ const ggml_tensor * Qp = dst->src[0];
+ const ggml_tensor * Kp = dst->src[1];
+ const ggml_tensor * Vp = dst->src[2];
+ const bool kv_f16 = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16;
+ const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1;
+ const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr;
+ const bool use_tile = !force_vec && kv_f16 && gqa_ratio >= 2;
if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) {
static bool logged = false;
if (!logged) {
logged = true;
- fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n",
- paged_tile ? "TILE(experimental)" : "VEC",
- (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1],
- (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]);
+ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n",
+ use_tile ? "TILE(gqa)" : "VEC",
+ (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3],
+ (long) gqa_ratio, (int) kv_f16);
}
}
- if (paged_tile) {
+ if (use_tile) {
ggml_cuda_flash_attn_ext_tile(ctx, dst);
} else {
ggml_cuda_flash_attn_ext_vec(ctx, dst);
--
2.43.0

View File

@@ -1,50 +0,0 @@
From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 11:12:05 +0200
Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile
route (patch 0012)
The now-default paged decode route (GQA-grouped fattn-tile kernel) does not
leak past-end KV rows only because the compacted mask/block-table length is
padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather,
256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last
tile sits entirely inside the -inf pad window. That invariant was implicit.
Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a
future change to the pad (e.g. < 256) or the tile (> 256) that broke the
whole-tile property cannot silently reintroduce the leak. Additive only, no
behaviour change.
Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off
vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert
stays silent (n_view remains a whole number of tiles across all decode steps).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/paged-attn.cpp | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
index 8eebeaa..fed8ca9 100644
--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0,
n_view = K->ne[2];
}
+ // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be
+ // a whole number of such tiles so the in-kernel decode never reads past the gathered
+ // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling
+ // the boundary still contributes zero. This holds today only because the pad (256) is a
+ // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would
+ // silently reintroduce a past-end KV leak, so assert it rather than trust it.
+ // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad
+ GGML_ASSERT(n_view % 64 == 0);
+
ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream);
ggml_set_input(idx);
res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view)));
--
2.43.0

View File

@@ -1,136 +0,0 @@
From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 11:52:45 +0200
Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch
0013)
llama-server already co-batches decode with chunked prefill: update_slots()
appends every generating slot's sampled token first, then fills the rest of the
n_batch budget with prompt tokens, deferring the overflow to the next step. But
the prefill chunk size is hard-wired to n_batch (default 2048): one slot's
~2048-token prefill chunk lands in a single compute-heavy step, and every decode
co-batched into that step sees a multi-second inter-token-latency (ITL) spike.
Lowering n_batch shrinks the chunk but also caps decode-concurrency width and
prefill throughput, because they are coupled.
Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch
(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold).
The prompt-fill loop and the outer slot loop now also stop once this many prompt
tokens have been added in the current update_slots() step, so a long prefill is
split across more steps that each still advance in-flight decode. Default (env
unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to
LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off.
Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode
streams with one 6000-token prefill injected mid-stream; same binary, only
LLAMA_PREFILL_BUDGET differs:
metric stock(off) budget=256 budget=512
worst decode freeze (ms) 3380 482 (7.0x) 778 (4.3x)
median decode ITL in window 2264 411 (5.5x) 689
decode_stall (ms) 3285 387 (8.5x) 684 (4.8x)
decode steps during prefill 38 201 (5.3x) 108
injected-req TTFT (ms) 8493 10172 (+20%) 8432 (~0%)
steady-state baseline ITL 94 95 94
This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens
the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller
worst freeze and 5.3x more decode progress during the prefill at budget=256), in
exchange for a modest TTFT rise on the long request (the classic chunked-prefill
trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is
unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor),
which the scheduler cannot lift.
Correctness (same model, greedy temp 0, fa on):
- budget unset or >= n_batch: byte-identical to stock (the added break never
fires before the existing n_batch break; the off-path is a no-op by
construction).
- short prompt (<= budget): byte-identical to stock.
- the knob is exactly equivalent to stock's native -b chunking: budget=512 ==
stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping
n_batch=2048 for decode width.
- on a prompt larger than the budget the chunked greedy output diverges from the
single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE
stock -b256 diverges from stock -b2048 the same way with the patch inactive,
and the output stays coherent and answers correctly.
Productisation (LocalAI): surface as a model options knob (max_prefill_tokens /
mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN
Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and
stays disjoint from the paged allocation hunks.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index b5f9d37..afcdebe 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -3043,6 +3043,29 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
+ // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
+ // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
+ // tokens ingested per update_slots() step at n_batch only; with cont_batching the
+ // sampled decode tokens of every generating slot are appended FIRST, then prompt
+ // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
+ // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
+ // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
+ // tokens added per step independently of n_batch, splitting a long prefill across
+ // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
+ // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
+ // (this is a pure scheduler knob; works with paged off).
+ int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+ {
+ const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
+ if (env_pb) {
+ const int v = atoi(env_pb);
+ if (v > 0) {
+ n_prefill_budget = std::min(n_batch, std::max(1, v));
+ }
+ }
+ }
+ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots)
+
auto & alora_scale = batch.alora_scale;
auto & alora_disabled_id = batch.alora_disabled_id;
@@ -3487,7 +3510,10 @@ private:
const auto last_user_pos = spans.last_user_message_pos();
// add prompt tokens for processing in the current batch
- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) {
+ // (patch 0013) also stop once the per-step prefill budget is spent, so a long
+ // prompt is split across more steps and leaves batch room for co-batched decode
+ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
+ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3512,6 +3538,7 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
+ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
// stop the prompt batch exactly before a user message
if (spans.is_user_start(slot.prompt.n_tokens())) {
@@ -3597,6 +3624,11 @@ private:
if (!slot_batched) {
slot_batched = &slot;
}
+ // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
+ // leaving the remaining batch capacity for co-batched decode of other slots
+ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ add_ok = false;
+ }
});
}
}
--
2.43.0

View File

@@ -1,140 +0,0 @@
From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 15:47:06 +0200
Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014)
On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the
sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch:
mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the
originally reported npl128 throughput cliff does NOT reproduce on this build.
llama-batched-bench decode (S_TG t/s) is monotonic across batch:
npl 1 8 32 64 128 256
S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on)
There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s
at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing.
What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the
token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert
column upper bound = token count, up to 128) in one column-tile. At MoE decode
the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of
ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide
col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile
time and burns throughput on the padding columns while the larger y-tile lowers
occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still
covers the density would raise fill + occupancy at no extra weight read (at
tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the
emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k
kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M.
Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only
(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x
selection, and therefore every kernel launched, is byte-identical to stock. The
cap only ever lowers the loop's upper bound and still selects from the same
granularity- and shared-memory-validated mmq_x set stock already uses for
smaller batches, so no new kernel configuration is exercised.
Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary,
only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s):
npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP
64 936 938 +0.1 2924 2883
128 1295 1357 +4.8 3075 3038
256 1784 1825 +2.3 3085 3046
(reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable)
cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at
npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative:
cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32
tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight
re-reads), so 64 is the recommended value and the only one that helps net.
Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server
throughput unlock (llama-server continuous batching already scales). It is a
modest high-effective-batch DECODE micro-optimization that matches vLLM's
smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The
durable density-aware auto-select (drop the blunt global cap, choose mmq_x from
ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in
patches/paged/MOE_GROUPED_GEMM_SCOPE.md.
Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to
stock for single-stream generation (fibonacci / capital-of-France / photosynthesis
prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at
npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++-
1 file changed, 36 insertions(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index edf546d..cff608e 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -6,6 +6,7 @@
#include <climits>
#include <cstdint>
+#include <cstdlib>
using namespace ggml_cuda_mma;
@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
}
}
+// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
+// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
+// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+static inline int ggml_cuda_moe_mmq_x_cap() {
+ static const int cap = []() -> int {
+ const char * s = getenv("LLAMA_MOE_MMQ_X");
+ return s ? atoi(s) : 0;
+ }();
+ return cap;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int mmq_x_max = get_mmq_x_max_host(cc);
const int mmq_y = get_mmq_y_host(cc);
+ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
+ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
+ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
+ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
+ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
+ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
+ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
+ // per-expert density raises tile fill + occupancy with no extra weight reads (at
+ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
+ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
+ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
+ // off the ids path the cap never applies.
+ int mmq_x_lim = mmq_x_max;
+ if (args.expert_bounds != nullptr) {
+ const int moe_cap = ggml_cuda_moe_mmq_x_cap();
+ if (moe_cap > 0) {
+ const int cap = moe_cap < 8 ? 8 : moe_cap;
+ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+ }
+ }
+
int mmq_x_best = 0;
int ntiles_x_best = INT_MAX;
- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) {
+ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) {
const int granularity = mmq_get_granularity_host(mmq_x, cc);
if (mmq_x % granularity != 0 || mmq_get_nbytes_shared<type>(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) {
--
2.43.0

View File

@@ -1,238 +0,0 @@
From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Tue, 23 Jun 2026 21:03:00 +0200
Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select
(patch 0015)
The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the
0014 doc itself scoped): replace the manual env cap with a host-side, default-on
auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the
MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low
(decode), and keeps the large 128-wide tile when density is high (prefill). No new
kernel: the selection only lowers the loop's upper bound to an already-compiled,
granularity- and shared-memory-validated mmq_x.
Density is estimated host-side from the args the ids path already passes:
ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments)
n_experts = nchannels_x = ne02
density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert)
Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's
global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not
regress by construction.
density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for
a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the
standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts,
16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8
sits strictly between for every n_experts in [128,511], so it caps decode and leaves
prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and
cratered its S_PP by ~2%, the regression this threshold exists to avoid.
Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear
attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock
(LLAMA_MOE_AUTO_TILE=0), median of 5 reps:
npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP%
8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73%
32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05%
64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03%
128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13%
Honest read: on THIS model the decode effect is within run-to-run noise (neutral)
and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and
256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile
lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014
cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only
useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%):
smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width.
Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction
(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE
decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves
the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode-
neutral on the SSM model, harmless where it does not help. Conservative by design:
at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density
(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's
+2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future
work.
LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the
old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto-
select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection.
LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold.
Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M
NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in
{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts).
All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with
LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path
nothing changes (non-MoE mul_mat byte-identical to stock).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++-------
tests/test-backend-ops.cpp | 16 ++++++
2 files changed, 99 insertions(+), 17 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index cff608e..9718b12 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
}
}
-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X.
-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical).
-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the
-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case).
+// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X.
+// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select).
+// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID
+// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept
+// as an explicit override / A-B knob; the default path is now the auto-select.
static inline int ggml_cuda_moe_mmq_x_cap() {
static const int cap = []() -> int {
const char * s = getenv("LLAMA_MOE_MMQ_X");
@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() {
return cap;
}
+// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON).
+// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection.
+static inline bool ggml_cuda_moe_auto_tile_enabled() {
+ static const bool en = []() -> bool {
+ const char * s = getenv("LLAMA_MOE_AUTO_TILE");
+ return !(s && atoi(s) == 0);
+ }();
+ return en;
+}
+// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64:
+// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom).
+static inline int ggml_cuda_moe_decode_tile() {
+ static const int t = []() -> int {
+ const char * s = getenv("LLAMA_MOE_DECODE_TILE");
+ const int v = s ? atoi(s) : 0;
+ return v >= 8 ? v : 64;
+ }();
+ return t;
+}
+// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must
+// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is
+// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is
+// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts
+// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in
+// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old
+// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on
+// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert
+// segment never splits into an extra col-tile.
+static inline int ggml_cuda_moe_density_max() {
+ static const int d = []() -> int {
+ const char * s = getenv("LLAMA_MOE_DENSITY_MAX");
+ const int v = s ? atoi(s) : 0;
+ return v > 0 ? v : 8;
+ }();
+ return d;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int mmq_x_max = get_mmq_x_max_host(cc);
const int mmq_y = get_mmq_y_host(cc);
- // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap.
- // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are
- // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count,
- // up to 128) in a single column-tile. At MoE decode the per-expert token density is low
- // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for
- // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty:
- // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the
- // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the
- // per-expert density raises tile fill + occupancy with no extra weight reads (at
- // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the
- // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel).
- // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock;
- // off the ids path the cap never applies.
+ // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
+ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
+ // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128)
+ // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate
+ // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token
+ // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly
+ // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the
+ // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite
+ // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a
+ // SMALLER mmq_x when - and only when - the per-expert density is low:
+ //
+ // ne_get_rows = args.ncols_dst = ne12 * n_expert_used (total token-expert assignments)
+ // n_experts = args.nchannels_x = ne02
+ // n_active_est = min(n_experts, ne_get_rows) (upper bound on active experts)
+ // density = ceil(ne_get_rows / n_active_est) (avg tokens per active expert)
+ //
+ // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below
+ // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the
+ // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom
+ // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts,
+ // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at
+ // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big
+ // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is
+ // prefill-safe by construction. The selection only ever picks an already-compiled, granularity-
+ // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no
+ // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat
+ // and the gated f16/bf16 host-loop fallback stay byte-identical to stock.
+ // - LLAMA_MOE_MMQ_X=<n> : manual blunt global cap, overrides the auto-select (patch 0014).
+ // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
+ // - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
int mmq_x_lim = mmq_x_max;
if (args.expert_bounds != nullptr) {
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
if (moe_cap > 0) {
const int cap = moe_cap < 8 ? 8 : moe_cap;
mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max;
+ } else if (ggml_cuda_moe_auto_tile_enabled()) {
+ const int64_t ne_get_rows = args.ncols_dst;
+ const int64_t n_experts = args.nchannels_x;
+ if (ne_get_rows > 0 && n_experts > 0) {
+ const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts;
+ const int64_t density = (ne_get_rows + n_active - 1) / n_active;
+ const int tile = ggml_cuda_moe_decode_tile();
+ if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) {
+ mmq_x_lim = tile;
+ }
+ }
}
}
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index c83e91f..62a0989 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8603,6 +8603,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880));
+ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert-
+ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8,
+ // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs.
+ // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16
+ // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a
+ // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak,
+ // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large
+ // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the
+ // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must
+ // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection).
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+ for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) {
+ test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048));
+ }
+ }
+
for (ggml_type type_a : all_types) {
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
}
--
2.43.0

View File

@@ -1,191 +0,0 @@
From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 10:11:48 +0200
Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch
0016, continuous-batch P1)
Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC,
decode-first token budget: the P1 of the token-granular continuous-batch
scheduler. POLICY change only inside update_slots(): no new slot states, no
batch-formation rewrite, zero libllama changes. llama-server already emits one
unified mixed prefill+decode batch per step (Phase 1 appends every ready decode
token unconditionally; Phase 2 fills prefill into the same batch). 0016 only
changes the COUNT of prefill tokens admitted per step.
The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens
== D (the live decode load) is known there. Instead of 0013's constant
LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one
long prompt monopolise the step), compute a dynamic budget:
T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch)
prefill_budget_step = max(n_ubatch, T - D) (leftover after decode,
auto-shrinks as decode load rises so the step never inflates past T)
prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch,
pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides)
Phase 2's inner prompt-fill loop and outer admission break are bounded by
prefill_budget_step (across slots) and a new per-slot slot_prompt_added
counter; the n_batch hard ceiling stays as the compute bound. Decode is
structurally claimed first and never capped (Phase 1), so the decode-first
guarantee is free.
DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical
to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the
determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly
(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly
subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical
decisions paged on or off.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
tools/server/server-context.cpp | 107 +++++++++++++++++++++++++-------
1 file changed, 85 insertions(+), 22 deletions(-)
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index afcdebe..b8b8f00 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -3043,24 +3043,78 @@ private:
int32_t n_batch = llama_n_batch(ctx_tgt);
int32_t n_ubatch = llama_n_ubatch(ctx_tgt);
- // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget.
- // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt
- // tokens ingested per update_slots() step at n_batch only; with cont_batching the
- // sampled decode tokens of every generating slot are appended FIRST, then prompt
- // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch
- // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every
- // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt
- // tokens added per step independently of n_batch, splitting a long prefill across
- // more steps so in-flight decode keeps advancing smoothly. Default (env unset or
- // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED
- // (this is a pure scheduler knob; works with paged off).
- int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking)
+ // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first
+ // per-step prefill-token budget (continuous-batch scheduler P1). llama-server
+ // already builds ONE mixed batch per update_slots() step: Phase 1 (just above)
+ // appended every generating slot's sampled token UNCONDITIONALLY, so at this point
+ // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining
+ // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC
+ // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and
+ // lets one long prompt monopolise the step.
+ //
+ // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue:
+ // a single total per-step token budget T, decode claims its D tokens first
+ // (already in the batch), and prefill gets the leftover T - D distributed across
+ // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill
+ // leftover auto-shrinks, so the step never inflates past T at any concurrency:
+ // the budget self-tunes across the npl range and across dense vs MoE without a
+ // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free
+ // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and
+ // never capped (Phase 1), so the decode-first guarantee is free here.
+ //
+ // LLAMA_MAX_BATCH_TOKENS (T) total per-step token budget (decode + prefill),
+ // default n_batch, clamped to [n_ubatch, n_batch] so
+ // the compute loop stays a single llama_decode and
+ // prefill keeps an n_ubatch floor of progress.
+ // LLAMA_PREFILL_CAP per-slot max prompt tokens per step (the
+ // long_prefill_token_threshold analogue), default
+ // min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so
+ // one long prompt cannot eat the whole leftover.
+ // LLAMA_PREFILL_BUDGET legacy static cap (patch 0013); honoured ONLY when
+ // LLAMA_MAX_BATCH_TOKENS is unset, for back-compat.
+ //
+ // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate
+ // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the
+ // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both
+ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no
+ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure
+ // scheduler policy, identical decisions with paged on or off.
+ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above
+ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking)
+ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap)
{
- const char * env_pb = getenv("LLAMA_PREFILL_BUDGET");
- if (env_pb) {
+ int32_t mbt = 0;
+ if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) {
+ mbt = atoi(env_mbt);
+ }
+ if (mbt > 0) {
+ // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch]
+ int32_t T = std::min(n_batch, mbt);
+ T = std::max(T, n_ubatch);
+ // leftover after decode, floored at n_ubatch so prefill never fully starves
+ prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch);
+ // per-slot prompt-chunk cap (long_prefill_token_threshold analogue)
+ int32_t cap = 0;
+ if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) {
+ cap = atoi(env_cap);
+ }
+ if (cap <= 0) {
+ const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx)
+ cap = std::min(T, std::max(n_ubatch, pct4));
+ }
+ cap = std::min(n_batch, std::max(n_ubatch, cap));
+ // at T == n_batch the leftover and cap both reach the n_batch ceiling
+ // together; pin the cap to n_batch so this case stays byte-identical
+ if (T >= n_batch) {
+ cap = n_batch;
+ }
+ prefill_cap_per_slot = cap;
+ } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) {
+ // legacy static budget (patch 0013), kept for back-compat when the
+ // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap
const int v = atoi(env_pb);
if (v > 0) {
- n_prefill_budget = std::min(n_batch, std::max(1, v));
+ prefill_budget_step = std::min(n_batch, std::max(1, v));
}
}
}
@@ -3509,11 +3563,18 @@ private:
const auto & spans = slot.task->params.message_spans;
const auto last_user_pos = spans.last_user_message_pos();
+ // (patch 0016) per-slot prompt tokens added this step, for the per-slot
+ // chunk cap (resets each slot); n_batch stays the hard compute ceiling
+ int32_t slot_prompt_added = 0;
+
// add prompt tokens for processing in the current batch
- // (patch 0013) also stop once the per-step prefill budget is spent, so a long
- // prompt is split across more steps and leaves batch room for co-batched decode
+ // (patch 0016) also stop once (a) the dynamic per-step prefill budget
+ // (the T - D leftover) is spent across all slots, or (b) this slot's
+ // per-slot chunk cap is hit, so a long prompt is split across more steps
+ // and leaves batch room for co-batched decode of the other slots
while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch &&
- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) {
+ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) &&
+ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) {
// get next token to process
llama_token cur_tok = input_tokens[slot.prompt.n_tokens()];
if (cur_tok == LLAMA_TOKEN_NULL) {
@@ -3538,7 +3599,8 @@ private:
slot.prompt.tokens.push_back(cur_tok);
slot.n_prompt_tokens_processed++;
- n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget
+ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget
+ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap
// stop the prompt batch exactly before a user message
if (spans.is_user_start(slot.prompt.n_tokens())) {
@@ -3624,9 +3686,10 @@ private:
if (!slot_batched) {
slot_batched = &slot;
}
- // (patch 0013) stop adding prompts once the per-step prefill budget is spent,
- // leaving the remaining batch capacity for co-batched decode of other slots
- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) {
+ // (patch 0016) stop admitting prompts once the dynamic per-step prefill
+ // budget (the T - D leftover) is spent, leaving the remaining batch
+ // capacity for co-batched decode of the other slots
+ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) {
add_ok = false;
}
});
--
2.43.0

View File

@@ -1,245 +0,0 @@
From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Wed, 24 Jun 2026 19:56:05 +0200
Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off
occupancy instrumentation (patch 0017)
Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0
bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the
honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the
kill-gate tripped - so nothing is enabled by default.
P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight-
row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode
M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at
default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail.
P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock:
- GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4
weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_
assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader),
left as the P2 kernel change; the host/device plumbing is in place and inert.
- GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever
(register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe.
- GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic.
Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG):
DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3
(-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses.
MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0
(-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound).
nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q<NVFP4,128,0> went 2.782s -> 3.025s
(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did
not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at
mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural
mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++----
tests/test-backend-ops.cpp | 16 +++++++
2 files changed, 92 insertions(+), 9 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh
index 9718b12..b53e38a 100644
--- a/ggml/src/ggml-cuda/mmq.cuh
+++ b/ggml/src/ggml-cuda/mmq.cuh
@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() {
#endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE)
}
-static int get_mmq_y_host(const int cc) {
+// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override.
+// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the
+// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to
+// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every
+// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth-
+// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE
+// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result
+// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact
+// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64
+// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched.
+#ifndef GGML_CUDA_FP4_MMQ_Y
+#define GGML_CUDA_FP4_MMQ_Y 128
+#endif
+
+static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) {
+ if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) {
+ return GGML_CUDA_FP4_MMQ_Y;
+ }
return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) :
((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64);
}
@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) {
return MMQ_ITER_K;
}
+template <ggml_type type = GGML_TYPE_COUNT>
static constexpr __device__ int get_mmq_y_device() {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) {
+ return GGML_CUDA_FP4_MMQ_Y;
+ }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
#if defined(GGML_USE_HIP)
#if defined(RDNA1)
return 64;
@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() {
#endif // defined(GGML_USE_HIP)
}
+// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM.
+// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread
+// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128).
+// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside,
+// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a
+// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once).
+// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4).
+// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable.
+// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks.
+#ifndef GGML_CUDA_FP4_MINBLOCKS
+#define GGML_CUDA_FP4_MINBLOCKS 1
+#endif
+template <ggml_type type = GGML_TYPE_COUNT>
+static constexpr __device__ int mmq_get_min_blocks_device(const int stock) {
+#if defined(BLACKWELL_MMA_AVAILABLE)
+ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) {
+ return GGML_CUDA_FP4_MINBLOCKS;
+ }
+#endif // defined(BLACKWELL_MMA_AVAILABLE)
+ return stock;
+}
+
// Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes.
// The K dimension of the tiles has either,
// 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K),
@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int nwarps = mmq_get_nwarps_device();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
constexpr load_tiles_mmq_t load_tiles = mmq_type_traits<mmq_x, mmq_y, need_check, type>::load_tiles;
extern __shared__ int data_mul_mat_q[];
@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile(
template <ggml_type type, int mmq_x, bool need_check>
#if defined(GGML_USE_HIP)
#if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
#endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN)
#else
#if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(1))
#else
- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2)
+ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device<type>(2))
#endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA
#endif // defined(GGML_USE_HIP)
static __global__ void mul_mat_q(
@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q(
constexpr int warp_size = ggml_cuda_get_physical_warp_size();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y
@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup(
float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst,
const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y,
const int stride_sample_dst, const uint3 ntx) {
- constexpr int mmq_y = get_mmq_y_device();
+ constexpr int mmq_y = get_mmq_y_device<type>();
constexpr int qk = ggml_cuda_type_traits<type>::qk;
constexpr int ITER_K = get_iter_k(type);
constexpr int blocks_per_iter = ITER_K / qk;
@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a
const int nsm = ggml_cuda_info().devices[id].nsm;
const int warp_size = ggml_cuda_info().devices[id].warp_size;
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
- const int mmq_y = get_mmq_y_host(cc);
+ const int mmq_y = get_mmq_y_host(cc, type);
const dim3 block_dims(warp_size, nwarps, 1);
@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() {
return d;
}
+// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off).
+// GGML_CUDA_FP4_DENSE_MMQ_X=<n> caps the dense (non-MoE) NVFP4 col-tile to <n>, splitting the M=128
+// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost
+// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if
+// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is
+// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is
+// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection.
+static inline int ggml_cuda_fp4_dense_mmq_x_cap() {
+ static const int c = []() -> int {
+ const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X");
+ return s ? atoi(s) : 0;
+ }();
+ return c;
+}
+
template <ggml_type type>
void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) {
const int id = ggml_cuda_get_device();
@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
const int nwarps = mmq_get_nwarps_host(cc, warp_size);
const int mmq_x_max = get_mmq_x_max_host(cc);
- const int mmq_y = get_mmq_y_host(cc);
+ const int mmq_y = get_mmq_y_host(cc, type);
// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON).
// On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens
@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda
// - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection).
// - LLAMA_MOE_DECODE_TILE=<n>, LLAMA_MOE_DENSITY_MAX=<n> : tune the tile / threshold.
int mmq_x_lim = mmq_x_max;
+ if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) {
+ // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap).
+ const int cap = ggml_cuda_fp4_dense_mmq_x_cap();
+ if (cap > 0 && cap < mmq_x_max) {
+ mmq_x_lim = cap < 8 ? 8 : cap;
+ }
+ }
if (args.expert_bounds != nullptr) {
const int moe_cap = ggml_cuda_moe_mmq_x_cap();
if (moe_cap > 0) {
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index f219309..291c275 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -8591,6 +8591,22 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
}
}
+ // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate.
+ // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the
+ // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a
+ // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves
+ // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128),
+ // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode
+ // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must
+ // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact.
+ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) {
+ for (int64_t m : {2048, 1600, 2050}) {
+ for (int64_t n : {32, 128}) {
+ test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1}));
+ }
+ }
+ }
+
for (ggml_type type_a : all_types) {
test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a)));
}
--
2.43.0

View File

@@ -1,349 +0,0 @@
From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 00:44:13 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state
write-back (patch 0018)
Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet :
16 full-attention layers) was dominated by recurrent-state plumbing, not the
FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new
recurrent state into graph scratch, then a separate ggml_cpy persisted it into
the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that
~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window).
This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place):
ggml_gated_delta_net_inplace writes the final recurrent state directly into the
active sequences contiguous cache slot (at kv_head), removing the copy-back. The
op output then carries only the attention scores; the SSM arithmetic is
unchanged (bit-identical greedy output vs the copy-back baseline).
- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view)
- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there
- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path,
dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged
Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on):
npl 32 : 113.74 -> 136.39 t/s (+19.9 percent)
npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling)
MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent).
nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share
(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the
next lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 14 ++++++
ggml/src/ggml-cpu/ops.cpp | 13 ++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++-----
ggml/src/ggml.c | 68 +++++++++++++++++++++++++++
src/models/delta-net-base.cpp | 30 ++++++++++++
5 files changed, 152 insertions(+), 12 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 823f5a9..4e7ab32 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2579,6 +2579,20 @@ extern "C" {
struct ggml_tensor * state,
int64_t K);
+ // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+ // in place into state_dst (a view into the recurrent-state cache) instead of being appended to
+ // the op output, eliminating the per-step state copy-back during decode. state_dst must be a
+ // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size).
+ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst);
+
// custom operators
typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 63c07a2..9457add 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target
const int64_t S_v = src_v->ne[0];
const int64_t H = src_v->ne[1];
@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const float scale = 1.0f / sqrtf((float) S_v);
+ // when src_state_dst is provided (in-place decode write-back) the final state is written
+ // directly into the persistent cache view, removing the separate state copy-back node.
+ float * inplace_state_base = nullptr;
+ if (src_state_dst != nullptr) {
+ GGML_ASSERT(K == 1);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ inplace_state_base = (float *) src_state_dst->data;
+ }
+
for (int64_t ir = ir0; ir < ir1; ++ir) {
const int64_t iv1 = ir % H; // head_index
const int64_t iv3 = ir / H; // sequence
@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
// For K>1, work in scratch and copy out per-token when the slot is in range.
float * s_out = (K > 1)
? state_work
- : state_out_base + (iv3 * H + iv1) * S_v * S_v;
+ : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v;
// copy input state into the working buffer and operate in-place
// state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride.
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index a547360..61a2b91 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q,
const uint3 neqk1_magic,
const uint3 rq3_magic,
float scale,
- int K) {
+ int K,
+ float * state_dst) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
// each warp owns one column, using warp-level primitives to reduce across rows
@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q,
const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
float * attn_data = dst;
- float * state = dst + attn_score_elems;
+ // when state_dst is provided (in-place decode write-back) the final recurrent state is written
+ // directly into the persistent cache view instead of being appended to the op output; this
+ // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false.
+ float * state = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems);
// input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v.
// output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before.
@@ -171,7 +175,7 @@ template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
const float * g_d, const float * b_d, const float * s_d,
- float * dst_d,
+ float * dst_d, float * state_dst_d,
int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
@@ -195,26 +199,26 @@ static void launch_gated_delta_net(
ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 32:
ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
case 64: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
case 128: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
break;
}
default:
@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
ggml_tensor * src_g = dst->src[3];
ggml_tensor * src_beta = dst->src[4];
ggml_tensor * src_state = dst->src[5];
+ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target
GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne);
GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb);
@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const float * s_d = (const float *) src_state->data;
float * dst_d = (float *) dst->data;
+ float * state_dst_d = nullptr;
+ if (src_state_dst != nullptr) {
+ // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H
+ GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float));
+ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float));
+ state_dst_d = (float *) src_state_dst->data;
+ }
+
GGML_ASSERT(ggml_is_contiguous_rows(src_q));
GGML_ASSERT(ggml_is_contiguous_rows(src_k));
GGML_ASSERT(ggml_is_contiguous_rows(src_v));
@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const int K = ggml_get_op_params_i32(dst, 0);
const bool keep_rs = K > 1;
+ // in-place write-back is only valid for the single-snapshot (final-state) case
+ GGML_ASSERT(state_dst_d == nullptr || !keep_rs);
+
if (kda) {
if (keep_rs) {
- launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
if (keep_rs) {
- launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d,
+ launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index adbe52b..b8d34bf 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net(
return result;
}
+// ggml_gated_delta_net_inplace
+//
+// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written
+// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being
+// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode.
+// The op output holds ONLY the attention scores; the state region is still allocated (unused) so
+// the attention-output view layout is identical to ggml_gated_delta_net.
+struct ggml_tensor * ggml_gated_delta_net_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst) {
+ GGML_ASSERT(ggml_is_contiguous_rows(q));
+ GGML_ASSERT(ggml_is_contiguous_rows(k));
+ GGML_ASSERT(ggml_is_contiguous_rows(v));
+ GGML_ASSERT(ggml_is_contiguous(g));
+ GGML_ASSERT(ggml_is_contiguous(beta));
+ GGML_ASSERT(ggml_is_contiguous(state));
+
+ GGML_ASSERT(q->type == GGML_TYPE_F32);
+ GGML_ASSERT(k->type == GGML_TYPE_F32);
+ GGML_ASSERT(v->type == GGML_TYPE_F32);
+ GGML_ASSERT(g->type == GGML_TYPE_F32);
+ GGML_ASSERT(beta->type == GGML_TYPE_F32);
+ GGML_ASSERT(state->type == GGML_TYPE_F32);
+ GGML_ASSERT(state_dst != NULL);
+ GGML_ASSERT(state_dst->type == GGML_TYPE_F32);
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H = v->ne[1];
+ const int64_t n_tokens = v->ne[2];
+ const int64_t n_seqs = v->ne[3];
+
+ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+ GGML_ASSERT(beta->ne[0] == 1);
+
+ GGML_ASSERT(state->ne[0] == S_v);
+ GGML_ASSERT(state->ne[1] == S_v);
+ GGML_ASSERT(state->ne[2] == H);
+ GGML_ASSERT(state->ne[3] == n_seqs);
+
+ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+ GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+ const int64_t state_rows = S_v * n_seqs; // K == 1
+ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+ ggml_set_op_params_i32(result, 0, 1); // K == 1
+
+ result->op = GGML_OP_GATED_DELTA_NET;
+ result->src[0] = q;
+ result->src[1] = k;
+ result->src[2] = v;
+ result->src[3] = g;
+ result->src[4] = beta;
+ result->src[5] = state;
+ result->src[6] = state_dst;
+
+ return result;
+}
+
////////////////////////////////////////////////////////////////////////////////
struct ggml_hash_set ggml_hash_set_new(size_t size) {
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index ad9ce77..26a718b 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
const bool keep = cparams.n_rs_seq > 0;
if (!keep) {
+ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+ if (fused) {
+ // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state
+ // directly into the persistent cache slot for the active sequences (a contiguous block
+ // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that
+ // dominated decode. The op output then carries only the attention scores.
+ ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs,
+ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+
+ ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst);
+ if (n_seq_tokens == 1) {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+ } else {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+ }
+
+ ggml_tensor * output = ggml_view_4d(ctx0, result,
+ S_v, H_v, n_seq_tokens, n_seqs,
+ ggml_row_size(result->type, S_v),
+ ggml_row_size(result->type, S_v * H_v),
+ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+ cb(output, "attn_output", il);
+
+ // the state write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, output);
+
+ return output;
+ }
+
auto attn_out = build_delta_net(q, k, v, g, b, s, il);
ggml_tensor * output = attn_out.first;
ggml_tensor * new_state = attn_out.second;
--
2.43.0

View File

@@ -1,583 +0,0 @@
From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 01:45:02 +0200
Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather
(patch 0019)
Step 2 of the SSM decode-throughput work. After Step 1 (in-place state
write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent-
state get_rows gather (18.8% of decode GPU time): build_rs materialized each
sequence's prior state into a contiguous scratch via ggml_get_rows before the
gated-DeltaNet op read it.
This eliminates that materialization, mirroring ggml_ssm_scan's ids source.
ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the
s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads
each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's
in-place write the op now reads AND writes the cache directly: no recurrent-state
materialization at all. build_recurrent_attn feeds the full cache + ids through
the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero
clear and the extra-states copy around the op.
Race-free by construction on CUDA. In-place write plus an ids read of the same
cache is only safe when read slot == write slot; s_copy is identity
(rs_head + s) for stable continuing sequences (the whole AR decode path) but can
remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch).
The recurrence kernel handles both per (seq, head) block on device: identity
sequences read s0 in place from the destination slot (the kernel loads all of s0
into registers before writing, so reading and writing the same slot is safe),
and non-identity sequences read from a disjoint scratch that a small gather
kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot
another block writes. The CPU op mirrors this (host identity check + a serial
gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is
device-resident at op-execute time). Bit-identical to the get_rows path in every
case.
- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel
(gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda;
CPU identity guard + serial gather fallback in the dispatcher
- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and
qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and
the rollback (n_rs_seq > 0) path are unchanged.
Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off):
dense q36-27b-nvfp4 : npl 32 137.64 -> 170.68 (+24.0 percent)
npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391)
MoE q36-35b-a3b-nvfp4: npl 32 299.68 -> 366.69 (+22.4 percent)
npl 128 409.30 -> 553.63 (+35.3 percent)
Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build
(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys
k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel
is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM
is now the FP4 GEMM (~48 percent of decode), a separate kernel track.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 17 ++++++
ggml/src/ggml-cpu/ops.cpp | 49 ++++++++++++++-
ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++----
ggml/src/ggml.c | 76 +++++++++++++++++++++++
src/models/delta-net-base.cpp | 63 ++++++++++++++++++++
src/models/models.h | 13 ++++
src/models/qwen35.cpp | 6 +-
src/models/qwen35moe.cpp | 6 +-
8 files changed, 292 insertions(+), 23 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 4e7ab32..951dd21 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2593,6 +2593,23 @@ extern "C" {
struct ggml_tensor * state,
struct ggml_tensor * state_dst);
+ // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read
+ // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring
+ // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache
+ // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination
+ // base slot. Eliminates the recurrent-state gather on the decode path.
+ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst,
+ struct ggml_tensor * ids,
+ int rs_head);
+
// custom operators
typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata);
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 9457add..b6a1976 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const int64_t K = ggml_get_op_params_i32(dst, 0);
GGML_ASSERT(K >= 1);
// per-seq stride in floats (seq s starts at state + s * seq_stride)
- const int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
+ int64_t state_seq_stride = src_state->nb[3] / sizeof(float);
const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0);
const int ith = params->ith;
@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk(
const float * state_in_base = (const float *)src_state->data;
+ // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly
+ // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity
+ // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs);
+ // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch
+ // region. Bit-identical to the get_rows path.
+ ggml_tensor * src_ids = dst->src[7];
+ if (src_ids != nullptr) {
+ const int64_t D = S_v * S_v * H;
+ const int32_t rs_head = ggml_get_op_params_i32(dst, 1);
+ const int32_t * ids = (const int32_t *) src_ids->data;
+ bool identity = true;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+ }
+ state_seq_stride = D;
+ state_in_base = identity
+ ? (const float *) src_state->data + (int64_t) rs_head * D
+ : (const float *) state_out_base; // gathered by the dispatcher (non-identity)
+ }
+
//const int64_t rq1 = nev1 / neq1;
//const int64_t rk1 = nev1 / nek1;
const int64_t rq3 = nev3 / neq3;
@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32(
if (ith == 0) {
ggml_threadpool_chunk_set(params->threadpool, nth);
+
+ // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from
+ // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel
+ // recurrence, so the in-place write never aliases another sequence's read.
+ ggml_tensor * src_ids = dst->src[7];
+ if (src_ids != nullptr) {
+ const ggml_tensor * src_state = dst->src[5];
+ const int64_t S_v = V->ne[0];
+ const int64_t H = V->ne[1];
+ const int64_t n_tokens = V->ne[2];
+ const int64_t n_seqs = V->ne[3];
+ const int64_t D = S_v * S_v * H;
+ const int32_t rs_head = ggml_get_op_params_i32(dst, 1);
+ const int32_t * ids = (const int32_t *) src_ids->data;
+ bool identity = true;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; }
+ }
+ if (!identity) {
+ const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs;
+ const float * cache = (const float *) src_state->data;
+ float * scratch = (float *) dst->data + attn_score_elems;
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float));
+ }
+ }
+ }
}
ggml_barrier(params->threadpool);
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 61a2b91..86d5e2a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,34 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the recurrence kernel and are skipped here. One block per sequence.
+__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head,
+ float * scratch, int64_t D, int n_seqs) {
+ const int s = blockIdx.x;
+ if (s >= n_seqs) {
+ return;
+ }
+ const int r = ids[s];
+ if (r == rs_head + s) {
+ return; // identity: prior state already lives in the in-place destination slot
+ }
+ const float * src = cache + (int64_t) r * D;
+ float * dst = scratch + (int64_t) s * D;
+ for (int64_t i = threadIdx.x; i < D; i += blockDim.x) {
+ dst[i] = src[i];
+ }
+}
+
+static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head,
+ float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) {
+ if (n_seqs <= 0) {
+ return;
+ }
+ gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
+}
+
template <int S_v, bool KDA, bool keep_rs_t>
__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
gated_delta_net_cuda(const float * q,
@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q,
const uint3 rq3_magic,
float scale,
int K,
- float * state_dst) {
+ float * state_dst,
+ const int32_t * ids,
+ int rs_head) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
// each warp owns one column, using warp-level primitives to reduce across rows
@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q,
const int64_t state_in_offset = sequence * H * S_v * S_v + h_idx * S_v * S_v;
const int64_t state_out_offset = (sequence * H + h_idx) * S_v * S_v;
state += state_out_offset;
- curr_state += state_in_offset + col * S_v;
+ // Step 2: select the prior-state read base per sequence. For the ids variant, identity
+ // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot
+ // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch
+ // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head)
+ // offset. The whole s0 is loaded into registers before the new state is written, so reading and
+ // writing the same slot per block (identity) is race-free.
+ const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
+ ? state_dst : curr_state;
+ read_state += state_in_offset + col * S_v;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q,
#pragma unroll
for (int r = 0; r < rows_per_lane; r++) {
const int i = r * warp_size + lane;
- s_shard[r] = curr_state[i];
+ s_shard[r] = read_state[i];
}
for (int t = 0; t < n_tokens; t++) {
@@ -176,6 +214,7 @@ static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
const float * g_d, const float * b_d, const float * s_d,
float * dst_d, float * state_dst_d,
+ const int32_t * ids_d, int rs_head,
int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs,
int64_t sq1, int64_t sq2, int64_t sq3,
int64_t sv1, int64_t sv2, int64_t sv3,
@@ -199,26 +238,26 @@ static void launch_gated_delta_net(
ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
case 32:
ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
case 64: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
}
case 128: {
ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d);
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
break;
}
default:
@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
const float * g_d = (const float *) src_g->data;
const float * b_d = (const float *) src_beta->data;
- const float * s_d = (const float *) src_state->data;
float * dst_d = (float *) dst->data;
float * state_dst_d = nullptr;
@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
state_dst_d = (float *) src_state_dst->data;
}
+ // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly
+ // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence
+ // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no
+ // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here
+ // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and
+ // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to
+ // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels).
+ ggml_tensor * src_ids = dst->src[7];
+ const float * s_d = (const float *) src_state->data;
+ const int32_t * ids_d = nullptr;
+ int rs_head = 0;
+ ggml_cuda_pool_alloc<float> ids_state_scratch(ctx.pool());
+ if (src_ids != nullptr) {
+ GGML_ASSERT(state_dst_d != nullptr);
+ GGML_ASSERT(src_ids->type == GGML_TYPE_I32);
+ rs_head = ggml_get_op_params_i32(dst, 1);
+ ids_d = (const int32_t *) src_ids->data;
+ const int64_t D = S_v * S_v * H;
+ float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs);
+ ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream());
+ s_d = scratch;
+ }
+
GGML_ASSERT(ggml_is_contiguous_rows(src_q));
GGML_ASSERT(ggml_is_contiguous_rows(src_k));
GGML_ASSERT(ggml_is_contiguous_rows(src_v));
@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor *
if (kda) {
if (keep_rs) {
- launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<true, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<true, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
} else {
if (keep_rs) {
- launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<false, true>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
} else {
- launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d,
+ launch_gated_delta_net<false, false>(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head,
S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
sb1, sb2, sb3, neqk1, rq3, scale, K, stream);
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index b8d34bf..1762037 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace(
return result;
}
+// ggml_gated_delta_net_inplace_ids
+//
+// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly
+// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring
+// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the
+// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head
+// + s), where the prior state already lives in the in-place destination slots.
+struct ggml_tensor * ggml_gated_delta_net_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * q,
+ struct ggml_tensor * k,
+ struct ggml_tensor * v,
+ struct ggml_tensor * g,
+ struct ggml_tensor * beta,
+ struct ggml_tensor * state,
+ struct ggml_tensor * state_dst,
+ struct ggml_tensor * ids,
+ int rs_head) {
+ GGML_ASSERT(ggml_is_contiguous_rows(q));
+ GGML_ASSERT(ggml_is_contiguous_rows(k));
+ GGML_ASSERT(ggml_is_contiguous_rows(v));
+ GGML_ASSERT(ggml_is_contiguous(g));
+ GGML_ASSERT(ggml_is_contiguous(beta));
+ GGML_ASSERT(ggml_is_contiguous(state));
+
+ GGML_ASSERT(q->type == GGML_TYPE_F32);
+ GGML_ASSERT(k->type == GGML_TYPE_F32);
+ GGML_ASSERT(v->type == GGML_TYPE_F32);
+ GGML_ASSERT(g->type == GGML_TYPE_F32);
+ GGML_ASSERT(beta->type == GGML_TYPE_F32);
+ GGML_ASSERT(state->type == GGML_TYPE_F32);
+ GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H = v->ne[1];
+ const int64_t n_tokens = v->ne[2];
+ const int64_t n_seqs = v->ne[3];
+
+ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v);
+ GGML_ASSERT(beta->ne[0] == 1);
+
+ // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs
+ GGML_ASSERT(state->ne[0] == S_v);
+ GGML_ASSERT(state->ne[1] == S_v);
+ GGML_ASSERT(state->ne[2] == H);
+ GGML_ASSERT(state->ne[3] >= n_seqs);
+
+ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs]
+ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H);
+ GGML_ASSERT(state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(state_dst->nb[0] == sizeof(float));
+
+ // ids: per-seq source slot into the full cache (s_copy_main)
+ GGML_ASSERT(ids->ne[0] >= n_seqs);
+
+ const int64_t state_rows = S_v * n_seqs; // K == 1
+ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 };
+ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne);
+
+ ggml_set_op_params_i32(result, 0, 1); // K == 1
+ ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check)
+
+ result->op = GGML_OP_GATED_DELTA_NET;
+ result->src[0] = q;
+ result->src[1] = k;
+ result->src[2] = v;
+ result->src[3] = g;
+ result->src[4] = beta;
+ result->src[5] = state; // FULL cache (read via ids)
+ result->src[6] = state_dst; // in-place final-state write-back target
+ result->src[7] = ids; // per-seq source slots (s_copy)
+
+ return result;
+}
+
////////////////////////////////////////////////////////////////////////////////
struct ggml_hash_set ggml_hash_set_new(size_t size) {
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index 26a718b..194e611 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
return conv_input;
}
+// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
+// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
+// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
+// and rollback paths fall back to materializing the prior state and delegating below.
+ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
+ llm_graph_input_rs * inp,
+ ggml_tensor * ssm_states_all,
+ ggml_tensor * q,
+ ggml_tensor * k,
+ ggml_tensor * v,
+ ggml_tensor * g,
+ ggml_tensor * b,
+ int il) {
+ const auto * mctx_cur = inp->mctx;
+ const auto kv_head = mctx_cur->get_head();
+
+ const int64_t S_v = v->ne[0];
+ const int64_t H_v = v->ne[1];
+ const int64_t n_seqs = v->ne[3];
+ const int64_t n_seq_tokens = q->ne[2];
+
+ const bool keep = cparams.n_rs_seq > 0;
+ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch;
+
+ if (!keep && fused) {
+ // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows
+ // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and
+ // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes
+ // the final state in place at kv_head; no recurrent-state materialization at all.
+ auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+ ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]);
+ ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs,
+ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all));
+ return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head);
+ };
+
+ ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op);
+ if (n_seq_tokens == 1) {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il);
+ } else {
+ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il);
+ }
+
+ ggml_tensor * output = ggml_view_4d(ctx0, result,
+ S_v, H_v, n_seq_tokens, n_seqs,
+ ggml_row_size(result->type, S_v),
+ ggml_row_size(result->type, S_v * H_v),
+ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0);
+ cb(output, "attn_output", il);
+
+ // the state write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, output);
+
+ return output;
+ }
+
+ // non-fused / rollback: materialize the prior state via gather and delegate to the
+ // state-taking overload (its fused !keep branch performs the Step-1 in-place write).
+ ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+ s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs);
+ return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il);
+}
+
ggml_tensor * llm_build_delta_net_base::build_recurrent_attn(
llm_graph_input_rs * inp,
ggml_tensor * ssm_states_all,
diff --git a/src/models/models.h b/src/models/models.h
index 2ac8415..98b89e9 100644
--- a/src/models/models.h
+++ b/src/models/models.h
@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context {
ggml_tensor * b,
ggml_tensor * s,
int il);
+
+ // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via
+ // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and
+ // delegates to the state-taking overload for the non-fused and rollback paths.
+ ggml_tensor * build_recurrent_attn(
+ llm_graph_input_rs * inp,
+ ggml_tensor * ssm_states_all,
+ ggml_tensor * q,
+ ggml_tensor * k,
+ ggml_tensor * v,
+ ggml_tensor * g,
+ ggml_tensor * b,
+ int il);
};
struct llm_build_rwkv6_base : public llm_graph_context {
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 6783d98..0be3247 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
-
ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
cb(conv_output_proper, "conv_output_raw", il);
@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
cb(k_conv, "k_conv_predelta", il);
cb(v_conv, "v_conv_predelta", il);
- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
// z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index eb5e9a4..2995f04 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
-
ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
cb(conv_output_proper, "conv_output_raw", il);
@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
cb(k_conv, "k_conv_predelta", il);
cb(v_conv, "v_conv_predelta", il);
- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il);
+ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il);
// z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim]
ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs);
--
2.43.0

View File

@@ -1,140 +0,0 @@
From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 12:40:49 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape
(patch 0020)
Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM
models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM
(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling
both engines pinned the largest llama-specific overage to the gated-DeltaNet
OUTPUT projection (ssm_out).
The GDN op left its output in SSM layout and the graph reshaped it to 3D
[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so
src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128
sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the
ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step,
the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one
M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D.
The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs]
(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128
routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across
all 128 tokens). The result is then already 2D, so the redundant post-matmul
reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical.
Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs
untouched.
Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both
q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical.
test-backend-ops MUL_MAT and MUL_MAT_ID OK.
decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128):
dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%)
MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%)
Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit).
nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses
to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) with a LOWER
per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call
vs 2.77 ms/call for the old GEMV.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/models/qwen35.cpp | 13 ++++---
src/models/qwen35moe.cpp | 13 ++++---
src/models/qwen3next.cpp | 13 ++++---
3 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0be3247..0874c43 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index 2995f04..1f6f643 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
index 97200a4..bfdf026 100644
--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
// Apply gated normalization: self.norm(core_attn_out, z)
ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il);
- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim]
- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs);
+ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the
+ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior
+ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ
+ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous
+ // data, just a 2D vs 3D view, so the result is bit-identical.
+ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs);
cb(final_output, "final_output", il);
- // Output projection
+ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs])
cur = build_lora_mm(model.layers[il].ssm_out, final_output);
cb(cur, "linear_attn_out", il);
- // Reshape back to original dimensions
- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs);
-
return cur;
}
--
2.43.0

View File

@@ -1,655 +0,0 @@
From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 18:55:54 +0200
Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch
0021)
The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.
New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+ branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.
decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%)
npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%)
npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 16 +++++
ggml/src/ggml-cpu/ops.cpp | 73 ++++++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++
ggml/src/ggml.c | 54 ++++++++++++++++
src/models/delta-net-base.cpp | 51 +++++++++++++++
src/models/models.h | 14 +++++
src/models/qwen35.cpp | 23 +++++--
src/models/qwen35moe.cpp | 23 +++++--
src/models/qwen3next.cpp | 29 ++++++---
tests/test-backend-ops.cpp | 47 ++++++++++++++
10 files changed, 420 insertions(+), 22 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 951dd21..76fa401 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2447,6 +2447,22 @@ extern "C" {
struct ggml_tensor * sx,
struct ggml_tensor * c);
+ // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles
+ // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels,
+ // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise
+ // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes
+ // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs]
+ // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back +
+ // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV;
+ // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode).
+ GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ bool fuse_silu);
+
GGML_API struct ggml_tensor * ggml_ssm_scan(
struct ggml_context * ctx,
struct ggml_tensor * s,
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index b6a1976..f9cd850 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32(
}
}
+// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the
+// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same
+// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv
+// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split
+// over channels.
+static void ggml_compute_forward_ssm_conv_update_f32(
+ const ggml_compute_params * params,
+ ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+ const int ith = params->ith;
+ const int nth = params->nth;
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+
+ const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float);
+ const int64_t states_ch_stride = conv_states->nb[1] / sizeof(float);
+ const int64_t w_stride = conv_kernel->nb[1] / sizeof(float);
+ const int64_t x_seq_stride = x_cur->nb[2] / sizeof(float);
+ const int64_t dst_seq_stride = dst->nb[2] / sizeof(float);
+ const int64_t cdst_seq_stride = cdst->nb[1] / sizeof(float);
+
+ const float * states_base = (const float *) conv_states->data;
+ const float * w_base = (const float *) conv_kernel->data;
+ const float * x_base = (const float *) x_cur->data;
+ float * cdst_base = (float *) cdst->data;
+ float * dst_base = (float *) dst->data;
+
+ const int64_t dc = (channels + nth - 1) / nth;
+ const int64_t c0 = dc * ith;
+ const int64_t c1 = MIN(c0 + dc, channels);
+
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ for (int64_t c = c0; c < c1; ++c) {
+ const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride;
+ const float * w_c = w_base + c * w_stride;
+ const float xc = x_base[s * x_seq_stride + c];
+
+ // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
+ float sumf = 0.0f;
+ for (int64_t j = 0; j < d_conv - 1; ++j) {
+ sumf += states_c[j] * w_c[j];
+ }
+ sumf += xc * w_c[d_conv - 1];
+ sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
+
+ dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
+
+ // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
+ float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
+ for (int64_t j = 0; j < d_conv - 2; ++j) {
+ out_state[j] = states_c[j + 1];
+ }
+ out_state[d_conv - 2] = xc;
+ }
+ }
+}
+
void ggml_compute_forward_ssm_conv(
const ggml_compute_params * params,
ggml_tensor * dst) {
switch (dst->src[0]->type) {
case GGML_TYPE_F32:
{
- ggml_compute_forward_ssm_conv_f32(params, dst);
+ if (dst->src[3] != nullptr) {
+ ggml_compute_forward_ssm_conv_update_f32(params, dst);
+ } else {
+ ggml_compute_forward_ssm_conv_f32(params, dst);
+ }
} break;
default:
{
diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
index 1463169..e1af1cd 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0,
}
}
+// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of
+// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the
+// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as
+// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted
+// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back.
+template <bool apply_silu, int d_conv>
+static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states,
+ const float * __restrict__ conv_kernel,
+ const float * __restrict__ x_cur,
+ float * __restrict__ conv_state_dst,
+ float * __restrict__ dst,
+ const int channels,
+ const int states_seq_stride,
+ const int w_stride,
+ const int x_seq_stride,
+ const int dst_seq_stride,
+ const int cdst_seq_stride) {
+ const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
+ const int s = blockIdx.y; // sequence
+ if (c >= channels) {
+ return;
+ }
+
+ const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1);
+ const float * w_c = conv_kernel + (int64_t) c * w_stride;
+ const float xc = x_cur[(int64_t) s * x_seq_stride + c];
+
+ // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window
+ float window[d_conv];
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ window[j] = states_c[j];
+ }
+ window[d_conv - 1] = xc;
+
+ float sumf = 0.0f;
+#pragma unroll
+ for (int j = 0; j < d_conv; j++) {
+ sumf += window[j] * w_c[j];
+ }
+ sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
+ dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
+
+ // 1-token-shifted ring write-back: drop the oldest tap, append the current token
+ float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ out_state[j] = window[j + 1];
+ }
+}
+
+static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ const ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+ GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
+
+ const float * states_d = (const float *) conv_states->data;
+ const float * w_d = (const float *) conv_kernel->data;
+ const float * x_d = (const float *) x_cur->data;
+ float * cdst_d = (float *) cdst->data;
+ float * dst_d = (float *) dst->data;
+ cudaStream_t stream = ctx.stream();
+
+ const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float));
+ const int w_stride = (int) (conv_kernel->nb[1] / sizeof(float));
+ const int x_seq_stride = (int) (x_cur->nb[2] / sizeof(float));
+ const int dst_seq_stride = (int) (dst->nb[2] / sizeof(float));
+ const int cdst_seq_stride = (int) (cdst->nb[1] / sizeof(float));
+
+ const int threads = 128;
+ const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
+
+ auto launch = [&](auto NC) {
+ constexpr int kNC = decltype(NC)::value;
+ if (apply_silu) {
+ ssm_conv_update_f32<true, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ } else {
+ ssm_conv_update_f32<false, kNC><<<blocks, threads, 0, stream>>>(states_d, w_d, x_d, cdst_d, dst_d,
+ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ }
+ };
+
+ switch (d_conv) {
+ case 3: launch(std::integral_constant<int, 3>{}); break;
+ case 4: launch(std::integral_constant<int, 4>{}); break;
+ default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4");
+ }
+}
+
template <bool apply_silu>
static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa
}
void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) {
+ // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a
+ // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/
+ // silu of the decode conv path into a single kernel.
+ if (dst->src[3] != nullptr) {
+ GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
+ ggml_cuda_op_ssm_conv_update(ctx, dst);
+ return;
+ }
+
const struct ggml_tensor * src0 = dst->src[0]; // conv_x
const struct ggml_tensor * src1 = dst->src[1]; // conv1d.weight
const bool fuse_bias = bias_add_node != nullptr;
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 1762037..b777748 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv(
return result;
}
+// ggml_ssm_conv_update_inplace
+//
+// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a
+// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new
+// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv),
+// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the
+// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache
+// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern.
+struct ggml_tensor * ggml_ssm_conv_update_inplace(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ bool fuse_silu) {
+ GGML_ASSERT(ggml_is_3d(conv_states));
+ GGML_ASSERT(ggml_is_matrix(conv_kernel));
+ GGML_ASSERT(ggml_is_3d(x_cur));
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = conv_states->ne[2];
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
+
+ // conv_states: [K-1, channels, n_seqs], contiguous taps per channel
+ GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
+ GGML_ASSERT(conv_states->ne[1] == channels);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ // x_cur: single decode token per sequence
+ GGML_ASSERT(x_cur->ne[0] == channels);
+ GGML_ASSERT(x_cur->ne[1] == 1);
+ GGML_ASSERT(x_cur->ne[2] == n_seqs);
+ // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
+ GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
+ GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
+
+ struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+
+ ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
+
+ result->op = GGML_OP_SSM_CONV;
+ result->src[0] = conv_states;
+ result->src[1] = conv_kernel;
+ result->src[2] = x_cur;
+ result->src[3] = conv_state_dst;
+
+ return result;
+}
+
// ggml_ssm_scan
struct ggml_tensor * ggml_ssm_scan(
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index 194e611..0eee804 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state(
return conv_input;
}
+// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same
+// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token-
+// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat
+// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate
+// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the
+// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy
+// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu.
+ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
+ llm_graph_input_rs * inp,
+ ggml_tensor * conv_states_all,
+ ggml_tensor * qkv_mixed,
+ ggml_tensor * conv_kernel,
+ int64_t conv_kernel_size,
+ int64_t conv_channels,
+ int il) {
+ const auto * mctx_cur = inp->mctx;
+ const auto kv_head = mctx_cur->get_head();
+
+ const int64_t n_seqs = ubatch.n_seqs;
+ const int64_t n_seq_tokens = ubatch.n_seq_tokens;
+
+ GGML_ASSERT(n_seq_tokens == 1); // single-token decode only
+ GGML_ASSERT(cparams.n_rs_seq == 0); // no rollback splits on this path
+
+ // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
+ // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
+ ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
+ conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
+ cb(conv_states, "conv_states_reshaped", il);
+
+ // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
+ ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
+
+ // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
+ // destination the baseline ggml_cpy wrote to (s_slot == 0).
+ const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
+ const size_t row_size = ggml_row_size(conv_states_all->type, row_count);
+ ggml_tensor * conv_state_dst =
+ ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
+ cb(conv_state_dst, "conv_state_update", il);
+
+ ggml_tensor * conv_output =
+ ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
+ cb(conv_output, "conv_output_silu", il);
+
+ // the ring write is a side effect of the op; pull the op into the graph via the output
+ ggml_build_forward_expand(gf, conv_output);
+
+ return conv_output; // [conv_channels, 1, n_seqs], already silu'd
+}
+
// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused
// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy
// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused
diff --git a/src/models/models.h b/src/models/models.h
index 98b89e9..da0dd86 100644
--- a/src/models/models.h
+++ b/src/models/models.h
@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context {
int64_t conv_channels,
int il);
+ // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu +
+ // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and
+ // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring
+ // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv
+ // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain.
+ ggml_tensor * build_conv_state_fused(
+ llm_graph_input_rs * inp,
+ ggml_tensor * conv_states_all,
+ ggml_tensor * qkv_mixed,
+ ggml_tensor * conv_kernel,
+ int64_t conv_kernel_size,
+ int64_t conv_channels,
+ int il);
+
// run delta-net attention and write the new recurrent state(s) back to ssm_states_all
// s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs)
ggml_tensor * build_recurrent_attn(
diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp
index 0874c43..b6dcc5f 100644
--- a/src/models/qwen35.cpp
+++ b/src/models/qwen35.cpp
@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ conv_qkv_mix = conv_output_silu;
+ }
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp
index 1f6f643..c7c7c44 100644
--- a/src/models/qwen35moe.cpp
+++ b/src/models/qwen35moe.cpp
@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ conv_qkv_mix = conv_output_silu;
+ }
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp
index bfdf026..92749d1 100644
--- a/src/models/qwen3next.cpp
+++ b/src/models/qwen3next.cpp
@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear(
const int64_t conv_kernel_size = conv_kernel->ne[0];
const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state;
- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
+ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv +
+ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the
+ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below.
+ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar;
+
+ ggml_tensor * conv_qkv_mix;
+ if (conv_decode_fused) {
+ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel,
+ conv_kernel_size, conv_channels, il);
+ } else {
+ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il);
- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
- cb(state, "state_predelta", il);
+ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
+ cb(conv_output_proper, "conv_output_raw", il);
- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel);
- cb(conv_output_proper, "conv_output_raw", il);
+ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
+ cb(conv_output_silu, "conv_output_silu", il);
- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper);
- cb(conv_output_silu, "conv_output_silu", il);
+ conv_qkv_mix = conv_output_silu;
+ }
- ggml_tensor * conv_qkv_mix = conv_output_silu;
+ ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs);
+ state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs);
+ cb(state, "state_predelta", il);
// Calculate the total conv dimension
int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads;
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 291c275..c7348d6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case {
}
};
+// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021).
+// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token-
+// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy
+// md5 gate); here it just exercises the in-place write target as an op src.
+struct test_ssm_conv_update : public test_case {
+ const int64_t d_conv;
+ const int64_t channels;
+ const int64_t n_seqs;
+
+ std::string op_desc(ggml_tensor * t) override {
+ GGML_UNUSED(t);
+ return "SSM_CONV_UPDATE";
+ }
+
+ std::string vars() override {
+ return VARS_TO_STR3(d_conv, channels, n_seqs);
+ }
+
+ test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
+ : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
+
+ ggml_tensor * build_graph(ggml_context * ctx) override {
+ ggml_tensor * conv_states = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
+ ggml_tensor * conv_kernel = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
+ ggml_tensor * x_cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+ ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
+ ggml_set_name(conv_states, "conv_states");
+ ggml_set_name(conv_kernel, "conv_kernel");
+ ggml_set_name(x_cur, "x_cur");
+ ggml_set_name(conv_state_dst, "conv_state_dst");
+
+ ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true);
+ ggml_set_name(out, "out");
+ return out;
+ }
+};
+
// GGML_OP_SSM_SCAN
struct test_ssm_scan : public test_case {
const ggml_type type;
@@ -8355,6 +8392,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
}
}
+ // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be
+ // a multiple of 128 for the CUDA SSM_CONV supports_op gate.
+ for (int64_t d_conv : {3, 4}) {
+ for (int64_t channels : {256, 3328}) {
+ for (int64_t n_seqs : {1, 4, 32, 128}) {
+ test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs));
+ }
+ }
+ }
+
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64, 8, 2, 32, 4)); // Falcon-H1
--
2.43.0

View File

@@ -1,403 +0,0 @@
From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 20:32:55 +0200
Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode
occupancy/coalescing retune (patch 0022)
Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence
kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in
float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth
efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak),
the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA
order byte-identical (md5-gateable). The bf16-state plan stays shelved.
Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP
(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent
state instead of 1, looping the existing per-column body over col, col+NUM_WARPS,
... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns;
grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded
across the lanes by the same strided i = r*warp_size + lane mapping, and every
column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte
unchanged; only the (warp,block)->column assignment and visit order differ, which a
column's value provably does not depend on (columns are fully independent). This
raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent
state-load bursts before any reduction + interleaved butterfly reductions hiding
each other's shfl latency), covering more DRAM latency on this bandwidth-bound
kernel. Every global access stays identically coalesced, so it is a scheduling /
latency-coverage win, not a coalescing change. The forbidden float4 state load
(which would repartition a lane to 4 contiguous rows and change the reduction
grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is
env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured
GB10 winner (16, 8).
GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked):
gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call,
200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%;
102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on):
dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE
35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill
unchanged.
Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on
both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control);
test-backend-ops -o GATED_DELTA_NET 36/36 PASS.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++---------
1 file changed, 157 insertions(+), 79 deletions(-)
diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu
index 86d5e2a..d071d5a 100644
--- a/ggml/src/ggml-cuda/gated_delta_net.cu
+++ b/ggml/src/ggml-cuda/gated_delta_net.cu
@@ -1,6 +1,8 @@
#include "gated_delta_net.cuh"
#include "ggml-cuda/common.cuh"
+#include <cstdlib>
+
// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a
// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
// destination slot by the recurrence kernel and are skipped here. One block per sequence.
@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i
gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs);
}
-template <int S_v, bool KDA, bool keep_rs_t>
-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2)
+// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent
+// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a
+// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded
+// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane
+// FMA accumulation and warp_reduce_sum<warp_size> butterfly are byte-for-byte unchanged. Only the
+// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's
+// f32 value provably does not depend on either (columns are fully independent: column c reads only
+// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result
+// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable),
+// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent
+// state-load bursts issued before any reduction, and the independent butterfly reductions interleave
+// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel.
+// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B
+// sector), so this is a latency-coverage / scheduling win, not a coalescing change.
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS = 4, int COLS_PER_WARP = 1, int MIN_BLOCKS = 2>
+__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS)
gated_delta_net_cuda(const float * q,
const float * k,
const float * v,
@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q,
int rs_head) {
const uint32_t h_idx = blockIdx.x;
const uint32_t sequence = blockIdx.y;
- // each warp owns one column, using warp-level primitives to reduce across rows
+ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows.
const int lane = threadIdx.x;
- const int col = blockIdx.z * blockDim.y + threadIdx.y;
+ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y;
const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic);
const uint32_t iq3 = fastdiv(sequence, rq3_magic);
@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q,
// writing the same slot per block (identity) is race-free.
const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence)
? state_dst : curr_state;
- read_state += state_in_offset + col * S_v;
+ read_state += state_in_offset;
attn_data += (sequence * n_tokens * H + h_idx) * S_v;
constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v;
static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size");
constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size;
- float s_shard[rows_per_lane];
- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous
+ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col].
+ float s_shard[COLS_PER_WARP][rows_per_lane];
ggml_cuda_pdl_sync();
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = read_state[i];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ const float * rs = read_state + col * S_v;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = rs[i];
+ }
}
for (int t = 0; t < n_tokens; t++) {
@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q,
const float beta_val = *beta_t;
- // Cache k and q in registers
+ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp).
float k_reg[rows_per_lane];
float q_reg[rows_per_lane];
#pragma unroll
@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q,
if constexpr (!KDA) {
const float g_val = expf(*g_t);
- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- kv_shard += s_shard[r] * k_reg[r];
- }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
- // delta[col] = (v[col] - g * kv[col]) * beta
- float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ kv_shard += s_shard[cc][r] * k_reg[r];
+ }
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // delta[col] = (v[col] - g * kv[col]) * beta
+ float delta_col = (v_t[col] - g_val * kv_col) * beta_val;
+
+ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
} else {
- // kv[col] = sum_i g[i] * S[i][col] * k[i]
- float kv_shard = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r];
- }
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+
+ // kv[col] = sum_i g[i] * S[i][col] * k[i]
+ float kv_shard = 0.0f;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r];
+ }
- float kv_col = warp_reduce_sum<warp_size>(kv_shard);
+ float kv_col = warp_reduce_sum<warp_size>(kv_shard);
- // delta[col] = (v[col] - kv[col]) * beta
- float delta_col = (v_t[col] - kv_col) * beta_val;
+ // delta[col] = (v[col] - kv[col]) * beta
+ float delta_col = (v_t[col] - kv_col) * beta_val;
- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
- float attn_partial = 0.0f;
+ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col]
+ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i]
+ float attn_partial = 0.0f;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col;
- attn_partial += s_shard[r] * q_reg[r];
- }
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col;
+ attn_partial += s_shard[cc][r] * q_reg[r];
+ }
- float attn_col = warp_reduce_sum<warp_size>(attn_partial);
+ float attn_col = warp_reduce_sum<warp_size>(attn_partial);
- if (lane == 0) {
- attn_data[col] = attn_col * scale;
+ if (lane == 0) {
+ attn_data[col] = attn_col * scale;
+ }
}
}
@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q,
const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output
const int target_slot = (int) n_tokens - 1 - t;
if (target_slot >= 0 && target_slot < K) {
- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- curr_state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ curr_state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q,
if constexpr (!keep_rs_t) {
#pragma unroll
- for (int r = 0; r < rows_per_lane; r++) {
- const int i = r * warp_size + lane;
- state[col * S_v + i] = s_shard[r];
+ for (int cc = 0; cc < COLS_PER_WARP; cc++) {
+ const int col = col_base + cc * NUM_WARPS;
+#pragma unroll
+ for (int r = 0; r < rows_per_lane; r++) {
+ const int i = r * warp_size + lane;
+ state[col * S_v + i] = s_shard[cc][r];
+ }
}
}
}
+// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model).
+// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep
+// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs.
+#ifndef GDN_DEFAULT_NW
+#define GDN_DEFAULT_NW 16
+#endif
+#ifndef GDN_DEFAULT_CPW
+#define GDN_DEFAULT_CPW 8
+#endif
+
+template <int S_v, bool KDA, bool keep_rs_t, int NUM_WARPS, int COLS_PER_WARP, int MIN_BLOCKS>
+static void launch_gdn_variant(
+ const float * q_d, const float * k_d, const float * v_d,
+ const float * g_d, const float * b_d, const float * s_d,
+ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head,
+ int64_t H, int64_t n_tokens, int64_t n_seqs,
+ int64_t sq1, int64_t sq2, int64_t sq3,
+ int64_t sv1, int64_t sv2, int64_t sv3,
+ int64_t sb1, int64_t sb2, int64_t sb3,
+ const uint3 neqk1_magic, const uint3 rq3_magic,
+ float scale, int K, int warp_size, cudaStream_t stream) {
+ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v");
+ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP));
+ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1);
+ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
+ ggml_cuda_kernel_launch(gated_delta_net_cuda<S_v, KDA, keep_rs_t, NUM_WARPS, COLS_PER_WARP, MIN_BLOCKS>, launch_params,
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
+ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
+ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+}
+
template <bool KDA, bool keep_rs_t>
static void launch_gated_delta_net(
const float * q_d, const float * k_d, const float * v_d,
@@ -223,47 +293,55 @@ static void launch_gated_delta_net(
float scale, int K, cudaStream_t stream) {
//TODO: Add chunked kernel for even faster pre-fill
const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size;
- const int num_warps = 4;
- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps);
- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1);
const uint3 neqk1_magic = init_fastdiv_values(neqk1);
const uint3 rq3_magic = init_fastdiv_values(rq3);
- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
+#define GDN_LAUNCH_ARGS \
+ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \
+ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \
+ neqk1_magic, rq3_magic, scale, K, warp_size, stream
- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream);
switch (S_v) {
case 16:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
case 32:
- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- case 64: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ case 64:
+ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
break;
- }
case 128: {
- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params,
- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H,
- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3,
- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head);
+ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp
+ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is
+ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every
+ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5).
+ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }();
+ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }();
+ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread
+ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ...
+ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4).
+ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS);
+ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS);
+ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS);
break;
}
default:
GGML_ABORT("fatal error");
break;
}
+
+#undef GDN_LAUNCH_ARGS
}
void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
--
2.43.0

View File

@@ -1,144 +0,0 @@
From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Thu, 25 Jun 2026 23:47:25 +0200
Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup
(patch 0023)
Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s
mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat =
ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every
expert of a token receives the SAME token activation, so the stock path
re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each
block as a pure per-thread function of its 16 consecutive inputs (no cross-thread
reduction), so the gathered blocks are byte-identical across the experts.
Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then
gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by
ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure
byte copy of identical blocks, so the gathered buffer is byte-for-byte identical
to re-quantizing each gathered row; the GEMM is untouched. down_proj
(ne11 == n_expert_used, distinct per expert) keeps the stock path.
Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4:
- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new
gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time.
- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%.
- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged.
Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022):
q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 (unchanged)
q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off)
test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805.
On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/mmq.cu | 21 +++++++++++++++++--
ggml/src/ggml-cuda/quantize.cu | 37 +++++++++++++++++++++++++++++++++
ggml/src/ggml-cuda/quantize.cuh | 4 ++++
3 files changed, 60 insertions(+), 2 deletions(-)
diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu
index e1add5e..9933fa6 100644
--- a/ggml/src/ggml-cuda/mmq.cu
+++ b/ggml/src/ggml-cuda/mmq.cu
@@ -1,3 +1,4 @@
+#include <cstdlib>
#include "common.cuh"
#include "mmq.cuh"
#include "quantize.cuh"
@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q(
const int64_t s13 = src1->nb[3] / ts_src1;
if (use_native_fp4) {
- quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
- ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared
+ // across experts, so quantize the ne12 unique tokens once and gather the blocks.
+ static const bool moe_quant_dedup = []{
+ const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP");
+ return e ? atoi(e) != 0 : true; // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables
+ }();
+ if (moe_quant_dedup && ne11 == 1) {
+ const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 +
+ get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq);
+ ggml_cuda_pool_alloc<char> src1_unique(ctx.pool(), nbytes_unique);
+ quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0,
+ ne10_padded, ne12, 1, 1, stream);
+ gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(),
+ ne11_flat, ne12, ne10_padded, stream);
+ } else {
+ quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
+ ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
+ }
} else {
quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13,
ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream);
diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu
index 39a500a..a7fd86f 100644
--- a/ggml/src/ggml-cuda/quantize.cu
+++ b/ggml/src/ggml-cuda/quantize.cu
@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda(
}
}
+// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every
+// gathered row references one of ne12 unique token activations, so the stock path
+// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy
+// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a
+// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock.
+static __global__ void gather_mmq_fp4(
+ const uint4 * __restrict__ unique, const int32_t * __restrict__ ids,
+ uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique,
+ const int64_t total_words) {
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block
+ const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x;
+ if (t >= total_words) {
+ return;
+ }
+ const int w = (int) (t % W);
+ const int64_t ib = t / W; // destination block index = kb*ne11_flat + j
+ const int j = (int) (ib % ne11_flat);
+ const int kb = (int) (ib / ne11_flat);
+ const int src = ids[j];
+ const int64_t ib_u = (int64_t) kb * ne12_unique + src;
+ gathered[t] = unique[ib_u * W + w];
+}
+
+void gather_mmq_fp4_cuda(
+ const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) {
+ const int blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K);
+ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4));
+ const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W;
+ const int bs = 256;
+ const dim3 block_size(bs, 1, 1);
+ const dim3 num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1);
+ gather_mmq_fp4<<<num_blocks, block_size, 0, stream>>>(
+ (const uint4 *) unique, ids, (uint4 *) gathered,
+ (int) ne11_flat, (int) ne12_unique, total_words);
+}
+
void quantize_mmq_fp4_cuda(
const float * x, const int32_t * ids, void * vy, const ggml_type type_src0,
const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03,
diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh
index 768a3ae..7f64069 100644
--- a/ggml/src/ggml-cuda/quantize.cuh
+++ b/ggml/src/ggml-cuda/quantize.cuh
@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda(
ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03,
int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream);
+void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered,
+ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded,
+ cudaStream_t stream);
+
void quantize_mmq_fp4_cuda(const float * x,
const int32_t * ids,
void * vy,
--
2.43.0

View File

@@ -1,357 +0,0 @@
From a8a9d129ae2226a08a12c30ece697865c0fc85c4 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 12:41:49 +0200
Subject: [PATCH] feat(paged): paged-pool burst-reclaim (truncate + defrag +
slot release) (patch 0024)
Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C
Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill
burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill
draws from a depleted/fragmented pool and its throughput collapses (the benchmark's
"restart per npl" crutch). Decode is unaffected. The fix changes only host-side
block accounting and placement, never KV values or compute, and is gated behind
LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior).
Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every
block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for
the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly.
Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the
free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving
content-cache hashes.
Fix-3 release on slot completion: server_slot::release() issues prompt_clear()
under the paged engine so a finished-idle slot returns its blocks promptly.
Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023):
- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM
(5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops
unaffected (no ggml op touched).
- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores
ascending popleft order. UNIT PASS.
- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998;
64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527).
- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses
488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to
532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96
(Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both
arms (bit-exact serving).
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/llama-kv-cache.cpp | 13 ++++++++++
src/paged-alloc.cpp | 31 +++++++++++++++++++++++
src/paged-alloc.h | 18 +++++++++++++
src/paged-kv-manager.cpp | 45 +++++++++++++++++++++++++++++++++
src/paged-kv-manager.h | 24 ++++++++++++++++++
src/paged-prefix-api.cpp | 8 ++++++
src/paged-prefix-api.h | 6 +++++
tools/server/server-context.cpp | 17 +++++++++++++
8 files changed, 162 insertions(+)
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 0351f86..21b8f1e 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -425,6 +425,19 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) {
}
}
+ // [paged 0024 Fix-1] Reclaim trailing blocks on a partial TAIL truncation
+ // (p1 == MAX, p0 > 0). llama-server issues seq_rm(slot, n_past, -1) on every
+ // reused slot and before a cross-request prefix splice; the kv-cache frees the
+ // cells [p0, end) but, without this, the paged manager keeps owning those
+ // blocks - the reclamation gap that leaks and fragments the pool across a
+ // burst. truncate() frees the blocks beyond ceil(p0/bs) so the manager's
+ // accounting tracks the kv-cache exactly. Gated so LLAMA_PAGED_NO_RECLAIM
+ // restores the pre-fix behavior for A/B.
+ if (paged_alloc::active() && paged_alloc::reclaim_active() && seq_id >= 0 &&
+ p0 > 0 && p1 == std::numeric_limits<llama_pos>::max()) {
+ paged_alloc::truncate(this, (int) seq_to_stream[seq_id], (int) seq_id, (uint32_t) p0);
+ }
+
if (seq_id >= 0) {
auto & cells = v_cells[seq_to_stream[seq_id]];
auto & head = v_heads[seq_to_stream[seq_id]];
diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp
index c1027fb..ba98dd5 100644
--- a/src/paged-alloc.cpp
+++ b/src/paged-alloc.cpp
@@ -14,6 +14,11 @@ bool active() {
return a;
}
+bool reclaim_active() {
+ static const bool off = (std::getenv("LLAMA_PAGED_NO_RECLAIM") != nullptr);
+ return !off;
+}
+
static bool debug() {
static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr);
return d;
@@ -124,12 +129,28 @@ void commit(const void * cache, int stream, int seq,
}
}
+void truncate(const void * cache, int stream, int seq, uint32_t n_keep) {
+ paged::PagedKVManager * mgr = find_mgr(cache, stream);
+ if (!mgr) {
+ return;
+ }
+ mgr->truncate(seq, (size_t) n_keep); // Fix-1: reclaim trailing blocks
+ mgr->defrag_free_pool(); // Fix-2: compact iff the pool emptied
+ if (debug()) {
+ fprintf(stderr, "[paged-alloc] truncate cache=%p stream=%d seq=%d keep<=%u (free=%zu)\n",
+ cache, stream, seq, n_keep, mgr->num_free_blocks());
+ }
+}
+
void release(const void * cache, int stream, int seq) {
paged::PagedKVManager * mgr = find_mgr(cache, stream);
if (!mgr) {
return;
}
mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them
+ if (reclaim_active()) {
+ mgr->defrag_free_pool(); // Fix-2: compact iff the pool emptied
+ }
if (debug()) {
fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n",
cache, stream, seq, mgr->num_free_blocks());
@@ -163,4 +184,14 @@ size_t num_free(const void * cache, int stream) {
return mgr ? mgr->num_free_blocks() : 0;
}
+size_t num_free_global() {
+ size_t total = 0;
+ for (auto & kv : g_managers) total += kv.second->num_free_blocks();
+ return total;
+}
+
+size_t num_managers() {
+ return g_managers.size();
+}
+
} // namespace paged_alloc
diff --git a/src/paged-alloc.h b/src/paged-alloc.h
index 88dedef..bfaf45b 100644
--- a/src/paged-alloc.h
+++ b/src/paged-alloc.h
@@ -31,6 +31,12 @@ namespace paged_alloc {
// true iff env LLAMA_KV_PAGED is set (evaluated once).
bool active();
+// [paged 0024] The burst-reclaim fix (truncate + defrag-on-empty + slot release)
+// is on by default whenever the paged engine is active. LLAMA_PAGED_NO_RECLAIM=1
+// restores the pre-fix behavior (no trailing-block reclaim, no compaction) for
+// A/B measurement. Evaluated once.
+bool reclaim_active();
+
// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq)
// on demand, appending their physical cell indices to `out`. pool_blocks =
// cells.size()/block_size is the stream's block budget. Returns false (leaving
@@ -58,6 +64,12 @@ int64_t slot(const void * cache, int stream, int seq, int pos);
void commit(const void * cache, int stream, int seq,
const std::vector<int> & tokens, uint32_t block_size, uint32_t pool_blocks);
+// [paged 0024 Fix-1] Reclaim the trailing blocks of (cache,stream,seq) beyond
+// logical position n_keep (ref-counted), mirroring a partial kv-cache seq_rm
+// [n_keep, end). When the stream's pool empties as a result, its free queue is
+// defragged to pristine contiguous order (Fix-2). No-op if no manager exists.
+void truncate(const void * cache, int stream, int seq, uint32_t n_keep);
+
// Return one sequence's blocks to the pool (ref-counted; sequence end).
void release(const void * cache, int stream, int seq);
@@ -69,4 +81,10 @@ void release_all(const void * cache);
int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size);
size_t num_free(const void * cache, int stream);
+// [paged 0024] Total free blocks summed across every live manager (all caches /
+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
+size_t num_free_global();
+size_t num_managers();
+
} // namespace paged_alloc
diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp
index 4c6ee4c..738b332 100644
--- a/src/paged-kv-manager.cpp
+++ b/src/paged-kv-manager.cpp
@@ -104,6 +104,22 @@ void FreeBlockQueue::prepend_n(const std::vector<KVCacheBlock*>& blocks) {
num_free_blocks += blocks.size();
}
+void FreeBlockQueue::rebuild(const std::vector<KVCacheBlock*>& blocks) {
+ // Relink the intrusive list using THIS queue's stable fake head/tail nodes.
+ num_free_blocks = blocks.size();
+ for (size_t i = 0; i < blocks.size(); ++i) {
+ blocks[i]->prev_free = (i == 0) ? &fake_head : blocks[i - 1];
+ blocks[i]->next_free = (i + 1 < blocks.size()) ? blocks[i + 1] : &fake_tail;
+ }
+ if (!blocks.empty()) {
+ fake_head.next_free = blocks.front();
+ fake_tail.prev_free = blocks.back();
+ } else {
+ fake_head.next_free = &fake_tail;
+ fake_tail.prev_free = &fake_head;
+ }
+}
+
std::vector<KVCacheBlock*> FreeBlockQueue::get_all_free_blocks() const {
std::vector<KVCacheBlock*> ret;
const KVCacheBlock* curr = fake_head.next_free;
@@ -199,6 +215,20 @@ void BlockPool::cache_full_blocks(const std::vector<KVCacheBlock*>& req_blocks,
}
}
+void BlockPool::defrag_free_queue() {
+ // Pool is fully idle: every non-null block is free (ref_cnt 0). Rebuild the
+ // free list in ascending block_id order so popleft hands out physically
+ // contiguous blocks again. Hashes / the content-cache map are left intact so
+ // a warm committed prefix stays re-hittable.
+ std::vector<KVCacheBlock*> ordered;
+ ordered.reserve(ptrs_.size());
+ for (KVCacheBlock* b : ptrs_) {
+ if (b->is_null) continue;
+ ordered.push_back(b);
+ }
+ free_queue_.rebuild(ordered);
+}
+
// ---------------------------------------------------------------------------
// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager)
// ---------------------------------------------------------------------------
@@ -250,6 +280,21 @@ void PagedKVManager::free(int seq_id) {
req_to_blocks_.erase(it);
}
+void PagedKVManager::truncate(int seq_id, size_t n_keep) {
+ auto it = req_to_blocks_.find(seq_id);
+ if (it == req_to_blocks_.end()) return;
+ auto & blocks = it->second;
+ const size_t keep = cdiv(n_keep, block_size_); // blocks covering [0, n_keep)
+ if (keep >= blocks.size()) return; // nothing trailing to reclaim
+ // Free the trailing blocks [keep, end) tail-first (vLLM eviction order). Their
+ // cells were just cleared by the partial seq_rm, so they are safe to reuse.
+ std::vector<KVCacheBlock*> ordered(blocks.rbegin(),
+ blocks.rbegin() + (blocks.size() - keep));
+ pool_.free_blocks(ordered);
+ blocks.resize(keep);
+ if (blocks.empty()) req_to_blocks_.erase(it);
+}
+
// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent
// hash into the seed so each block hash transitively encodes its whole prefix
// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes).
diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h
index 34decbc..e410d58 100644
--- a/src/paged-kv-manager.h
+++ b/src/paged-kv-manager.h
@@ -47,6 +47,11 @@ public:
void append_n(const std::vector<KVCacheBlock*>& blocks);
void prepend_n(const std::vector<KVCacheBlock*>& blocks);
std::vector<KVCacheBlock*> get_all_free_blocks() const;
+ // [paged 0024 Fix-2] Relink the intrusive free list to the given order using
+ // THIS queue's fake head/tail (the nodes' addresses are stable; a temporary
+ // FreeBlockQueue would leave dangling fake-node pointers). Used to restore a
+ // pristine, contiguous popleft order after a fragmenting burst drains.
+ void rebuild(const std::vector<KVCacheBlock*>& blocks);
private:
KVCacheBlock fake_head{-1};
@@ -67,6 +72,14 @@ public:
size_t num_cached_blocks, size_t num_full_blocks,
const std::vector<uint64_t>& block_hashes);
size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; }
+ // [paged 0024 Fix-2] Total non-null blocks, and whether the pool is fully
+ // idle (every non-null block back in the free queue). defrag_free_queue()
+ // relinks the free queue into pristine ascending-block-id order; only valid
+ // when all_free() so no live request's block table is disturbed. Block hashes
+ // are preserved, so a warm committed prefix stays re-hittable.
+ size_t total_blocks() const { return blocks_.size(); }
+ bool all_free() const { return free_queue_.num_free_blocks + 1 == blocks_.size(); }
+ void defrag_free_queue();
private:
bool maybe_evict_cached_block(KVCacheBlock* block);
@@ -94,6 +107,17 @@ public:
void free(int seq_id);
int block_size() const { return block_size_; }
+ // [paged 0024 Fix-1] Reclaim the trailing blocks of seq_id beyond logical
+ // position n_keep: free every block at index >= ceil(n_keep/bs) (ref-counted,
+ // mirroring vLLM's free of a truncated block suffix). Called on a partial tail
+ // seq_rm [n_keep, end) so the manager's block accounting tracks the kv-cache
+ // exactly instead of stranding the blocks whose cells were just cleared.
+ void truncate(int seq_id, size_t n_keep);
+
+ // [paged 0024 Fix-2] When no live request holds a block, relink the free
+ // queue into pristine contiguous order (undo a burst's scrambled free order).
+ void defrag_free_pool() { if (pool_.all_free()) pool_.defrag_free_queue(); }
+
// Prefix caching (win 3).
static uint64_t hash_block(uint64_t parent_hash, const std::vector<int>& token_ids);
std::vector<uint64_t> compute_block_hashes(const std::vector<int>& token_ids) const;
diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp
index 8573cd2..209cee8 100644
--- a/src/paged-prefix-api.cpp
+++ b/src/paged-prefix-api.cpp
@@ -45,4 +45,12 @@ long num_free(llama_context * ctx) {
return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0);
}
+long num_free_global() {
+ return (long) paged_alloc::num_free_global();
+}
+
+long num_managers() {
+ return (long) paged_alloc::num_managers();
+}
+
} // namespace paged_prefix_api
diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h
index 78a3864..8dd817e 100644
--- a/src/paged-prefix-api.h
+++ b/src/paged-prefix-api.h
@@ -24,4 +24,10 @@ int ref_at(llama_context * ctx, llama_seq_id seq, int pos);
// Number of free blocks in the unified stream-0 pool, or 0 if no manager.
long num_free(llama_context * ctx);
+// [paged 0024] Total free blocks across every live paged manager (all caches /
+// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA
+// models whose outer memory is not a llama_kv_cache. Diagnostics only.
+long num_free_global();
+long num_managers();
+
} // namespace paged_prefix_api
diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp
index f7a114c..8c19cfb 100644
--- a/tools/server/server-context.cpp
+++ b/tools/server/server-context.cpp
@@ -411,6 +411,23 @@ struct server_slot {
reset();
+ // [paged 0024 Fix-3] Return this finished slot's paged blocks to the
+ // pool promptly. Stock llama-server keeps an idle slot's KV for its own
+ // next-prompt cache, but under the paged engine that strands blocks in
+ // idle slots after a high-fan-out burst, so a later low-npl run sees a
+ // depleted, fragmented pool and its prefill collapses. prompt_clear()
+ // issues a full seq_rm (clearing the cells AND, via the paged hook,
+ // releasing + defragging the blocks) and clears the slot-local prompt
+ // cache so the next reuse recomputes from a pristine pool; cross-request
+ // reuse still works through the committed paged content cache. Gated on
+ // LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM opts out for A/B); stock
+ // (paged off) is byte-identical.
+ static const bool paged_release_on_idle =
+ getenv("LLAMA_KV_PAGED") != nullptr && getenv("LLAMA_PAGED_NO_RECLAIM") == nullptr;
+ if (paged_release_on_idle && prompt.n_tokens() > 0) {
+ prompt_clear(false);
+ }
+
callback_on_release(id);
}
}
--
2.43.0

View File

@@ -1,56 +0,0 @@
From 2f4f5ab7c9050f890ee1137ef9c8ee09dfcd9ae7 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 16:52:21 +0200
Subject: [PATCH] feat(paged): qwen35moe NVFP4 MoE-decode re-graph
(should_use_mmq graph-safe id-path) (patch 0025)
The MUL_MAT_ID CUDA-graph guard (ggml-cuda.cu [TAG_MUL_MAT_ID_CUDA_GRAPHS]) disables CUDA graphs for
the whole decode step whenever a MUL_MAT_ID node has ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121),
because the per-expert host-loop fallback synchronizes the stream. But on Blackwell NVFP4 the path
actually taken is should_use_mmq()==true -> the grouped stream-k mul_mat_q id-branch, which launches
on one stream with NO host sync (no cudaStreamSynchronize/Memcpy in mmq.cu/mmid.cu). The disable is
therefore conservative; graphs are safe for the grouped path.
Env-gated (LLAMA_MOE_FORCE_GRAPHS, default-off = byte-identical to stock): when set and the node
takes the grouped MMQ path, keep CUDA graphs on for the MoE decode step.
Measured (DGX GB10 sm_121, q36-35b-a3b-nvfp4, llama-batched-bench -fa on -npp128 -ntg128, decode_agg):
npl 8 226.0 -> 226.4 +0.2% (noise; ne2<=8 already on the MMVQ-graphed path)
npl 32 433.8 -> 452.7 +4.4%
npl 64 589.0 -> 605.9 +2.9%
npl 128 743.1 -> 757.1 +1.9%
Bit-exact (graph replay re-issues identical kernels): test-backend-ops MUL_MAT_ID 806/806 CUDA0 OK;
parallel-greedy np16 (ne2=16>8) generated content byte-identical ON==OFF.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/src/ggml-cuda/ggml-cuda.cu | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)
diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu
index cca7059..254d2e0 100644
--- a/ggml/src/ggml-cuda/ggml-cuda.cu
+++ b/ggml/src/ggml-cuda/ggml-cuda.cu
@@ -3275,7 +3275,17 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) {
if (node->op == GGML_OP_MUL_MAT_ID) {
const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc;
const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc);
- if (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max) {
+ bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max;
+ // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is
+ // launched on-stream with no host sync (only the per-expert host-loop fallback syncs);
+ // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe
+ // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step.
+ if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) &&
+ getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr &&
+ ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) {
+ mmid_needs_sync = false;
+ }
+ if (mmid_needs_sync) {
// under these conditions, the mul_mat_id operation will need to synchronize the stream, so we cannot use CUDA graphs
// TODO: figure out a way to enable for larger batch sizes, without hurting performance
// ref: https://github.com/ggml-org/llama.cpp/pull/18958
--
2.43.0

View File

File diff suppressed because it is too large Load Diff

View File

@@ -1,578 +0,0 @@
From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Fri, 26 Jun 2026 22:58:47 +0200
Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch
0028)
The MoE-gap groundtruth found k_get_rows_float to be the single biggest decode
kernel vLLM has no equivalent of (~5.2 ms/step MoE; also dense): vLLM updates its
gated-DeltaNet recurrent state in place, while llama ran a separate ggml_get_rows
gather. Patch 0019 fused the SSM-state gather; patch 0021 fused the conv compute
but kept a build_rs gather for the conv taps. This closes that residual.
nsys located the residual k_get_rows as the conv-state tap gather in
build_conv_state_fused: a 24576-float (= n_embd_r = (d_conv-1)*(d_inner +
2*n_group*d_state)) row x 128 sequences, once per GDN layer per decode step
(~720 big ~115 us gathers / 24-step window). The SSM-state gather is already
fused by 0019, so this conv gather is the last k_get_rows in the GDN decode path.
New op ggml_ssm_conv_update_inplace_ids (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[4] = ids) takes the FULL conv cache + the s_copy ids and reads
each active sequence's prior taps directly from cache[ids[s]] in the kernel (no
ggml_get_rows). Identity sequences (ids[s] == rs_head + s, the AR-decode path)
read in place from the conv_state_dst write slot (the whole window is loaded into
registers before the ring write-back, so read==write is race-free); non-identity
sequences (reorder / rs_zero) are gathered into a disjoint scratch by a small
ssm_conv_gather_nonident_kernel first. Mirrors the 0019 in-place + ids gather
fusion. The read VALUES are unchanged; only the read path (gather -> indexed
in-kernel read) changes, so it is bit-identical to the build_rs gather + 0021 op.
build_conv_state_fused now feeds the full cache + ids through the build_rs
get_state_rows lambda (rs_zero clear + extra-states copy still run around it).
Helps BOTH dense and MoE (shared GDN conv path).
GATE test-backend-ops (CUDA0 vs CPU, 2/2 backends): SSM_CONV_UPDATE_IDS OK (new),
SSM_CONV_UPDATE OK, SSM_CONV OK, GATED_DELTA_NET OK, GET_ROWS OK.
GATE greedy md5 (--temp 0 --seed 1 -n 48) BYTE-IDENTICAL both models:
q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4
07db32c2bcb78d17a43ed18bc22705cd (== baseline).
nsys: k_get_rows_float float,float 10174 -> 9454 instances (720 fewer = 30 GDN
layers x 24 steps), 186.3 -> 102.8 ms; the 720 ~115 us conv gathers replaced by a
720 x ~1.1 us no-op ssm_conv_gather_nonident (all identity at steady decode).
MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
ggml/include/ggml.h | 20 ++++
ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++-
ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++-
ggml/src/ggml.c | 62 +++++++++++++
src/models/delta-net-base.cpp | 26 ++++--
tests/test-backend-ops.cpp | 69 ++++++++++++++
6 files changed, 411 insertions(+), 11 deletions(-)
diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h
index 2a5cbce..5fa220a 100644
--- a/ggml/include/ggml.h
+++ b/ggml/include/ggml.h
@@ -2463,6 +2463,26 @@ extern "C" {
struct ggml_tensor * conv_state_dst,
bool fuse_silu);
+ // Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
+ // per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
+ // n_cells]) plus the per-sequence `ids` ([n_seqs], I32, = the recurrent-state s_copy) and reads
+ // each active sequence's prior taps directly from cache[ids[s]] inside the kernel -- no
+ // ggml_get_rows materialization (mirrors ggml_gated_delta_net_inplace_ids). Identity sequences
+ // (ids[s] == rs_head + s) are read in place from `conv_state_dst` (the write slot); any
+ // non-identity sequence (reorder / rs_zero remap) is gathered into a disjoint scratch by the
+ // backend first, so the read never aliases another sequence's in-place ring write -> race-free
+ // and bit-identical to the get_rows + ggml_ssm_conv_update_inplace path. op_params[0]=fuse_silu,
+ // op_params[1]=rs_head. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
+ GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ struct ggml_tensor * ids,
+ int rs_head,
+ bool fuse_silu);
+
GGML_API struct ggml_tensor * ggml_ssm_scan(
struct ggml_context * ctx,
struct ggml_tensor * s,
diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp
index 07ab9e5..515aae4 100644
--- a/ggml/src/ggml-cpu/ops.cpp
+++ b/ggml/src/ggml-cpu/ops.cpp
@@ -9580,6 +9580,90 @@ static void ggml_compute_forward_ssm_conv_update_f32(
}
}
+// Patch 0028: CPU reference for ggml_ssm_conv_update_inplace_ids (mirror of the CUDA
+// ssm_conv_update_ids_f32). Reads each active sequence's prior K-1 taps directly from the FULL conv
+// cache (src[0]) via ids (src[4]) -- identity sequences (ids[s] == rs_head + s) read in place from the
+// destination slot src[3], non-identity from cache[ids[s]] -- computes the depthwise conv with the
+// same ascending-tap FMA order, optionally folds silu, writes the conv output to dst, and writes the
+// 1-token-shifted ring state back in place into src[3]. The window is copied to a local before the
+// write so the identity (read == write slot) case is correct. Threads split over channels.
+static void ggml_compute_forward_ssm_conv_update_ids_f32(
+ const ggml_compute_params * params,
+ ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+ const ggml_tensor * ids = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
+
+ const int ith = params->ith;
+ const int nth = params->nth;
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = x_cur->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+ const int32_t rs_head = ggml_get_op_params_i32(dst, 1);
+
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+ GGML_ASSERT(ids->type == GGML_TYPE_I32);
+ GGML_ASSERT(d_conv <= 8);
+
+ const int64_t cache_row_stride = conv_states->nb[2] / sizeof(float); // (K-1)*channels
+ const int64_t w_stride = conv_kernel->nb[1] / sizeof(float);
+ const int64_t x_seq_stride = x_cur->nb[2] / sizeof(float);
+ const int64_t dst_seq_stride = dst->nb[2] / sizeof(float);
+ const int64_t cdst_seq_stride = cdst->nb[1] / sizeof(float);
+
+ const float * cache_base = (const float *) conv_states->data;
+ const float * w_base = (const float *) conv_kernel->data;
+ const float * x_base = (const float *) x_cur->data;
+ float * cdst_base = (float *) cdst->data;
+ float * dst_base = (float *) dst->data;
+ const int32_t * ids_base = (const int32_t *) ids->data;
+
+ const int64_t dc = (channels + nth - 1) / nth;
+ const int64_t c0 = dc * ith;
+ const int64_t c1 = MIN(c0 + dc, channels);
+
+ for (int64_t s = 0; s < n_seqs; ++s) {
+ const int32_t r = ids_base[s];
+ const bool ident = (r == rs_head + (int32_t) s);
+ // identity reads the K-1 taps in place from the destination slot; non-identity from cache[r].
+ const float * states_seq = ident
+ ? (cdst_base + s * cdst_seq_stride)
+ : (cache_base + (int64_t) r * cache_row_stride);
+ for (int64_t c = c0; c < c1; ++c) {
+ const float * states_c = states_seq + c * (d_conv - 1);
+ const float * w_c = w_base + c * w_stride;
+ const float xc = x_base[s * x_seq_stride + c];
+
+ // window = [tap0 .. tap_{K-2}, xc], copied to a local before the (possibly aliasing) write
+ float window[8];
+ for (int64_t j = 0; j < d_conv - 1; ++j) {
+ window[j] = states_c[j];
+ }
+ window[d_conv - 1] = xc;
+
+ // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv)
+ float sumf = 0.0f;
+ for (int64_t j = 0; j < d_conv; ++j) {
+ sumf += window[j] * w_c[j];
+ }
+ sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0
+
+ dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf;
+
+ // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc]
+ float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1);
+ for (int64_t j = 0; j < d_conv - 1; ++j) {
+ out_state[j] = window[j + 1];
+ }
+ }
+ }
+}
+
void ggml_compute_forward_ssm_conv(
const ggml_compute_params * params,
ggml_tensor * dst) {
@@ -9587,7 +9671,11 @@ void ggml_compute_forward_ssm_conv(
case GGML_TYPE_F32:
{
if (dst->src[3] != nullptr) {
- ggml_compute_forward_ssm_conv_update_f32(params, dst);
+ if (dst->src[4] != nullptr) {
+ ggml_compute_forward_ssm_conv_update_ids_f32(params, dst);
+ } else {
+ ggml_compute_forward_ssm_conv_update_f32(params, dst);
+ }
} else {
ggml_compute_forward_ssm_conv_f32(params, dst);
}
diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu
index e1af1cd..28b3cce 100644
--- a/ggml/src/ggml-cuda/ssm-conv.cu
+++ b/ggml/src/ggml-cuda/ssm-conv.cu
@@ -226,6 +226,153 @@ static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_t
}
}
+// Patch 0028: gather only the NON-identity sequences' prior conv taps from the FULL conv cache into a
+// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the
+// destination slot by the update kernel and are skipped here. One block per sequence. Mirrors
+// gdn_gather_nonident_kernel (the 0019 recurrent-state gather fusion).
+static __global__ void ssm_conv_gather_nonident_kernel(const float * __restrict__ cache,
+ const int32_t * __restrict__ ids, int rs_head,
+ float * __restrict__ scratch, int row_stride, int n_seqs) {
+ const int s = blockIdx.x;
+ if (s >= n_seqs) {
+ return;
+ }
+ const int r = ids[s];
+ if (r == rs_head + s) {
+ return; // identity: prior taps already live in the in-place destination slot
+ }
+ const float * src = cache + (int64_t) r * row_stride;
+ float * dst = scratch + (int64_t) s * row_stride;
+ for (int i = threadIdx.x; i < row_stride; i += blockDim.x) {
+ dst[i] = src[i];
+ }
+}
+
+// Patch 0028: gather-free fused conv update. Per (channel, sequence), read the K-1 prior taps from the
+// active sequence's cache slot via ids -- identity (ids[s] == rs_head + s) reads in place from
+// conv_state_dst (the same slot it writes; the whole window is loaded into registers before any write,
+// so it is race-free), non-identity reads the pre-gathered disjoint scratch -- then computes the
+// depthwise conv with the SAME ascending-tap FMA order as ssm_conv_update_f32, folds silu, writes the
+// conv output, and writes the 1-token-shifted ring state back in place. Bit-identical to the get_rows +
+// ssm_conv_update_f32 path: the read VALUES are the same; only the read POINTER changes.
+template <bool apply_silu, int d_conv>
+static __global__ void ssm_conv_update_ids_f32(const float * __restrict__ nonident_scratch,
+ const float * __restrict__ conv_kernel,
+ const float * __restrict__ x_cur,
+ float * __restrict__ conv_state_dst,
+ float * __restrict__ dst,
+ const int32_t * __restrict__ ids,
+ const int rs_head,
+ const int channels,
+ const int scratch_seq_stride,
+ const int w_stride,
+ const int x_seq_stride,
+ const int dst_seq_stride,
+ const int cdst_seq_stride) {
+ const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel
+ const int s = blockIdx.y; // sequence
+ if (c >= channels) {
+ return;
+ }
+
+ const bool ident = (ids[s] == rs_head + s);
+ const float * states_c = ident
+ ? conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1)
+ : nonident_scratch + (int64_t) s * scratch_seq_stride + (int64_t) c * (d_conv - 1);
+ const float * w_c = conv_kernel + (int64_t) c * w_stride;
+ const float xc = x_cur[(int64_t) s * x_seq_stride + c];
+
+ // window = [tap0 .. tap_{K-2}, current-token], same ordering as ssm_conv_update_f32
+ float window[d_conv];
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ window[j] = states_c[j];
+ }
+ window[d_conv - 1] = xc;
+
+ float sumf = 0.0f;
+#pragma unroll
+ for (int j = 0; j < d_conv; j++) {
+ sumf += window[j] * w_c[j];
+ }
+ sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias)
+ dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf;
+
+ // 1-token-shifted ring write-back: drop the oldest tap, append the current token
+ float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1);
+#pragma unroll
+ for (int j = 0; j < d_conv - 1; j++) {
+ out_state[j] = window[j + 1];
+ }
+}
+
+static void ggml_cuda_op_ssm_conv_update_ids(ggml_backend_cuda_context & ctx, ggml_tensor * dst) {
+ const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells]
+ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels]
+ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs]
+ const ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target
+ const ggml_tensor * ids = dst->src[4]; // [n_seqs] I32 slot indices (s_copy)
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = x_cur->ne[2];
+ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0;
+ const int rs_head = ggml_get_op_params_i32(dst, 1);
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32);
+ GGML_ASSERT(ids->type == GGML_TYPE_I32);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float));
+ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float));
+ GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs);
+
+ const float * cache_d = (const float *) conv_states->data;
+ const float * w_d = (const float *) conv_kernel->data;
+ const float * x_d = (const float *) x_cur->data;
+ float * cdst_d = (float *) cdst->data;
+ float * dst_d = (float *) dst->data;
+ const int32_t * ids_d = (const int32_t *) ids->data;
+ cudaStream_t stream = ctx.stream();
+
+ // n_embd_r = (K-1)*channels: the per-cell row stride of the full conv cache.
+ const int cache_row_stride = (int) (conv_states->nb[2] / sizeof(float));
+ const int w_stride = (int) (conv_kernel->nb[1] / sizeof(float));
+ const int x_seq_stride = (int) (x_cur->nb[2] / sizeof(float));
+ const int dst_seq_stride = (int) (dst->nb[2] / sizeof(float));
+ const int cdst_seq_stride = (int) (cdst->nb[1] / sizeof(float));
+
+ // Gather only the non-identity sequences' prior taps into a disjoint scratch (identity sequences
+ // read in place from cdst). The scratch is written here and read-only by the update kernel, so the
+ // update kernel never reads a slot another block writes -> race-free. No-op at steady AR decode.
+ ggml_cuda_pool_alloc<float> nonident_scratch(ctx.pool());
+ float * scratch = nonident_scratch.alloc((size_t) cache_row_stride * n_seqs);
+ if (n_seqs > 0) {
+ ssm_conv_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(
+ cache_d, ids_d, rs_head, scratch, cache_row_stride, (int) n_seqs);
+ }
+
+ const int threads = 128;
+ const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1);
+
+ auto launch = [&](auto NC) {
+ constexpr int kNC = decltype(NC)::value;
+ if (apply_silu) {
+ ssm_conv_update_ids_f32<true, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
+ ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ } else {
+ ssm_conv_update_ids_f32<false, kNC><<<blocks, threads, 0, stream>>>(scratch, w_d, x_d, cdst_d, dst_d,
+ ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride);
+ }
+ };
+
+ switch (d_conv) {
+ case 3: launch(std::integral_constant<int, 3>{}); break;
+ case 4: launch(std::integral_constant<int, 4>{}); break;
+ default: GGML_ABORT("ssm_conv_update_ids only supports d_conv 3 or 4");
+ }
+}
+
template <bool apply_silu>
static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1,
const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1,
@@ -266,7 +413,13 @@ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, g
// silu of the decode conv path into a single kernel.
if (dst->src[3] != nullptr) {
GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr);
- ggml_cuda_op_ssm_conv_update(ctx, dst);
+ // Patch 0028: a non-null src[4] (ids) selects the gather-free variant that reads each
+ // sequence's prior taps directly from the full cache via ids (no get_rows materialization).
+ if (dst->src[4] != nullptr) {
+ ggml_cuda_op_ssm_conv_update_ids(ctx, dst);
+ } else {
+ ggml_cuda_op_ssm_conv_update(ctx, dst);
+ }
return;
}
diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c
index 16b180f..dcc09bd 100644
--- a/ggml/src/ggml.c
+++ b/ggml/src/ggml.c
@@ -5606,6 +5606,68 @@ struct ggml_tensor * ggml_ssm_conv_update_inplace(
return result;
}
+// ggml_ssm_conv_update_inplace_ids
+//
+// Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered
+// per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels,
+// n_cells]) plus the per-sequence `ids` (the recurrent-state s_copy) and reads each active sequence's
+// prior taps directly from cache[ids[s]] inside the kernel (no ggml_get_rows). Identity sequences
+// (ids[s] == rs_head + s) read in place from the `conv_state_dst` write slot; non-identity sequences
+// are gathered into a disjoint scratch by the backend first. Bit-identical to the get_rows +
+// ggml_ssm_conv_update_inplace path. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4].
+// op_params[1] carries rs_head. Mirrors the 0019 ggml_gated_delta_net_inplace_ids gather fusion.
+struct ggml_tensor * ggml_ssm_conv_update_inplace_ids(
+ struct ggml_context * ctx,
+ struct ggml_tensor * conv_states,
+ struct ggml_tensor * conv_kernel,
+ struct ggml_tensor * x_cur,
+ struct ggml_tensor * conv_state_dst,
+ struct ggml_tensor * ids,
+ int rs_head,
+ bool fuse_silu) {
+ GGML_ASSERT(ggml_is_3d(conv_states));
+ GGML_ASSERT(ggml_is_matrix(conv_kernel));
+ GGML_ASSERT(ggml_is_3d(x_cur));
+ GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32);
+
+ const int64_t d_conv = conv_kernel->ne[0];
+ const int64_t channels = conv_kernel->ne[1];
+ const int64_t n_seqs = x_cur->ne[2];
+
+ GGML_ASSERT(conv_states->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_kernel->type == GGML_TYPE_F32);
+ GGML_ASSERT(x_cur->type == GGML_TYPE_F32);
+ GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32);
+
+ // conv_states: FULL cache [K-1, channels, n_cells], contiguous taps per channel
+ GGML_ASSERT(conv_states->ne[0] == d_conv - 1);
+ GGML_ASSERT(conv_states->ne[1] == channels);
+ GGML_ASSERT(conv_states->nb[0] == sizeof(float));
+ // x_cur: single decode token per sequence
+ GGML_ASSERT(x_cur->ne[0] == channels);
+ GGML_ASSERT(x_cur->ne[1] == 1);
+ // ids: one slot index per active sequence
+ GGML_ASSERT(ids->ne[0] == n_seqs);
+ // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target
+ GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels);
+ GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs);
+ GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float));
+
+ struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+
+ ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0);
+ ggml_set_op_params_i32(result, 1, rs_head);
+
+ result->op = GGML_OP_SSM_CONV;
+ result->src[0] = conv_states;
+ result->src[1] = conv_kernel;
+ result->src[2] = x_cur;
+ result->src[3] = conv_state_dst;
+ result->src[4] = ids;
+
+ return result;
+}
+
// ggml_ssm_scan
struct ggml_tensor * ggml_ssm_scan(
diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp
index 58f3d0c..962f5eb 100644
--- a/src/models/delta-net-base.cpp
+++ b/src/models/delta-net-base.cpp
@@ -548,25 +548,33 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state_fused(
GGML_ASSERT(n_seq_tokens == 1); // single-token decode only
GGML_ASSERT(cparams.n_rs_seq == 0); // no rollback splits on this path
- // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows
- // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets).
- ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs);
- conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs);
- cb(conv_states, "conv_states_reshaped", il);
-
// Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs].
ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs);
// In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the
// destination the baseline ggml_cpy wrote to (s_slot == 0).
- const int64_t row_count = (conv_kernel_size - 1) * conv_channels;
+ const int64_t row_count = (conv_kernel_size - 1) * conv_channels; // = n_embd_r
const size_t row_size = ggml_row_size(conv_states_all->type, row_count);
ggml_tensor * conv_state_dst =
ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size);
cb(conv_state_dst, "conv_state_update", il);
- ggml_tensor * conv_output =
- ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true);
+ // Patch 0028: fuse the residual conv-state tap gather (the k_get_rows that build_conv_state's
+ // build_rs left firing -- ~the biggest single residual decode kernel, see MOE_GAP_VS_VLLM.md).
+ // Exactly like the 0019 SSM-state gather fusion, build_rs feeds the FULL conv cache + the s_copy
+ // ids into the op (via the get_state_rows lambda) and still performs the rs_zero clear and the
+ // extra-states copy around it; the op reads each active sequence's prior taps directly from
+ // cache[ids[s]] (identity sequences read in place from conv_state_dst), so the separate
+ // ggml_get_rows materialization is eliminated. The read VALUES are unchanged, only the read path
+ // (gather -> indexed in-kernel read) changes, so it is bit-identical to the build_rs gather.
+ auto get_conv_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * {
+ // states = full conv-state cache reshaped 2d [n_embd_r, n_cells]
+ ggml_tensor * cache3d = ggml_reshape_3d(ctx, states, conv_kernel_size - 1, conv_channels, states->ne[1]);
+ return ggml_ssm_conv_update_inplace_ids(ctx, cache3d, conv_kernel, x_cur, conv_state_dst,
+ ids, (int) kv_head, /*fuse_silu=*/true);
+ };
+
+ ggml_tensor * conv_output = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs, get_conv_op);
cb(conv_output, "conv_output_silu", il);
// the ring write is a side effect of the op; pull the op into the graph via the output
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index b5e3048..302975f 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -3793,6 +3793,65 @@ struct test_ssm_conv_update : public test_case {
}
};
+// GGML_OP_SSM_CONV gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids,
+// patch 0028). conv_states is the FULL cache; ids (a shuffled permutation of [0,n_seqs), rs_head=0)
+// selects each sequence's slot, exercising BOTH the identity in-place read (ids[s]==s) and the
+// non-identity cache read. Validates the conv + silu output (dst) against the CPU reference.
+struct test_ssm_conv_update_ids : public test_case {
+ const int64_t d_conv;
+ const int64_t channels;
+ const int64_t n_seqs;
+
+ std::string op_desc(ggml_tensor * t) override {
+ GGML_UNUSED(t);
+ return "SSM_CONV_UPDATE_IDS";
+ }
+
+ std::string vars() override {
+ return VARS_TO_STR3(d_conv, channels, n_seqs);
+ }
+
+ test_ssm_conv_update_ids(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4)
+ : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {}
+
+ ggml_tensor * build_graph(ggml_context * ctx) override {
+ ggml_tensor * conv_states = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs);
+ ggml_tensor * conv_kernel = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels);
+ ggml_tensor * x_cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs);
+ ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs);
+ ggml_tensor * ids = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, n_seqs);
+ ggml_set_name(conv_states, "conv_states");
+ ggml_set_name(conv_kernel, "conv_kernel");
+ ggml_set_name(x_cur, "x_cur");
+ ggml_set_name(conv_state_dst, "conv_state_dst");
+ ggml_set_name(ids, "ids");
+
+ ggml_tensor * out = ggml_ssm_conv_update_inplace_ids(ctx, conv_states, conv_kernel, x_cur,
+ conv_state_dst, ids, /*rs_head=*/0, /*fuse_silu=*/true);
+ ggml_set_name(out, "out");
+ return out;
+ }
+
+ void initialize_tensors(ggml_context * ctx) override {
+ std::random_device rd;
+ std::default_random_engine rng(rd());
+ for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
+ if (t->type == GGML_TYPE_I32) {
+ // ids: shuffled permutation of [0, n_seqs) into the full cache (rs_head == 0), so some
+ // sequences are identity (ids[s] == s, in-place read) and some are not (scratch read).
+ std::vector<int32_t> data(t->ne[0]);
+ for (int i = 0; i < t->ne[0]; i++) {
+ data[i] = i;
+ }
+ std::shuffle(data.begin(), data.end(), rng);
+ ggml_backend_tensor_set(t, data.data(), 0, t->ne[0] * sizeof(int32_t));
+ } else {
+ init_tensor_uniform(t);
+ }
+ }
+ }
+};
+
// GGML_OP_SSM_SCAN
struct test_ssm_scan : public test_case {
const ggml_type type;
@@ -8504,6 +8563,16 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_eval() {
}
}
+ // gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids, patch 0028).
+ // channels must be a multiple of 128 for the CUDA SSM_CONV supports_op gate.
+ for (int64_t d_conv : {3, 4}) {
+ for (int64_t channels : {256, 3328}) {
+ for (int64_t n_seqs : {1, 4, 32, 128}) {
+ test_cases.emplace_back(new test_ssm_conv_update_ids(d_conv, channels, n_seqs));
+ }
+ }
+ }
+
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2
test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64, 8, 2, 32, 4)); // Falcon-H1
--
2.43.0

View File

@@ -1,176 +0,0 @@
From e2acb3bca4d12ecef4964a214d397fc91ecfcebc Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 27 Jun 2026 03:45:19 +0200
Subject: [PATCH] feat(paged): block-table within-step host cache (patch 0029)
Lever 5 (host pipeline). get_block_table() is called once per full-attention
layer per decode step, but the KV cell layout (and therefore the block table)
is fixed for the whole step: it only changes in apply() when the ubatch's slots
are committed. The old path recomputed the full table on every layer.
This caches the table the first time it is built in a step and reuses the bytes
(memcpy) for every subsequent full-attention layer, invalidating the cache in
apply(). The reused bytes are identical to a fresh compute, so the change is
bit-exact. Toggle off with LLAMA_PAGED_NO_BT_CACHE=1.
Measured host-side get_block_table time (llama-batched-bench, npp128 ntg128
npl128, cache OFF -> ON):
- MoE q36-35b-a3b-nvfp4: 112.94 -> 14.82 ms (-87%)
- dense q36-27b-nvfp4 : 193.78 -> 16.90 ms (-91%)
Throughput: dense is partly host-bound and gains (TG 364.8 -> 374.7 t/s,
+2.7%, ~95.8% of the vLLM 391 t/s reference @npl128). MoE decode is compute-
bound (FP4 GEMM dominates) so the saved host time is off the critical path and
TG is flat (752.2 -> 757.0 t/s). The cache is therefore a pure pipeline cleanup,
not a numeric change.
Bit-exact, per path (llama-completion --temp 0 --seed 1, 48 tok):
- non-paged MoE = 07db32c2bcb78d17a43ed18bc22705cd (unchanged baseline)
- paged MoE = 8cb0ce23777bf55f92f63d0292c756b0 (paged baseline)
- paged MoE cache OFF == cache ON (both 8cb0ce23)
- dense non-paged == dense paged = 5951a5b4d624ce891e22ab5fca9bc439
The paged-MoE md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a
benign FP-accumulation-order difference of the paged attention reduction, not a
bug: KL-divergence vs the f16 reference (16 chunks, c512) gives KLD(paged||f16)
= 0.13600 <= KLD(nonpaged||f16) = 0.13660 and PPL(paged) = 7.4009 ~
PPL(nonpaged) = 7.3896 (within +/- 0.29). See PAGED_BITEXACT_NOTE.md and
LEVER5_HOSTPIPE_RESULTS.md.
Includes the [L5INSTR] host-timing instrumentation used to measure the lever.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/llama-context.cpp | 7 +++++++
src/llama-kv-cache.cpp | 28 +++++++++++++++++++++++++++-
src/llama-kv-cache.h | 9 +++++++++
src/paged-attn.cpp | 9 +++++++++
4 files changed, 52 insertions(+), 1 deletion(-)
diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index 5c90c48..ad7939e 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -1306,7 +1306,11 @@ bool llama_context::set_adapter_cvec(
return res;
}
+extern "C" void l5_add_setinp(double ns);
+extern "C" void l5_add_hostproc(double ns);
+static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; }
llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) {
+ double _l5_t0=l5c_now_ns();
if (mctx && !mctx->apply()) {
LLAMA_LOG_ERROR("%s: failed to apply memory context\n", __func__);
ret = GGML_STATUS_FAILED;
@@ -1361,11 +1365,14 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll
//const auto t_start_us = ggml_time_us();
// FIXME this call causes a crash if any model inputs were not used in the graph and were therefore not allocated
+ double _l5_si=l5c_now_ns();
res->set_inputs(&ubatch);
+ l5_add_setinp(l5c_now_ns()-_l5_si);
//LLAMA_LOG_INFO("graph set inputs time: %.3f ms\n", (ggml_time_us() - t_start_us)/1000.0);
}
+ l5_add_hostproc(l5c_now_ns()-_l5_t0);
const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
if (status != GGML_STATUS_SUCCESS) {
LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status);
diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp
index 21b8f1e..17aaf40 100644
--- a/src/llama-kv-cache.cpp
+++ b/src/llama-kv-cache.cpp
@@ -2772,6 +2772,9 @@ bool llama_kv_cache_context::apply() {
kv->apply_ubatch(sinfos[i_cur], ubatches[i_cur]);
n_kv = kv->get_n_kv(sinfos[i_cur]);
+ // the cells for this ubatch just changed -> drop the cached block table
+ bt_cache_valid = false;
+
return true;
}
@@ -2814,7 +2817,30 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const {
}
void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const {
- kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]);
+ const auto & sinfo = sinfos[i_cur];
+ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1;
+ const size_t total = (size_t) ns * n_blk;
+
+ // within-step reuse: all full-attention layers of a step request the same
+ // table (same i_cur/n_blk, cells fixed since apply()). The bytes are
+ // identical to a fresh compute, so this is bit-exact.
+ static const bool nocache = (getenv("LLAMA_PAGED_NO_BT_CACHE") != nullptr);
+ if (nocache) {
+ kv->get_block_table(dst, n_blk, n_kv, sinfo);
+ return;
+ }
+
+ if (bt_cache_valid && bt_cache_n_blk == n_blk && bt_cache.size() == total) {
+ memcpy(dst, bt_cache.data(), total * sizeof(int32_t));
+ return;
+ }
+
+ kv->get_block_table(dst, n_blk, n_kv, sinfo);
+
+ bt_cache.resize(total);
+ memcpy(bt_cache.data(), dst, total * sizeof(int32_t));
+ bt_cache_n_blk = n_blk;
+ bt_cache_valid = true;
}
ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const {
diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h
index e9980b6..b03de78 100644
--- a/src/llama-kv-cache.h
+++ b/src/llama-kv-cache.h
@@ -451,4 +451,13 @@ private:
// a heuristic, to avoid attending the full cache if it is not yet utilized
// as the cache gets filled, the benefit from this heuristic disappears
int32_t n_kv;
+
+ // [paged L5] within-step block-table cache. get_block_table() is called once
+ // per full-attention layer per decode step, but the cell layout (and hence
+ // the table) is identical across all layers of a step. Compute it on the
+ // first call and reuse the bytes for the rest; invalidated in apply() when
+ // the ubatch's slots are committed (the only host-side mutation per step).
+ mutable std::vector<int32_t> bt_cache;
+ mutable uint32_t bt_cache_n_blk = 0;
+ mutable bool bt_cache_valid = false;
};
diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp
index fed8ca9..ebd92be 100644
--- a/src/paged-attn.cpp
+++ b/src/paged-attn.cpp
@@ -8,6 +8,13 @@
#include <cstdlib>
#include <cstdio>
+#include <ctime>
+namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } }
+double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0;
+extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; }
+extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; }
+namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; }
+
namespace paged_attn {
@@ -54,7 +61,9 @@ public:
void set_input(const llama_ubatch * ubatch) override {
GGML_UNUSED(ubatch);
GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer));
+ double _t=l5_now_ns();
mctx->get_block_table((int32_t *) idxs->data, n_blk);
+ g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++;
}
const llama_kv_cache_context * mctx;
--
2.43.0

View File

@@ -1,106 +0,0 @@
From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001
From: Ettore Di Giacinto <mudler@localai.io>
Date: Sat, 27 Jun 2026 07:30:43 +0000
Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV
emission (patch 0030)
Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place
Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid])
and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace
[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src
slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON
(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the
CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU
reference ONLY.
The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores
the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for
the node and the scheduler assigns the discriminated conv to it; it then runs the
wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn
device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the
discriminated-SSM_CONV safety was only incidentally covered (it happened to share
backend coverage with the GDN op); it becomes live the moment a non-CUDA paged
build of a gated-DeltaNet model exists.
FIX: gate the fused-op emission on the active compute backend type. Before the
auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute
backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force
fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off
these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch),
so disabling them routes the graph to the upstream non-fused path: a PLAIN
ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles
correctly. This makes the discriminated-op safety explicit and decoupled from the
GDN-op device-mismatch heuristic.
INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so
fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode
graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on
non-CUDA/non-CPU backends.
GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d +
0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the
edited llama-context.cpp compiles clean (uses only already-included <cstring> +
backend-reg API already used in this TU). test-backend-ops correctness for
SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a
CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are
registered and exercised on the CUDA DGX run.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
---
src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++
1 file changed, 39 insertions(+)
diff --git a/src/llama-context.cpp b/src/llama-context.cpp
index ad7939e..c408eef 100644
--- a/src/llama-context.cpp
+++ b/src/llama-context.cpp
@@ -521,6 +521,45 @@ void llama_context::sched_reserve() {
cparams.auto_fa = false;
}
+ // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated
+ // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra
+ // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only
+ // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all
+ // built from the hipified ggml-cuda TU) and the CPU reference. Any other
+ // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but
+ // ignores the discriminator src would silently run the WRONG conv. The
+ // upstream auto_fgdn device-mismatch check below only inspects
+ // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety
+ // explicitly to the backend type here: keep the fused path enabled only when
+ // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags
+ // untouched, so the emitted decode graph is byte-identical.
+ if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) {
+ bool fgdn_backend_ok = true;
+ for (auto & backend : backends) {
+ ggml_backend_dev_t dev = ggml_backend_get_device(backend.get());
+ if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) {
+ // CPU reference handles the fused/discriminated ops
+ continue;
+ }
+ ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev);
+ const char * name = reg ? ggml_backend_reg_name(reg) : "";
+ // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the
+ // same ggml-cuda TU that carries the discriminated-op handling.
+ if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) {
+ fgdn_backend_ok = false;
+ break;
+ }
+ }
+
+ if (!fgdn_backend_ok) {
+ cparams.fused_gdn_ar = false;
+ cparams.fused_gdn_ch = false;
+ cparams.auto_fgdn = false;
+ LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled "
+ "(compute backend is not CUDA/HIP/CPU)\n", __func__);
+ }
+ }
+
if (cparams.auto_fgdn) {
LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__);
--
2.43.0

View File

@@ -1,507 +0,0 @@
# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
================================================================================
0. GROUND TRUTH (what the repo actually does today)
================================================================================
The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
worktree. Two mechanisms, both already present:
(a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
prepare.sh has a fallback `patch`-based apply guarded by a sentinel
(llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
already ships the paged engine compiled in.
(b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
hooks (lines ~752-842). They only call setenv() before context init:
- option `kv_paged` / `paged_kv` / `paged_attention` -> setenv LLAMA_KV_PAGED=1
- option `kv_paged_debug` / `paged_kv_debug` -> setenv LLAMA_KV_PAGED_DEBUG=1
- option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
- option `max_batch_tokens` / `mbt` -> setenv LLAMA_MAX_BATCH_TOKENS
- option `prefill_cap` -> setenv LLAMA_PREFILL_CAP
Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
The paged engine itself lives entirely inside the patched llama.cpp lib
(paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
ONE backend. The task is to SPLIT that into two backends:
- llama-cpp = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
paged hook), grpc-server.cpp keeps the dormant hooks.
- <newname> = stock grpc-server.cpp + paged patch series applied + paged on.
The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
gets the same single-build CPU target for free by reusing the same Makefile machinery.
--------------------------------------------------------------------------------
RECOMMENDED BACKEND NAME: `llama-cpp-paged` (see section 4 for the full rationale)
--------------------------------------------------------------------------------
Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
================================================================================
1. NEW BACKEND - file by file
================================================================================
--------------------------------------------------------------------------------
1.1 backend/cpp/llama-cpp/Makefile (the ONE necessary touch to stock)
--------------------------------------------------------------------------------
Change exactly one default so the STOCK image ships clean against upstream:
-LLAMA_PAGED?=on
+LLAMA_PAGED?=off
Why: this is the entire point of the split - stock llama-cpp must build clean so an
upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
the PR. If the team prefers stock literally untouched, the alternative is to leave
`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
not achieved. Recommendation: flip to off.
(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
patches/, patches/paged/ are all reused as-is by the new backend.)
--------------------------------------------------------------------------------
1.2 backend/cpp/llama-cpp-paged/Makefile (NEW - thin wrapper, model on turboquant)
--------------------------------------------------------------------------------
Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
do NOT):
- turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
turboquant). We only force LLAMA_PAGED=on.
- turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
own `llama.cpp:` target when LLAMA_PAGED=on.
Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
cpu-all/grpc/rpc-server):
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
define paged-build # $(1)=flavor $(2)=cmake flags $(3)=target
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
# clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
endef
llama-cpp-paged-cpu-all:
# identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
# + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
... LLAMA_PAGED=on SHARED_LIBS=ON \
EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
TARGET="--target grpc-server --target ggml" ...
package: ; bash package.sh
purge: ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
clean: purge
Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
package.sh glob them.
--------------------------------------------------------------------------------
1.3 backend/cpp/llama-cpp-paged/run.sh (NEW - copy turboquant/run.sh, rename binaries)
--------------------------------------------------------------------------------
s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
--------------------------------------------------------------------------------
1.4 backend/cpp/llama-cpp-paged/package.sh (NEW - copy turboquant/package.sh, rename)
--------------------------------------------------------------------------------
s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
and the per-arch libc/ld.so set (unchanged).
--------------------------------------------------------------------------------
1.5 backend/Dockerfile.llama-cpp-paged (NEW - copy Dockerfile.turboquant, swap paths)
--------------------------------------------------------------------------------
Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
- bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
- ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
(OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
stock cache would give the paged build near-free object reuse. Trade-off: a regression
in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
- both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
- final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
--------------------------------------------------------------------------------
1.6 .docker/llama-cpp-paged-compile.sh (NEW - copy llama-cpp-compile.sh, swap make targets)
--------------------------------------------------------------------------------
Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
--------------------------------------------------------------------------------
1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
--------------------------------------------------------------------------------
a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
# llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
# series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
(lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
backends/llama-cpp-darwin at line 1124; see 1.11).
--------------------------------------------------------------------------------
1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
--------------------------------------------------------------------------------
For EACH variant you choose to ship (see phased recommendation in section 4), add a row
copied from the corresponding llama-cpp row with:
- backend: "llama-cpp-paged"
- dockerfile: "./backend/Dockerfile.llama-cpp-paged"
- tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
(e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
-gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
- builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
(this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
- CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
pattern as -cpu-llama-cpp.
- Darwin (if shipping): add to includeDarwin:
- backend: "llama-cpp-paged"
tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
lang: "go"
(omit build-type, exactly like the llama-cpp darwin row at line 4908.)
REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
(the shared source) retrigger this backend too.
--------------------------------------------------------------------------------
1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
--------------------------------------------------------------------------------
a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
next to the turboquant branch (line 45):
if (item.dockerfile.endsWith("llama-cpp-paged")) {
// reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
return `backend/cpp/llama-cpp-paged/`;
}
ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
keep the specific branch first regardless (defensive, and returns the right path).
b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
if (backend === "llama-cpp-paged" && !changed) {
changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
}
Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
--------------------------------------------------------------------------------
1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
--------------------------------------------------------------------------------
GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
- &llamacpppaged
name: "llama-cpp-paged"
alias: "llama-cpp-paged"
license: mit
icon: <same as llama-cpp>
description: |
LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
urls: [ https://github.com/ggerganov/llama.cpp ]
tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
capabilities:
default: "cpu-llama-cpp-paged"
nvidia: "cuda12-llama-cpp-paged"
nvidia-cuda-12: "cuda12-llama-cpp-paged"
nvidia-cuda-13: "cuda13-llama-cpp-paged"
nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
metal: "metal-llama-cpp-paged"
# add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
capabilities map pointing at the `*-development` image names.
Step c - concrete image entries at end of file (mirror the llama-cpp block lines
2106-2200), one latest + one development per variant, each as:
- !!merge <<: *llamacpppaged
name: "cpu-llama-cpp-paged"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
- !!merge <<: *llamacpppaged
name: "cpu-llama-cpp-paged-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
...repeat for cuda12 / cuda13 / l4t / metal etc.
The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
--------------------------------------------------------------------------------
1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
--------------------------------------------------------------------------------
If metal is shipped, also:
- scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
- Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
- backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
step that calls `make backends/llama-cpp-darwin`).
- index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
- C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
linking protobuf/grpc++), so no Homebrew-include failure.
--------------------------------------------------------------------------------
1.12 Importer / /backends/known dropdown (drop-in, NOT a new importer)
--------------------------------------------------------------------------------
This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
- AdditionalBackends() (line 37): append
{Name: "llama-cpp-paged", Modality: "text",
Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
- Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
preferences.backend == "llama-cpp-paged" is honored:
case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
- core/gallery/importers/importers_test.go: add a table case asserting the preference
override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
HF fixture). Run `go test ./core/gallery/importers/...`.
--------------------------------------------------------------------------------
1.13 Docs
--------------------------------------------------------------------------------
- docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
-> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
--------------------------------------------------------------------------------
1.14 Does grpc-server.cpp need the paged hooks? YES - already present, reused unchanged.
--------------------------------------------------------------------------------
The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
================================================================================
2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
================================================================================
Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
`options:` string list (key:value, parsed by grpc-server.cpp set_option).
--------------------------------------------------------------------------------
2.1 Benchmark llama-server flags -> LocalAI model-config mapping
--------------------------------------------------------------------------------
-c 131072 -> context_size: 131072 (LLMConfig.ContextSize, yaml context_size)
-fa on -> flash_attention: "on" (LLMConfig.FlashAttention, yaml flash_attention; string)
-ngl 99 -> gpu_layers: 99 (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
-b 2048 -> batch: 2048 (schema.PredictionOptions.Batch, yaml batch) [see caveat]
--parallel 128 -> options: ["parallel:128"] (grpc-server.cpp:629; alias n_parallel)
LLAMA_KV_PAGED=1 -> options: ["paged_kv:true"] (grpc-server.cpp:778)
LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
f16 KV -> f16: true (LLMConfig.F16, yaml f16)
(recommended for paged) -> options: ["kv_unified:false"] (grpc-server.cpp:746 - the per-slot paged
capacity/memory benefit only materializes with a per-sequence cache;
the patch comment explicitly recommends pairing paged with kv_unified:false)
CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
(grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
(i) set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
benchmark states decode throughput is budget-independent), OR
(ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
be added later to honor -b/-ub independently (not required to ship).
--------------------------------------------------------------------------------
2.2 gallery/index.yaml entry - DENSE q36-27b-nvfp4
--------------------------------------------------------------------------------
- name: "qwen3.6-27b-nvfp4-paged"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF # placeholder, section 3
description: |
Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
license: "apache-2.0" # confirm vs Qwen license
tags: [ llm, gguf, nvfp4, reasoning ]
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
overrides:
backend: llama-cpp-paged
f16: true
flash_attention: "on"
context_size: 131072
gpu_layers: 99
batch: 512 # see -ub caveat 2.1; matches the 512 ubatch floor
known_usecases: [ chat ]
options:
- use_jinja:true
- paged_kv:true # LLAMA_KV_PAGED=1
- max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
- kv_unified:false # enables the per-slot paged capacity/memory benefit
- parallel:128 # --parallel 128 serving slots
parameters:
model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
sha256: <FILL after publish>
uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
--------------------------------------------------------------------------------
2.3 gallery/index.yaml entry - MoE q36-35b-a3b-nvfp4
--------------------------------------------------------------------------------
Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
purely for saturated MoE decode use max_batch_tokens:256.
- name: "qwen3.6-35b-a3b-nvfp4-paged"
urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
...
overrides:
backend: llama-cpp-paged
f16: true
flash_attention: "on"
context_size: 131072
batch: 512
options:
- use_jinja:true
- paged_kv:true
- max_batch_tokens:512 # or 256 for max saturated MoE decode (sweep winner)
- kv_unified:false
- parallel:128
parameters:
model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
files:
- filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
sha256: <FILL after publish>
uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
Note: these are the BENCHMARK serving configs. For an interactive single-user default you
may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
- optional, not required to ship the benchmark reproduction.
================================================================================
3. GGUF PUBLISHING (so the gallery uri: resolves)
================================================================================
The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
tensors from nvidia modelopt weights). To publish:
1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
<ORG>/Qwen3.6-27B-NVFP4-GGUF (single q36-27b-nvfp4.gguf)
<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF (single q36-35b-a3b-nvfp4.gguf)
ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
(LocalAI verifies it on download). Without sha256 the entry still works but loses the
integrity check - fill it.
3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
attribute upstream in the card.
================================================================================
4. OPEN DECISIONS / BLOCKERS / BUILD COST
================================================================================
BACKEND NAME - RECOMMEND `llama-cpp-paged`.
- llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
- localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
- localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
".cpp" looks like a file extension to the suffix matcher. Avoid.
BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
until upload. Blocks gallery validation only, not the backend build.
3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
option; shipped config uses batch:512. Minor, not a blocker.
4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
- get explicit sign-off since it alters a heavily-used backend's build.
PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
- cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64 (NVFP4/paged win)
- cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
Phase 2 (parity with stock llama-cpp coverage, only if demand):
- metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
and these add CI cost without a clear consumer.
BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
~byte-identical to stock so a SHARED ccache id makes most objects free):
- CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
apt step. Two arches + a merge job.
- CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
- Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
- No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
write; first master build per row pays the cold cost once, then warm.
VERIFICATION (post-implementation, needs a GPU box - out of scope here):
- `make backends/llama-cpp-paged` builds + installs locally (from-source path).
- Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
checkout) - proves the split.
- Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
max_batch_tokens/parallel took effect.
- go test ./core/gallery/importers/... green (importer drop-in case).
- node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
================================================================================
END OF PLAN
================================================================================

View File

@@ -1,75 +0,0 @@
# Paged bit-exactness gate - per path (canonical references)
## TL;DR
The greedy decode of the **paged** path does not byte-match the **non-paged**
path for the MoE model. This is a **benign FP-accumulation-order difference of
the paged attention reduction**, KL-validated against the f16 reference. It is
**not a bug**. The bit-exactness gate is therefore **per path**:
| path | model | canonical md5 |
|------|-------|---------------|
| non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` |
| paged | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` |
| non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` |
| paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
Gate command (chat-template / conversation path):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1
# paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
completion lands in a different md5 namespace).
**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
single reference `5951a5b4`.
## Why dense is bit-exact but MoE is not
Dense paged decode reproduces the non-paged reduction order exactly, so dense
greedy md5 is identical across paths. The MoE path runs additional kernels (the
NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
between the paged and non-paged attention layouts. Over a long greedy decode this
flips a small number of near-tied argmaxes, changing the byte stream. The same
divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
off, and with the patch-0029 block-table cache on or off - it is a property of
the paged attention path, not of any one lever.
## KL evidence that the paged path is sound (the load-bearing check)
`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
| comparison | PPL(Q) | KL divergence | Same top p | Cor |
|------------|-------:|--------------:|-----------:|----:|
| f16 reference | 7.3734 | - | - | - |
| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
| **paged** vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
### Verdict: BENIGN
- **Paged does not diverge from the f16 ground truth more than non-paged does.**
KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
error bars). A real paged-MoE correctness bug would push paged measurably
*further* from f16; it does not (it is marginally closer).
- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
with essentially zero probability bias. That is the signature of two equivalent
FP-reorderings of the same quantized model, both equally approximating the f16
ground truth - not a quality regression.
- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
logit near-ties are abundant, so a different-but-equivalent reduction order
flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
zero Delta-p bias).
Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
reference for the MoE deployment path.

View File

@@ -1,100 +0,0 @@
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
## Upstream jump
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
**zero patch changes**. The already-shipped source-only series (the result of the
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
`git apply`** (the `llama.cpp` target in `backend/cpp/llama-cpp/Makefile`:
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
28 patches reported "Applied patch ... cleanly", the sentinel
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
intact). git apply tolerates `@@` line-number offsets, which absorbed the
upstream drift; no hunk context broke.
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
patch tarball used for the verification has
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
## Clean build
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
28 patches applied as working-tree changes, then:
```
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build-cuda --target llama-completion test-backend-ops -j20
```
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
## GATE: ALL GREEN
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
`9d5d882d` build too):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
# paged dense: prefix LLAMA_KV_PAGED=1
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
(a) greedy md5 - all four paths PASS:
| path | model | md5 @ c299a92c | baseline | verdict |
|------|-------|----------------|----------|---------|
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
| op | result |
|----|--------|
| SSM_CONV | 45/45 OK |
| SSM_CONV_UPDATE | 16/16 OK |
| SSM_CONV_UPDATE_IDS | 16/16 OK |
| GATED_DELTA_NET | 84/84 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
Bit-exactness preserved across the 23-commit upstream jump.
## Canary
`.github/workflows/llama-cpp-paged-canary.yml` and
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
series is source-only and applies strict-clean with no `--exclude`, the canary's
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
the shipped series) and may be removed on a future canary touch; left in place
here to keep the pin-bump diff minimal.
## Source of truth
The shipped `.patch` files under `backend/cpp/llama-cpp/patches/paged/` are the
source of truth and are unchanged by this bump. The DGX dev tree
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.

View File

@@ -1,317 +0,0 @@
# LocalAI paged-attention llama.cpp patch series
This directory holds the vendored patch series that turns stock llama.cpp into
LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The patches are
applied on top of a pinned upstream llama.cpp at build time; nothing here is a
fork - it is a source-only `*.patch` stack plus this single canonical doc.
> One-file rule: this README is the canonical reference for the patch series. The
> only other docs kept in this directory are operational and linked below:
> - [`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md) - the current pin-sync record (referenced by the canary workflow + scripts).
> - [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference).
> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items.
---
## 1. What it is
`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a
vendored patch series over upstream llama.cpp that adds
- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks,
free pool, ref-counted blocks) with a **block-table flash-attention** read so
the attention kernels index physical cells instead of a contiguous buffer;
- **cross-request prefix sharing** - concurrent requests that share a long
prefix physically reuse one committed copy of the prefix blocks and prefill
only their divergent suffix;
- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget
decoupled from `n_batch`, so a long prefill never freezes co-batched decode;
- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid
gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4
GEMM - dominates the decode step.
It is **pinned to llama.cpp `c299a92c`** ("binaries : Improve rpc-server and
export-graph-ops names", #25045) and advanced only by a manual, bit-exact-gated
[pin-sync process](PIN_SYNC_c299a92c.md), decoupled from the nightly auto-bumper
(see section 7).
The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is
enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`,
`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the
runtime hooks are inert, so a single `grpc-server.cpp` is shared between the
clean and the paged build.
---
## 2. Architecture
The decode step on these models breaks into three cost centers; the patch series
attacks each one.
**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager`
(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out
fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with
copy-on-write for shared prefixes). The attention path reads through a **block
table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed
as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the
CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]`
and read K/V in place. Token-position ordering keeps the flash-attn online-softmax
reduction order identical to stock. A null block table is the stock contiguous
read, byte-identical.
**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48
gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On
GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode
cost. The series fuses that plumbing to mirror vLLM's
`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written
to its cache slot in place (no copy-back, no `get_rows` materialization), the
conv state is updated in place, the output projection is reshaped to route to the
tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all
bit-exact (md5-gateable) against the f32 baseline.
**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses
Blackwell's native FP4-MMA. The series removes a redundant activation-requantize
in the MoE broadcast projections (bit-exact byte copy of identical blocks) and
keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only
NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back
to dequant.
**The prefill/decode scheduler.** `update_slots()` already emits one unified
mixed prefill+decode batch per step. The scheduler patches change only the *count*
of prefill tokens admitted per step: decode tokens are claimed first
(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the
live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto-
shrinking as decode load rises. Pure scheduler policy, byte-identical when off,
orthogonal to the paged allocator.
---
## 3. Patch series (0001-0030)
28 patches (0005 and 0027 are intentionally unused). "Bit-exact" = greedy md5 /
`test-backend-ops` byte-identical to the relevant baseline; the gate methodology
is in section 5.
### Paged-KV core (0001-0012)
| # | What it does | Bit-exact |
|---|---|---|
| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) |
| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) |
| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes |
| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) |
| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) |
| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) |
| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band |
| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band |
| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) |
| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec |
| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) |
### Decode-first scheduler (0013, 0016)
| # | What it does | Bit-exact |
|---|---|---|
| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) |
| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) |
(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in
high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a
default-on, density-aware auto-select that is prefill-safe by construction. Both
bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green,
but every cheap occupancy lever regressed on GB10, so nothing is enabled - it
ships as the parity gate + default-off instrumentation only.)
### SSM (gated-DeltaNet) decode levers (0018-0022, 0028)
These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact.
| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) |
|---|---|---|
| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% |
| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% |
| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% |
| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% |
| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% |
| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s |
### MoE NVFP4 quant (0023, 0025)
| # | What it does | Bit-exact |
|---|---|---|
| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) |
| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Env-gated `LLAMA_MOE_FORCE_GRAPHS`. | yes (graph replay re-issues identical kernels) |
### Pool reclaim, block-table cache, backend gate, opt-in bf16-SSM
| # | What it does | Bit-exact |
|---|---|---|
| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes |
| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) |
| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere |
| 0026 | **Hybrid per-head bf16 SSM state (opt-in)** - `--ssm-bf16-tau` / option `ssm_bf16_tau`: fast-decaying GDN heads (memory length below the tau threshold) persist state as bf16, halving that head's decode byte stream (~+12% decode). | default tau=0 = f32 = **bit-exact**; the bf16 mode is **NOT** bit-exact (~91% same-top-p) |
---
## 4. Benchmarks
Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense
**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg`
S_TG (t/s) from `llama-batched-bench`, `-fa on`, `npp 128 / ntg 128`, swept over
serving width `npl`. Plots: [`qwen36_dense_decode_vs_npl.png`](qwen36_dense_decode_vs_npl.png),
[`qwen36_moe_decode_vs_npl.png`](qwen36_moe_decode_vs_npl.png); raw data
[`final_benchmark.csv`](final_benchmark.csv).
### (a) + (b) Patched vs stock vs vLLM
The **stock** and **patched** columns are the same binary, env-toggled, on the
**same harness** (`llama-batched-bench`) - so "x over stock" is an exact
apples-to-apples measure of the patch series' contribution. The **vLLM** column
is a **different harness** (vLLM server + client continuous batching), so the
cross-engine "% of vLLM" is **indicative, not apples-to-apples**.
**Dense Qwen3.6-27B-NVFP4** (t/s):
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|-----:|------------------:|---------------------:|
| 8 | 65.7 | 84.0 | 71.1 | 118% | 1.28x |
| 32 | 113.7 | 204.0 | 207.6 | 98% | 1.79x |
| 64 | 134.3 | 294.9 | 309.7 | 95% | 2.20x |
| 128 | 143.5 | 371.2 | 422.4 | 88% | 2.59x |
**MoE Qwen3.6-35B-A3B-NVFP4** (t/s):
| npl | stock | patched | vLLM | patched % of vLLM | patched x over stock |
|----:|------:|--------:|------:|-----------------:|---------------------:|
| 8 | 181.4 | 227.4 | 315.1 | 72% | 1.25x |
| 32 | 260.8 | 455.7 | 681.9 | 67% | 1.75x |
| 64 | 306.8 | 612.3 | 765.5 | 80% | 2.00x |
| 128 | 331.3 | 772.6 | 1011.7 | 76% | 2.33x |
**Caveat on the vLLM column.** Besides the different harness, the vLLM MoE
@npl128 number here (1011.7 at 128/128) runs *hotter* than the 901 t/s reference
config (512/256), so the MoE "% of vLLM" reads **76% here vs ~86% at the
groundtruth config**. Memory: llama uses **1.5-3x lower** memory than vLLM.
**Takeaway.** The patch series gives up to **2.59x (dense) / 2.33x (MoE)** over
stock on the same harness. Dense is **parity-to-ahead of vLLM**; MoE trails - the
remaining gap is structural (see section 5).
### (c) Apple M4 (16GB) - for curiosity only
No t/s: the 24GB NVFP4 GGUF did not finish downloading and would not fit in 16GB
RAM (= SSD paging). Architectural findings:
- Metal `supports_op` **excludes NVFP4** from `MUL_MAT` / `MUL_MAT_ID` /
`GET_ROWS`, so the FP4 matmuls **fall back to CPU** - there is no Apple
FP4-MMA.
- `GATED_DELTA_NET` and `SSM_CONV` / `SSM_SCAN` **do** have Metal kernels.
Verdict: NVFP4 Qwen3.6 needs **Blackwell FP4-MMA + >24GB RAM**; a 16GB M4 is not
a fit. A Metal-native Q4_K Qwen3.6 would be a different artifact.
---
## 5. Dev notes - what we learned
**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy
md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France
is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with
`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default
chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every
touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)).
Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5
(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this
is a benign FP-accumulation-order difference of the paged attention reduction,
**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <=
KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two
equivalent FP-reorderings of the same quantized model, not a regression. Future
paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`.
**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE
decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama
**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth;
the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection
bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel
llama is losing. The MoE GEMM kernel is *not* where the gap lives.
**Rejected / flat levers** (recorded so they are not re-tried):
- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was
exhausted by 0025; more graph/stream overlap is a no-op or small regression on
this model.
- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only
by W4A16 (a precision change, rejected) or a structural kernel rewrite; no
further bit-exact lever clears it. 0023 already banks the de-dup.
- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).**
Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the
same bf16 projections. No-ship.
- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody
needs bought with a ~5% slower kernel; both kernels are already at the BW floor.
(The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict
carries over to MoE.)
**Opt-in bf16-SSM fast mode** (patch 0026, `ssm_bf16_tau`). The design premise -
that bf16 KL error concentrates in long-memory heads and can be removed by
keeping them f32 - is **empirically refuted**: the error scales with the bf16
head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful
byte saving, and the carry is byte-exact (genuine bf16 rounding, not a bug). The
byte-saving (and ~+12% decode) is real but cannot meet a strict KL bar, so it
ships **default-off (f32, bit-exact)** and opt-in only. Do not put a hybrid tau
in a recommended/gallery config.
---
## 6. Architecture and quant generality
(From the arch-generality and quant-generality audits.)
- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4
activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations
help **any quant** of these models (the GDN recurrence, conv, gather and
o_proj-MMQ levers operate on the f32 recurrent state and the routing layout,
not on the weight dtype).
- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back
to dequant on other hardware; the GB10-tuned occupancy params (0022) are
perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change
correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family +
CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path.
---
## 7. Pin + maintenance policy
- **Pinned to llama.cpp `c299a92c`.** The pin is advanced **only** by the manual
[`PIN_SYNC`](PIN_SYNC_c299a92c.md) process: rebase the source-only patch series
onto the new tip, rebuild on GPU, and pass the bit-exact gate on every path
(dense + MoE, paged + non-paged) plus `test-backend-ops`. The `9d5d882d ->
c299a92c` jump (23 upstream commits) needed zero patch changes and did not
change decode output.
- **Decoupled from the nightly auto-bumper.** There is deliberately **no**
`bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could
silently shift the tree out from under the patches.
- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../../../.github/workflows/llama-cpp-paged-canary.yml)
(via [`.github/scripts/paged-canary-apply.sh`](../../../../../.github/scripts/paged-canary-apply.sh))
tries the patch series against the latest upstream tip with the build's own
strict `git apply`. **Red = upstream drifted past the series -> run a
PIN_SYNC** (do not bump the pin blindly). The canary references
[`PIN_SYNC_c299a92c.md`](PIN_SYNC_c299a92c.md).
---
## 8. Models
The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery:
| Gallery entry | Weights (HuggingFace) | Notes |
|---|---|---|
| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). |
| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. |
Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config
(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`,
`flash_attention:on`, `context_size`). They intentionally stay bit-exact (no
`ssm_bf16_tau`). The full backend-split + gallery plan is in
[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](LOCALAI_LLAMACPP_BACKEND_PLAN.md).

View File

@@ -1,17 +0,0 @@
model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
1 model engine npl decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms peak_gb
2 q36-27b-nvfp4 llama 8 82.5 9.57 507.3 6038.1 53.51
3 q36-27b-nvfp4 llama 32 192.6 4.79 115.0 133551.7 69.63
4 q36-27b-nvfp4 llama 64 277.8 3.09 95.9 321618.8 83.96
5 q36-27b-nvfp4 llama 128 384.6 1.86 69.7 902762.7 93.82
6 q36-27b-nvfp4 vllm 8 70.4 8.76 2096.2 1861.1 110.92
7 q36-27b-nvfp4 vllm 32 211.8 6.28 2182.6 5353.2 110.87
8 q36-27b-nvfp4 vllm 64 309.1 4.38 2088.9 9512.4 110.88
9 q36-27b-nvfp4 vllm 128 418.8 2.79 1929.1 18449.5 110.95
10 q36-35b-a3b-nvfp4 llama 8 211.8 24.45 1236.4 2477.1 39.66
11 q36-35b-a3b-nvfp4 llama 32 393.0 10.02 1213.9 8225.2 47.11
12 q36-35b-a3b-nvfp4 llama 64 527.0 6.15 1152.3 15849.5 57.13
13 q36-35b-a3b-nvfp4 llama 128 726.4 3.73 276.8 213017.2 61.51
14 q36-35b-a3b-nvfp4 vllm 8 256.5 31.84 5186.5 768.8 109.62
15 q36-35b-a3b-nvfp4 vllm 32 500.8 14.90 6223.4 1830.4 109.63
16 q36-35b-a3b-nvfp4 vllm 64 686.1 9.83 5926.5 3224.4 109.63
17 q36-35b-a3b-nvfp4 vllm 128 882.2 6.05 5300.5 6487.7 109.64

View File

@@ -1,217 +0,0 @@
// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
//
// Reproduces, at the libllama level, the two host-side defects behind the
// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
// signature:
//
// * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
// what llama-server issues on every reused slot - frees the kv-cache CELLS
// but the paged manager keeps owning the trailing BLOCKS. The manager's
// free pool silently shrinks. Test A measures the reclaimed-block delta.
//
// * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
// many sequences and frees them in a scrambled order leaves the free queue a
// scrambled permutation of physical block ids. A later low-npl prefill then
// pops physically scattered blocks, so its KV scatter-write + in-kernel
// paged-attention gather lose locality and prefill throughput collapses;
// decode (single-token append) barely notices. Test B times an npl8 prefill
// on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
//
// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
// of the fresh npl8 and num_free returns to the pristine value after the drain.
//
// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
#include "llama.h"
#include "paged-prefix-api.h"
#include <chrono>
#include <clocale>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
using clk = std::chrono::steady_clock;
static double secs(clk::time_point a, clk::time_point b) {
return std::chrono::duration<double>(b - a).count();
}
struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
// blocks unique so no cross-request prefix sharing masks the accounting.
static llama_token tok_of(int seq, int pos, int n_vocab) {
return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
}
// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
// Returns wall seconds (sync'd).
static double prefill(Ctx & C, int seq, int pos0, int n) {
clk::time_point t0 = clk::now();
C.batch.n_tokens = 0;
for (int j = 0; j < n; ++j) {
int i = C.batch.n_tokens;
C.batch.token[i] = tok_of(seq, pos0 + j, C.n_vocab);
C.batch.pos[i] = pos0 + j;
C.batch.n_seq_id[i] = 1;
C.batch.seq_id[i][0]= seq;
C.batch.logits[i] = (j + 1 == n) ? 1 : 0;
C.batch.n_tokens++;
}
if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
llama_synchronize(C.ctx);
return secs(t0, clk::now());
}
// One decode step (single token) for seq at pos.
static void decode1(Ctx & C, int seq, int pos) {
C.batch.n_tokens = 1;
C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
C.batch.pos[0] = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
}
int main(int argc, char ** argv) {
std::setlocale(LC_NUMERIC, "C");
const char * model_path = nullptr;
for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
const int NSLOT = env_i("BURST_NSLOT", 64);
const int NPL = env_i("NPL", 8);
const int PP = env_i("PP", 512);
const int KEEP = env_i("KEEP", 256);
const int GEN = env_i("GEN", 4);
const int ngl = env_i("PAGED_NGL", 99);
const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
ggml_backend_load_all();
llama_model_params mp = llama_model_default_params();
mp.n_gpu_layers = ngl;
llama_model * model = llama_model_load_from_file(model_path, mp);
if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
const llama_vocab * vocab = llama_model_get_vocab(model);
const int n_vocab = llama_vocab_n_tokens(vocab);
// Pool sized for the burst plus headroom so the burst fits but a later npl
// run draws from whatever the burst's churn left behind.
const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
llama_context_params cp = llama_context_default_params();
cp.n_ctx = (uint32_t) cells;
cp.n_batch = (uint32_t) (PP + 16);
cp.n_ubatch = (uint32_t) (PP + 16);
cp.n_seq_max = NSLOT + NPL + 2;
cp.kv_unified = true; // one unified stream-0 pool -> num_free(ctx) is the whole pool
cp.no_perf = true;
llama_context * ctx = llama_init_from_model(model, cp);
if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
C.batch = llama_batch_init(cp.n_batch, 0, 1);
printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
paged, NSLOT, NPL, PP, KEEP, GEN, cells);
llama_memory_clear(C.mem, true);
const long F_start = paged_prefix_api::num_free_global();
// ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
{
prefill(C, 0, 0, PP);
const long f_after_prefill = paged_prefix_api::num_free_global();
llama_memory_seq_rm(C.mem, 0, KEEP, -1); // partial tail removal
const long f_after_rm = paged_prefix_api::num_free_global();
llama_memory_seq_rm(C.mem, 0, -1, -1); // full free -> pristine
const long f_after_full = paged_prefix_api::num_free_global();
const long bs = 16;
const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
"(expect %ld post-fix, 0 pre-fix) afterFullFree=%ld\n",
F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
}
// ---- Test B: fragmentation -> npl prefill collapse -----------------------
// Fresh npl prefill baseline on a pristine pool.
llama_memory_clear(C.mem, true);
double tps_fresh;
{
clk::time_point t0 = clk::now();
long ntok = 0;
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
tps_fresh = ntok / secs(t0, clk::now());
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
}
const long F_pristine = paged_prefix_api::num_free_global();
// High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
// steps (mixed alloc), then drain them in a scrambled order (odd ids first,
// then even, each truncated before the full free) so the free queue becomes a
// scrambled permutation - the fragmentation the bug never compacts.
for (int s = 0; s < NSLOT; ++s) {
if (prefill(C, NPL + s, 0, PP) < 0) return 1;
for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
}
const long F_during_burst = paged_prefix_api::num_free_global();
// Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
// scrambled slot order to scramble the physical free order.
for (int parity = 1; parity >= 0; --parity)
for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1); // partial (Fix-1 path)
llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // full free
}
const long F_after_drain = paged_prefix_api::num_free_global();
// Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
// defragged) free queue.
double tps_post;
{
clk::time_point t0 = clk::now();
long ntok = 0;
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
tps_post = ntok / secs(t0, clk::now());
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
}
const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
"(afterDrain==pristine? %s)\n",
F_start, F_pristine, F_during_burst, F_after_drain,
F_after_drain == F_pristine ? "YES" : "NO");
printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f ratio=%.3f "
"(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
// ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
// Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
// slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
// low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
// prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
llama_memory_clear(C.mem, true);
// Touch the pool once so the manager exists, then read the full-pool size
// (num_free is 0 while no manager is registered).
if (prefill(C, 0, 0, 16) < 0) return 1;
llama_memory_seq_rm(C.mem, 0, -1, -1);
const long F_pre_c = paged_prefix_api::num_free_global();
for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
const long F_idle = paged_prefix_api::num_free_global();
for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
const long F_reclaimed = paged_prefix_api::num_free_global();
printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
"(returns_to_fresh? %s)\n",
F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
F_reclaimed == F_pre_c ? "YES" : "NO");
printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
paged, ratio,
F_after_drain == F_pristine ? "YES" : "NO",
F_reclaimed == F_pre_c ? "YES" : "NO");
llama_batch_free(C.batch);
llama_free(ctx);
llama_model_free(model);
return 0;
}

View File

@@ -1,59 +0,0 @@
// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
//
// Fix-1 PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
// beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
// longer strands blocks whose cells were cleared.
// Fix-2 defrag_free_pool() relinks the free queue into ascending block-id
// order once the pool is fully idle, undoing a burst's scrambled frees
// so a later prefill pops physically contiguous blocks again.
#include "paged-kv-manager.h"
#include <cstdio>
using paged::PagedKVManager;
int main() {
int rc = 0;
// ---- Fix-1: truncate reclaims the trailing block suffix -----------------
{
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
const size_t f0 = m.num_free_blocks(); // 63 (block 0 reserved as null)
m.allocate(0, 512); // ceil(512/16)=32 blocks
const size_t f1 = m.num_free_blocks(); // 31
m.truncate(0, 256); // keep ceil(256/16)=16, free 16
const size_t f2 = m.num_free_blocks(); // 47
printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
f0, f1, f2, f2 - f1);
if (f2 - f1 != 16) rc = 1;
m.truncate(0, 16); // keep 1 block, free 15 more
const size_t f3 = m.num_free_blocks(); // 62
printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
if (f3 != f0 - 1) rc = 1;
m.free(0);
if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
}
// ---- Fix-2: defrag restores ascending popleft order ---------------------
{
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
for (int s = 0; s < 8; ++s) m.allocate(s, 16); // pop blocks 1..8
const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4}; // free out of order
for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
m.defrag_free_pool(); // all idle -> compact
m.allocate(100, 16 * 3); // pop 3 blocks
const auto bt = m.block_table(100);
bool asc = true;
printf("[unit Fix-2] post-defrag block_table:");
for (size_t i = 0; i < bt.size(); ++i) {
printf(" %d", bt[i]);
if (i && bt[i] < bt[i - 1]) asc = false;
}
printf(" ascending=%s (expect YES)\n", asc ? "YES" : "NO");
if (!asc) rc = 1;
}
printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
return rc;
}

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 88 KiB

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 89 KiB

View File

@@ -2,30 +2,18 @@
## Patches
## Apply patches: the base `patches/` series, then the gated `patches/paged/`
## series (default on; LLAMA_PAGED=off skips it). Only *.patch files are applied
## (docs/dirs like patches/paged/ and *.md are skipped). The Makefile `llama.cpp`
## target already `git apply`s these at checkout, so each apply is guarded by a
## sentinel and skipped when already present - re-applying git-format patches with
## `patch` fuzzily duplicates hunks (redefinition errors). This block only does
## real work if prepare.sh is run against an unpatched checkout.
## Apply the base `patches/` series (top-level *.patch only; *.md/dirs skipped).
## The stock llama-cpp backend is patch-free by default, so this normally does
## nothing. The Makefile `llama.cpp` target already `git apply`s any base patch
## at checkout, so each apply here is `-N` (skip already-applied): re-applying a
## git-format patch with `patch` would fuzzily duplicate hunks. This block only
## does real work if prepare.sh is run against an unpatched checkout.
if [ -d "patches" ]; then
for patch in patches/*.patch; do
[ -e "$patch" ] || continue
echo "Applying patch $patch"
patch -d llama.cpp/ -p1 -N -r - < "$patch" || true
done
if [ "${LLAMA_PAGED:-on}" != "off" ] && [ -d "patches/paged" ]; then
if [ -f llama.cpp/src/paged-kv-manager.cpp ]; then
echo "paged-attention patch series already applied (sentinel present) - skipping re-apply"
else
for patch in patches/paged/*.patch; do
[ -e "$patch" ] || continue
echo "Applying paged patch $patch"
patch -d llama.cpp/ -p1 -N -r - < "$patch" || true
done
fi
fi
fi
set -e