Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM (patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling both engines pinned the largest llama-specific overage to the gated-DeltaNet output projection (ssm_out). The GDN op left its output in SSM layout and the graph reshaped it to 3D [value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the ssm_out weight read across the 128 sequences. vLLM packs the same projection into one M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] (= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 routes to the MMQ M=128 tensor-core GEMM. The result is then already 2D, so the redundant post-matmul reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs untouched. Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. test-backend-ops MUL_MAT and MUL_MAT_ID OK. decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). nsys: the o_proj mul_mat_vec_q<NVFP4,m=1> bucket (132.8 ms / 48 inst) collapses to zero; mul_mat_q<NVFP4,m=128> absorbs it (+1200 inst, +363 ms) at a LOWER per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call vs 2.77 ms/call for the old GEMV. Mirrors DGX dev-tree commit df1cc97. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
llama.cpp patch series — paged attention (vLLM-parity engine)
A stacking series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned LLAMA_VERSION
(backend/cpp/llama-cpp/Makefile). The build applies them automatically after checkout (see the
llama.cpp: target). Keeping the work as ordered patches — rather than one big diff — is what lets us
rebase cleanly across llama.cpp bumps and avoid drift: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
Base
LLAMA_VERSIONpin in../Makefile. All patches are generated against that exact commit. Bumping the pin = re-run the regen workflow below and fix only the patches that no longer apply.
The series (phases → patches)
| # | Patch | What | Verifies |
|---|---|---|---|
| 0001 | 0001-vendor-paged-kv-manager.patch |
Add src/paged-kv-manager.{h,cpp} (vLLM-parity block manager, CPU foundation) + CMake; no behavior change |
builds; unit-tested separately under ../paged/ |
| 0002 | 0002-paged-kv-storage.patch |
Shared block-pool KV tensor + set_rows-by-slot writes, behind LLAMA_KV_PAGED |
builds; write/gather round-trip |
| 0003 | 0003-paged-gather-read.patch |
build_attn_paged gather-read in llama-graph.cpp |
Gate 0: token-identical greedy gen, single + multi-seq |
| 0004 | 0004-paged-ondemand-alloc.patch |
On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | 0005-paged-continuous-batching.patch |
Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | 0006-paged-prefix-caching.patch |
Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate git commit on the dev branch (below), exported 1:1 as a patch. Default off
(LLAMA_KV_PAGED) until Gate 0 (0003) is green, so partial series never changes stock behavior.
Regen workflow (the anti-drift recipe)
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
Build integration
../Makefile's llama.cpp: target runs, after git checkout -b build $(LLAMA_VERSION):
for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
All variants (avx/avx2/avx512/cuda/…) copy the patched llama.cpp/ tree, so the series ships everywhere.
Status
- 0001 vendor manager — DONE. Applies clean to the pin; builds into
libllama. - 0002 block placement — DONE + VERIFIED. Built
llama-simpleat the pin; greedy generation is token-identical stock vsLLAMA_KV_PAGED=1(Qwen3-0.6B), paged branch confirmed firing. - 0003 gather-read — DONE + VERIFIED (Gate 0 green). Implemented in the additive form
(
ADDITIVE_DESIGN.md): all logic in newsrc/paged-attn.{h,cpp}(allm_graph_input_igather-index subclass + the K/V/mask gather), hooked by one line inbuild_attn+ two thin accessors onllama_kv_cache_context+ 1 CMake line (216 insertions; no edit tollm_graph_input_attn_kvorllama-graph.h). Greedy generation is token-identical stock vsLLAMA_KV_PAGED=1(Qwen3-0.6B, 9/9 across 3 prompts × {32,96,128} tokens), withn_gather=71 < n_kv=256confirming real compaction. Patch:0003-paged-gather-read-env-LLAMA_KV_PAGED.patch.- Key correctness finding:
get_gather_idxsmust emit cells sorted by token position. The CPU flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's scattered placement alone (full-window read, no gather) diverges from stock once a sequence crosses the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit- identical, not merely mathematically equivalent. So 0002 is the placement substrate; 0003 is what makes paged placement token-identical under flash-attn.
- Key correctness finding:
- 0004–0006 follow.
Honest parity note (important)
This series delivers the paged-attention engine (capacity + scheduling + prefix sharing). It does not
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the FP4 MoE GEMM kernel
(Lever 3: mul_mat_q<MXFP4> ~22 TFLOP/s, ~27× behind vLLM) — a per-token compute gap that paging does not
touch. Paged attention closes the concurrency/memory gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
../paged/UPSTREAM_GGML_ISSUE.md and DGX_BLACKWELL_PLAN.md). So full vLLM parity = this series AND the
kernel; neither alone suffices.