The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate
design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet
recurrence is already single-pass at the f32 byte floor), the decode conv path
was the only remaining bit-exact lever.
New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated
by a non-null src[3]). On the single-token decode path it replaces the four-op
conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu
+ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per
(channel, sequence), assembles the width-K window in registers from the K-1 cached
taps plus the current qkv_mixed token, computes the depthwise conv with the SAME
ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv
output, and writes the 1-token-shifted ring state back IN PLACE into the conv
cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018
in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and
write target (the cache view) are disjoint buffers, so it is race-free by
construction with no ids/identity logic.
- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel,
src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring;
op_params[0]=fuse_silu)
- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32<apply_silu,d_conv> kernel +
ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv
- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels)
+ branch in ggml_compute_forward_ssm_conv
- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs
conv-tap gather; fuses conv+silu+shifted write-back)
- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path
(n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep
the original chain
- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference
test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90.
Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1
(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5
ac163882... both BYTE-IDENTICAL.
decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session:
dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%)
npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391)
MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%)
npl 128 689.74 -> 713.54 (+3.5%)
Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step
(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the
decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152);
conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state
conv-cache plumbing.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
llama.cpp patch series — paged attention (vLLM-parity engine)
A stacking series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned LLAMA_VERSION
(backend/cpp/llama-cpp/Makefile). The build applies them automatically after checkout (see the
llama.cpp: target). Keeping the work as ordered patches — rather than one big diff — is what lets us
rebase cleanly across llama.cpp bumps and avoid drift: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
Base
LLAMA_VERSIONpin in../Makefile. All patches are generated against that exact commit. Bumping the pin = re-run the regen workflow below and fix only the patches that no longer apply.
The series (phases → patches)
| # | Patch | What | Verifies |
|---|---|---|---|
| 0001 | 0001-vendor-paged-kv-manager.patch |
Add src/paged-kv-manager.{h,cpp} (vLLM-parity block manager, CPU foundation) + CMake; no behavior change |
builds; unit-tested separately under ../paged/ |
| 0002 | 0002-paged-kv-storage.patch |
Shared block-pool KV tensor + set_rows-by-slot writes, behind LLAMA_KV_PAGED |
builds; write/gather round-trip |
| 0003 | 0003-paged-gather-read.patch |
build_attn_paged gather-read in llama-graph.cpp |
Gate 0: token-identical greedy gen, single + multi-seq |
| 0004 | 0004-paged-ondemand-alloc.patch |
On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | 0005-paged-continuous-batching.patch |
Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | 0006-paged-prefix-caching.patch |
Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate git commit on the dev branch (below), exported 1:1 as a patch. Default off
(LLAMA_KV_PAGED) until Gate 0 (0003) is green, so partial series never changes stock behavior.
Regen workflow (the anti-drift recipe)
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
Build integration
../Makefile's llama.cpp: target runs, after git checkout -b build $(LLAMA_VERSION):
for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done
All variants (avx/avx2/avx512/cuda/…) copy the patched llama.cpp/ tree, so the series ships everywhere.
Status
- 0001 vendor manager — DONE. Applies clean to the pin; builds into
libllama. - 0002 block placement — DONE + VERIFIED. Built
llama-simpleat the pin; greedy generation is token-identical stock vsLLAMA_KV_PAGED=1(Qwen3-0.6B), paged branch confirmed firing. - 0003 gather-read — DONE + VERIFIED (Gate 0 green). Implemented in the additive form
(
ADDITIVE_DESIGN.md): all logic in newsrc/paged-attn.{h,cpp}(allm_graph_input_igather-index subclass + the K/V/mask gather), hooked by one line inbuild_attn+ two thin accessors onllama_kv_cache_context+ 1 CMake line (216 insertions; no edit tollm_graph_input_attn_kvorllama-graph.h). Greedy generation is token-identical stock vsLLAMA_KV_PAGED=1(Qwen3-0.6B, 9/9 across 3 prompts × {32,96,128} tokens), withn_gather=71 < n_kv=256confirming real compaction. Patch:0003-paged-gather-read-env-LLAMA_KV_PAGED.patch.- Key correctness finding:
get_gather_idxsmust emit cells sorted by token position. The CPU flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's scattered placement alone (full-window read, no gather) diverges from stock once a sequence crosses the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit- identical, not merely mathematically equivalent. So 0002 is the placement substrate; 0003 is what makes paged placement token-identical under flash-attn.
- Key correctness finding:
- 0004–0006 follow.
Honest parity note (important)
This series delivers the paged-attention engine (capacity + scheduling + prefix sharing). It does not
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the FP4 MoE GEMM kernel
(Lever 3: mul_mat_q<MXFP4> ~22 TFLOP/s, ~27× behind vLLM) — a per-token compute gap that paging does not
touch. Paged attention closes the concurrency/memory gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
../paged/UPSTREAM_GGML_ISSUE.md and DGX_BLACKWELL_PLAN.md). So full vLLM parity = this series AND the
kernel; neither alone suffices.