mirror of https://github.com/mudler/LocalAI.git synced 2026-06-23 16:19:07 -04:00

Files

Ettore Di Giacinto e983919516 feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

Increment 3 attention lever. In the paged in-kernel decode dispatch, route the
common grouped-query F16 case to the tile kernel and keep the inc-1 vec kernel
for everything else. Tile groups the q-heads that share a kv-head (ncols2) so
each K/V row is loaded once per group instead of once per q-head, and runs at
higher occupancy (108-128 regs vs vec 168 -> 25%). On GB10 (Qwen3-32B NVFP4,
F16 cache, gqa 8, batch 32, 1024 ctx, same build, env-toggled) this cuts the
decode step from 186.3 to 177.9 ms/step (-4.5%), within 1.8% of stock (174.8).
The win grows with context (tile vs vec decode step, npl=8): 1024 -2.3%, 4096
-3.3%, 8192 -4.1%, 16384 -6.1%, as attention takes a larger share of the step.

Routing guard: tile has no K/V type template (loads half2), so a non-F16 cache
would be converted to a contiguous F16 copy by launch_fattn, breaking the
in-kernel block-table read. So tile is correct only for an F16 cache, and the
grouping only helps at gqa>=2. tile is used only for {F16 K and V, gqa_ratio>=2};
everything else falls back to the inc-1 vec path, exactly as before this change.
LLAMA_KV_PAGED_VEC=1 forces vec for A/B. The inc-2 phys(j) tile read (patch 0010)
was already plumbed; this only adds the default route. (Paged decode currently
needs an F16 cache; quantized + paged is a pre-existing limitation unaffected by
this change: stock+q8_0 works, paged+q8_0 aborts both before and after.)

Split-K was ruled out: the vec decode grid is already block-saturated (~43 waves
over 144 resident on 48 SM), so more parallel_blocks adds no SM fill; the
under-saturation is intra-SM occupancy + 8x KV re-streaming, which GQA grouping
attacks directly.

Validated (greedy): CPU plumbing gate (0.6B, build-cpu, paged-on vs off)
byte-identical; GPU 0.6B gqa=2 tile token-coherent with the inc-1 vec path
(7/8 sequences identical, 8th in the same kernel-noise band where vec also
drifts from stock); 32B gqa=8 tile tracks stock at least as well as vec. Stock
(no block table) is byte-identical: the dispatch guard only diverts on src[5].
Full rationale and numbers in the patch header.

Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
Assisted-by: Claude:opus-4.8 [Claude Code]

2026-06-22 22:38:28 +00:00

kernel

kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (Lever 3)

2026-06-19 23:44:31 +00:00

paged

feat(paged): route GQA-grouped tile kernel by default for paged decode (patch 0011)

2026-06-22 22:38:28 +00:00

BENCHMARKS.md

bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the headline

2026-06-20 06:59:50 +00:00

README.md

feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, additive

2026-06-22 08:26:46 +00:00

README.md

llama.cpp patch series — paged attention (vLLM-parity engine)

A stacking series: each patch is a small, self-contained, independently-buildable step toward an in-model paged-attention engine. They apply in numeric order on top of the pinned LLAMA_VERSION (backend/cpp/llama-cpp/Makefile). The build applies them automatically after checkout (see the llama.cpp: target). Keeping the work as ordered patches — rather than one big diff — is what lets us rebase cleanly across llama.cpp bumps and avoid drift: when a patch stops applying, only that small patch needs fixing, and the failure points at exactly which step the upstream change touched.

Base

LLAMA_VERSION pin in ../Makefile. All patches are generated against that exact commit. Bumping the pin = re-run the regen workflow below and fix only the patches that no longer apply.

The series (phases → patches)

#	Patch	What	Verifies
0001	`0001-vendor-paged-kv-manager.patch`	Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change	builds; unit-tested separately under `../paged/`
0002	`0002-paged-kv-storage.patch`	Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED`	builds; write/gather round-trip
0003	`0003-paged-gather-read.patch`	`build_attn_paged` gather-read in `llama-graph.cpp`	Gate 0: token-identical greedy gen, single + multi-seq
0004	`0004-paged-ondemand-alloc.patch`	On-demand block allocation via PagedKVManager	max concurrent seqs before OOM
0005	`0005-paged-continuous-batching.patch`	Block-granular admit/evict in the server slot path	tok/s vs concurrency, mixed-length
0006	`0006-paged-prefix-caching.patch`	Block-hash cross-request prefix dedup	TTFT + memory on shared prefixes

Each row is a separate git commit on the dev branch (below), exported 1:1 as a patch. Default off (LLAMA_KV_PAGED) until Gate 0 (0003) is green, so partial series never changes stock behavior.

Regen workflow (the anti-drift recipe)

# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged

# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp/patches/00*.patch     # or `git apply` + commit per patch

# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp/patches/ --zero-commit -N

# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.

Build integration

../Makefile's llama.cpp: target runs, after git checkout -b build $(LLAMA_VERSION):

for p in $(CURRENT_MAKEFILE_DIR)/patches/0*.patch; do git apply --verbose "$p"; done

All variants (avx/avx2/avx512/cuda/…) copy the patched llama.cpp/ tree, so the series ships everywhere.

Status

0001 vendor manager — DONE. Applies clean to the pin; builds into libllama.
0002 block placement — DONE + VERIFIED. Built llama-simple at the pin; greedy generation is token-identical stock vs LLAMA_KV_PAGED=1 (Qwen3-0.6B), paged branch confirmed firing.
0003 gather-read — DONE + VERIFIED (Gate 0 green). Implemented in the additive form (ADDITIVE_DESIGN.md): all logic in new src/paged-attn.{h,cpp} (a llm_graph_input_i gather-index subclass + the K/V/mask gather), hooked by one line in build_attn + two thin accessors on llama_kv_cache_context + 1 CMake line (216 insertions; no edit to llm_graph_input_attn_kv or llama-graph.h). Greedy generation is token-identical stock vs LLAMA_KV_PAGED=1 (Qwen3-0.6B, 9/9 across 3 prompts × {32,96,128} tokens), with n_gather=71 < n_kv=256 confirming real compaction. Patch: 0003-paged-gather-read-env-LLAMA_KV_PAGED.patch.
- Key correctness finding: get_gather_idxs must emit cells sorted by token position. The CPU flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's scattered placement alone (full-window read, no gather) diverges from stock once a sequence crosses the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit- identical, not merely mathematically equivalent. So 0002 is the placement substrate; 0003 is what makes paged placement token-identical under flash-attn.
0004–0006 follow.

Honest parity note (important)

This series delivers the paged-attention engine (capacity + scheduling + prefix sharing). It does not by itself reach vLLM throughput parity, because the measured prefill bottleneck is the FP4 MoE GEMM kernel (Lever 3: mul_mat_q<MXFP4> ~22 TFLOP/s, ~27× behind vLLM) — a per-token compute gap that paging does not touch. Paged attention closes the concurrency/memory gap (more sequences, prefix reuse); the prefill/throughput gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see ../paged/UPSTREAM_GGML_ISSUE.md and DGX_BLACKWELL_PLAN.md). So full vLLM parity = this series AND the kernel; neither alone suffices.

README.md Unescape Escape

llama.cpp patch series — paged attention (vLLM-parity engine)

Base

The series (phases → patches)

Regen workflow (the anti-drift recipe)

Build integration

Status

Honest parity note (important)

README.md