chore(paged): keep patches/ patch-only; README to backend root, docs to docs/

The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.

Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
  final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)

Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)

Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.

The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.

Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
Ettore Di Giacinto
2026-06-27 13:20:05 +00:00
parent db14006fcd
commit 08b754f910
21 changed files with 41 additions and 235 deletions

View File

@@ -0,0 +1,514 @@
# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and
the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and
the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas /
metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the
darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to-
negative there and non-CUDA users should use the stock llama-cpp backend (README 4c).
================================================================================
0. GROUND TRUTH (what the repo actually does today)
================================================================================
The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
worktree. Two mechanisms, both already present:
(a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
prepare.sh has a fallback `patch`-based apply guarded by a sentinel
(llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
already ships the paged engine compiled in.
(b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
hooks (lines ~752-842). They only call setenv() before context init:
- option `kv_paged` / `paged_kv` / `paged_attention` -> setenv LLAMA_KV_PAGED=1
- option `kv_paged_debug` / `paged_kv_debug` -> setenv LLAMA_KV_PAGED_DEBUG=1
- option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
- option `max_batch_tokens` / `mbt` -> setenv LLAMA_MAX_BATCH_TOKENS
- option `prefill_cap` -> setenv LLAMA_PREFILL_CAP
Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
The paged engine itself lives entirely inside the patched llama.cpp lib
(paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
ONE backend. The task is to SPLIT that into two backends:
- llama-cpp = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
paged hook), grpc-server.cpp keeps the dormant hooks.
- <newname> = stock grpc-server.cpp + paged patch series applied + paged on.
The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
gets the same single-build CPU target for free by reusing the same Makefile machinery.
--------------------------------------------------------------------------------
RECOMMENDED BACKEND NAME: `llama-cpp-paged` (see section 4 for the full rationale)
--------------------------------------------------------------------------------
Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
================================================================================
1. NEW BACKEND - file by file
================================================================================
--------------------------------------------------------------------------------
1.1 backend/cpp/llama-cpp/Makefile (the ONE necessary touch to stock)
--------------------------------------------------------------------------------
Change exactly one default so the STOCK image ships clean against upstream:
-LLAMA_PAGED?=on
+LLAMA_PAGED?=off
Why: this is the entire point of the split - stock llama-cpp must build clean so an
upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
the PR. If the team prefers stock literally untouched, the alternative is to leave
`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
not achieved. Recommendation: flip to off.
(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
patches/, patches/paged/ are all reused as-is by the new backend.)
--------------------------------------------------------------------------------
1.2 backend/cpp/llama-cpp-paged/Makefile (NEW - thin wrapper, model on turboquant)
--------------------------------------------------------------------------------
Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
do NOT):
- turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
turboquant). We only force LLAMA_PAGED=on.
- turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
own `llama.cpp:` target when LLAMA_PAGED=on.
Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
cpu-all/grpc/rpc-server):
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
define paged-build # $(1)=flavor $(2)=cmake flags $(3)=target
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
# clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
endef
llama-cpp-paged-cpu-all:
# identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
# + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
... LLAMA_PAGED=on SHARED_LIBS=ON \
EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
TARGET="--target grpc-server --target ggml" ...
package: ; bash package.sh
purge: ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
clean: purge
Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
package.sh glob them.
--------------------------------------------------------------------------------
1.3 backend/cpp/llama-cpp-paged/run.sh (NEW - copy turboquant/run.sh, rename binaries)
--------------------------------------------------------------------------------
s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
--------------------------------------------------------------------------------
1.4 backend/cpp/llama-cpp-paged/package.sh (NEW - copy turboquant/package.sh, rename)
--------------------------------------------------------------------------------
s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
and the per-arch libc/ld.so set (unchanged).
--------------------------------------------------------------------------------
1.5 backend/Dockerfile.llama-cpp-paged (NEW - copy Dockerfile.turboquant, swap paths)
--------------------------------------------------------------------------------
Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
- bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
- ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
(OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
stock cache would give the paged build near-free object reuse. Trade-off: a regression
in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
- both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
- final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
--------------------------------------------------------------------------------
1.6 .docker/llama-cpp-paged-compile.sh (NEW - copy llama-cpp-compile.sh, swap make targets)
--------------------------------------------------------------------------------
Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
--------------------------------------------------------------------------------
1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
--------------------------------------------------------------------------------
a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
# llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
# series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
(lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
backends/llama-cpp-darwin at line 1124; see 1.11).
--------------------------------------------------------------------------------
1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
--------------------------------------------------------------------------------
For EACH variant you choose to ship (see phased recommendation in section 4), add a row
copied from the corresponding llama-cpp row with:
- backend: "llama-cpp-paged"
- dockerfile: "./backend/Dockerfile.llama-cpp-paged"
- tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
(e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
-gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
- builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
(this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
- CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
pattern as -cpu-llama-cpp.
- Darwin (if shipping): add to includeDarwin:
- backend: "llama-cpp-paged"
tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
lang: "go"
(omit build-type, exactly like the llama-cpp darwin row at line 4908.)
REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
(the shared source) retrigger this backend too.
--------------------------------------------------------------------------------
1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
--------------------------------------------------------------------------------
a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
next to the turboquant branch (line 45):
if (item.dockerfile.endsWith("llama-cpp-paged")) {
// reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
return `backend/cpp/llama-cpp-paged/`;
}
ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
keep the specific branch first regardless (defensive, and returns the right path).
b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
if (backend === "llama-cpp-paged" && !changed) {
changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
}
Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
--------------------------------------------------------------------------------
1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
--------------------------------------------------------------------------------
GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
- &llamacpppaged
name: "llama-cpp-paged"
alias: "llama-cpp-paged"
license: mit
icon: <same as llama-cpp>
description: |
LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
urls: [ https://github.com/ggerganov/llama.cpp ]
tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
capabilities:
default: "cpu-llama-cpp-paged"
nvidia: "cuda12-llama-cpp-paged"
nvidia-cuda-12: "cuda12-llama-cpp-paged"
nvidia-cuda-13: "cuda13-llama-cpp-paged"
nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
metal: "metal-llama-cpp-paged"
# add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
capabilities map pointing at the `*-development` image names.
Step c - concrete image entries at end of file (mirror the llama-cpp block lines
2106-2200), one latest + one development per variant, each as:
- !!merge <<: *llamacpppaged
name: "cpu-llama-cpp-paged"
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
- !!merge <<: *llamacpppaged
name: "cpu-llama-cpp-paged-development"
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
...repeat for cuda12 / cuda13 / l4t / metal etc.
The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
--------------------------------------------------------------------------------
1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
--------------------------------------------------------------------------------
If metal is shipped, also:
- scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
- Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
- backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
step that calls `make backends/llama-cpp-darwin`).
- index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
- C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
linking protobuf/grpc++), so no Homebrew-include failure.
--------------------------------------------------------------------------------
1.12 Importer / /backends/known dropdown (drop-in, NOT a new importer)
--------------------------------------------------------------------------------
This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
- AdditionalBackends() (line 37): append
{Name: "llama-cpp-paged", Modality: "text",
Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
- Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
preferences.backend == "llama-cpp-paged" is honored:
case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
- core/gallery/importers/importers_test.go: add a table case asserting the preference
override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
HF fixture). Run `go test ./core/gallery/importers/...`.
--------------------------------------------------------------------------------
1.13 Docs
--------------------------------------------------------------------------------
- docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
-> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
--------------------------------------------------------------------------------
1.14 Does grpc-server.cpp need the paged hooks? YES - already present, reused unchanged.
--------------------------------------------------------------------------------
The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
================================================================================
2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
================================================================================
Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
`options:` string list (key:value, parsed by grpc-server.cpp set_option).
--------------------------------------------------------------------------------
2.1 Benchmark llama-server flags -> LocalAI model-config mapping
--------------------------------------------------------------------------------
-c 131072 -> context_size: 131072 (LLMConfig.ContextSize, yaml context_size)
-fa on -> flash_attention: "on" (LLMConfig.FlashAttention, yaml flash_attention; string)
-ngl 99 -> gpu_layers: 99 (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
-b 2048 -> batch: 2048 (schema.PredictionOptions.Batch, yaml batch) [see caveat]
--parallel 128 -> options: ["parallel:128"] (grpc-server.cpp:629; alias n_parallel)
LLAMA_KV_PAGED=1 -> options: ["paged_kv:true"] (grpc-server.cpp:778)
LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
f16 KV -> f16: true (LLMConfig.F16, yaml f16)
(recommended for paged) -> options: ["kv_unified:false"] (grpc-server.cpp:746 - the per-slot paged
capacity/memory benefit only materializes with a per-sequence cache;
the patch comment explicitly recommends pairing paged with kv_unified:false)
CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
(grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
(i) set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
benchmark states decode throughput is budget-independent), OR
(ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
be added later to honor -b/-ub independently (not required to ship).
--------------------------------------------------------------------------------
2.2 gallery/index.yaml entry - DENSE q36-27b-nvfp4
--------------------------------------------------------------------------------
- name: "qwen3.6-27b-nvfp4-paged"
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
urls:
- https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF # placeholder, section 3
description: |
Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
license: "apache-2.0" # confirm vs Qwen license
tags: [ llm, gguf, nvfp4, reasoning ]
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
overrides:
backend: llama-cpp-paged
f16: true
flash_attention: "on"
context_size: 131072
gpu_layers: 99
batch: 512 # see -ub caveat 2.1; matches the 512 ubatch floor
known_usecases: [ chat ]
options:
- use_jinja:true
- paged_kv:true # LLAMA_KV_PAGED=1
- max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
- kv_unified:false # enables the per-slot paged capacity/memory benefit
- parallel:128 # --parallel 128 serving slots
parameters:
model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
template:
use_tokenizer_template: true
files:
- filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
sha256: <FILL after publish>
uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
--------------------------------------------------------------------------------
2.3 gallery/index.yaml entry - MoE q36-35b-a3b-nvfp4
--------------------------------------------------------------------------------
Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
purely for saturated MoE decode use max_batch_tokens:256.
- name: "qwen3.6-35b-a3b-nvfp4-paged"
urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
...
overrides:
backend: llama-cpp-paged
f16: true
flash_attention: "on"
context_size: 131072
batch: 512
options:
- use_jinja:true
- paged_kv:true
- max_batch_tokens:512 # or 256 for max saturated MoE decode (sweep winner)
- kv_unified:false
- parallel:128
parameters:
model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
files:
- filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
sha256: <FILL after publish>
uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
Note: these are the BENCHMARK serving configs. For an interactive single-user default you
may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
- optional, not required to ship the benchmark reproduction.
================================================================================
3. GGUF PUBLISHING (so the gallery uri: resolves)
================================================================================
The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
tensors from nvidia modelopt weights). To publish:
1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
<ORG>/Qwen3.6-27B-NVFP4-GGUF (single q36-27b-nvfp4.gguf)
<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF (single q36-35b-a3b-nvfp4.gguf)
ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
(LocalAI verifies it on download). Without sha256 the entry still works but loses the
integrity check - fill it.
3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
attribute upstream in the card.
================================================================================
4. OPEN DECISIONS / BLOCKERS / BUILD COST
================================================================================
BACKEND NAME - RECOMMEND `llama-cpp-paged`.
- llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
- localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
- localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
".cpp" looks like a file extension to the suffix matcher. Avoid.
BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
until upload. Blocks gallery validation only, not the backend build.
3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
option; shipped config uses batch:512. Minor, not a blocker.
4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
- get explicit sign-off since it alters a heavily-used backend's build.
PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
- cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64 (NVFP4/paged win)
- cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
Phase 2 (parity with stock llama-cpp coverage, only if demand):
- metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
and these add CI cost without a clear consumer.
BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
~byte-identical to stock so a SHARED ccache id makes most objects free):
- CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
apt step. Two arches + a merge job.
- CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
- Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
- No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
write; first master build per row pays the cold cost once, then warm.
VERIFICATION (post-implementation, needs a GPU box - out of scope here):
- `make backends/llama-cpp-paged` builds + installs locally (from-source path).
- Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
checkout) - proves the split.
- Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
max_batch_tokens/parallel took effect.
- go test ./core/gallery/importers/... green (importer drop-in case).
- node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
================================================================================
END OF PLAN
================================================================================

View File

@@ -0,0 +1,75 @@
# Paged bit-exactness gate - per path (canonical references)
## TL;DR
The greedy decode of the **paged** path does not byte-match the **non-paged**
path for the MoE model. This is a **benign FP-accumulation-order difference of
the paged attention reduction**, KL-validated against the f16 reference. It is
**not a bug**. The bit-exactness gate is therefore **per path**:
| path | model | canonical md5 |
|------|-------|---------------|
| non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` |
| paged | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` |
| non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` |
| paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
Gate command (chat-template / conversation path):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1
# paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
completion lands in a different md5 namespace).
**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
single reference `5951a5b4`.
## Why dense is bit-exact but MoE is not
Dense paged decode reproduces the non-paged reduction order exactly, so dense
greedy md5 is identical across paths. The MoE path runs additional kernels (the
NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
between the paged and non-paged attention layouts. Over a long greedy decode this
flips a small number of near-tied argmaxes, changing the byte stream. The same
divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
off, and with the patch-0029 block-table cache on or off - it is a property of
the paged attention path, not of any one lever.
## KL evidence that the paged path is sound (the load-bearing check)
`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
| comparison | PPL(Q) | KL divergence | Same top p | Cor |
|------------|-------:|--------------:|-----------:|----:|
| f16 reference | 7.3734 | - | - | - |
| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
| **paged** vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
### Verdict: BENIGN
- **Paged does not diverge from the f16 ground truth more than non-paged does.**
KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
error bars). A real paged-MoE correctness bug would push paged measurably
*further* from f16; it does not (it is marginally closer).
- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
with essentially zero probability bias. That is the signature of two equivalent
FP-reorderings of the same quantized model, both equally approximating the f16
ground truth - not a quality regression.
- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
logit near-ties are abundant, so a different-but-equivalent reduction order
flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
zero Delta-p bias).
Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
reference for the MoE deployment path.

View File

@@ -0,0 +1,86 @@
# llama.cpp patch series — paged attention (vLLM-parity engine)
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
patch needs fixing, and the failure points at exactly which step the upstream change touched.
## Base
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
## The series (phases → patches)
| # | Patch | What | Verifies |
|---|-------|------|----------|
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately |
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
## Regen workflow (the anti-drift recipe)
```sh
# 1. check out the exact pin into a dev tree
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
git checkout <LLAMA_VERSION from ../Makefile>
git checkout -b paged
# 2. apply the current series (each becomes a commit), or develop the next patch
git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch # or `git apply` + commit per patch
# 3. iterate a phase as ONE commit, then export the whole series 1:1
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
```
## Build integration
The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock
`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned
`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the
same strict `git apply` the stock build uses for base patches:
```
for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done
```
All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships
everywhere without ever touching the stock `llama-cpp` source tree.
## Status
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
(see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
makes paged placement token-identical under flash-attn.**
- 00040006 follow.
### Honest parity note (important)
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
`../README.md`). So full vLLM parity = this series **AND** the
kernel; neither alone suffices.

View File

@@ -0,0 +1,101 @@
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
## Upstream jump
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
("model : Add label for LFM2.5-230M (#25008)")
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
**zero patch changes**. The already-shipped source-only series (the result of the
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
`git apply`** (the `apply-paged-patches` step in
`backend/cpp/llama-cpp-localai-paged/Makefile`:
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
28 patches reported "Applied patch ... cleanly", the sentinel
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
intact). git apply tolerates `@@` line-number offsets, which absorbed the
upstream drift; no hunk context broke.
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
patch tarball used for the verification has
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
## Clean build
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
28 patches applied as working-tree changes, then:
```
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
cmake --build build-cuda --target llama-completion test-backend-ops -j20
```
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
## GATE: ALL GREEN
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
`9d5d882d` build too):
```
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
# paged dense: prefix LLAMA_KV_PAGED=1
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
```
(a) greedy md5 - all four paths PASS:
| path | model | md5 @ c299a92c | baseline | verdict |
|------|-------|----------------|----------|---------|
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
| op | result |
|----|--------|
| SSM_CONV | 45/45 OK |
| SSM_CONV_UPDATE | 16/16 OK |
| SSM_CONV_UPDATE_IDS | 16/16 OK |
| GATED_DELTA_NET | 84/84 OK |
| MUL_MAT | 1146/1146 OK |
| MUL_MAT_ID | 806/806 OK |
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
Bit-exactness preserved across the 23-commit upstream jump.
## Canary
`.github/workflows/llama-cpp-paged-canary.yml` and
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
series is source-only and applies strict-clean with no `--exclude`, the canary's
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
the shipped series) and may be removed on a future canary touch; left in place
here to keep the pin-bump diff minimal.
## Source of truth
The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
source of truth and are unchanged by this bump. The DGX dev tree
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.

View File

@@ -0,0 +1,337 @@
# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
Source-only analysis (no GPU, no build) of what it would take to give the
gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
backends, so the patch-series decode win extends past CUDA-family hardware.
In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
backend that ran the plain op on the discriminated node would silently
miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
This doc was written against the ggml backend trees in
`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
touches GDN/SSM). So every "kernel already exists" statement below is a
conservative lower bound: the pin has at least these kernels.
--------------------------------------------------------------------------------
## 0. Headline finding (correct a stale assumption first)
The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
**That is now stale.** All three backends already carry the BASE compute ops:
| op | Metal | Vulkan | SYCL |
|------------------------|------------------------------------|------------------------------------------|---------------------------------|
| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) |
Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
Consequence: the GDN-compute op being "partly there" is true on every backend,
not just Metal. What is still missing per backend is only the FUSION plumbing
(in-place write-back target, the ids gather read, and the conv-update kernel) -
a materially smaller scope than "port GDN from scratch."
--------------------------------------------------------------------------------
## 1. Per-op semantics (the four fusions to port)
All four reuse an existing GGML_OP enum with extra `src[]` slots as a
discriminator; none adds a new enum value. f32 throughout. The arithmetic core
is IDENTICAL to the upstream non-fused op; only the read source and/or the write
target are redirected. That single fact drives the whole bit-exactness story
(section 3).
### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
`state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
cache at `kv_head`). K == 1 only.
- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
directly into `state_dst` instead of appending it to the op output. The op
output then carries only the attention scores. Removes the per-layer per-step
~full-state D2D copy-back (the 0018 win).
- Race (in-place read == write): each (seq, head) block owns a disjoint cache
slot. The kernel loads the whole prior state `s0` into per-thread registers
(`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
BEFORE the ring write, so reading and writing the same slot is safe.
### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
(I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
`rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
(mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
materialization. Combined with OP A the op now reads AND writes the cache in
place.
- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
read s0 in place from the destination slot (safe via the register snapshot
above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
DISJOINT scratch that the recurrence then reads, so the recurrence never reads
a slot another block is writing. Value-preserving memcpy -> bit-identical to
the get_rows path.
### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
`conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
`src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
`[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
fuse_silu.
- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
width-K conv window in registers from the K-1 cached taps + the current token,
compute the depthwise conv with the SAME ascending-tap FMA order as plain
`ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
`sumf += b` with b==0), optionally fold SiLU, write the conv output
`[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
cpy).
- Race: read source (gathered taps) and write target (cache view) are disjoint
buffers -> race-free by construction, no ids/identity logic.
### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
- Semantics: gather-free conv-update - read each sequence's prior taps from
`cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
`conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
`ssm_conv_gather_nonident` kernel. The window is copied to a local array
BEFORE the (possibly aliasing) ring write so the identity read==write slot is
correct. Bit-identical to get_rows + OP C.
### Net new kernels vs reuse, per op
- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
1 ids binding + rs_head param + 1 tiny gather kernel.
- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
a windowed reduction over a PRE-concatenated input; it does not assemble the
window from cached taps + the current token, fold silu, or write the shifted
ring state. This is the largest net-new piece.
- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
tiny conv gather kernel.
The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
work is the kernel(s) + the backend wiring.
--------------------------------------------------------------------------------
## 2. Per-backend: authoring model, effort, gotchas, wiring
### 2.1 Metal (MSL)
Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
and are ergonomic to extend. OP A is a write-base redirect of the existing
`kernel_gated_delta_net_impl` (its tail already does
`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
write are both new). Host glue spans 3-4 files.
Gotchas:
- In-place race: the existing kernel ALREADY snapshots the state column into
`ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
`has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
`has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
`src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
`ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
the fused variant only redirects addresses, so it is bit-identical to Metal's
own non-fused path by construction (the conv per-channel FMA needs the exact
ascending order + the `+0.0f`).
- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
pipeline name (or a function-constant branch) distinguishes the variants.
### 2.2 Vulkan (GLSL .comp -> SPIR-V)
Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
binds subbuffers; `supports_op` in the device support function. Subgroup size
VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
new `ggml_vk_create_pipeline` registration with an explicit binding count and
push-constant size, a new/extended push-constant struct (add `rs_head`), and
GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
pipeline is created with `"main", 7, ...` - both must change.
Gotchas:
- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
already ships shmem / cluster / nocluster variants keyed on subgroup size and
relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
applied across ALL of those variants, and re-validated per vendor.
- In-place race: GLSL must read the full column shard into local registers before
the ring write (same pattern); confirm the SPIR-V memory model is not relied on
for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
OP C/D need the explicit window-to-local copy.
- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
Both must be tightened. This is the backend where the 0030 hazard is most
concrete (a present plain-conv kernel + a permissive supports_op = silent
miscompute) - Vulkan is the exact case 0030 was written for.
- conv-update is per-channel (one invocation per channel) so it is
subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
constants, so the silu-fold half of OP C is partly precedented here (unlike
Metal); the ring write-back + tap-window assembly are still new.
### 2.3 SYCL (single-source DPC++)
Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
source).
Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
read-base select, the two tiny gather kernels, and the new conv-update kernel.
No pipeline/push-constant/binding bookkeeping.
Gotchas:
- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
the register-snapshot-before-write pattern carries over directly. Safe.
- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
`ggml_sycl_op_gated_delta_net` must branch on the extra src.
- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
situation as CUDA; the fused variant matches SYCL's own non-fused path by
construction. conv-update is per-channel -> subgroup-agnostic.
### 2.4 Common wiring (all three) + the 0030 emission-gate change
Per backend, four wiring touch-points beyond the kernel body:
1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
entries so the discriminated/extra-src node is reported supported ONLY when
the new kernel handles it (and rejected otherwise, instead of today's
silently-true-for-the-plain-kernel).
2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
dispatch the fused kernel.
3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
4. The patch-0030 gate in `src/llama-context.cpp`.
The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
auto_fgdn = false`. As each backend gains kernels this must become capability-
driven, in one of two ways:
- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
kernels + tightened supports_op ship; OR
- clean (recommended upstream form): DELETE the name allow-list and make
`supports_op` authoritative - have the `auto_fgdn` resolution probe
`ggml_backend_dev_supports_op` on a representative node that carries the
discriminated `src[]` slots. Then routing falls out of the normal scheduler
fallback and no backend name is ever hard-coded. This also fixes 0030's stated
weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
nodes and covered the discriminated SSM_CONV only incidentally.
--------------------------------------------------------------------------------
## 3. Bit-exactness per backend (the md5 gate question)
Feasible on ALL THREE, and not actually constraining, because of how the gate is
scoped:
- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
`warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
cross-backend md5 and none is expected.
- The relevant per-backend invariant is: the FUSED variant must equal that
backend's OWN non-fused path. The fusions change only the read source
(gather -> indexed read; the gather is a value-preserving memcpy) and the write
target (appended output -> in-place cache slot). They do NOT touch the
per-column FMA/reduce order. So the fused op is bit-identical to the
non-fused op on the same backend BY CONSTRUCTION.
- Two arithmetic details each port MUST preserve exactly: (a) the conv
ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
`sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
GATED_DELTA_NET) is the per-backend gate.
--------------------------------------------------------------------------------
## 4. Upstream path and ranked recommendation
### Ops-first, then one PR per backend (NOT one big PR)
Recommended sequence:
1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
`ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
`test-backend-ops` cases, and the capability-driven gate (the clean
`supports_op`-authoritative version of 0030). This is independently mergeable
and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
that way).
2. PR #2 - Metal kernels + wiring.
3. PR #3 - SYCL kernels + wiring.
4. PR #4 - Vulkan kernels + wiring.
Do NOT bundle the backends: each needs its own hardware to validate
`test-backend-ops`, reviewers are backend-specialized, and a regression in one
must not block the others.
### Value x effort ranking (which backend first)
| backend | user base / value | author effort | bit-exact difficulty | net rank |
|---------|----------------------------|---------------|----------------------|----------|
| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** |
| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
Recommendation: **Metal first.** It banks the biggest user-facing decode win at
medium effort, the base GDN + conv kernels already exist, and Apple's fixed
simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
as the high-effort / high-breadth capstone - it reaches the widest hardware
(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
existing variant matrix, the subgroup-width variance, and the per-vendor
validation burden make it the right capstone once the pattern is proven on
Metal + SYCL.
A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
separate later effort.
--------------------------------------------------------------------------------
## 5. Summary
- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
(the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
(OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
descriptor wiring. The builders, CPU refs, model graph and tests are shared and
already done.
- Bit-exactness is feasible everywhere and per-backend by construction (the
fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
(backendX-vs-CPU) is the gate.
- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
name allow-list), then Metal, then SYCL, then Vulkan.

View File

@@ -0,0 +1,17 @@
model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
1 model engine npl decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms peak_gb
2 q36-27b-nvfp4 llama 8 82.5 9.57 507.3 6038.1 53.51
3 q36-27b-nvfp4 llama 32 192.6 4.79 115.0 133551.7 69.63
4 q36-27b-nvfp4 llama 64 277.8 3.09 95.9 321618.8 83.96
5 q36-27b-nvfp4 llama 128 384.6 1.86 69.7 902762.7 93.82
6 q36-27b-nvfp4 vllm 8 70.4 8.76 2096.2 1861.1 110.92
7 q36-27b-nvfp4 vllm 32 211.8 6.28 2182.6 5353.2 110.87
8 q36-27b-nvfp4 vllm 64 309.1 4.38 2088.9 9512.4 110.88
9 q36-27b-nvfp4 vllm 128 418.8 2.79 1929.1 18449.5 110.95
10 q36-35b-a3b-nvfp4 llama 8 211.8 24.45 1236.4 2477.1 39.66
11 q36-35b-a3b-nvfp4 llama 32 393.0 10.02 1213.9 8225.2 47.11
12 q36-35b-a3b-nvfp4 llama 64 527.0 6.15 1152.3 15849.5 57.13
13 q36-35b-a3b-nvfp4 llama 128 726.4 3.73 276.8 213017.2 61.51
14 q36-35b-a3b-nvfp4 vllm 8 256.5 31.84 5186.5 768.8 109.62
15 q36-35b-a3b-nvfp4 vllm 32 500.8 14.90 6223.4 1830.4 109.63
16 q36-35b-a3b-nvfp4 vllm 64 686.1 9.83 5926.5 3224.4 109.63
17 q36-35b-a3b-nvfp4 vllm 128 882.2 6.05 5300.5 6487.7 109.64

View File

@@ -0,0 +1,217 @@
// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
//
// Reproduces, at the libllama level, the two host-side defects behind the
// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
// signature:
//
// * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
// what llama-server issues on every reused slot - frees the kv-cache CELLS
// but the paged manager keeps owning the trailing BLOCKS. The manager's
// free pool silently shrinks. Test A measures the reclaimed-block delta.
//
// * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
// many sequences and frees them in a scrambled order leaves the free queue a
// scrambled permutation of physical block ids. A later low-npl prefill then
// pops physically scattered blocks, so its KV scatter-write + in-kernel
// paged-attention gather lose locality and prefill throughput collapses;
// decode (single-token append) barely notices. Test B times an npl8 prefill
// on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
//
// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
// of the fresh npl8 and num_free returns to the pristine value after the drain.
//
// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
#include "llama.h"
#include "paged-prefix-api.h"
#include <chrono>
#include <clocale>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <vector>
static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
using clk = std::chrono::steady_clock;
static double secs(clk::time_point a, clk::time_point b) {
return std::chrono::duration<double>(b - a).count();
}
struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
// blocks unique so no cross-request prefix sharing masks the accounting.
static llama_token tok_of(int seq, int pos, int n_vocab) {
return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
}
// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
// Returns wall seconds (sync'd).
static double prefill(Ctx & C, int seq, int pos0, int n) {
clk::time_point t0 = clk::now();
C.batch.n_tokens = 0;
for (int j = 0; j < n; ++j) {
int i = C.batch.n_tokens;
C.batch.token[i] = tok_of(seq, pos0 + j, C.n_vocab);
C.batch.pos[i] = pos0 + j;
C.batch.n_seq_id[i] = 1;
C.batch.seq_id[i][0]= seq;
C.batch.logits[i] = (j + 1 == n) ? 1 : 0;
C.batch.n_tokens++;
}
if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
llama_synchronize(C.ctx);
return secs(t0, clk::now());
}
// One decode step (single token) for seq at pos.
static void decode1(Ctx & C, int seq, int pos) {
C.batch.n_tokens = 1;
C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
C.batch.pos[0] = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
}
int main(int argc, char ** argv) {
std::setlocale(LC_NUMERIC, "C");
const char * model_path = nullptr;
for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
const int NSLOT = env_i("BURST_NSLOT", 64);
const int NPL = env_i("NPL", 8);
const int PP = env_i("PP", 512);
const int KEEP = env_i("KEEP", 256);
const int GEN = env_i("GEN", 4);
const int ngl = env_i("PAGED_NGL", 99);
const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
ggml_backend_load_all();
llama_model_params mp = llama_model_default_params();
mp.n_gpu_layers = ngl;
llama_model * model = llama_model_load_from_file(model_path, mp);
if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
const llama_vocab * vocab = llama_model_get_vocab(model);
const int n_vocab = llama_vocab_n_tokens(vocab);
// Pool sized for the burst plus headroom so the burst fits but a later npl
// run draws from whatever the burst's churn left behind.
const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
llama_context_params cp = llama_context_default_params();
cp.n_ctx = (uint32_t) cells;
cp.n_batch = (uint32_t) (PP + 16);
cp.n_ubatch = (uint32_t) (PP + 16);
cp.n_seq_max = NSLOT + NPL + 2;
cp.kv_unified = true; // one unified stream-0 pool -> num_free(ctx) is the whole pool
cp.no_perf = true;
llama_context * ctx = llama_init_from_model(model, cp);
if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
C.batch = llama_batch_init(cp.n_batch, 0, 1);
printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
paged, NSLOT, NPL, PP, KEEP, GEN, cells);
llama_memory_clear(C.mem, true);
const long F_start = paged_prefix_api::num_free_global();
// ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
{
prefill(C, 0, 0, PP);
const long f_after_prefill = paged_prefix_api::num_free_global();
llama_memory_seq_rm(C.mem, 0, KEEP, -1); // partial tail removal
const long f_after_rm = paged_prefix_api::num_free_global();
llama_memory_seq_rm(C.mem, 0, -1, -1); // full free -> pristine
const long f_after_full = paged_prefix_api::num_free_global();
const long bs = 16;
const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
"(expect %ld post-fix, 0 pre-fix) afterFullFree=%ld\n",
F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
}
// ---- Test B: fragmentation -> npl prefill collapse -----------------------
// Fresh npl prefill baseline on a pristine pool.
llama_memory_clear(C.mem, true);
double tps_fresh;
{
clk::time_point t0 = clk::now();
long ntok = 0;
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
tps_fresh = ntok / secs(t0, clk::now());
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
}
const long F_pristine = paged_prefix_api::num_free_global();
// High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
// steps (mixed alloc), then drain them in a scrambled order (odd ids first,
// then even, each truncated before the full free) so the free queue becomes a
// scrambled permutation - the fragmentation the bug never compacts.
for (int s = 0; s < NSLOT; ++s) {
if (prefill(C, NPL + s, 0, PP) < 0) return 1;
for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
}
const long F_during_burst = paged_prefix_api::num_free_global();
// Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
// scrambled slot order to scramble the physical free order.
for (int parity = 1; parity >= 0; --parity)
for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1); // partial (Fix-1 path)
llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // full free
}
const long F_after_drain = paged_prefix_api::num_free_global();
// Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
// defragged) free queue.
double tps_post;
{
clk::time_point t0 = clk::now();
long ntok = 0;
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
tps_post = ntok / secs(t0, clk::now());
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
}
const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
"(afterDrain==pristine? %s)\n",
F_start, F_pristine, F_during_burst, F_after_drain,
F_after_drain == F_pristine ? "YES" : "NO");
printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f ratio=%.3f "
"(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
// ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
// Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
// slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
// low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
// prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
llama_memory_clear(C.mem, true);
// Touch the pool once so the manager exists, then read the full-pool size
// (num_free is 0 while no manager is registered).
if (prefill(C, 0, 0, 16) < 0) return 1;
llama_memory_seq_rm(C.mem, 0, -1, -1);
const long F_pre_c = paged_prefix_api::num_free_global();
for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
const long F_idle = paged_prefix_api::num_free_global();
for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
const long F_reclaimed = paged_prefix_api::num_free_global();
printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
"(returns_to_fresh? %s)\n",
F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
F_reclaimed == F_pre_c ? "YES" : "NO");
printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
paged, ratio,
F_after_drain == F_pristine ? "YES" : "NO",
F_reclaimed == F_pre_c ? "YES" : "NO");
llama_batch_free(C.batch);
llama_free(ctx);
llama_model_free(model);
return 0;
}

View File

@@ -0,0 +1,59 @@
// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
//
// Fix-1 PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
// beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
// longer strands blocks whose cells were cleared.
// Fix-2 defrag_free_pool() relinks the free queue into ascending block-id
// order once the pool is fully idle, undoing a burst's scrambled frees
// so a later prefill pops physically contiguous blocks again.
#include "paged-kv-manager.h"
#include <cstdio>
using paged::PagedKVManager;
int main() {
int rc = 0;
// ---- Fix-1: truncate reclaims the trailing block suffix -----------------
{
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
const size_t f0 = m.num_free_blocks(); // 63 (block 0 reserved as null)
m.allocate(0, 512); // ceil(512/16)=32 blocks
const size_t f1 = m.num_free_blocks(); // 31
m.truncate(0, 256); // keep ceil(256/16)=16, free 16
const size_t f2 = m.num_free_blocks(); // 47
printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
f0, f1, f2, f2 - f1);
if (f2 - f1 != 16) rc = 1;
m.truncate(0, 16); // keep 1 block, free 15 more
const size_t f3 = m.num_free_blocks(); // 62
printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
if (f3 != f0 - 1) rc = 1;
m.free(0);
if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
}
// ---- Fix-2: defrag restores ascending popleft order ---------------------
{
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
for (int s = 0; s < 8; ++s) m.allocate(s, 16); // pop blocks 1..8
const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4}; // free out of order
for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
m.defrag_free_pool(); // all idle -> compact
m.allocate(100, 16 * 3); // pop 3 blocks
const auto bt = m.block_table(100);
bool asc = true;
printf("[unit Fix-2] post-defrag block_table:");
for (size_t i = 0; i < bt.size(); ++i) {
printf(" %d", bt[i]);
if (i && bt[i] < bt[i - 1]) asc = false;
}
printf(" ascending=%s (expect YES)\n", asc ? "YES" : "NO");
if (!asc) rc = 1;
}
printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
return rc;
}

View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

View File

Binary file not shown.

After

Width:  |  Height:  |  Size: 89 KiB