mirror of
https://github.com/mudler/LocalAI.git
synced 2026-06-27 18:06:58 -04:00
chore(paged): keep patches/ patch-only; README to backend root, docs to docs/
The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv,
dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv.
Restore the invariant that patches/ holds only the .patch series.
Moves:
- patches/paged/README.md -> README.md (canonical doc at the backend root)
- patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md,
final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/
- patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README)
Deletes:
- patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section)
- patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide)
Repoint every reference to the moved files: README internal links (docs/ + the
.github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md,
.github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml,
the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml,
docs/content/features/backends.md, gallery/index.yaml.
The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged)
is unchanged and still resolves to the 28 patches.
Assisted-by: Claude:opus-4.8 [Claude Code]
Signed-off-by: Ettore Di Giacinto <mudler@localai.io>
This commit is contained in:
@@ -0,0 +1,514 @@
|
||||
# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items
|
||||
|
||||
Scoping deliverable only. NOTHING is changed by this document. It is grounded in the
|
||||
actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions.
|
||||
|
||||
SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and
|
||||
the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and
|
||||
the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas /
|
||||
metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the
|
||||
darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to-
|
||||
negative there and non-CUDA users should use the stock llama-cpp backend (README 4c).
|
||||
|
||||
================================================================================
|
||||
0. GROUND TRUTH (what the repo actually does today)
|
||||
================================================================================
|
||||
|
||||
The paged patchset is ALREADY integrated into the stock llama-cpp backend in this
|
||||
worktree. Two mechanisms, both already present:
|
||||
|
||||
(a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:`
|
||||
target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off,
|
||||
patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017).
|
||||
prepare.sh has a fallback `patch`-based apply guarded by a sentinel
|
||||
(llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY
|
||||
already ships the paged engine compiled in.
|
||||
|
||||
(b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option
|
||||
hooks (lines ~752-842). They only call setenv() before context init:
|
||||
- option `kv_paged` / `paged_kv` / `paged_attention` -> setenv LLAMA_KV_PAGED=1
|
||||
- option `kv_paged_debug` / `paged_kv_debug` -> setenv LLAMA_KV_PAGED_DEBUG=1
|
||||
- option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET
|
||||
- option `max_batch_tokens` / `mbt` -> setenv LLAMA_MAX_BATCH_TOKENS
|
||||
- option `prefill_cap` -> setenv LLAMA_PREFILL_CAP
|
||||
Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env),
|
||||
so grpc-server.cpp is byte-safe to share between a clean build and a paged build.
|
||||
The paged engine itself lives entirely inside the patched llama.cpp lib
|
||||
(paged-kv-manager.cpp etc.), NOT in grpc-server.cpp.
|
||||
|
||||
Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of
|
||||
ONE backend. The task is to SPLIT that into two backends:
|
||||
- llama-cpp = clean upstream llama.cpp (de-risked: a dep-bump can never break on a
|
||||
paged hook), grpc-server.cpp keeps the dormant hooks.
|
||||
- <newname> = stock grpc-server.cpp + paged patch series applied + paged on.
|
||||
|
||||
The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the
|
||||
backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile
|
||||
+ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1).
|
||||
|
||||
CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all`
|
||||
(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS,
|
||||
SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend
|
||||
gets the same single-build CPU target for free by reusing the same Makefile machinery.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
RECOMMENDED BACKEND NAME: `llama-cpp-paged` (see section 4 for the full rationale)
|
||||
--------------------------------------------------------------------------------
|
||||
Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged,
|
||||
SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED.
|
||||
DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile.<suffix> and
|
||||
in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp,
|
||||
turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching.
|
||||
|
||||
================================================================================
|
||||
1. NEW BACKEND - file by file
|
||||
================================================================================
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.1 backend/cpp/llama-cpp/Makefile (the ONE necessary touch to stock)
|
||||
--------------------------------------------------------------------------------
|
||||
Change exactly one default so the STOCK image ships clean against upstream:
|
||||
|
||||
-LLAMA_PAGED?=on
|
||||
+LLAMA_PAGED?=off
|
||||
|
||||
Why: this is the entire point of the split - stock llama-cpp must build clean so an
|
||||
upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in
|
||||
grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so
|
||||
it does not depend on this default. NOTE this DOES change stock's shipped artifact (it
|
||||
currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in
|
||||
the PR. If the team prefers stock literally untouched, the alternative is to leave
|
||||
`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is
|
||||
not achieved. Recommendation: flip to off.
|
||||
|
||||
(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh,
|
||||
patches/, patches/paged/ are all reused as-is by the new backend.)
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.2 backend/cpp/llama-cpp-paged/Makefile (NEW - thin wrapper, model on turboquant)
|
||||
--------------------------------------------------------------------------------
|
||||
Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we
|
||||
do NOT):
|
||||
- turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin
|
||||
as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do
|
||||
NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs
|
||||
turboquant). We only force LLAMA_PAGED=on.
|
||||
- turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and
|
||||
apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the
|
||||
paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's
|
||||
own `llama.cpp:` target when LLAMA_PAGED=on.
|
||||
|
||||
Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/
|
||||
cpu-all/grpc/rpc-server):
|
||||
|
||||
LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp
|
||||
|
||||
define paged-build # $(1)=flavor $(2)=cmake flags $(3)=target
|
||||
rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
|
||||
cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build
|
||||
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge
|
||||
# clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it)
|
||||
LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp
|
||||
CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \
|
||||
$(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server
|
||||
cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1)
|
||||
endef
|
||||
|
||||
llama-cpp-paged-cpu-all:
|
||||
# identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS
|
||||
# + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle.
|
||||
... LLAMA_PAGED=on SHARED_LIBS=ON \
|
||||
EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \
|
||||
TARGET="--target grpc-server --target ggml" ...
|
||||
|
||||
package: ; bash package.sh
|
||||
purge: ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package
|
||||
clean: purge
|
||||
|
||||
Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and
|
||||
package.sh glob them.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.3 backend/cpp/llama-cpp-paged/run.sh (NEW - copy turboquant/run.sh, rename binaries)
|
||||
--------------------------------------------------------------------------------
|
||||
s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to
|
||||
llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin
|
||||
DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.4 backend/cpp/llama-cpp-paged/package.sh (NEW - copy turboquant/package.sh, rename)
|
||||
--------------------------------------------------------------------------------
|
||||
s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles
|
||||
ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh,
|
||||
and the per-arch libc/ld.so set (unchanged).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.5 backend/Dockerfile.llama-cpp-paged (NEW - copy Dockerfile.turboquant, swap paths)
|
||||
--------------------------------------------------------------------------------
|
||||
Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits:
|
||||
- bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh
|
||||
- ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE}
|
||||
(OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE
|
||||
stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm
|
||||
stock cache would give the paged build near-free object reuse. Trade-off: a regression
|
||||
in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.)
|
||||
- both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package`
|
||||
- final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.6 .docker/llama-cpp-paged-compile.sh (NEW - copy llama-cpp-compile.sh, swap make targets)
|
||||
--------------------------------------------------------------------------------
|
||||
Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call
|
||||
`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback`
|
||||
(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the
|
||||
arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines
|
||||
--------------------------------------------------------------------------------
|
||||
a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged`
|
||||
b) Backend def (after BACKEND_TURBOQUANT, line ~1172):
|
||||
# llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch
|
||||
# series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper.
|
||||
BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false
|
||||
(lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the
|
||||
llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.)
|
||||
c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273):
|
||||
$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED)))
|
||||
d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged
|
||||
e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant,
|
||||
line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend
|
||||
f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror
|
||||
backends/llama-cpp-darwin at line 1124; see 1.11).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names)
|
||||
--------------------------------------------------------------------------------
|
||||
For EACH variant you choose to ship (see phased recommendation in section 4), add a row
|
||||
copied from the corresponding llama-cpp row with:
|
||||
- backend: "llama-cpp-paged"
|
||||
- dockerfile: "./backend/Dockerfile.llama-cpp-paged"
|
||||
- tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged`
|
||||
(e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged;
|
||||
-gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.)
|
||||
- builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp
|
||||
(this backend compiles the same gRPC + same toolchain; no new base-images.yml variant
|
||||
is needed, so NO base-images bootstrap step). This is the cheap-variant payoff.
|
||||
- CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing
|
||||
tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and
|
||||
backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge
|
||||
pattern as -cpu-llama-cpp.
|
||||
- Darwin (if shipping): add to includeDarwin:
|
||||
- backend: "llama-cpp-paged"
|
||||
tag-suffix: "-metal-darwin-arm64-llama-cpp-paged"
|
||||
lang: "go"
|
||||
(omit build-type, exactly like the llama-cpp darwin row at line 4908.)
|
||||
|
||||
REMINDER: the CI path filter only builds a backend on a PR when a file under its dir
|
||||
changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self-
|
||||
triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/
|
||||
(the shared source) retrigger this backend too.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly)
|
||||
--------------------------------------------------------------------------------
|
||||
a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56),
|
||||
next to the turboquant branch (line 45):
|
||||
if (item.dockerfile.endsWith("llama-cpp-paged")) {
|
||||
// reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile
|
||||
return `backend/cpp/llama-cpp-paged/`;
|
||||
}
|
||||
ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but
|
||||
keep the specific branch first regardless (defensive, and returns the right path).
|
||||
b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66):
|
||||
if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; }
|
||||
c) Per-backend cross-trigger (line 274-278, mirror the turboquant block):
|
||||
if (backend === "llama-cpp-paged" && !changed) {
|
||||
changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/"));
|
||||
}
|
||||
Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri)
|
||||
--------------------------------------------------------------------------------
|
||||
GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a
|
||||
meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete
|
||||
per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31).
|
||||
|
||||
Step a - meta anchor in `## metas` (after *turboquant, ~line 74):
|
||||
- &llamacpppaged
|
||||
name: "llama-cpp-paged"
|
||||
alias: "llama-cpp-paged"
|
||||
license: mit
|
||||
icon: <same as llama-cpp>
|
||||
description: |
|
||||
LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first
|
||||
prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series.
|
||||
Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server.
|
||||
urls: [ https://github.com/ggerganov/llama.cpp ]
|
||||
tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ]
|
||||
capabilities:
|
||||
default: "cpu-llama-cpp-paged"
|
||||
nvidia: "cuda12-llama-cpp-paged"
|
||||
nvidia-cuda-12: "cuda12-llama-cpp-paged"
|
||||
nvidia-cuda-13: "cuda13-llama-cpp-paged"
|
||||
nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged"
|
||||
nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged"
|
||||
nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged"
|
||||
metal: "metal-llama-cpp-paged"
|
||||
# add amd/intel/vulkan keys ONLY for variants you actually build (section 4)
|
||||
|
||||
Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same
|
||||
capabilities map pointing at the `*-development` image names.
|
||||
|
||||
Step c - concrete image entries at end of file (mirror the llama-cpp block lines
|
||||
2106-2200), one latest + one development per variant, each as:
|
||||
- !!merge <<: *llamacpppaged
|
||||
name: "cpu-llama-cpp-paged"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged"
|
||||
mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ]
|
||||
- !!merge <<: *llamacpppaged
|
||||
name: "cpu-llama-cpp-paged-development"
|
||||
uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged"
|
||||
mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ]
|
||||
...repeat for cuda12 / cuda13 / l4t / metal etc.
|
||||
The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2)
|
||||
--------------------------------------------------------------------------------
|
||||
If metal is shipped, also:
|
||||
- scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it
|
||||
drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on.
|
||||
- Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin).
|
||||
- backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific
|
||||
step that calls `make backends/llama-cpp-darwin`).
|
||||
- index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10).
|
||||
- C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto
|
||||
linking protobuf/grpc++), so no Homebrew-include failure.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.12 Importer / /backends/known dropdown (drop-in, NOT a new importer)
|
||||
--------------------------------------------------------------------------------
|
||||
This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not
|
||||
add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go:
|
||||
- AdditionalBackends() (line 37): append
|
||||
{Name: "llama-cpp-paged", Modality: "text",
|
||||
Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"}
|
||||
- Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a
|
||||
preferences.backend == "llama-cpp-paged" is honored:
|
||||
case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b
|
||||
- core/gallery/importers/importers_test.go: add a table case asserting the preference
|
||||
override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF
|
||||
HF fixture). Run `go test ./core/gallery/importers/...`.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.13 Docs
|
||||
--------------------------------------------------------------------------------
|
||||
- docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list,
|
||||
one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine
|
||||
-> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.)
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
1.14 Does grpc-server.cpp need the paged hooks? YES - already present, reused unchanged.
|
||||
--------------------------------------------------------------------------------
|
||||
The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the
|
||||
SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim
|
||||
(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The
|
||||
hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED /
|
||||
LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads.
|
||||
|
||||
================================================================================
|
||||
2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE
|
||||
================================================================================
|
||||
|
||||
Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and
|
||||
the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via
|
||||
either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the
|
||||
`options:` string list (key:value, parsed by grpc-server.cpp set_option).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2.1 Benchmark llama-server flags -> LocalAI model-config mapping
|
||||
--------------------------------------------------------------------------------
|
||||
-c 131072 -> context_size: 131072 (LLMConfig.ContextSize, yaml context_size)
|
||||
-fa on -> flash_attention: "on" (LLMConfig.FlashAttention, yaml flash_attention; string)
|
||||
-ngl 99 -> gpu_layers: 99 (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all)
|
||||
-b 2048 -> batch: 2048 (schema.PredictionOptions.Batch, yaml batch) [see caveat]
|
||||
--parallel 128 -> options: ["parallel:128"] (grpc-server.cpp:629; alias n_parallel)
|
||||
LLAMA_KV_PAGED=1 -> options: ["paged_kv:true"] (grpc-server.cpp:778)
|
||||
LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt)
|
||||
f16 KV -> f16: true (LLMConfig.F16, yaml f16)
|
||||
(recommended for paged) -> options: ["kv_unified:false"] (grpc-server.cpp:746 - the per-slot paged
|
||||
capacity/memory benefit only materializes with a per-sequence cache;
|
||||
the patch comment explicitly recommends pairing paged with kv_unified:false)
|
||||
|
||||
CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch()
|
||||
(grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the
|
||||
benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options:
|
||||
(i) set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first
|
||||
max_batch_tokens=512 budget is the dominant prefill lever anyway, and the
|
||||
benchmark states decode throughput is budget-independent), OR
|
||||
(ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch).
|
||||
RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run +
|
||||
lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could
|
||||
be added later to honor -b/-ub independently (not required to ship).
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2.2 gallery/index.yaml entry - DENSE q36-27b-nvfp4
|
||||
--------------------------------------------------------------------------------
|
||||
- name: "qwen3.6-27b-nvfp4-paged"
|
||||
url: "github:mudler/LocalAI/gallery/virtual.yaml@master"
|
||||
urls:
|
||||
- https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF # placeholder, section 3
|
||||
description: |
|
||||
Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's
|
||||
paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget.
|
||||
Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory.
|
||||
license: "apache-2.0" # confirm vs Qwen license
|
||||
tags: [ llm, gguf, nvfp4, reasoning ]
|
||||
icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png
|
||||
overrides:
|
||||
backend: llama-cpp-paged
|
||||
f16: true
|
||||
flash_attention: "on"
|
||||
context_size: 131072
|
||||
gpu_layers: 99
|
||||
batch: 512 # see -ub caveat 2.1; matches the 512 ubatch floor
|
||||
known_usecases: [ chat ]
|
||||
options:
|
||||
- use_jinja:true
|
||||
- paged_kv:true # LLAMA_KV_PAGED=1
|
||||
- max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget)
|
||||
- kv_unified:false # enables the per-slot paged capacity/memory benefit
|
||||
- parallel:128 # --parallel 128 serving slots
|
||||
parameters:
|
||||
model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
|
||||
template:
|
||||
use_tokenizer_template: true
|
||||
files:
|
||||
- filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf
|
||||
sha256: <FILL after publish>
|
||||
uri: https://huggingface.co/<ORG>/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2.3 gallery/index.yaml entry - MoE q36-35b-a3b-nvfp4
|
||||
--------------------------------------------------------------------------------
|
||||
Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the
|
||||
MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing
|
||||
purely for saturated MoE decode use max_batch_tokens:256.
|
||||
- name: "qwen3.6-35b-a3b-nvfp4-paged"
|
||||
urls: [ https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF ]
|
||||
...
|
||||
overrides:
|
||||
backend: llama-cpp-paged
|
||||
f16: true
|
||||
flash_attention: "on"
|
||||
context_size: 131072
|
||||
batch: 512
|
||||
options:
|
||||
- use_jinja:true
|
||||
- paged_kv:true
|
||||
- max_batch_tokens:512 # or 256 for max saturated MoE decode (sweep winner)
|
||||
- kv_unified:false
|
||||
- parallel:128
|
||||
parameters:
|
||||
model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
|
||||
files:
|
||||
- filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf
|
||||
sha256: <FILL after publish>
|
||||
uri: https://huggingface.co/<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf
|
||||
|
||||
Note: these are the BENCHMARK serving configs. For an interactive single-user default you
|
||||
may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget)
|
||||
- optional, not required to ship the benchmark reproduction.
|
||||
|
||||
================================================================================
|
||||
3. GGUF PUBLISHING (so the gallery uri: resolves)
|
||||
================================================================================
|
||||
|
||||
The two GGUFs already exist on the DGX dev box (final_benchmark.csv references
|
||||
q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks"
|
||||
document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4
|
||||
tensors from nvidia modelopt weights). To publish:
|
||||
|
||||
1. HF repos (suggest two, under the org that owns the gallery-referenced weights):
|
||||
<ORG>/Qwen3.6-27B-NVFP4-GGUF (single q36-27b-nvfp4.gguf)
|
||||
<ORG>/Qwen3.6-35B-A3B-NVFP4-GGUF (single q36-35b-a3b-nvfp4.gguf)
|
||||
ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions.
|
||||
2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256
|
||||
(LocalAI verifies it on download). Without sha256 the entry still works but loses the
|
||||
integrity check - fill it.
|
||||
3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4,
|
||||
pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs
|
||||
Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 +
|
||||
paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers).
|
||||
4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned
|
||||
LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev
|
||||
tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the
|
||||
backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS
|
||||
IS A GATING CHECK before the gallery items are usable - verify on a GPU box.
|
||||
5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia
|
||||
modelopt weights. Ensure redistribution of the converted GGUFs is permitted and
|
||||
attribute upstream in the card.
|
||||
|
||||
================================================================================
|
||||
4. OPEN DECISIONS / BLOCKERS / BUILD COST
|
||||
================================================================================
|
||||
|
||||
BACKEND NAME - RECOMMEND `llama-cpp-paged`.
|
||||
- llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like
|
||||
every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the
|
||||
changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known
|
||||
importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp".
|
||||
- localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a
|
||||
dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing.
|
||||
Slightly less self-explanatory about WHAT differs (paged) in the dropdown.
|
||||
- localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai-
|
||||
llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and
|
||||
".cpp" looks like a file extension to the suffix matcher. Avoid.
|
||||
|
||||
BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here):
|
||||
1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU.
|
||||
If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader.
|
||||
2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders
|
||||
until upload. Blocks gallery validation only, not the backend build.
|
||||
3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server
|
||||
option; shipped config uses batch:512. Minor, not a blocker.
|
||||
4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended)
|
||||
- get explicit sign-off since it alters a heavily-used backend's build.
|
||||
|
||||
PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same
|
||||
base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes):
|
||||
Phase 1 (the benchmark target - GB10/Blackwell is CUDA):
|
||||
- cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64 (NVFP4/paged win)
|
||||
- cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage)
|
||||
Phase 2 (parity with stock llama-cpp coverage, only if demand):
|
||||
- metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32
|
||||
Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric
|
||||
and these add CI cost without a clear consumer.
|
||||
|
||||
BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are
|
||||
~byte-identical to stock so a SHARED ccache id makes most objects free):
|
||||
- CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14
|
||||
apt step. Two arches + a merge job.
|
||||
- CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less
|
||||
across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs.
|
||||
- Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache.
|
||||
- No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags),
|
||||
so the only new CI cost is the per-row build minutes above. PR builds read cache, don't
|
||||
write; first master build per row pays the cold cost once, then warm.
|
||||
|
||||
VERIFICATION (post-implementation, needs a GPU box - out of scope here):
|
||||
- `make backends/llama-cpp-paged` builds + installs locally (from-source path).
|
||||
- Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the
|
||||
checkout) - proves the split.
|
||||
- Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the
|
||||
server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured
|
||||
max_batch_tokens/parallel took effect.
|
||||
- go test ./core/gallery/importers/... green (importer drop-in case).
|
||||
- node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers
|
||||
llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too.
|
||||
|
||||
================================================================================
|
||||
END OF PLAN
|
||||
================================================================================
|
||||
@@ -0,0 +1,75 @@
|
||||
# Paged bit-exactness gate - per path (canonical references)
|
||||
|
||||
## TL;DR
|
||||
|
||||
The greedy decode of the **paged** path does not byte-match the **non-paged**
|
||||
path for the MoE model. This is a **benign FP-accumulation-order difference of
|
||||
the paged attention reduction**, KL-validated against the f16 reference. It is
|
||||
**not a bug**. The bit-exactness gate is therefore **per path**:
|
||||
|
||||
| path | model | canonical md5 |
|
||||
|------|-------|---------------|
|
||||
| non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` |
|
||||
| paged | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` |
|
||||
| non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` |
|
||||
| paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) |
|
||||
|
||||
Gate command (chat-template / conversation path):
|
||||
```
|
||||
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
|
||||
-n 48 --temp 0 --seed 1
|
||||
# paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
|
||||
```
|
||||
Note: use the default chat-template path (do **not** pass `-no-cnv`; raw
|
||||
completion lands in a different md5 namespace).
|
||||
|
||||
**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to
|
||||
the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the
|
||||
single reference `5951a5b4`.
|
||||
|
||||
## Why dense is bit-exact but MoE is not
|
||||
|
||||
Dense paged decode reproduces the non-paged reduction order exactly, so dense
|
||||
greedy md5 is identical across paths. The MoE path runs additional kernels (the
|
||||
NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs
|
||||
between the paged and non-paged attention layouts. Over a long greedy decode this
|
||||
flips a small number of near-tied argmaxes, changing the byte stream. The same
|
||||
divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or
|
||||
off, and with the patch-0029 block-table cache on or off - it is a property of
|
||||
the paged attention path, not of any one lever.
|
||||
|
||||
## KL evidence that the paged path is sound (the load-bearing check)
|
||||
|
||||
`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks,
|
||||
`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference
|
||||
(`darwin_36b_opus/f16.gguf`, PPL 7.3734):
|
||||
|
||||
| comparison | PPL(Q) | KL divergence | Same top p | Cor |
|
||||
|------------|-------:|--------------:|-----------:|----:|
|
||||
| f16 reference | 7.3734 | - | - | - |
|
||||
| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% |
|
||||
| **paged** vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% |
|
||||
| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% |
|
||||
|
||||
Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%.
|
||||
|
||||
### Verdict: BENIGN
|
||||
|
||||
- **Paged does not diverge from the f16 ground truth more than non-paged does.**
|
||||
KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) =
|
||||
7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29
|
||||
error bars). A real paged-MoE correctness bug would push paged measurably
|
||||
*further* from f16; it does not (it is marginally closer).
|
||||
- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050,
|
||||
89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p),
|
||||
with essentially zero probability bias. That is the signature of two equivalent
|
||||
FP-reorderings of the same quantized model, both equally approximating the f16
|
||||
ground truth - not a quality regression.
|
||||
- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that
|
||||
heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model
|
||||
logit near-ties are abundant, so a different-but-equivalent reduction order
|
||||
flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and
|
||||
zero Delta-p bias).
|
||||
|
||||
Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged
|
||||
reference for the MoE deployment path.
|
||||
@@ -0,0 +1,86 @@
|
||||
# llama.cpp patch series — paged attention (vLLM-parity engine)
|
||||
|
||||
A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an
|
||||
in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION`
|
||||
(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the
|
||||
`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us
|
||||
**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small
|
||||
patch needs fixing, and the failure points at exactly which step the upstream change touched.
|
||||
|
||||
## Base
|
||||
|
||||
- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping
|
||||
the pin = re-run the regen workflow below and fix only the patches that no longer apply.
|
||||
|
||||
## The series (phases → patches)
|
||||
|
||||
| # | Patch | What | Verifies |
|
||||
|---|-------|------|----------|
|
||||
| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately |
|
||||
| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip |
|
||||
| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq |
|
||||
| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM |
|
||||
| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length |
|
||||
| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes |
|
||||
|
||||
Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off
|
||||
(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior.
|
||||
|
||||
## Regen workflow (the anti-drift recipe)
|
||||
|
||||
```sh
|
||||
# 1. check out the exact pin into a dev tree
|
||||
git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev
|
||||
git checkout <LLAMA_VERSION from ../Makefile>
|
||||
git checkout -b paged
|
||||
|
||||
# 2. apply the current series (each becomes a commit), or develop the next patch
|
||||
git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch # or `git apply` + commit per patch
|
||||
|
||||
# 3. iterate a phase as ONE commit, then export the whole series 1:1
|
||||
git format-patch <LLAMA_VERSION>..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N
|
||||
|
||||
# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export.
|
||||
```
|
||||
|
||||
## Build integration
|
||||
|
||||
The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock
|
||||
`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned
|
||||
`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the
|
||||
same strict `git apply` the stock build uses for base patches:
|
||||
```
|
||||
for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done
|
||||
```
|
||||
All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships
|
||||
everywhere without ever touching the stock `llama-cpp` source tree.
|
||||
|
||||
## Status
|
||||
|
||||
- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`.
|
||||
- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is
|
||||
**token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing.
|
||||
- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form
|
||||
(see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index
|
||||
subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on
|
||||
`llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or
|
||||
`llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B,
|
||||
**9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real
|
||||
compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`.
|
||||
- **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU
|
||||
flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's
|
||||
scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses
|
||||
the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit-
|
||||
identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what
|
||||
makes paged placement token-identical under flash-attn.**
|
||||
- 0004–0006 follow.
|
||||
|
||||
### Honest parity note (important)
|
||||
|
||||
This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not**
|
||||
by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel**
|
||||
(Lever 3: `mul_mat_q<MXFP4>` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not
|
||||
touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput
|
||||
gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see
|
||||
`../README.md`). So full vLLM parity = this series **AND** the
|
||||
kernel; neither alone suffices.
|
||||
101
backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
Normal file
101
backend/cpp/llama-cpp-localai-paged/docs/PIN_SYNC_c299a92c.md
Normal file
@@ -0,0 +1,101 @@
|
||||
# Pin-sync: paged patch-stack -> llama.cpp c299a92c
|
||||
|
||||
Status: COMPLETE. The shipped source-only paged patch series (`0001`-`0030`,
|
||||
28 `.patch` files) was advanced from llama.cpp `9d5d882d` to `c299a92c`
|
||||
("binaries : Improve rpc-server and export-graph-ops names. (#25045)"),
|
||||
GPU-rebuilt clean (CUDA sm_121 / GB10), and the bit-exact gate is GREEN on every
|
||||
path (dense + MoE, paged + non-paged) plus `test-backend-ops`. The 23-commit
|
||||
upstream jump `9d5d882d..c299a92c` did NOT change our decode output.
|
||||
|
||||
## Upstream jump
|
||||
|
||||
- OLD LocalAI paged pin: `9d5d882d8cd0f0a9283d87ed5e6fe3ee0d925fb1`
|
||||
("model : Add label for LFM2.5-230M (#25008)")
|
||||
- NEW LocalAI paged pin: `c299a92c38b6de6a1139617652b66081828648db`
|
||||
("binaries : Improve rpc-server and export-graph-ops names. (#25045)")
|
||||
- Upstream jump `9d5d882d..c299a92c` = **23 commits**.
|
||||
|
||||
## Re-export decision: NONE NEEDED - the source-only series applies STRICT-CLEAN at c299a92c
|
||||
|
||||
Unlike the `9d5d882d` sync (which needed 4 patch re-exports), this bump required
|
||||
**zero patch changes**. The already-shipped source-only series (the result of the
|
||||
`7e1832b8` strip that removed all stray dev-doc hunks) applies to a fresh clean
|
||||
`ggml-org/llama.cpp` checkout at `c299a92c` with the build's own **strict
|
||||
`git apply`** (the `apply-paged-patches` step in
|
||||
`backend/cpp/llama-cpp-localai-paged/Makefile`:
|
||||
`git apply --verbose "$p" || exit 1`) and reaches **exit 0** - every one of the
|
||||
28 patches reported "Applied patch ... cleanly", the sentinel
|
||||
`src/paged-kv-manager.cpp` was created, and there are **zero** stray
|
||||
`*_RESULTS.md` / `*_PROGRESS.md` in the resulting tree (source-only invariant
|
||||
intact). git apply tolerates `@@` line-number offsets, which absorbed the
|
||||
upstream drift; no hunk context broke.
|
||||
|
||||
Therefore the shipped `.patch` files are kept **byte-identical** (no churn). The
|
||||
patch tarball used for the verification has
|
||||
`sha256(cat 0*.patch | sort -V) = a99cc1fe4b66a7d0f4adcf9786bf2f9cda40792d7a6a01f36c4619369509114c`.
|
||||
|
||||
## Clean build
|
||||
|
||||
Fresh clone `~/llama-paged-c299/llama.cpp` @ `c299a92c` (NOT the dev tree), the
|
||||
28 patches applied as working-tree changes, then:
|
||||
|
||||
```
|
||||
cmake -B build-cuda -DGGML_CUDA=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
|
||||
-DCMAKE_CUDA_ARCHITECTURES=121 -DGGML_CUDA_NCCL=ON -DGGML_CUDA_FA=ON \
|
||||
-DGGML_CUDA_GRAPHS=ON -DGGML_CCACHE=ON -DLLAMA_CURL=OFF -DCMAKE_BUILD_TYPE=Release
|
||||
cmake --build build-cuda --target llama-completion test-backend-ops -j20
|
||||
```
|
||||
|
||||
Result: configure exit 0 (ggml 0.15.3, commit `c299a92-dirty`), build exit 0,
|
||||
`build-cuda/bin/llama-completion` + `build-cuda/bin/test-backend-ops` produced.
|
||||
|
||||
## GATE: ALL GREEN
|
||||
|
||||
Gate command (locked - reproduces the dense baseline byte-for-byte on the OLD
|
||||
`9d5d882d` build too):
|
||||
```
|
||||
llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \
|
||||
-n 48 --temp 0 --seed 1 </dev/null 2>/dev/null | md5sum
|
||||
# paged dense: prefix LLAMA_KV_PAGED=1
|
||||
# paged MoE: prefix LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1
|
||||
```
|
||||
|
||||
(a) greedy md5 - all four paths PASS:
|
||||
| path | model | md5 @ c299a92c | baseline | verdict |
|
||||
|------|-------|----------------|----------|---------|
|
||||
| non-paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| non-paged | MoE `q36-35b-a3b-nvfp4` | `07db32c2bcb78d17a43ed18bc22705cd` | `07db32c2bcb78d17a43ed18bc22705cd` | PASS |
|
||||
| paged | dense `q36-27b-nvfp4` | `5951a5b4d624ce891e22ab5fca9bc439` | `5951a5b4d624ce891e22ab5fca9bc439` | PASS |
|
||||
| paged | MoE `q36-35b-a3b-nvfp4` | `8cb0ce23777bf55f92f63d0292c756b0` | `8cb0ce23777bf55f92f63d0292c756b0` (PAGED_BITEXACT_NOTE) | PASS |
|
||||
|
||||
(b) `test-backend-ops` (Backend CUDA0) - all PASS:
|
||||
| op | result |
|
||||
|----|--------|
|
||||
| SSM_CONV | 45/45 OK |
|
||||
| SSM_CONV_UPDATE | 16/16 OK |
|
||||
| SSM_CONV_UPDATE_IDS | 16/16 OK |
|
||||
| GATED_DELTA_NET | 84/84 OK |
|
||||
| MUL_MAT | 1146/1146 OK |
|
||||
| MUL_MAT_ID | 806/806 OK |
|
||||
|
||||
(GATED_DELTA_NET grew 36/36 -> 84/84 vs the `9d5d882d` sync because the shipped
|
||||
series now carries patches `0026`/`0028`'s added per-head/gather test cases; all
|
||||
pass. SSM_CONV/MUL_MAT/MUL_MAT_ID counts match the prior sync exactly.)
|
||||
|
||||
Bit-exactness preserved across the 23-commit upstream jump.
|
||||
|
||||
## Canary
|
||||
|
||||
`.github/workflows/llama-cpp-paged-canary.yml` and
|
||||
`.github/scripts/paged-canary-apply.sh` now reference this doc. Because the
|
||||
series is source-only and applies strict-clean with no `--exclude`, the canary's
|
||||
`SSM_DECODE_FIX_RESULTS.md` workaround is now inert (the glob matches nothing in
|
||||
the shipped series) and may be removed on a future canary touch; left in place
|
||||
here to keep the pin-bump diff minimal.
|
||||
|
||||
## Source of truth
|
||||
|
||||
The shipped `.patch` files under `backend/cpp/llama-cpp-localai-paged/patches/paged/` are the
|
||||
source of truth and are unchanged by this bump. The DGX dev tree
|
||||
(`~/llama-paged-dev`, branch `paged`) was advanced to `c299a92c` for consistency;
|
||||
the pre-bump state is retained at `paged-prebump-9d5d882d-backup`.
|
||||
@@ -0,0 +1,337 @@
|
||||
# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL
|
||||
|
||||
Source-only analysis (no GPU, no build) of what it would take to give the
|
||||
gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute
|
||||
backends, so the patch-series decode win extends past CUDA-family hardware.
|
||||
|
||||
In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels
|
||||
+ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan /
|
||||
SYCL (a CPU-fallback fused op would regress via the device round-trip, and a
|
||||
backend that ran the plain op on the discriminated node would silently
|
||||
miscompute). "Layer 2" is the upstream work that adds the missing native kernels.
|
||||
|
||||
This doc was written against the ggml backend trees in
|
||||
`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the
|
||||
series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither
|
||||
touches GDN/SSM). So every "kernel already exists" statement below is a
|
||||
conservative lower bound: the pin has at least these kernels.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 0. Headline finding (correct a stale assumption first)
|
||||
|
||||
The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel
|
||||
upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there."
|
||||
**That is now stale.** All three backends already carry the BASE compute ops:
|
||||
|
||||
| op | Metal | Vulkan | SYCL |
|
||||
|------------------------|------------------------------------|------------------------------------------|---------------------------------|
|
||||
| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net<KDA,keep_rs>`) |
|
||||
| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) |
|
||||
| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) |
|
||||
|
||||
Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base
|
||||
commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on
|
||||
Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today
|
||||
(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is
|
||||
the decode SPEEDUP from the fusions, NOT enabling the model to run at all.
|
||||
|
||||
Consequence: the GDN-compute op being "partly there" is true on every backend,
|
||||
not just Metal. What is still missing per backend is only the FUSION plumbing
|
||||
(in-place write-back target, the ids gather read, and the conv-update kernel) -
|
||||
a materially smaller scope than "port GDN from scratch."
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 1. Per-op semantics (the four fusions to port)
|
||||
|
||||
All four reuse an existing GGML_OP enum with extra `src[]` slots as a
|
||||
discriminator; none adds a new enum value. f32 throughout. The arithmetic core
|
||||
is IDENTICAL to the upstream non-fused op; only the read source and/or the write
|
||||
target are redirected. That single fact drives the whole bit-exactness story
|
||||
(section 3).
|
||||
|
||||
### OP A - `ggml_gated_delta_net_inplace` (patch 0018)
|
||||
- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` =
|
||||
`state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state
|
||||
cache at `kv_head`). K == 1 only.
|
||||
- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state
|
||||
directly into `state_dst` instead of appending it to the op output. The op
|
||||
output then carries only the attention scores. Removes the per-layer per-step
|
||||
~full-state D2D copy-back (the 0018 win).
|
||||
- Race (in-place read == write): each (seq, head) block owns a disjoint cache
|
||||
slot. The kernel loads the whole prior state `s0` into per-thread registers
|
||||
(`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL)
|
||||
BEFORE the ring write, so reading and writing the same slot is safe.
|
||||
|
||||
### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019)
|
||||
- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids`
|
||||
(I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` =
|
||||
`rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target.
|
||||
- Semantics: read each sequence's prior state directly from `cache[ids[seq]]`
|
||||
(mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows`
|
||||
materialization. Combined with OP A the op now reads AND writes the cache in
|
||||
place.
|
||||
- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case)
|
||||
read s0 in place from the destination slot (safe via the register snapshot
|
||||
above). Non-identity sequences (reorder / rs_zero remap) are first copied by a
|
||||
TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a
|
||||
DISJOINT scratch that the recurrence then reads, so the recurrence never reads
|
||||
a slot another block is writing. Value-preserving memcpy -> bit-identical to
|
||||
the get_rows path.
|
||||
|
||||
### OP C - `ggml_ssm_conv_update_inplace` (patch 0021)
|
||||
- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` =
|
||||
`conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view).
|
||||
`src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel
|
||||
`[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` =
|
||||
fuse_silu.
|
||||
- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the
|
||||
width-K conv window in registers from the K-1 cached taps + the current token,
|
||||
compute the depthwise conv with the SAME ascending-tap FMA order as plain
|
||||
`ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's
|
||||
`sumf += b` with b==0), optionally fold SiLU, write the conv output
|
||||
`[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place.
|
||||
Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring
|
||||
cpy).
|
||||
- Race: read source (gathered taps) and write target (cache view) are disjoint
|
||||
buffers -> race-free by construction, no ids/identity logic.
|
||||
|
||||
### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028)
|
||||
- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the
|
||||
FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head.
|
||||
- Semantics: gather-free conv-update - read each sequence's prior taps from
|
||||
`cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from
|
||||
`conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny
|
||||
`ssm_conv_gather_nonident` kernel. The window is copied to a local array
|
||||
BEFORE the (possibly aliasing) ring write so the identity read==write slot is
|
||||
correct. Bit-identical to get_rows + OP C.
|
||||
|
||||
### Net new kernels vs reuse, per op
|
||||
- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING
|
||||
GDN kernel + 1 buffer binding + a supports_op/op-handler branch.
|
||||
- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) +
|
||||
1 ids binding + rs_head param + 1 tiny gather kernel.
|
||||
- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes
|
||||
a windowed reduction over a PRE-concatenated input; it does not assemble the
|
||||
window from cached taps + the current token, fold silu, or write the shifted
|
||||
ring state. This is the largest net-new piece.
|
||||
- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1
|
||||
tiny conv gather kernel.
|
||||
|
||||
The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph
|
||||
emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are
|
||||
SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend
|
||||
work is the kernel(s) + the backend wiring.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 2. Per-backend: authoring model, effort, gotchas, wiring
|
||||
|
||||
### 2.1 Metal (MSL)
|
||||
|
||||
Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant
|
||||
specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host
|
||||
glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline
|
||||
lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`,
|
||||
and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU
|
||||
simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce.
|
||||
|
||||
Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist
|
||||
and are ergonomic to extend. OP A is a write-base redirect of the existing
|
||||
`kernel_gated_delta_net_impl` (its tail already does
|
||||
`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after
|
||||
loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer
|
||||
and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu
|
||||
ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring
|
||||
write are both new). Host glue spans 3-4 files.
|
||||
|
||||
Gotchas:
|
||||
- In-place race: the existing kernel ALREADY snapshots the state column into
|
||||
`ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D
|
||||
must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use.
|
||||
- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns
|
||||
`has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns
|
||||
`has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of
|
||||
`src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant
|
||||
only once the kernel exists) AND `ggml_metal_op_ssm_conv` /
|
||||
`ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel.
|
||||
- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three -
|
||||
the fused variant only redirects addresses, so it is bit-identical to Metal's
|
||||
own non-fused path by construction (the conv per-channel FMA needs the exact
|
||||
ascending order + the `+0.0f`).
|
||||
- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new
|
||||
pipeline name (or a function-constant branch) distinguishes the variants.
|
||||
|
||||
### 2.2 Vulkan (GLSL .comp -> SPIR-V)
|
||||
|
||||
Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by
|
||||
`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data`
|
||||
etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count +
|
||||
push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*`
|
||||
binds subbuffers; `supports_op` in the device support function. Subgroup size
|
||||
VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32).
|
||||
|
||||
Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as
|
||||
Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a
|
||||
new `ggml_vk_create_pipeline` registration with an explicit binding count and
|
||||
push-constant size, a new/extended push-constant struct (add `rs_head`), and
|
||||
GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9
|
||||
(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the
|
||||
pipeline is created with `"main", 7, ...` - both must change.
|
||||
|
||||
Gotchas:
|
||||
- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp
|
||||
already ships shmem / cluster / nocluster variants keyed on subgroup size and
|
||||
relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be
|
||||
applied across ALL of those variants, and re-validated per vendor.
|
||||
- In-place race: GLSL must read the full column shard into local registers before
|
||||
the ring write (same pattern); confirm the SPIR-V memory model is not relied on
|
||||
for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)).
|
||||
OP C/D need the explicit window-to-local copy.
|
||||
- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO
|
||||
discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check.
|
||||
Both must be tightened. This is the backend where the 0030 hazard is most
|
||||
concrete (a present plain-conv kernel + a permissive supports_op = silent
|
||||
miscompute) - Vulkan is the exact case 0030 was written for.
|
||||
- conv-update is per-channel (one invocation per channel) so it is
|
||||
subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden.
|
||||
- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization
|
||||
constants, so the silu-fold half of OP C is partly precedented here (unlike
|
||||
Metal); the ring write-back + tap-window assembly are still new.
|
||||
|
||||
### 2.3 SYCL (single-source DPC++)
|
||||
|
||||
Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`,
|
||||
`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with
|
||||
`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`);
|
||||
`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single
|
||||
source).
|
||||
|
||||
Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are
|
||||
near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net<KDA,keep_rs>`,
|
||||
`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) -
|
||||
a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port
|
||||
almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the
|
||||
read-base select, the two tiny gather kernels, and the new conv-update kernel.
|
||||
No pipeline/push-constant/binding bookkeeping.
|
||||
|
||||
Gotchas:
|
||||
- In-place race: the `s_shard[]` / window arrays are per-work-item private, so
|
||||
the register-snapshot-before-write pattern carries over directly. Safe.
|
||||
- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO
|
||||
discriminator check; GDN returns a BARE `true` (the MOST permissive, so the
|
||||
hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` /
|
||||
`ggml_sycl_op_gated_delta_net` must branch on the extra src.
|
||||
- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same
|
||||
situation as CUDA; the fused variant matches SYCL's own non-fused path by
|
||||
construction. conv-update is per-channel -> subgroup-agnostic.
|
||||
|
||||
### 2.4 Common wiring (all three) + the 0030 emission-gate change
|
||||
|
||||
Per backend, four wiring touch-points beyond the kernel body:
|
||||
1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET`
|
||||
entries so the discriminated/extra-src node is reported supported ONLY when
|
||||
the new kernel handles it (and rejected otherwise, instead of today's
|
||||
silently-true-for-the-plain-kernel).
|
||||
2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to
|
||||
dispatch the fused kernel.
|
||||
3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor
|
||||
bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions).
|
||||
4. The patch-0030 gate in `src/llama-context.cpp`.
|
||||
|
||||
The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg
|
||||
name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch =
|
||||
auto_fgdn = false`. As each backend gains kernels this must become capability-
|
||||
driven, in one of two ways:
|
||||
- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its
|
||||
kernels + tightened supports_op ship; OR
|
||||
- clean (recommended upstream form): DELETE the name allow-list and make
|
||||
`supports_op` authoritative - have the `auto_fgdn` resolution probe
|
||||
`ggml_backend_dev_supports_op` on a representative node that carries the
|
||||
discriminated `src[]` slots. Then routing falls out of the normal scheduler
|
||||
fallback and no backend name is ever hard-coded. This also fixes 0030's stated
|
||||
weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET
|
||||
nodes and covered the discriminated SSM_CONV only incidentally.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 3. Bit-exactness per backend (the md5 gate question)
|
||||
|
||||
Feasible on ALL THREE, and not actually constraining, because of how the gate is
|
||||
scoped:
|
||||
|
||||
- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has
|
||||
its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL
|
||||
`warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no
|
||||
cross-backend md5 and none is expected.
|
||||
- The relevant per-backend invariant is: the FUSED variant must equal that
|
||||
backend's OWN non-fused path. The fusions change only the read source
|
||||
(gather -> indexed read; the gather is a value-preserving memcpy) and the write
|
||||
target (appended output -> in-place cache slot). They do NOT touch the
|
||||
per-column FMA/reduce order. So the fused op is bit-identical to the
|
||||
non-fused op on the same backend BY CONSTRUCTION.
|
||||
- Two arithmetic details each port MUST preserve exactly: (a) the conv
|
||||
ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s
|
||||
`sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not
|
||||
re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already
|
||||
registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS /
|
||||
GATED_DELTA_NET) is the per-backend gate.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 4. Upstream path and ranked recommendation
|
||||
|
||||
### Ops-first, then one PR per backend (NOT one big PR)
|
||||
|
||||
Recommended sequence:
|
||||
|
||||
1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/
|
||||
`ggml.c` builders, the CPU reference kernels, the CUDA kernels, the
|
||||
`test-backend-ops` cases, and the capability-driven gate (the clean
|
||||
`supports_op`-authoritative version of 0030). This is independently mergeable
|
||||
and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed
|
||||
that way).
|
||||
2. PR #2 - Metal kernels + wiring.
|
||||
3. PR #3 - SYCL kernels + wiring.
|
||||
4. PR #4 - Vulkan kernels + wiring.
|
||||
|
||||
Do NOT bundle the backends: each needs its own hardware to validate
|
||||
`test-backend-ops`, reviewers are backend-specialized, and a regression in one
|
||||
must not block the others.
|
||||
|
||||
### Value x effort ranking (which backend first)
|
||||
|
||||
| backend | user base / value | author effort | bit-exact difficulty | net rank |
|
||||
|---------|----------------------------|---------------|----------------------|----------|
|
||||
| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** |
|
||||
| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** |
|
||||
| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** |
|
||||
|
||||
Recommendation: **Metal first.** It banks the biggest user-facing decode win at
|
||||
medium effort, the base GDN + conv kernels already exist, and Apple's fixed
|
||||
simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap,
|
||||
nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is
|
||||
low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last**
|
||||
as the high-effort / high-breadth capstone - it reaches the widest hardware
|
||||
(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the
|
||||
existing variant matrix, the subgroup-width variance, and the per-vendor
|
||||
validation burden make it the right capstone once the pattern is proven on
|
||||
Metal + SYCL.
|
||||
|
||||
A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR
|
||||
(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a
|
||||
separate later effort.
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
## 5. Summary
|
||||
|
||||
- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL
|
||||
(the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all
|
||||
three today via the non-fused path; Layer-2 is about the decode SPEEDUP.
|
||||
- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids
|
||||
read (OP B) to the existing GDN kernel, write ONE new conv-update kernel
|
||||
(OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten
|
||||
supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/
|
||||
descriptor wiring. The builders, CPU refs, model graph and tests are shared and
|
||||
already done.
|
||||
- Bit-exactness is feasible everywhere and per-backend by construction (the
|
||||
fusions redirect addresses, not the f32 reduction order); `test-backend-ops`
|
||||
(backendX-vs-CPU) is the gate.
|
||||
- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's
|
||||
name allow-list), then Metal, then SYCL, then Vulkan.
|
||||
17
backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
Normal file
17
backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv
Normal file
@@ -0,0 +1,17 @@
|
||||
model,engine,npl,decode_agg_tps,decode_perseq_tps,prefill_tps,ttft_mean_ms,peak_gb
|
||||
q36-27b-nvfp4,llama,8,82.5,9.57,507.3,6038.1,53.51
|
||||
q36-27b-nvfp4,llama,32,192.6,4.79,115.0,133551.7,69.63
|
||||
q36-27b-nvfp4,llama,64,277.8,3.09,95.9,321618.8,83.96
|
||||
q36-27b-nvfp4,llama,128,384.6,1.86,69.7,902762.7,93.82
|
||||
q36-27b-nvfp4,vllm,8,70.4,8.76,2096.2,1861.1,110.92
|
||||
q36-27b-nvfp4,vllm,32,211.8,6.28,2182.6,5353.2,110.87
|
||||
q36-27b-nvfp4,vllm,64,309.1,4.38,2088.9,9512.4,110.88
|
||||
q36-27b-nvfp4,vllm,128,418.8,2.79,1929.1,18449.5,110.95
|
||||
q36-35b-a3b-nvfp4,llama,8,211.8,24.45,1236.4,2477.1,39.66
|
||||
q36-35b-a3b-nvfp4,llama,32,393.0,10.02,1213.9,8225.2,47.11
|
||||
q36-35b-a3b-nvfp4,llama,64,527.0,6.15,1152.3,15849.5,57.13
|
||||
q36-35b-a3b-nvfp4,llama,128,726.4,3.73,276.8,213017.2,61.51
|
||||
q36-35b-a3b-nvfp4,vllm,8,256.5,31.84,5186.5,768.8,109.62
|
||||
q36-35b-a3b-nvfp4,vllm,32,500.8,14.90,6223.4,1830.4,109.63
|
||||
q36-35b-a3b-nvfp4,vllm,64,686.1,9.83,5926.5,3224.4,109.63
|
||||
q36-35b-a3b-nvfp4,vllm,128,882.2,6.05,5300.5,6487.7,109.64
|
||||
|
217
backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
Normal file
217
backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp
Normal file
@@ -0,0 +1,217 @@
|
||||
// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY.
|
||||
//
|
||||
// Reproduces, at the libllama level, the two host-side defects behind the
|
||||
// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark
|
||||
// signature:
|
||||
//
|
||||
// * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly
|
||||
// what llama-server issues on every reused slot - frees the kv-cache CELLS
|
||||
// but the paged manager keeps owning the trailing BLOCKS. The manager's
|
||||
// free pool silently shrinks. Test A measures the reclaimed-block delta.
|
||||
//
|
||||
// * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates
|
||||
// many sequences and frees them in a scrambled order leaves the free queue a
|
||||
// scrambled permutation of physical block ids. A later low-npl prefill then
|
||||
// pops physically scattered blocks, so its KV scatter-write + in-kernel
|
||||
// paged-attention gather lose locality and prefill throughput collapses;
|
||||
// decode (single-token append) barely notices. Test B times an npl8 prefill
|
||||
// on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain.
|
||||
//
|
||||
// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the
|
||||
// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10%
|
||||
// of the fresh npl8 and num_free returns to the pristine value after the drain.
|
||||
//
|
||||
// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256)
|
||||
// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared.
|
||||
|
||||
#include "llama.h"
|
||||
#include "paged-prefix-api.h"
|
||||
|
||||
#include <chrono>
|
||||
#include <clocale>
|
||||
#include <cstdio>
|
||||
#include <cstdlib>
|
||||
#include <cstring>
|
||||
#include <vector>
|
||||
|
||||
static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; }
|
||||
|
||||
using clk = std::chrono::steady_clock;
|
||||
static double secs(clk::time_point a, clk::time_point b) {
|
||||
return std::chrono::duration<double>(b - a).count();
|
||||
}
|
||||
|
||||
struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; };
|
||||
|
||||
// Deterministic, content-distinct token for (seq, pos): keeps every sequence's
|
||||
// blocks unique so no cross-request prefix sharing masks the accounting.
|
||||
static llama_token tok_of(int seq, int pos, int n_vocab) {
|
||||
return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100);
|
||||
}
|
||||
|
||||
// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch).
|
||||
// Returns wall seconds (sync'd).
|
||||
static double prefill(Ctx & C, int seq, int pos0, int n) {
|
||||
clk::time_point t0 = clk::now();
|
||||
C.batch.n_tokens = 0;
|
||||
for (int j = 0; j < n; ++j) {
|
||||
int i = C.batch.n_tokens;
|
||||
C.batch.token[i] = tok_of(seq, pos0 + j, C.n_vocab);
|
||||
C.batch.pos[i] = pos0 + j;
|
||||
C.batch.n_seq_id[i] = 1;
|
||||
C.batch.seq_id[i][0]= seq;
|
||||
C.batch.logits[i] = (j + 1 == n) ? 1 : 0;
|
||||
C.batch.n_tokens++;
|
||||
}
|
||||
if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; }
|
||||
llama_synchronize(C.ctx);
|
||||
return secs(t0, clk::now());
|
||||
}
|
||||
|
||||
// One decode step (single token) for seq at pos.
|
||||
static void decode1(Ctx & C, int seq, int pos) {
|
||||
C.batch.n_tokens = 1;
|
||||
C.batch.token[0] = tok_of(seq, pos, C.n_vocab);
|
||||
C.batch.pos[0] = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1;
|
||||
if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq);
|
||||
}
|
||||
|
||||
int main(int argc, char ** argv) {
|
||||
std::setlocale(LC_NUMERIC, "C");
|
||||
const char * model_path = nullptr;
|
||||
for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i];
|
||||
if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; }
|
||||
|
||||
const int NSLOT = env_i("BURST_NSLOT", 64);
|
||||
const int NPL = env_i("NPL", 8);
|
||||
const int PP = env_i("PP", 512);
|
||||
const int KEEP = env_i("KEEP", 256);
|
||||
const int GEN = env_i("GEN", 4);
|
||||
const int ngl = env_i("PAGED_NGL", 99);
|
||||
const bool paged = getenv("LLAMA_KV_PAGED") != nullptr;
|
||||
|
||||
ggml_backend_load_all();
|
||||
llama_model_params mp = llama_model_default_params();
|
||||
mp.n_gpu_layers = ngl;
|
||||
llama_model * model = llama_model_load_from_file(model_path, mp);
|
||||
if (!model) { fprintf(stderr, "model load failed\n"); return 1; }
|
||||
const llama_vocab * vocab = llama_model_get_vocab(model);
|
||||
const int n_vocab = llama_vocab_n_tokens(vocab);
|
||||
|
||||
// Pool sized for the burst plus headroom so the burst fits but a later npl
|
||||
// run draws from whatever the burst's churn left behind.
|
||||
const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16);
|
||||
llama_context_params cp = llama_context_default_params();
|
||||
cp.n_ctx = (uint32_t) cells;
|
||||
cp.n_batch = (uint32_t) (PP + 16);
|
||||
cp.n_ubatch = (uint32_t) (PP + 16);
|
||||
cp.n_seq_max = NSLOT + NPL + 2;
|
||||
cp.kv_unified = true; // one unified stream-0 pool -> num_free(ctx) is the whole pool
|
||||
cp.no_perf = true;
|
||||
llama_context * ctx = llama_init_from_model(model, cp);
|
||||
if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; }
|
||||
|
||||
Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab;
|
||||
C.batch = llama_batch_init(cp.n_batch, 0, 1);
|
||||
|
||||
printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n",
|
||||
paged, NSLOT, NPL, PP, KEEP, GEN, cells);
|
||||
|
||||
llama_memory_clear(C.mem, true);
|
||||
const long F_start = paged_prefix_api::num_free_global();
|
||||
|
||||
// ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm --------------
|
||||
{
|
||||
prefill(C, 0, 0, PP);
|
||||
const long f_after_prefill = paged_prefix_api::num_free_global();
|
||||
llama_memory_seq_rm(C.mem, 0, KEEP, -1); // partial tail removal
|
||||
const long f_after_rm = paged_prefix_api::num_free_global();
|
||||
llama_memory_seq_rm(C.mem, 0, -1, -1); // full free -> pristine
|
||||
const long f_after_full = paged_prefix_api::num_free_global();
|
||||
const long bs = 16;
|
||||
const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks
|
||||
printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld "
|
||||
"(expect %ld post-fix, 0 pre-fix) afterFullFree=%ld\n",
|
||||
F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full);
|
||||
}
|
||||
|
||||
// ---- Test B: fragmentation -> npl prefill collapse -----------------------
|
||||
// Fresh npl prefill baseline on a pristine pool.
|
||||
llama_memory_clear(C.mem, true);
|
||||
double tps_fresh;
|
||||
{
|
||||
clk::time_point t0 = clk::now();
|
||||
long ntok = 0;
|
||||
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
|
||||
tps_fresh = ntok / secs(t0, clk::now());
|
||||
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
|
||||
}
|
||||
const long F_pristine = paged_prefix_api::num_free_global();
|
||||
|
||||
// High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode
|
||||
// steps (mixed alloc), then drain them in a scrambled order (odd ids first,
|
||||
// then even, each truncated before the full free) so the free queue becomes a
|
||||
// scrambled permutation - the fragmentation the bug never compacts.
|
||||
for (int s = 0; s < NSLOT; ++s) {
|
||||
if (prefill(C, NPL + s, 0, PP) < 0) return 1;
|
||||
for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g);
|
||||
}
|
||||
const long F_during_burst = paged_prefix_api::num_free_global();
|
||||
// Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a
|
||||
// scrambled slot order to scramble the physical free order.
|
||||
for (int parity = 1; parity >= 0; --parity)
|
||||
for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) {
|
||||
llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1); // partial (Fix-1 path)
|
||||
llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // full free
|
||||
}
|
||||
const long F_after_drain = paged_prefix_api::num_free_global();
|
||||
|
||||
// Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix
|
||||
// defragged) free queue.
|
||||
double tps_post;
|
||||
{
|
||||
clk::time_point t0 = clk::now();
|
||||
long ntok = 0;
|
||||
for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; }
|
||||
tps_post = ntok / secs(t0, clk::now());
|
||||
for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1);
|
||||
}
|
||||
|
||||
const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0;
|
||||
printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld "
|
||||
"(afterDrain==pristine? %s)\n",
|
||||
F_start, F_pristine, F_during_burst, F_after_drain,
|
||||
F_after_drain == F_pristine ? "YES" : "NO");
|
||||
printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f ratio=%.3f "
|
||||
"(PASS if >=0.90)\n", tps_fresh, tps_post, ratio);
|
||||
|
||||
// ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) -----
|
||||
// Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle
|
||||
// slot's KV; the blocks are stranded). F_idle shows the depleted pool a later
|
||||
// low-npl run would see. Then full-seq_rm each (exactly what Fix-3's
|
||||
// prompt_clear() issues at slot.release): F_reclaimed must return to pristine.
|
||||
llama_memory_clear(C.mem, true);
|
||||
// Touch the pool once so the manager exists, then read the full-pool size
|
||||
// (num_free is 0 while no manager is registered).
|
||||
if (prefill(C, 0, 0, 16) < 0) return 1;
|
||||
llama_memory_seq_rm(C.mem, 0, -1, -1);
|
||||
const long F_pre_c = paged_prefix_api::num_free_global();
|
||||
for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; }
|
||||
const long F_idle = paged_prefix_api::num_free_global();
|
||||
for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release
|
||||
const long F_reclaimed = paged_prefix_api::num_free_global();
|
||||
printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld "
|
||||
"(returns_to_fresh? %s)\n",
|
||||
F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed,
|
||||
F_reclaimed == F_pre_c ? "YES" : "NO");
|
||||
|
||||
printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n",
|
||||
paged, ratio,
|
||||
F_after_drain == F_pristine ? "YES" : "NO",
|
||||
F_reclaimed == F_pre_c ? "YES" : "NO");
|
||||
|
||||
llama_batch_free(C.batch);
|
||||
llama_free(ctx);
|
||||
llama_model_free(model);
|
||||
return 0;
|
||||
}
|
||||
@@ -0,0 +1,59 @@
|
||||
// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024).
|
||||
// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency.
|
||||
//
|
||||
// Fix-1 PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks
|
||||
// beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no
|
||||
// longer strands blocks whose cells were cleared.
|
||||
// Fix-2 defrag_free_pool() relinks the free queue into ascending block-id
|
||||
// order once the pool is fully idle, undoing a burst's scrambled frees
|
||||
// so a later prefill pops physically contiguous blocks again.
|
||||
|
||||
#include "paged-kv-manager.h"
|
||||
#include <cstdio>
|
||||
|
||||
using paged::PagedKVManager;
|
||||
|
||||
int main() {
|
||||
int rc = 0;
|
||||
|
||||
// ---- Fix-1: truncate reclaims the trailing block suffix -----------------
|
||||
{
|
||||
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true);
|
||||
const size_t f0 = m.num_free_blocks(); // 63 (block 0 reserved as null)
|
||||
m.allocate(0, 512); // ceil(512/16)=32 blocks
|
||||
const size_t f1 = m.num_free_blocks(); // 31
|
||||
m.truncate(0, 256); // keep ceil(256/16)=16, free 16
|
||||
const size_t f2 = m.num_free_blocks(); // 47
|
||||
printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n",
|
||||
f0, f1, f2, f2 - f1);
|
||||
if (f2 - f1 != 16) rc = 1;
|
||||
m.truncate(0, 16); // keep 1 block, free 15 more
|
||||
const size_t f3 = m.num_free_blocks(); // 62
|
||||
printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1);
|
||||
if (f3 != f0 - 1) rc = 1;
|
||||
m.free(0);
|
||||
if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; }
|
||||
}
|
||||
|
||||
// ---- Fix-2: defrag restores ascending popleft order ---------------------
|
||||
{
|
||||
PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false);
|
||||
for (int s = 0; s < 8; ++s) m.allocate(s, 16); // pop blocks 1..8
|
||||
const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4}; // free out of order
|
||||
for (int i = 0; i < 8; ++i) m.free(scrambled[i]);
|
||||
m.defrag_free_pool(); // all idle -> compact
|
||||
m.allocate(100, 16 * 3); // pop 3 blocks
|
||||
const auto bt = m.block_table(100);
|
||||
bool asc = true;
|
||||
printf("[unit Fix-2] post-defrag block_table:");
|
||||
for (size_t i = 0; i < bt.size(); ++i) {
|
||||
printf(" %d", bt[i]);
|
||||
if (i && bt[i] < bt[i - 1]) asc = false;
|
||||
}
|
||||
printf(" ascending=%s (expect YES)\n", asc ? "YES" : "NO");
|
||||
if (!asc) rc = 1;
|
||||
}
|
||||
|
||||
printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL");
|
||||
return rc;
|
||||
}
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 88 KiB |
Binary file not shown.
|
After Width: | Height: | Size: 89 KiB |
Reference in New Issue
Block a user